Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                

(eccv) Package eccv Warning: Package ‘hyperref’ is loaded with option ‘pagebackref’, which is *not* recommended for camera-ready version

11institutetext: University of Maryland, College Park 22institutetext: University of Toronto 33institutetext: Mila and Université de Montréal 44institutetext: King Abdullah University of Science and Technology (KAUST)
Project page – https://github.com/schowdhury671/meerkat
44email: {sanjoyc,rhgao,dmanocha}@umd.edu  sayan.nag@mail.utoronto.ca  subhrajyoti.dasgupta@umontreal.ca  {jun.chen,mohamed.elhoseiny}@kaust.edu.sa

 [Uncaptioned image]Meerkat: Audio-Visual Large Language Model for Grounding in Space and Time

Sanjoy Chowdhury* 11    Sayan Nag 22    Subhrajyoti Dasgupta 33    Jun Chen 44    Mohamed Elhoseiny 44    Ruohan Gao 11    Dinesh Manocha 11
Abstract

Leveraging Large Language Models’ remarkable proficiency in text-based tasks, recent works on Multi-modal LLMs (MLLMs) extend them to other modalities like vision and audio. However, the progress in these directions has been mostly focused on tasks that only require a coarse-grained understanding of the audio-visual semantics. We present Meerkat, an audio-visual LLM equipped with a fine-grained understanding of image and audio both spatially and temporally. With a new modality alignment module based on optimal transport and a cross-attention module that enforces audio-visual consistency, Meerkat can tackle challenging tasks such as audio referred image grounding, image guided audio temporal localization, and audio-visual fact-checking. Moreover, we carefully curate a large dataset AVFIT that comprises 3M instruction tuning samples collected from open-source datasets, and introduce MeerkatBench that unifies five challenging audio-visual tasks. We achieve state-of-the-art performance on all these downstream tasks with a relative improvement of up to 37.12%.

Keywords:
Audio-Visual LLM AV Localization AVFIT Dataset
Refer to caption
Figure 1: We present Meerkat, an audio-visual LLM that can effectively ground both spatially and temporally in image and audio. Our model is adept in tasks that require fine-grained understanding such as Audio Referred Image Grounding, Image Guided (IG) Audio Temporal Localization & Audio-Visual (AV) Fact-checking. It can also be extended to perform coarse-grained tasks like AVQA & AV Captioning.
Equal contribution.Equal advising.

1 Introduction

Large Language Models (LLMs) [6, 97, 19, 20, 80] have demonstrated remarkable performance in various natural language processing tasks, achieving human-level accuracies in comprehension and reasoning abilities. Furthermore, powered by the emergent instruction fine-tuning paradigm [69, 23, 73], these language models can be equipped to follow open-ended natural language instructions, or even combined with other modalities, especially vision [2, 118, 7, 112, 51, 41, 113, 89, 61, 59, 60, 33]. Audio, though often complementary to the associated visual scene, remains largely under-explored in the context of LLMs. Building Multi-modal LLMs (MLLMs) that can listen may enable new applications in multimedia content analysis, multi-modal virtual assistants, education and training, etc.

Limited prior works (refer to Tab. 1) have incorporated audio in MLLMs [33, 70, 87]. However, they mostly focus on coarse-grained tasks such as captioning and question-answering, which is comparatively straightforward to be subsumed into an LLM interface [89, 60, 87, 112]. Although there have been some recent advancements in leveraging MLLMs for grounding [102, 116, 101, 109, 12, 13, 76], they either only focus on the visual modality [109, 12, 13, 76, 40], or struggles to capture fine-grained details occurring within audio-visual events due to insufficient joint modeling of the two modalities [60, 89, 112].

Our goal is to harness the power of LLMs for fine-grained audio-visual understanding. This is challenging mainly because: (i) there is a disparity of input and output formats across different tasks (e.g., image grounding from an audio query, image-guided audio temporal localization), (ii) no large-scale datasets exist for training audio-visual LLMs with grounding capabilities. Existing audio-visual LLMs [60, 89, 87] are restricted to coarse-grained tasks and do not incorporate cross-modality fusion, which is a crucial component for achieving fine-grained understanding and reasoning capabilities, as shown in [25, 46]. Although there exist individual models capable of handling image grounding (BuboGPT [116]) and temporal localization (TimeChat [83]) separately, they are either not suitable for open-domain audio (TimeChat) or are not trained in an end-to-end fashion (BuboGPT) (refer to Tab. 1).

In light of these challenges, we present Meerkat 111Meerkats are known for their strong spotting and listening abilities. (ref Fig. 1), the first unified audio-visual LLM framework that can effectively ground both spatially and temporally in image and audio, respectively. It has two crucial modules that are key to its strong capability in fine-grained understanding: a modality alignment module that learns the cross-modal alignment between image and audio patches in a weakly-supervised manner based on optimal transport, and a cross-modal attention module that is capable of enforcing consistency in the cross-attention heatmaps. Together, these two modules enable learning better joint audio-visual representations that subsequently enhance downstream tasks.

Audio Types Data Features
Model Speech Open-domain Output Image Grounding Output Audio Grounding End-to-end Convention GPT-Prompted Robustness
VideoLlama [112]
Macaw-LLM [60]
PandaGPT [89]
AV LLM [87]
X-InstructBLIP [70]
TimeChat [83]
BuboGPT [116]
Meerkat (ours)
Table 1: Comparison of Meerkat with recent Audio-Visual LLMs. ‘Convention’ refers to a collection of publicly available data that has been transformed using templates, ‘GPT-Prompted’ signifies if the generated instructions are obtained/refined employing GPT, and ‘Robustness’ is the model’s ability to tackle negative samples. We compare our method against these approaches in Sec. 5.

To support Meerkat, we further introduce MeerkatBench that unifies five different audio-visual tasks (shown in Tab. 2), including audio referred image grounding, image-guided audio temporal localization, audio-visual fact checking, audio-visual question answering, and audio-visual captioning (see Fig. 1 for examples). To enable the training of these five tasks, we also curate a large dataset AVFIT, which contains 3M instruction tuning samples with various degrees of difficulties for learning fine-grained audio-visual semantics. Extensive experiments on these tasks demonstrate the effectiveness of our proposed model.

In summary, we make the following main contributions:

  • We present Meerkat, the first audio-visual LLM equipped with fine-grained spatio-temporal understanding that can ground in image and audio.

  • We introduce MeerkatBench that unifies five audio-visual learning tasks, and a new large instruction-tuning dataset AVFIT to enable learning fine-grained audio-visual semantics.

  • Evaluating on these five benchmark tasks, we set new state-of-the-art results on all of them with a relative improvement up to 37.12%.

2 Related Works

Multi-modal Large Language Models. Inspired by the success of instruction following capabilities of large language models [69, 19, 92], the community has recently started to leverage LLMs for understanding multi-modal contents. Powered by high-quality multi-modal instructional data, recent methods [118, 51, 41, 89, 7, 74, 13, 2] extend LLMs for multi-modal learning. While some approaches such as MiniGPT4 [118], X-LLM [7], and Video-ChatGPT [61] perform latent alignment between the pre-trained LLM and other modalities via learned visual encoder. Other methods like Otter [41], and LLaMA-Adapter [113] learn cross-attention layers into the LLM to infuse multi-modal information. Prior works in the realm of LLMs predominantly focus on either visual-only inputs [41, 51, 118, 108] or tackle coarse-grained tasks [45, 61] leaving room for fine-grained audio-visual understanding. Unlike prior approaches, in this work, we focus on equipping LLMs with strong audio-visual comprehension abilities.

Task
Granularity
Task Name Dataset Train Test
Spatial
Bounding Box
Time
Interval
# Samples
Train / Test
Metrics
Openimages-AudioSet 1.07M / –
Openimages-VGGSound 180K / –
AVSBench 2.30K / 0.49K cIOU, AUC
VGGSS – / 4.38K cIOU, AUC
PASCAL Sound – / 0.56K cIOU, AUC
Audio Referred Image Grounding Flickr-Soundnet – / 2.78K cIOU, AUC
Openimages-AudioSet Strong 96.5K / 24.1K F1-score
Image Guided Audio Temporal Localization LLP – / 2.32K F1-score
Fine Audio-Visual Fact-checking Openimages-AudioSet 1.18M / 321K F1-score
AVQA 40.4K / 16.9K Accuracy
AV Question Answering Music AVQA 25.7K / 7.36K Accuracy
Coarse AV Captioning VALOR 25.0K / 3.50K B@4, M, R, C
Table 2: Task-wise dataset distribution, dataset details, and metrics. We collect AVFIT, which is a collection of 12 datasets. We denote dataset-wise train/test usage. The visual grounding datasets contain spatial bounding box annotations while the audio temporal localization contains time-interval annotations. We consider audio-visual fact-checking as a fine-grained task as it requires an understanding of spatio-temporal grounding information (refer to Sec. 3 for more details). Here B@4: BLUE@4, M: METEOR, R: ROUGE, C: CIDEr. For all our experiments we consider F1@0.5. We obtain the bounding box from the segmentation maps.

Fine-grained Multi-modal Understanding. Of late, general-purpose multi-modal large language models have demonstrated their effectiveness in unifying a versatile array of vision-language or video-understanding tasks. These models, powered by LLMs [97, 98, 103, 104, 115, 93, 20] have superior reasoning and understanding capabilities. As a natural extension, MLLMs have been leveraged to unify region-based grounding tasks [74, 13, 12, 109, 101, 116, 40, 114, 102]. Despite significant strides, these models are still limited to fine-grained comprehension within a single modality. In this work, we propose Meerkat to precisely address this research gap under in-the-wild audio-visual event settings. To this end, we present a novel audio-visual task unification framework which promotes strong multi-modal reasoning and understanding capabilities.

LLM guided Task Unification. LLMs as an interface of task unification framework have seen massive advancements in recent times. Fuelled by the success of language models [107, 100, 57], the community has started to explore ways to unify generative and reasoning tasks under the sphere of language models leveraging its ease of accessibility. Various approaches [109, 12, 70, 45] present alternative ways to integrate new tasks within the scope of LLMs. Inspired by the success of these approaches, we present, to the best of our knowledge, the first approach to unifying fine-grained audio-visual tasks.

Audio-Visual Learning. Benefiting from the natural synchrony between the visual and the auditory modalities, audio-visual learning has opened up abundant applications including audio-visual sound source localization [64, 66, 90, 37], audio-visual sound separation [11, 91, 62], audio-visual segmentation [63, 117, 53], audio-visual question answering [106, 110, 42], audio-visual captioning [16, 15, 94]. Different from these lines of work that focus on a single task, we aim to harness the power of LLM to propose a multi-task learning setting by unifying five different audio-visual tasks with the LLM serving as a common interface.

Refer to caption
Figure 2: Overview of Meerkat. Our model is equipped with fine-grained audio-visual comprehension abilities. When fed with image I, audio A pairs, the Audio-Visual Optimal Transport alignment (AVOpT) module \CircledB learns the patch-wise image-audio association to facilitate weak alignment between the two modalities by minimizing the patch-level Wasserstein distance. Subsequently, the Audio-Visual Attention Consistency Enforcement (AVACE) module \CircledA maximizes the region-level alignment by confining the cross-modal attention maps around the objects of interest and minimizing the association with the background. After tokenizing the text instruction T, the modality-specific latents (z~I,z~A,zTsubscript~𝑧𝐼subscript~𝑧𝐴subscript𝑧𝑇\tilde{z}_{I},\tilde{z}_{A},z_{T}over~ start_ARG italic_z end_ARG start_POSTSUBSCRIPT italic_I end_POSTSUBSCRIPT , over~ start_ARG italic_z end_ARG start_POSTSUBSCRIPT italic_A end_POSTSUBSCRIPT , italic_z start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT) are passed to the instruction tuned Llama 2 model which serves as a unified interface for the downstream tasks. We employ a LoRA-based fine-tuning of the LLM.

3 Methodology

In this section, we introduce Meerkat. Fig. 2 provides an overview of our approach. We first discuss the multi-modal feature extraction in Sec. 3. In Sec. 3 we introduce our novel audio-visual feature alignment modules. In Sec. 3 we add the overall training objective followed by Sec. 3 where we elaborate the numerical representations of the visual bounding box and time intervals.

Image Encoder. Given a batch of k𝑘kitalic_k input images 𝐈={Ii}i=1k:IiH×W×C:𝐈subscriptsuperscriptsubscript𝐼𝑖𝑘𝑖1subscript𝐼𝑖superscript𝐻𝑊𝐶\mathbf{I}=\{I_{i}\}^{k}_{i=1}:I_{i}\in\mathbb{R}^{H\times W\times C}bold_I = { italic_I start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT } start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT : italic_I start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_H × italic_W × italic_C end_POSTSUPERSCRIPT where H𝐻Hitalic_H, W𝑊Witalic_W, C𝐶Citalic_C represent the height, width and channels respectively, we employ a pretrained CLIP-ViT-B/16 [78] encoder I()superscript𝐼\mathcal{E}^{I}(\cdot)caligraphic_E start_POSTSUPERSCRIPT italic_I end_POSTSUPERSCRIPT ( ⋅ ) to extract the image embeddings. Where ithsuperscript𝑖thi^{\text{th}}italic_i start_POSTSUPERSCRIPT th end_POSTSUPERSCRIPT image embedding can be represented as zI𝒮I×𝒟Isubscript𝑧𝐼superscriptsubscript𝒮𝐼subscript𝒟𝐼z_{I}\in\mathbb{R}^{\mathcal{S}_{I}\times\mathcal{D}_{I}}italic_z start_POSTSUBSCRIPT italic_I end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT caligraphic_S start_POSTSUBSCRIPT italic_I end_POSTSUBSCRIPT × caligraphic_D start_POSTSUBSCRIPT italic_I end_POSTSUBSCRIPT end_POSTSUPERSCRIPT, where 𝒮Isubscript𝒮𝐼\mathcal{S}_{I}caligraphic_S start_POSTSUBSCRIPT italic_I end_POSTSUBSCRIPT and 𝒟Isubscript𝒟𝐼\mathcal{D}_{I}caligraphic_D start_POSTSUBSCRIPT italic_I end_POSTSUBSCRIPT denote the number of image tokens and hidden dimension respectively.

Audio Encoder. The audio encoder transforms the raw audio input into an audio embedding. We use the audio transformer backbone from CLAP [26] as our audio encoder due to its success in diverse audio tasks owing to its superior multi-modal alignment. We leverage this powerful pre-trained encoder (A()superscript𝐴\mathcal{E}^{A}(\cdot)caligraphic_E start_POSTSUPERSCRIPT italic_A end_POSTSUPERSCRIPT ( ⋅ )) to extract meaningful audio representations. For a batch of k𝑘kitalic_k processed audio inputs A={Ai}i=1kAsubscriptsuperscriptsubscript𝐴𝑖𝑘𝑖1\textbf{A}=\{A_{i}\}^{k}_{i=1}A = { italic_A start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT } start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT: AiF×Tsubscript𝐴𝑖superscript𝐹𝑇A_{i}\in\mathbb{R}^{F\times T}italic_A start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_F × italic_T end_POSTSUPERSCRIPT where F𝐹Fitalic_F is the number of spectral components (e.g. Mel bins) and T𝑇Titalic_T is the number of time bins. Each ithsuperscript𝑖thi^{\text{th}}italic_i start_POSTSUPERSCRIPT th end_POSTSUPERSCRIPT audio embedding is denoted as zA𝒮A×𝒟Asubscript𝑧𝐴superscriptsubscript𝒮𝐴subscript𝒟𝐴z_{A}\in\mathbb{R}^{\mathcal{S}_{A}\times\mathcal{D}_{A}}italic_z start_POSTSUBSCRIPT italic_A end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT caligraphic_S start_POSTSUBSCRIPT italic_A end_POSTSUBSCRIPT × caligraphic_D start_POSTSUBSCRIPT italic_A end_POSTSUBSCRIPT end_POSTSUPERSCRIPT, 𝒮Asubscript𝒮𝐴\mathcal{S}_{A}caligraphic_S start_POSTSUBSCRIPT italic_A end_POSTSUBSCRIPT and 𝒟Asubscript𝒟𝐴\mathcal{D}_{A}caligraphic_D start_POSTSUBSCRIPT italic_A end_POSTSUBSCRIPT are the number of audio tokens and hidden dimension respectively.

LLM. Meerkat adopts the open sourced Llama 2-Chat (7B) [97] as the large language model backbone. Pre-trained LLMs tokenizer projects the text sequence T into embeddings zT𝒮T×𝒟Tsubscript𝑧𝑇superscriptsubscript𝒮𝑇subscript𝒟𝑇z_{T}\in\mathbb{R}^{\mathcal{S}_{T}\times\mathcal{D}_{T}}italic_z start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT caligraphic_S start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT × caligraphic_D start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT end_POSTSUPERSCRIPT, where 𝒮Tsubscript𝒮𝑇\mathcal{S}_{T}caligraphic_S start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT and 𝒟Tsubscript𝒟𝑇\mathcal{D}_{T}caligraphic_D start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT refer to token length and hidden dimension respectively. Before passing the image and audio embeddings into the LLM, they undergo transformations via additional linear layers to ensure the embedding dimensions across different modalities remain consistent. Since the LLM serve as the unified interface for audio-visual inputs, we rely on the language tokens to carry out the individual tasks.

Inspired by the success of recent pre-training frameworks in grounding tasks [25, 12, 46], we equip our model with two different levels of supervision: weak supervision through modality alignment module (AVOpT) and strong supervision through audio-visual consistency enforcement module (AVACE). We follow a single-stage training strategy and empirically show our method achieves similar performance compared to two-stage training (more details in the appendix).

Audio-Visual Optimal Transport Alignment Module (AVOpT). Weak supervision as a precursor to fine-grained supervision has been proven to be an effective training strategy in various tasks [25, 44]. Earth Mover Distance based algorithms [111] involving Optimal Transport (OT) methods [14] have been recently leveraged for patch-level alignment between the query and the support images in a siamese network [111]. Furthermore, in the context of vision-language models, OT-based algorithms have been employed for patch-word alignment [18]. As the image (CLIP) and audio (CLAP) encoders are trained separately their learned embeddings are in a different semantic space. Our intuition is that such a patch-level alignment can improve vision and audio semantic consistency[31]. We experimentally demonstrate that this patch-level weak guidance is superior to contrastive loss-based [34, 68] global supervision (more details in appendix).

From a given image I𝐼Iitalic_I and audio A𝐴Aitalic_A pair, we obtain patch-level (local) feature embeddings zIsubscript𝑧𝐼z_{I}italic_z start_POSTSUBSCRIPT italic_I end_POSTSUBSCRIPT and zAsubscript𝑧𝐴z_{A}italic_z start_POSTSUBSCRIPT italic_A end_POSTSUBSCRIPT where, zI=I(I);zA=A(A)formulae-sequencesubscript𝑧𝐼superscript𝐼𝐼subscript𝑧𝐴superscript𝐴𝐴z_{I}=\mathcal{E}^{I}(I);z_{A}=\mathcal{E}^{A}(A)italic_z start_POSTSUBSCRIPT italic_I end_POSTSUBSCRIPT = caligraphic_E start_POSTSUPERSCRIPT italic_I end_POSTSUPERSCRIPT ( italic_I ) ; italic_z start_POSTSUBSCRIPT italic_A end_POSTSUBSCRIPT = caligraphic_E start_POSTSUPERSCRIPT italic_A end_POSTSUPERSCRIPT ( italic_A ). For modeling cross-modal relations by utilizing the inherent rich semantic structures in these feature representations, we generate two discrete distributions, represented by θI𝐏(I)subscript𝜃𝐼𝐏subscript𝐼\theta_{I}\in\mathbf{P}(\mathbb{Z}_{I})italic_θ start_POSTSUBSCRIPT italic_I end_POSTSUBSCRIPT ∈ bold_P ( blackboard_Z start_POSTSUBSCRIPT italic_I end_POSTSUBSCRIPT ) and θA𝐏(A)subscript𝜃𝐴𝐏subscript𝐴\theta_{A}\in\mathbf{P}(\mathbb{Z}_{A})italic_θ start_POSTSUBSCRIPT italic_A end_POSTSUBSCRIPT ∈ bold_P ( blackboard_Z start_POSTSUBSCRIPT italic_A end_POSTSUBSCRIPT ), for image and audio respectively:

θI=k=1MuI(k)δzI(k);θA=l=1NuA(l)δzA(l)formulae-sequencesubscript𝜃𝐼superscriptsubscript𝑘1𝑀subscript𝑢𝐼𝑘subscript𝛿subscript𝑧𝐼𝑘subscript𝜃𝐴superscriptsubscript𝑙1𝑁subscript𝑢𝐴𝑙subscript𝛿subscript𝑧𝐴𝑙\vspace{-0.05in}\theta_{I}=\sum_{k=1}^{M}u_{I}(k)\delta_{z_{I}}(k);\theta_{A}=% \sum_{l=1}^{N}u_{A}(l)\delta_{z_{A}}(l)\vspace{-0.05in}italic_θ start_POSTSUBSCRIPT italic_I end_POSTSUBSCRIPT = ∑ start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_M end_POSTSUPERSCRIPT italic_u start_POSTSUBSCRIPT italic_I end_POSTSUBSCRIPT ( italic_k ) italic_δ start_POSTSUBSCRIPT italic_z start_POSTSUBSCRIPT italic_I end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_k ) ; italic_θ start_POSTSUBSCRIPT italic_A end_POSTSUBSCRIPT = ∑ start_POSTSUBSCRIPT italic_l = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT italic_u start_POSTSUBSCRIPT italic_A end_POSTSUBSCRIPT ( italic_l ) italic_δ start_POSTSUBSCRIPT italic_z start_POSTSUBSCRIPT italic_A end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_l ) (1)

where, k=1MuI(k)=l=1NuA(l)=1superscriptsubscript𝑘1𝑀subscript𝑢𝐼𝑘superscriptsubscript𝑙1𝑁subscript𝑢𝐴𝑙1\sum_{k=1}^{M}u_{I}(k)=\sum_{l=1}^{N}u_{A}(l)=1∑ start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_M end_POSTSUPERSCRIPT italic_u start_POSTSUBSCRIPT italic_I end_POSTSUBSCRIPT ( italic_k ) = ∑ start_POSTSUBSCRIPT italic_l = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT italic_u start_POSTSUBSCRIPT italic_A end_POSTSUBSCRIPT ( italic_l ) = 1, uIsubscript𝑢𝐼u_{I}italic_u start_POSTSUBSCRIPT italic_I end_POSTSUBSCRIPT and uAsubscript𝑢𝐴u_{A}italic_u start_POSTSUBSCRIPT italic_A end_POSTSUBSCRIPT being the respective weight vectors for the probability distributions θIsubscript𝜃𝐼\theta_{I}italic_θ start_POSTSUBSCRIPT italic_I end_POSTSUBSCRIPT and θAsubscript𝜃𝐴\theta_{A}italic_θ start_POSTSUBSCRIPT italic_A end_POSTSUBSCRIPT. δzsubscript𝛿𝑧\delta_{z}italic_δ start_POSTSUBSCRIPT italic_z end_POSTSUBSCRIPT is the Dirac delta function placed at support point z𝑧zitalic_z in the embedding space [8]. The goal is to discern the optimal transport plan while matching these two distributions. Therefore, we compute the Wasserstein Distance (WD) between these probability distributions θIsubscript𝜃𝐼\theta_{I}italic_θ start_POSTSUBSCRIPT italic_I end_POSTSUBSCRIPT and θAsubscript𝜃𝐴\theta_{A}italic_θ start_POSTSUBSCRIPT italic_A end_POSTSUBSCRIPT while preserving the topological information during the cross-domain alignment process, mathematically given as follows:

OT=𝒟Wasserstein(θI,θA)=min𝛀Ψ(uI,uA)kl𝛀klϕ(zI(k),zA(l))subscriptOTsubscript𝒟Wassersteinsubscript𝜃𝐼subscript𝜃𝐴subscript𝛀Ψsubscript𝑢𝐼subscript𝑢𝐴subscript𝑘subscript𝑙subscript𝛀𝑘𝑙italic-ϕsubscript𝑧𝐼𝑘subscript𝑧𝐴𝑙\vspace{-0.05in}\mathcal{L}_{\text{OT}}=\mathcal{D}_{\mathrm{Wasserstein}}(% \theta_{I},\theta_{A})=\min_{\mathbf{\Omega}\in\mathrm{\Psi}(u_{I},u_{A})}\sum% _{k}\sum_{l}\mathbf{\Omega}_{kl}\cdot\phi(z_{I}(k),z_{A}(l))\vspace{-0.05in}caligraphic_L start_POSTSUBSCRIPT OT end_POSTSUBSCRIPT = caligraphic_D start_POSTSUBSCRIPT roman_Wasserstein end_POSTSUBSCRIPT ( italic_θ start_POSTSUBSCRIPT italic_I end_POSTSUBSCRIPT , italic_θ start_POSTSUBSCRIPT italic_A end_POSTSUBSCRIPT ) = roman_min start_POSTSUBSCRIPT bold_Ω ∈ roman_Ψ ( italic_u start_POSTSUBSCRIPT italic_I end_POSTSUBSCRIPT , italic_u start_POSTSUBSCRIPT italic_A end_POSTSUBSCRIPT ) end_POSTSUBSCRIPT ∑ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ∑ start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT bold_Ω start_POSTSUBSCRIPT italic_k italic_l end_POSTSUBSCRIPT ⋅ italic_ϕ ( italic_z start_POSTSUBSCRIPT italic_I end_POSTSUBSCRIPT ( italic_k ) , italic_z start_POSTSUBSCRIPT italic_A end_POSTSUBSCRIPT ( italic_l ) ) (2)

Here, Ψ(uI,uA)={𝛀+M×N|𝛀𝟏N=uI,𝛀𝟏M=uA}Ψsubscript𝑢𝐼subscript𝑢𝐴conditional-set𝛀superscriptsubscript𝑀𝑁formulae-sequence𝛀subscript1𝑁subscript𝑢𝐼superscript𝛀topsubscript1𝑀subscript𝑢𝐴\mathrm{\Psi}(u_{I},u_{A})=\{\mathbf{\Omega}\in\mathbb{R}_{+}^{M\times N}|% \mathbf{\Omega}\mathbf{1}_{N}=u_{I},\mathbf{\Omega}^{\top}\mathbf{1}_{M}=u_{A}\}roman_Ψ ( italic_u start_POSTSUBSCRIPT italic_I end_POSTSUBSCRIPT , italic_u start_POSTSUBSCRIPT italic_A end_POSTSUBSCRIPT ) = { bold_Ω ∈ blackboard_R start_POSTSUBSCRIPT + end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_M × italic_N end_POSTSUPERSCRIPT | bold_Ω bold_1 start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT = italic_u start_POSTSUBSCRIPT italic_I end_POSTSUBSCRIPT , bold_Ω start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT bold_1 start_POSTSUBSCRIPT italic_M end_POSTSUBSCRIPT = italic_u start_POSTSUBSCRIPT italic_A end_POSTSUBSCRIPT }, ϕ(zI(k),zA(l))italic-ϕsubscript𝑧𝐼𝑘subscript𝑧𝐴𝑙\phi(z_{I}(k),z_{A}(l))italic_ϕ ( italic_z start_POSTSUBSCRIPT italic_I end_POSTSUBSCRIPT ( italic_k ) , italic_z start_POSTSUBSCRIPT italic_A end_POSTSUBSCRIPT ( italic_l ) ) is the function computing the cosine distance between the cross-modal embedding pair, and 𝛀𝛀\mathbf{\Omega}bold_Ω is the transport plan, imitating the amount of mass shifted from the distribution θIsubscript𝜃𝐼\theta_{I}italic_θ start_POSTSUBSCRIPT italic_I end_POSTSUBSCRIPT to the distribution θAsubscript𝜃𝐴\theta_{A}italic_θ start_POSTSUBSCRIPT italic_A end_POSTSUBSCRIPT. An exact solution to the above expression leads to a sparse representation of the transport plan 𝛀𝛀\mathbf{\Omega}bold_Ω which at most (2max(M,N)1)2max𝑀𝑁1(2\cdot\text{max}(M,N)-1)( 2 ⋅ max ( italic_M , italic_N ) - 1 ) non-zero elements, ensuing an explainable and robust cross-modal alignment. We defer additional details to the appendix.

Audio-Visual Attention Consistency Enforcement Module (AVACE). Cross-modal interaction is essential for aligning the audio and visual modalities. Moreover, region-level supervision can encourage efficient localization. Inspired by the success of recent methods [25, 22, 86], we employ an adapter-based cross-attention strategy for efficient sound source localization. The modality-specific features in AVOpT lack awareness [38] of information from alternative modalities which can be infused through cross-modal attention. Therefore, to enable the audio-visual cross-modal reciprocity, we propose the AVACE module.

Although in a multi-modal context, feature fusion through a cross-attention scheme is effective in attending to relevant objects in the image, inconsistencies may arise such as attended regions being dispersed throughout the image including background objects. The reasons can be attributed to the quality of interplay between the feature embeddings. Considering CLAP audio encoder pre-trained with examples such as ‘a man playing the violin’ (refer Fig. 2) paired with audio of a violin, the cross-modal knowledge of audio representations encourages it to focus on both the man and the violin in the image. Therefore, to ensure superior region-level alignment we confine the cross-modality attention map (𝒜csuperscript𝒜𝑐\mathcal{A}^{c}caligraphic_A start_POSTSUPERSCRIPT italic_c end_POSTSUPERSCRIPT) within the boundaries of the object of interest, denoted by the ground-truth bounding box. Considering a bounding box represented as [xLeft,yTop,xRight,yBottom]subscript𝑥Leftsubscript𝑦Topsubscript𝑥Rightsubscript𝑦Bottom[x_{\text{Left}},y_{\text{Top}},x_{\text{Right}},y_{\text{Bottom}}][ italic_x start_POSTSUBSCRIPT Left end_POSTSUBSCRIPT , italic_y start_POSTSUBSCRIPT Top end_POSTSUBSCRIPT , italic_x start_POSTSUBSCRIPT Right end_POSTSUBSCRIPT , italic_y start_POSTSUBSCRIPT Bottom end_POSTSUBSCRIPT ], we define a mask \mathcal{M}caligraphic_M such that (yTop:yBottom,xLeft:xRight)=1, otherwise 0\mathcal{M}(y_{\text{Top}}:y_{\text{Bottom}},x_{\text{Left}}:x_{\text{Right}})% =1\text{, otherwise 0}caligraphic_M ( italic_y start_POSTSUBSCRIPT Top end_POSTSUBSCRIPT : italic_y start_POSTSUBSCRIPT Bottom end_POSTSUBSCRIPT , italic_x start_POSTSUBSCRIPT Left end_POSTSUBSCRIPT : italic_x start_POSTSUBSCRIPT Right end_POSTSUBSCRIPT ) = 1 , otherwise 0. Our goal is to maximize the attention within this bounding box and minimize it elsewhere. Therefore, we mathematically formulate the attention consistency objective ACsubscriptAC\mathcal{L}_{\text{AC}}caligraphic_L start_POSTSUBSCRIPT AC end_POSTSUBSCRIPT as follows:

AC=λ1(1i,j(i,j)𝒜c(i,j)i,j(i,j)+ϵ1)+λ2(i,j(1(i,j))𝒜c(i,j)i,j(1(i,j))+ϵ2)subscriptACsubscript𝜆11subscript𝑖𝑗𝑖𝑗superscript𝒜𝑐𝑖𝑗subscript𝑖𝑗𝑖𝑗subscriptitalic-ϵ1subscript𝜆2subscript𝑖𝑗1𝑖𝑗superscript𝒜𝑐𝑖𝑗subscript𝑖𝑗1𝑖𝑗subscriptitalic-ϵ2\vspace{-0.05in}\mathcal{L}_{\text{AC}}=\lambda_{1}\left(1-\frac{\sum_{i,j}{% \mathcal{M}(i,j)\mathcal{A}^{c}(i,j)}}{\sum_{i,j}{\mathcal{M}(i,j)}+\epsilon_{% 1}}\right)+\lambda_{2}\left(\frac{\sum_{i,j}{\left(1-\mathcal{M}(i,j)\right)% \mathcal{A}^{c}(i,j)}}{\sum_{i,j}{\left(1-\mathcal{M}(i,j)\right)}+\epsilon_{2% }}\right)caligraphic_L start_POSTSUBSCRIPT AC end_POSTSUBSCRIPT = italic_λ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( 1 - divide start_ARG ∑ start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT caligraphic_M ( italic_i , italic_j ) caligraphic_A start_POSTSUPERSCRIPT italic_c end_POSTSUPERSCRIPT ( italic_i , italic_j ) end_ARG start_ARG ∑ start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT caligraphic_M ( italic_i , italic_j ) + italic_ϵ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_ARG ) + italic_λ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( divide start_ARG ∑ start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT ( 1 - caligraphic_M ( italic_i , italic_j ) ) caligraphic_A start_POSTSUPERSCRIPT italic_c end_POSTSUPERSCRIPT ( italic_i , italic_j ) end_ARG start_ARG ∑ start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT ( 1 - caligraphic_M ( italic_i , italic_j ) ) + italic_ϵ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_ARG ) (3)

Here, 𝒜csuperscript𝒜𝑐\mathcal{A}^{c}caligraphic_A start_POSTSUPERSCRIPT italic_c end_POSTSUPERSCRIPT denotes the audio-visual cross-modality attention, (i,j)𝑖𝑗(i,j)( italic_i , italic_j ) represents the pixel location, λ1subscript𝜆1\lambda_{1}italic_λ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT, λ2subscript𝜆2\lambda_{2}italic_λ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT are the loss hyper-parameters (we keep λ1=λ2=0.5subscript𝜆1subscript𝜆20.5\lambda_{1}=\lambda_{2}=0.5italic_λ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT = italic_λ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT = 0.5), and ϵ1subscriptitalic-ϵ1\epsilon_{1}italic_ϵ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT, ϵ2subscriptitalic-ϵ2\epsilon_{2}italic_ϵ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT are the stability factors respectively. In Sec. 5, we demonstrate that ACsubscriptAC\mathcal{L}_{\text{AC}}caligraphic_L start_POSTSUBSCRIPT AC end_POSTSUBSCRIPT encourages efficient localization and audio-visual alignment of the cross-attention maps, eventually leading to improved fine-grained cross-modal representations for downstream tasks.

Our overall training objective comprises a combination of three sub-objectives: cross-entropy loss (CEsubscriptCE\mathcal{L}_{\text{CE}}caligraphic_L start_POSTSUBSCRIPT CE end_POSTSUBSCRIPT), weak AV alignment loss (OTsubscriptOT\mathcal{L}_{\text{OT}}caligraphic_L start_POSTSUBSCRIPT OT end_POSTSUBSCRIPT), and attention consistency loss (ACsubscriptAC\mathcal{L}_{\text{AC}}caligraphic_L start_POSTSUBSCRIPT AC end_POSTSUBSCRIPT). These losses are added together to obtain the final training loss for Meerkat given as:

Meerkat=CE+λOTOT+λACACsubscriptMeerkatsubscriptCEsubscript𝜆OTsubscriptOTsubscript𝜆ACsubscriptAC\mathcal{L}_{\textsc{Meerkat}}=\mathcal{L}_{\text{CE}}+\lambda_{\text{OT}}% \cdot\mathcal{L}_{\text{OT}}+\lambda_{\text{AC}}\cdot\mathcal{L}_{\text{AC}}caligraphic_L start_POSTSUBSCRIPT Meerkat end_POSTSUBSCRIPT = caligraphic_L start_POSTSUBSCRIPT CE end_POSTSUBSCRIPT + italic_λ start_POSTSUBSCRIPT OT end_POSTSUBSCRIPT ⋅ caligraphic_L start_POSTSUBSCRIPT OT end_POSTSUBSCRIPT + italic_λ start_POSTSUBSCRIPT AC end_POSTSUBSCRIPT ⋅ caligraphic_L start_POSTSUBSCRIPT AC end_POSTSUBSCRIPT (4)

Here, λOTsubscript𝜆OT\lambda_{\text{OT}}italic_λ start_POSTSUBSCRIPT OT end_POSTSUBSCRIPT and λACsubscript𝜆AC\lambda_{\text{AC}}italic_λ start_POSTSUBSCRIPT AC end_POSTSUBSCRIPT are the loss weighting factors. We provide Algorithm 1 outlining the overall training procedure.

Algorithm 1 Meerkat: Training
1:Image: I𝐼Iitalic_I; Audio: A𝐴Aitalic_A; Textual Instruction: T𝑇Titalic_T; Pre-trained LLM: LLM()superscriptLLM\mathcal{E}^{\text{LLM}}(\cdot)caligraphic_E start_POSTSUPERSCRIPT LLM end_POSTSUPERSCRIPT ( ⋅ ); LLM Tokenizer: τLLM()superscript𝜏LLM\tau^{\text{LLM}}(\cdot)italic_τ start_POSTSUPERSCRIPT LLM end_POSTSUPERSCRIPT ( ⋅ ); Pre-trained Image Encoder: I()superscript𝐼\mathcal{E}^{I}(\cdot)caligraphic_E start_POSTSUPERSCRIPT italic_I end_POSTSUPERSCRIPT ( ⋅ ); Pre-trained Audio Encoder: A()superscript𝐴\mathcal{E}^{A}(\cdot)caligraphic_E start_POSTSUPERSCRIPT italic_A end_POSTSUPERSCRIPT ( ⋅ ); AVACE Module: AVACE(,)AVACE\text{AVACE}(\cdot,\cdot)AVACE ( ⋅ , ⋅ ); Masks from GT Bounding-Boxes: \mathcal{M}caligraphic_M; Loss Hyperparameters: λOT,λACsubscript𝜆OTsubscript𝜆AC\lambda_{\text{OT}},\lambda_{\text{AC}}italic_λ start_POSTSUBSCRIPT OT end_POSTSUBSCRIPT , italic_λ start_POSTSUBSCRIPT AC end_POSTSUBSCRIPT; GT Tokens: ϕGTsubscriptitalic-ϕGT\phi_{\text{GT}}italic_ϕ start_POSTSUBSCRIPT GT end_POSTSUBSCRIPT.
2:Fine-tuned LLM: T()superscript𝑇\mathcal{E}^{T}(\cdot)caligraphic_E start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT ( ⋅ ); Trained AVACE Module: AVACE(,)AVACE\text{AVACE}(\cdot,\cdot)AVACE ( ⋅ , ⋅ ); Predicted Tokens: ϕpredsubscriptitalic-ϕpred\phi_{\text{pred}}italic_ϕ start_POSTSUBSCRIPT pred end_POSTSUBSCRIPT.
3:zII(I);zAA(A)formulae-sequencesubscript𝑧𝐼superscript𝐼𝐼subscript𝑧𝐴superscript𝐴𝐴z_{I}\leftarrow\mathcal{E}^{I}(I);z_{A}\leftarrow\mathcal{E}^{A}(A)italic_z start_POSTSUBSCRIPT italic_I end_POSTSUBSCRIPT ← caligraphic_E start_POSTSUPERSCRIPT italic_I end_POSTSUPERSCRIPT ( italic_I ) ; italic_z start_POSTSUBSCRIPT italic_A end_POSTSUBSCRIPT ← caligraphic_E start_POSTSUPERSCRIPT italic_A end_POSTSUPERSCRIPT ( italic_A ) \triangleright Obtain Visual and Audio Embeddings.
4:zTτLLM(T)subscript𝑧𝑇superscript𝜏LLM𝑇z_{T}\leftarrow\tau^{\text{LLM}}(T)italic_z start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT ← italic_τ start_POSTSUPERSCRIPT LLM end_POSTSUPERSCRIPT ( italic_T ) \triangleright Tokenize and Obtain Textual Encodings.
5:z~I,z~A,𝒜cAVACE(zI,zA)subscript~𝑧𝐼subscript~𝑧𝐴superscript𝒜𝑐AVACEsubscript𝑧𝐼subscript𝑧𝐴\tilde{z}_{I},\tilde{z}_{A},\mathcal{A}^{c}\leftarrow\text{AVACE}(z_{I},z_{A})over~ start_ARG italic_z end_ARG start_POSTSUBSCRIPT italic_I end_POSTSUBSCRIPT , over~ start_ARG italic_z end_ARG start_POSTSUBSCRIPT italic_A end_POSTSUBSCRIPT , caligraphic_A start_POSTSUPERSCRIPT italic_c end_POSTSUPERSCRIPT ← AVACE ( italic_z start_POSTSUBSCRIPT italic_I end_POSTSUBSCRIPT , italic_z start_POSTSUBSCRIPT italic_A end_POSTSUBSCRIPT ) \triangleright Obtain Audio-Visual Projections, Cross-Attn Map.
6:zAVT(z~Iz~AzT)subscript𝑧𝐴𝑉𝑇subscript~𝑧𝐼normsubscript~𝑧𝐴subscript𝑧𝑇z_{AVT}\leftarrow(\tilde{z}_{I}\parallel\tilde{z}_{A}\parallel z_{T})italic_z start_POSTSUBSCRIPT italic_A italic_V italic_T end_POSTSUBSCRIPT ← ( over~ start_ARG italic_z end_ARG start_POSTSUBSCRIPT italic_I end_POSTSUBSCRIPT ∥ over~ start_ARG italic_z end_ARG start_POSTSUBSCRIPT italic_A end_POSTSUBSCRIPT ∥ italic_z start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT ) \triangleright Concatenate Embeddings.
7:ϕpredLLM(zAVT)subscriptitalic-ϕpredsuperscriptLLMsubscript𝑧𝐴𝑉𝑇\phi_{\text{pred}}\leftarrow\mathcal{E}^{\text{LLM}}(z_{AVT})italic_ϕ start_POSTSUBSCRIPT pred end_POSTSUBSCRIPT ← caligraphic_E start_POSTSUPERSCRIPT LLM end_POSTSUPERSCRIPT ( italic_z start_POSTSUBSCRIPT italic_A italic_V italic_T end_POSTSUBSCRIPT ) \triangleright LLM Output.
8:MeerkatCE(ϕpred,ϕGT)+λOTOT(zI,zA)+λACAC(𝒜c,)subscriptMeerkatsubscriptCEsubscriptitalic-ϕpredsubscriptitalic-ϕGTsubscript𝜆OTsubscriptOTsubscript𝑧𝐼subscript𝑧𝐴subscript𝜆ACsubscriptACsuperscript𝒜𝑐\mathcal{L}_{\textsc{Meerkat}}\leftarrow\mathcal{L}_{\text{CE}}(\phi_{\text{% pred}},\phi_{\text{GT}})+\lambda_{\text{OT}}\cdot\mathcal{L}_{\text{OT}}(z_{I}% ,z_{A})+\lambda_{\text{AC}}\cdot\mathcal{L}_{\text{AC}}(\mathcal{A}^{c},% \mathcal{M})caligraphic_L start_POSTSUBSCRIPT Meerkat end_POSTSUBSCRIPT ← caligraphic_L start_POSTSUBSCRIPT CE end_POSTSUBSCRIPT ( italic_ϕ start_POSTSUBSCRIPT pred end_POSTSUBSCRIPT , italic_ϕ start_POSTSUBSCRIPT GT end_POSTSUBSCRIPT ) + italic_λ start_POSTSUBSCRIPT OT end_POSTSUBSCRIPT ⋅ caligraphic_L start_POSTSUBSCRIPT OT end_POSTSUBSCRIPT ( italic_z start_POSTSUBSCRIPT italic_I end_POSTSUBSCRIPT , italic_z start_POSTSUBSCRIPT italic_A end_POSTSUBSCRIPT ) + italic_λ start_POSTSUBSCRIPT AC end_POSTSUBSCRIPT ⋅ caligraphic_L start_POSTSUBSCRIPT AC end_POSTSUBSCRIPT ( caligraphic_A start_POSTSUPERSCRIPT italic_c end_POSTSUPERSCRIPT , caligraphic_M )
9:Optimize model parameters to reduce MeerkatsubscriptMeerkat\mathcal{L}_{\textsc{Meerkat}}caligraphic_L start_POSTSUBSCRIPT Meerkat end_POSTSUBSCRIPT until convergence.
10:return T()superscript𝑇\mathcal{E}^{T}(\cdot)caligraphic_E start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT ( ⋅ ), AVACE(,)AVACE\text{AVACE}(\cdot,\cdot)AVACE ( ⋅ , ⋅ ), ϕpredsubscriptitalic-ϕpred\phi_{\text{pred}}italic_ϕ start_POSTSUBSCRIPT pred end_POSTSUBSCRIPT.

Representation of Box Location. We embed the location of bounding boxes with numerical values in the natural language sequence. A box is represented intuitively by its top-left and bottom-right corners, i.e., [xLeftsubscript𝑥Leftx_{\text{Left}}italic_x start_POSTSUBSCRIPT Left end_POSTSUBSCRIPT, yTopsubscript𝑦Topy_{\text{Top}}italic_y start_POSTSUBSCRIPT Top end_POSTSUBSCRIPT, xRightsubscript𝑥Rightx_{\text{Right}}italic_x start_POSTSUBSCRIPT Right end_POSTSUBSCRIPT, yBottomsubscript𝑦Bottomy_{\text{Bottom}}italic_y start_POSTSUBSCRIPT Bottom end_POSTSUBSCRIPT]. Notably, these values are normalized whose factors are determined by the size of the respective image to which the bbox belongs. These coordinates may appear in either the input or the output sequences depending on the task. For instance, in Audio Referred Image Grounding task, Meerkat predicts the bounding box of the object of interest, whereas, for Audio-Visual Fact-checking task, the text input to Meerkat might contain the box coordinates.

Representation of Time Segment. We embed the time interval information using numerical figures in the natural language expression. A time segment is intuitively represented by its start and end times, i.e., [tStart, tEnd], designating the onset of an event or an activity. Similar to boxes, these representations may appear in either the input or the output sequences depending on the task. For instance, in Image Guided Audio Temporal Localization task, the model predicts the time interval within which the query might have occurred, while for Audio-Visual Fact-checking, the input sequence might contain a reference time window. We add more details on the instruction preparation formats in the appendix.

4 MeerkatBench: A Unified Benchmark Suite for Fine-grained Audio-Visual Understanding

Multi-modal conversation as an emergent ability is gaining prominence in the context of MLLMs. Although a line of research [109, 76, 12] addresses vision-language tasks, extension to other modalities such as audio is relatively underexplored. The task’s difficulty escalates further when an intricate understanding of the modality-specific information is necessitated. To add to this, there doesn’t exist any publicly available dataset that particularly facilitates such tasks. One of our primary contributions is to introduce a novel audio-visual fine-grained task unification benchmark. To this end, we present MeerkatBench comprising three fine-grained tasks: (i) audio referred image grounding, (ii) image guided audio temporal localization, (iii) audio-visual fact-checking, and two coarse-grained tasks: (iv) audio-visual question answering, (v) audio-visual captioning.

In this section, we present AVFIT, an AV instruction tuning dataset comprising 3M multi-modal dialogues for model training. AVFIT consists of samples collected in the following ways: (i) suitable adaptation of public datasets and (ii) instruction-tuning data generation via prompting GPT-3.5 [6]. Next, we discuss the data curation procedure:

Adaptation of Public Datasets. Depending on the task and availability of datasets, we either collect the image-audio pairs directly from the publicly available datasets (VGG-SS [9], AVSBench [117], Flickr-SoundNet [85], LLP [95], AVQA [106], MUSIC-AVQA [42], VALOR [15]) or follow a semi-automated strategy to prepare the pairs by forming matching image-audio pairs from large-scale datasets having visual grounding annotation such as Openimages [39], PASCAL [27] and audio event datasets like AudioSet/AudioSet Strong [30], VGG-Sound [10]. We retain the original category labels (‘Existential’, ‘Temporal’, etc.) from the MUSIC-AVQA. To get similar insights in the AVQA dataset, we categorise every sample into one of the ‘Existential’, ‘Temporal’, ‘Localisation’, ‘Count’ and ‘World Knowledge’ categories. During the direct collection of pairs, we augment the audio snippet with a carefully chosen representative frame from the associated video. On the other hand, while forming pairs ourselves, we refer to a lookup table which we prepare beforehand by matching the corresponding class labels from the image and the audio datasets (more details in the appendix). We associate each image sample with its counterpart from the audio dataset. Finally, we supplement the image-audio pairs with the generated instructions as explained next. Details on the task-wise dataset details can be found in Tab. 2.

GPT-Assisted Instruction Generation. Instruction tuning datasets [51, 58, 75, 35] have primarily focused on coarse-grained details like global image descriptions in the form of captioning or question answering without explicitly capturing fine-grained details. In this work, we aim to bridge this gap by introducing AVFIT that promotes region-level and time-sensitive understanding in the following ways: (i) AVFIT includes spatial coordinates of objects of interest (bounding box) along with corresponding audio snippets which leverage the synergy between audio-visual data. (ii) The designed dialogues audio time intervals either in input or output or both. (iii) To generate high-quality instructions we manually write a few example descriptions of each task and resort to GPT-3.5 [6] to create different variations. For further refinement of the generated dialogues we re-prompt GPT-4 [1] to ensure quality by reducing its context size. During training, we randomly pick one instruction for each sample. Fig. 2 illustrates a sample instruction from MeerkatBench. We use special tokens <image>, <audio>, <obj> which we later replace with instruction-guided image, audio and object categories respectively to generate prefix-based prompting.

5 Experiments and Results

To the best of our knowledge, Meerkat is the first MLLM that unifies audio-visual spatial and temporal grounding, alongside possessing strong reasoning capabilities. We carefully choose the closest baseline for each task and suitably adapt them for fair comparisons. Owing to BuboGPT’s [116] spatial localization ability, we select it as our baseline for the audio referred image grounding task. Most similar in spirit to our image guided audio-temporal localization task is TimeChat [83]. It leverages the pre-trained VideoLlama model and suitably instruction-tune it to tackle temporal grounding tasks. Due to their audio-visual comprehension abilities, we resort to X-InstructBLIP [70], Macaw-LLM [60], PandaGPT [89], and VideoLlama [112] as baselines for audio-visual fact-checking, AV question answering, and AV captioning tasks respectively. Please refer to Tab. 1 for an overview of the characteristics of the generalist baselines. For specialist baselines, refer to the corresponding task tables. We finetune all baselines on our datasets except for using Openimages-AudioSet and Openimages-VGGSound train splits from the audio-referred visual grounding task.

Audio Referred Image Grounding (ARIG) This task involves visual grounding by predicting the coordinates of a bounding box around the object of interest guided by the input audio. We prepare 1.2M image-audio-instruction pairs using steps explained in Sec. 4. We add details of the input instruction format and model output in the appendix. Meerkat achieves superior performance in sounding object localization task, setting a new benchmark as shown in Tab. 3.

VGG-SS Flickr-SoundNet PascalSound AVSBench
Models Generalist? cIoU \uparrow AUC \uparrow cIoU \uparrow AUC \uparrow cIoU \uparrow AUC \uparrow cIoU \uparrow AUC \uparrow
SSPL [88] 33.90 38.00 76.70 60.50 51.72 39.79 61.32 48.44
EZ-VSL [65] 38.85 39.54 83.94 63.60 51.90 40.25 60.06 49.64
SSL-TIE [52] 38.63 39.65 79.50 61.20 52.14 40.44 62.88 51.28
SLAVC [64] 39.80 86.00 52.29 42.19 63.39 51.07
MarginNCE [72] 39.78 40.01 85.14 64.55 53.61 45.52 65.85 52.92
HearTheFlow [29] 39.40 40.00 84.80 64.00 55.48 47.40 67.49 54.39
FNAC [90] 41.85 40.80 85.14 64.30 57.38 48.03 68.78 56.19
Alignment [86] 42.64 41.48 82.40 64.60 58.34 49.86 71.57 57.52
BuboGPT [116] 40.31 39.68 81.17 62.29 58.52 51.63 74.33 59.49
Meerkat (ours) 48.51 45.62 88.35 67.88 65.23 56.10 79.82 65.35
ΔMeerkatBuboGPTsubscriptΔMeerkatBuboGPT{\color[rgb]{0,0,1}\definecolor[named]{pgfstrokecolor}{rgb}{0,0,1}\Delta_{% \text{{Meerkat}}-\text{BuboGPT}}}roman_Δ start_POSTSUBSCRIPT Meerkat - BuboGPT end_POSTSUBSCRIPT +20.34% +14.97% +8.85% +8.97% +11.47% +8.66% +7.39% +9.85%
Table 3: Audio referred image grounding results. For AVSBench we follow the same train/test splits for all methods. We use the VGG-SS, Flickr-SoundNet, and PascalSound datasets only for evaluation.

LLP AudioSet Strong Models Generalist? F1-score \uparrow F1-score \uparrow AVE [96] 35.47 37.42 AVSDN [49] 37.15 41.48 AVVP [95] 48.93 49.20 TimeChat [83] 51.28 54.66 Meerkat (ours) 54.96 56.85 ΔMeerkatTimeChatsubscriptΔMeerkatTimeChat{\color[rgb]{0,0,1}\definecolor[named]{pgfstrokecolor}{rgb}{0,0,1}\Delta_{% \text{{Meerkat}}-\text{TimeChat}}}roman_Δ start_POSTSUBSCRIPT Meerkat - TimeChat end_POSTSUBSCRIPT +7.18% +4.01%

Table 4: Image guided audio temporal localization results. We report the segment level F1-scores and attribute our performance gain over specialist models to our multi-task learning strategy.
Type 1 Type 2 Type 3 Type 4
Model F1-score \uparrow F1-score \uparrow F1-score \uparrow F1-score \uparrow
Macaw-LLM [60] 0.65 0.70 0.56 0.77
PandaGPT [89] 0.67 0.70 0.66 0.70
VideoLlama [112] 0.71 0.72 0.72 0.78
BuboGPT [116] 0.72 0.66 0.67 0.70
X-InstructBLIP [70] 0.73 0.72 0.72 0.80
TimeChat [83] 0.74 0.76 0.74 0.82
Meerkat (ours) 0.85 0.83 0.84 0.88
ΔMeerkatTimeChatsubscriptΔMeerkatTimeChat{\color[rgb]{0,0,1}\definecolor[named]{pgfstrokecolor}{rgb}{0,0,1}\Delta_{% \text{{Meerkat}}-\text{TimeChat}}}roman_Δ start_POSTSUBSCRIPT Meerkat - TimeChat end_POSTSUBSCRIPT +14.86% +9.21% +13.51% +7.32%
Table 5: Audio-Visual fact-checking requires powerful reasoning capabilities across audio-visual modalities.
Refer to caption
Figure 3: Qualitative results. We compare our method against its closest baselines on all downstream tasks. Meerkat aided by our novel design approach and instruction tuning datasets achieves superior performance on spatio-temporal grounding as well as coarse-grained tasks by outperforming prior approaches.

Image Guided Audio Temporal Localization (IGATL). When prompted to indicate a time interval within which a certain audio event occurs, Meerkat is capable of producing accurate time bounds in the form [tStart, tEnd], where tStart and tEnd are the start and end times, respectively. For all our experiments, we maintain the audio duration to be 30s. Different from prior visual grounding-based approaches [109, 76, 12], we present a new audio event localization task by setting a new baseline. We attribute the superior performance of our method on fine-grained audio temporal localization task to our specially designed AVOpT and AVACE modules, which ensure superior modality-specific guidance. Fig. 3 demonstrates our model can locate a precise time interval associated with an audio event. Tab. 5 reports the quantitative comparison of our method against other baselines.

Audio-Visual Fact-checking (AVFact). In this section we introduce a new suite of tasks that involves a strong comprehension of the audio-visual semantic information. These tasks broadly require the model to analyze and verify whether a given statement about an audio-visual scenario holds or not. Although we do not use GT spatio-temporal annotations to train the model, we classify this task under the fine-grained category as the task requires the model to attend to a specific region/time interval as passed in the query. To alleviate inconsistencies in evaluation, we restrict the model’s response to binary True/False only. We divide these tasks into the following 4 categories:

Type 1: Given an audio-image pair, verify if the object within the bounding box produces sound that corresponds to the input audio.

Type 2: Given an audio snippet, verify whether its visual counterpart is present in the image or not.

Type 3: Given an audio-image pair, verify if the object present within the provided bounding box produces sound that corresponds to the audio within a given time segment.

Type 4: Given an audio-image pair, verify if the supplied audio is related to the object within the provided bounding box.
In Tab. 5 we contrast the performance of other baselines against Meerkat on all four types of AVFact tasks.

Model Generalist? AVQA MUSIC AVQA VALOR-32K
Exist \uparrow Localis \uparrow Temp \uparrow Exist \uparrow Localis \uparrow Temp \uparrow BLEU@4 \uparrow METEOR \uparrow ROUGE \uparrow CIDEr \uparrow
AVSD [84] 81.61 58.79 61.41 - - - - - - -
PanoAVQA [110] 81.21 59.33 63.23 - - - - - - -
ST-AVQA [42] 81.81 64.51 63.23 - - - - - - -
CAD [67] 83.42 73.97 76.16 - - - - - - -
AVST [42] - - - 72.44 65.54 59.36 - - - -
LAVISH [50] - - - 73.83 65.00 60.81 - - - -
LAST [54] - - - 76.21 68.91 60.60 - - - -
SMPFF [17] - - - - - - 7.59 12.64 28.69 37.18
VALOR [15] - - - - - - 8.97 14.88 30.86 55.73
Macaw-LLM [60] 82.19 74.86 78.98 72.99 71.28 59.36 9.36 15.28 33.31 58.98
PandaGPT [89] 83.38 76.81 79.11 78.48 73.12 65.85 10.35 16.92 34.88 61.22
VideoLlama [112] 84.48 77.06 81.36 81.21 76.10 67.52 11.45 17.39 35.14 63.63
X-InstructBLIP [70] 85.53 80.09 83.91 80.28 77.45 68.83 12.31 18.82 37.93 65.73
Meerkat (ours) 88.24 86.65 86.55 83.62 80.51 73.33 16.88 23.18 45.67 76.84
ΔMeerkatX-InstructBLIPsubscriptΔMeerkatX-InstructBLIP{\color[rgb]{0,0,1}\definecolor[named]{pgfstrokecolor}{rgb}{0,0,1}\Delta_{% \text{{Meerkat}}-\text{X-InstructBLIP}}}roman_Δ start_POSTSUBSCRIPT Meerkat - X-InstructBLIP end_POSTSUBSCRIPT +3.17% +8.19% +3.15% +4.16% +3.95% +6.54% +37.12% +23.17% +20.41% +16.9%
Table 6: Quantitative results on AVQA and AV captioning tasks. The reported numbers on AVQA dataset [106] are on the val split. For the MUSIC-AVQA dataset [42], results are reported on the balanced test set. Here, Exist: Existential, Localis: Localisation, Temp: Temporal. Evaluation for AV captioning is done on VALOR-32K [15] val set. Meerkat demonstrates strong coarse-grained understanding abilities.

Audio-Visual Question Answering (AVQA). Audio-visual question answering aims to answer questions encompassing both audio and visual modalities. We collect question-answer pairs from the AVQA [106] and MusicAVQA [42] datasets and augment them with instruction tuning templates (details in appendix) to prepare the data samples. We contrast our method against SoTA generalist and specialist models on the AVQA task in Tab. 6. We report the evaluation results on the other metrics like Count and Comp in the appendix.

Audio-Visual Captioning (AVC). This task learns how to generate text tokens conditioned on audio-visual inputs. In contrast to image/audio-only captioning methods, this requires strong multi-modal understanding and reasoning capabilities. We note that Meerkat outperforms existing specialist and generalist models by a considerable margin and sets a new baseline on a recent benchmark dataset VALOR [15], as shown in Tab. 6.

We argue that the seamless extension of Meerkat to coarse-grained tasks is facilitated by the strong semantic understanding acquired by our model during training. This comprehension ability enables our model to effectively navigate and interpret the complexities inherent in coarse-grained tasks, showcasing the versatility and easy extensibility of our approach.

Weak vs. Strong Alignment. We ablate the quantitative effectiveness of our proposed weak and strong alignment modules in Tab. 7. Without the AVACE module, the method’s performance on the visual grounding task is considerably worse. For a similar reason, ablating this module in AVFact (Type 3), which requires region-level visual understanding, also shows inferior performance. For coarse-grained tasks (AV Captioning, AVQA), introducing OTsubscriptOT\mathcal{L}_{\text{OT}}caligraphic_L start_POSTSUBSCRIPT OT end_POSTSUBSCRIPT boosts performance compared to the baseline. Overall, optimal performance is achieved when two objective functions work in tandem with optimal weight factors.

Training Objective VGGSS LLP AVFact(T3) AVQA VALOR
CEsubscriptCE\mathcal{L}_{\text{CE}}caligraphic_L start_POSTSUBSCRIPT CE end_POSTSUBSCRIPT OTsubscriptOT\mathcal{L}_{\text{OT}}caligraphic_L start_POSTSUBSCRIPT OT end_POSTSUBSCRIPT ACsubscriptAC\mathcal{L}_{\text{AC}}caligraphic_L start_POSTSUBSCRIPT AC end_POSTSUBSCRIPT cIOU \uparrow F1-score \uparrow F1-score \uparrow Avg \uparrow CIDEr \uparrow
42.93 52.13 0.76 84.00 71.52
43.75 53.41 0.78 85.91 73.49
46.83 52.57 0.81 85.82 73.14
48.51 54.96 0.84 87.14 76.84
Table 7: Ablation on different combinations of OTsubscriptOT\mathcal{L}_{\text{OT}}caligraphic_L start_POSTSUBSCRIPT OT end_POSTSUBSCRIPT and ACsubscriptAC\mathcal{L}_{\text{AC}}caligraphic_L start_POSTSUBSCRIPT AC end_POSTSUBSCRIPT. Meerkat achieves optimal performance with a weighted linear combination of the 3 objective functions on all tasks. AVQA avg is calculated over Exist, Localis, and Temp.

Evaluation on Pre-training Tasks. To study the effect of unified pre-training, we evaluate our model under single task vs. multi-task learning setting. We gradually add datasets for each task and assess the model’s performance. On quantitative evaluation, we note that our multi-task setting is indeed benefitting from each other in achieving superior performance as shown in Tab. 8. While the model trained on fine-grained tasks performs significantly well on the coarse-grained tasks, introducing the coarse-grained tasks in the training set doesn’t have a considerable impact on ARIG, IGATL, and AVFact - underlining the importance of our collected fine-grained datasets.

Full vs. LoRA Finetuning. We conduct experiments on different modes of LLM fine-tuning. As shown in Fig. 4, LoRA [36] based fine-tuning with r=32 achieves optimal performance. Lower values of r (4,16) performs poorly compared to 32 and we empirically find full-finetuning performs slightly worse than LoRA (r=32). We add more ablation results in the appendix.

Fig. 3 illustrates the comparison of Meerkat with its closest baseline on all downstream tasks. We observe that our model powered by the combination of AVOpT and AVACE is equipped with finer region-level understanding compared to Bubo-GPT [116]. Similarly, on image-guided audio temporal localization, our method outperforms TimeChat [83]. We attribute the excellent performance of Meerkat to the strong AV association learning backed by the instruction tuning data and multi-task learning set-up. For the AVQA task, the recently proposed X-InstructBLIP [70] achieves comparable results. We argue that fuelled by a strong fine-grained understanding acquired through the pre-training stages, Meerkat can extract additional contextual information from the visual modality. Our training paradigm emphasizes on both audio and visual modalities facilitating precise audio understanding by the model when compared against Video-LLaMA [112]. Finally, on the AVFact tasks, our approach achieves superior performance due to its better multi-modal comprehension skills.

Pre-training Task VGG-SS LLP AVFact AVQA VALOR
ARIG IGATL AVFC AVQA AVC cIOU \uparrow F1-score \uparrow Avg F1-score \uparrow Avg Acc. \uparrow CIDEr \uparrow
47.53 18.73 0.71 77.22 67.82
47.75 54.26 0.74 79.74 70.19
48.17 54.65 0.83 81.11 72.13
48.29 54.82 0.83 86.68 74.14
48.51 54.96 0.85 87.14 76.84
Table 8: We systematically analyze the effect of multi-task learning. Here ARIG: audio referred image grounding, IGATL: image guided audio temporal localization, AVFC: audio-visual fact-checking, AVQA: audio-visual question answering, and AVC: audio-visual captioning. AVQA avg accuracy calculated over Exist, Localis, and Temp.
001111333355550010101010202020203030303040404040505050506060606070707070# Training EpochscIOU

Full

LoRA r=32

LoRA r=16

LoRA r=4


Figure 4: cIoU upper bound on VGG-SS for Full vs. LoRA based finetuning.

We train the model for 5555 epochs and report results using the checkpoint with the best validation loss. We use 8 A100 GPUs for training with validation at the end of every epoch. Inspired by the recent success of Low-Rank Adaptation (LoRA) [36], we use it to finetune the LLM. Meerkat is trained using AdamW optimizer [56]. We use a gradient accumulation step of 3333. Training our model takes around 52 hours for 5 epochs. We utilize DeepSpeed [82] for optimization during the training process. The model is trained with a learning rate of 3×1053superscript1053\times 10^{-5}3 × 10 start_POSTSUPERSCRIPT - 5 end_POSTSUPERSCRIPT. The warmup ratio is 0.030.030.030.03, along with a cosine learning rate scheduler. We use FP16 precision for both training and inference.

6 Conclusions and Future Works

We presented Meerkat, a powerful multi-modal large language model adept at processing audio-visual inputs to comprehend fine-grained spatio-temporal information. Our novel audio-visual alignment strategy powered by the AVOpT and AVACE modules instil strong compositional understanding into Meerkat, thereby making it suitable for challenging tasks like audio-referred visual grounding, image to audio temporal localization, audio-visual fact-checking, etc. To pave the way for future research in this direction, we collect AVFIT comprising 3M instruction tuning samples and introduce MeerkatBench that unifies five challenging audio-visual learning tasks. Extensive experiments demonstrate the effectiveness of our approach on a wide range of downstream tasks, consistently achieving state-of-the-art performance.

In future work, we plan to equip our model to address more challenging tasks like AV segmentation. We also plan to extend the model’s capability to operate on videos and handle associated tasks such as video temporal grounding, and video summarization. Future work can also focus on collecting video-centric multi-modal training data and reasoning benchmarks for evaluation at scale. Finally, our work opens up avenues to study robustness and compositional understanding of AV LLMs with fine-grained comprehension abilities.

References

  • [1] Achiam, J., Adler, S., Agarwal, S., Ahmad, L., Akkaya, I., Aleman, F.L., Almeida, D., Altenschmidt, J., Altman, S., Anadkat, S., et al.: Gpt-4 technical report. arXiv preprint arXiv:2303.08774 (2023)
  • [2] Alayrac, J.B., Donahue, J., Luc, P., Miech, A., Barr, I., Hasson, Y., Lenc, K., Mensch, A., Millican, K., Reynolds, M., et al.: Flamingo: a visual language model for few-shot learning. Advances in Neural Information Processing Systems 35, 23716–23736 (2022)
  • [3] Arandjelovic, R., Zisserman, A.: Look, listen and learn. In: Proceedings of the IEEE international conference on computer vision. pp. 609–617 (2017)
  • [4] Baevski, A., Zhou, Y., Mohamed, A., Auli, M.: wav2vec 2.0: A framework for self-supervised learning of speech representations. Advances in neural information processing systems 33, 12449–12460 (2020)
  • [5] Banerjee, S., Lavie, A.: Meteor: An automatic metric for mt evaluation with improved correlation with human judgments. In: Proceedings of the acl workshop on intrinsic and extrinsic evaluation measures for machine translation and/or summarization. pp. 65–72 (2005)
  • [6] Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in neural information processing systems 33, 1877–1901 (2020)
  • [7] Chen, F., Han, M., Zhao, H., Zhang, Q., Shi, J., Xu, S., Xu, B.: X-llm: Bootstrapping advanced large language models by treating multi-modalities as foreign languages. arXiv preprint arXiv:2305.04160 (2023)
  • [8] Chen, G., et al: Plot: Prompt learning with optimal transport for vision-language models. ICLR (2023)
  • [9] Chen, H., Xie, W., Afouras, T., Nagrani, A., Vedaldi, A., Zisserman, A.: Localizing visual sounds the hard way. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 16867–16876 (2021)
  • [10] Chen, H., Xie, W., Vedaldi, A., Zisserman, A.: Vggsound: A large-scale audio-visual dataset. In: ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). pp. 721–725. IEEE (2020)
  • [11] Chen, J., Zhang, R., Lian, D., Yang, J., Zeng, Z., Shi, J.: iquery: Instruments as queries for audio-visual sound separation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 14675–14686 (2023)
  • [12] Chen, J., Zhu, D., Shen, X., Li, X., Liu, Z., Zhang, P., Krishnamoorthi, R., Chandra, V., Xiong, Y., Elhoseiny, M.: Minigpt-v2: large language model as a unified interface for vision-language multi-task learning. arXiv preprint arXiv:2310.09478 (2023)
  • [13] Chen, K., Zhang, Z., Zeng, W., Zhang, R., Zhu, F., Zhao, R.: Shikra: Unleashing multimodal llm’s referential dialogue magic. arXiv preprint arXiv:2306.15195 (2023)
  • [14] Chen, L., Gan, Z., Cheng, Y., Li, L., Carin, L., Liu, J.: Graph optimal transport for cross-domain alignment. In: International Conference on Machine Learning. pp. 1542–1553. PMLR (2020)
  • [15] Chen, S., He, X., Guo, L., Zhu, X., Wang, W., Tang, J., Liu, J.: Valor: Vision-audio-language omni-perception pretraining model and dataset. arXiv preprint arXiv:2304.08345 (2023)
  • [16] Chen, S., Li, H., Wang, Q., Zhao, Z., Sun, M., Zhu, X., Liu, J.: Vast: A vision-audio-subtitle-text omni-modality foundation model and dataset. Advances in Neural Information Processing Systems 36 (2024)
  • [17] Chen, S., Zhu, X., Hao, D., Liu, W., Liu, J., Zhao, Z., Guo, L., Liu, J.: Mm21 pre-training for video understanding challenge: Video captioning with pretraining techniques. In: Proceedings of the 29th ACM International Conference on Multimedia. pp. 4853–4857 (2021)
  • [18] Chen, Y.C., Li, L., Yu, L., El Kholy, A., Ahmed, F., Gan, Z., Cheng, Y., Liu, J.: Uniter: Universal image-text representation learning. In: European conference on computer vision. pp. 104–120. Springer (2020)
  • [19] Chiang, W.L., Li, Z., Lin, Z., Sheng, Y., Wu, Z., Zhang, H., Zheng, L., Zhuang, S., Zhuang, Y., Gonzalez, J.E., Stoica, I., Xing, E.P.: Vicuna: An open-source chatbot impressing gpt-4 with 90%* chatgpt quality (March 2023), https://lmsys.org/blog/2023-03-30-vicuna/
  • [20] Chowdhery, A., Narang, S., Devlin, J., Bosma, M., Mishra, G., Roberts, A., Barham, P., Chung, H.W., Sutton, C., Gehrmann, S., et al.: Palm: Scaling language modeling with pathways. Journal of Machine Learning Research 24(240), 1–113 (2023)
  • [21] Chowdhury, S., Nag, S., Joseph, K., Srinivasan, B.V., Manocha, D.: Melfusion: Synthesizing music from image and language cues using diffusion models. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 26826–26835 (2024)
  • [22] Chowdhury, S., Nag, S., Manocha, D.: Apollo: Unified adapter and prompt learning for vision language models. In: The 2023 Conference on Empirical Methods in Natural Language Processing (2023)
  • [23] Chung, H.W., Hou, L., Longpre, S., Zoph, B., Tay, Y., Fedus, W., Li, Y., Wang, X., Dehghani, M., Brahma, S., et al.: Scaling instruction-finetuned language models. arXiv preprint arXiv:2210.11416 (2022)
  • [24] Cramer, A.L., Wu, H.H., Salamon, J., Bello, J.P.: Look, listen, and learn more: Design choices for deep audio embeddings. In: ICASSP 2019-2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). pp. 3852–3856. IEEE (2019)
  • [25] Dou, Z.Y., Kamath, A., Gan, Z., Zhang, P., Wang, J., Li, L., Liu, Z., Liu, C., LeCun, Y., Peng, N., et al.: Coarse-to-fine vision-language pre-training with fusion in the backbone. Advances in neural information processing systems 35, 32942–32956 (2022)
  • [26] Elizalde, B., Deshmukh, S., Al Ismail, M., Wang, H.: Clap learning audio concepts from natural language supervision. In: ICASSP 2023-2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). pp. 1–5. IEEE (2023)
  • [27] Everingham, M., Eslami, S.A., Van Gool, L., Williams, C.K., Winn, J., Zisserman, A.: The pascal visual object classes challenge: A retrospective. International journal of computer vision 111, 98–136 (2015)
  • [28] Fabbrizzi, S., Papadopoulos, S., Ntoutsi, E., Kompatsiaris, I.: A survey on bias in visual datasets. Computer Vision and Image Understanding 223, 103552 (2022)
  • [29] Fedorishin, D., Mohan, D.D., Jawade, B., Setlur, S., Govindaraju, V.: Hear the flow: Optical flow-based self-supervised visual sound source localization. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision. pp. 2278–2287 (2023)
  • [30] Gemmeke, J.F., Ellis, D.P., Freedman, D., Jansen, A., Lawrence, W., Moore, R.C., Plakal, M., Ritter, M.: Audio set: An ontology and human-labeled dataset for audio events. In: 2017 IEEE international conference on acoustics, speech and signal processing (ICASSP). pp. 776–780. IEEE (2017)
  • [31] Georgescu, M.I., Fonseca, E., Ionescu, R.T., Lucic, M., Schmid, C., Arnab, A.: Audiovisual masked autoencoders. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. pp. 16144–16154 (2023)
  • [32] Girdhar, R., El-Nouby, A., Liu, Z., Singh, M., Alwala, K.V., Joulin, A., Misra, I.: Imagebind: One embedding space to bind them all. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 15180–15190 (2023)
  • [33] Gong, Y., Luo, H., Liu, A.H., Karlinsky, L., Glass, J.: Listen, think, and understand. arXiv preprint arXiv:2305.10790 (2023)
  • [34] Gutmann, M.U., Hyvärinen, A.: Noise-contrastive estimation of unnormalized statistical models, with applications to natural image statistics. Journal of machine learning research 13(2) (2012)
  • [35] Honovich, O., Scialom, T., Levy, O., Schick, T.: Unnatural instructions: Tuning language models with (almost) no human labor. arXiv preprint arXiv:2212.09689 (2022)
  • [36] Hu, E.J., Shen, Y., Wallis, P., Allen-Zhu, Z., Li, Y., Wang, S., Wang, L., Chen, W.: Lora: Low-rank adaptation of large language models. arXiv preprint arXiv:2106.09685 (2021)
  • [37] Huang, C., Tian, Y., Kumar, A., Xu, C.: Egocentric audio-visual object localization. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 22910–22921 (2023)
  • [38] Huang, S., Qin, L., Wang, B., Tu, G., Xu, R.: Sdif-da: A shallow-to-deep interaction framework with data augmentation for multi-modal intent detection. arXiv preprint arXiv:2401.00424 (2023)
  • [39] Kuznetsova, A., Rom, H., Alldrin, N., Uijlings, J., Krasin, I., Pont-Tuset, J., Kamali, S., Popov, S., Malloci, M., Kolesnikov, A., et al.: The open images dataset v4: Unified image classification, object detection, and visual relationship detection at scale. International Journal of Computer Vision 128(7), 1956–1981 (2020)
  • [40] Lai, X., Tian, Z., Chen, Y., Li, Y., Yuan, Y., Liu, S., Jia, J.: Lisa: Reasoning segmentation via large language model. arXiv preprint arXiv:2308.00692 (2023)
  • [41] Li, B., Zhang, Y., Chen, L., Wang, J., Yang, J., Liu, Z.: Otter: A multi-modal model with in-context instruction tuning. arXiv preprint arXiv:2305.03726 (2023)
  • [42] Li, G., Wei, Y., Tian, Y., Xu, C., Wen, J.R., Hu, D.: Learning to answer questions in dynamic audio-visual scenarios. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 19108–19118 (2022)
  • [43] Li, J., Li, D., Xiong, C., Hoi, S.: Blip: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International conference on machine learning. pp. 12888–12900. PMLR (2022)
  • [44] Li, J., Selvaraju, R., Gotmare, A., Joty, S., Xiong, C., Hoi, S.C.H.: Align before fuse: Vision and language representation learning with momentum distillation. Advances in neural information processing systems 34, 9694–9705 (2021)
  • [45] Li, K., He, Y., Wang, Y., Li, Y., Wang, W., Luo, P., Wang, Y., Wang, L., Qiao, Y.: Videochat: Chat-centric video understanding. arXiv preprint arXiv:2305.06355 (2023)
  • [46] Li, L.H., Zhang, P., Zhang, H., Yang, J., Li, C., Zhong, Y., Wang, L., Yuan, L., Zhang, L., Hwang, J.N., et al.: Grounded language-image pre-training. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 10965–10975 (2022)
  • [47] Liang, P., Bommasani, R., Lee, T., Tsipras, D., Soylu, D., Yasunaga, M., Zhang, Y., Narayanan, D., Wu, Y., Kumar, A., et al.: Holistic evaluation of language models. arXiv preprint arXiv:2211.09110 (2022)
  • [48] Lin, C.Y.: Rouge: A package for automatic evaluation of summaries. In: Text summarization branches out. pp. 74–81 (2004)
  • [49] Lin, Y.B., Li, Y.J., Wang, Y.C.F.: Dual-modality seq2seq network for audio-visual event localization. In: ICASSP 2019-2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). pp. 2002–2006. IEEE (2019)
  • [50] Lin, Y.B., Sung, Y.L., Lei, J., Bansal, M., Bertasius, G.: Vision transformers are parameter-efficient audio-visual learners. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 2299–2309 (2023)
  • [51] Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in neural information processing systems 36 (2024)
  • [52] Liu, J., Ju, C., Xie, W., Zhang, Y.: Exploiting transformation invariance and equivariance for self-supervised sound localisation. In: Proceedings of the 30th ACM International Conference on Multimedia. pp. 3742–3753 (2022)
  • [53] Liu, J., Wang, Y., Ju, C., Ma, C., Zhang, Y., Xie, W.: Annotation-free audio-visual segmentation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision. pp. 5604–5614 (2024)
  • [54] Liu, X., Dong, Z., Zhang, P.: Tackling data bias in music-avqa: Crafting a balanced dataset for unbiased question-answering. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision. pp. 4478–4487 (2024)
  • [55] Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin transformer: Hierarchical vision transformer using shifted windows. In: Proceedings of the IEEE/CVF international conference on computer vision. pp. 10012–10022 (2021)
  • [56] Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017)
  • [57] Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-io: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022)
  • [58] Lu, P., Mishra, S., Xia, T., Qiu, L., Chang, K.W., Zhu, S.C., Tafjord, O., Clark, P., Kalyan, A.: Learn to explain: Multimodal reasoning via thought chains for science question answering. Advances in Neural Information Processing Systems 35, 2507–2521 (2022)
  • [59] Luo, R., Zhao, Z., Yang, M., Dong, J., Qiu, M., Lu, P., Wang, T., Wei, Z.: Valley: Video assistant with large language model enhanced ability. arXiv preprint arXiv:2306.07207 (2023)
  • [60] Lyu, C., Wu, M., Wang, L., Huang, X., Liu, B., Du, Z., Shi, S., Tu, Z.: Macaw-llm: Multi-modal language modeling with image, audio, video, and text integration. arXiv preprint arXiv:2306.09093 (2023)
  • [61] Maaz, M., Rasheed, H., Khan, S., Khan, F.S.: Video-chatgpt: Towards detailed video understanding via large vision and language models. arXiv preprint arXiv:2306.05424 (2023)
  • [62] Majumder, S., Grauman, K.: Active audio-visual separation of dynamic sound sources. In: European Conference on Computer Vision. pp. 551–569. Springer (2022)
  • [63] Mao, Y., Zhang, J., Xiang, M., Zhong, Y., Dai, Y.: Multimodal variational auto-encoder based audio-visual segmentation. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. pp. 954–965 (2023)
  • [64] Mo, S., Morgado, P.: A closer look at weakly-supervised audio-visual source localization. Advances in Neural Information Processing Systems 35, 37524–37536 (2022)
  • [65] Mo, S., Morgado, P.: Localizing visual sounds the easy way. In: European Conference on Computer Vision. pp. 218–234. Springer (2022)
  • [66] Mo, S., Tian, Y.: Audio-visual grouping network for sound localization from mixtures. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 10565–10574 (2023)
  • [67] Nadeem, A., Hilton, A., Dawes, R., Thomas, G., Mustafa, A.: Cad-contextual multi-modal alignment for dynamic avqa. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision. pp. 7251–7263 (2024)
  • [68] Oord, A.v.d., Li, Y., Vinyals, O.: Representation learning with contrastive predictive coding. arXiv preprint arXiv:1807.03748 (2018)
  • [69] Ouyang, L., Wu, J., Jiang, X., Almeida, D., Wainwright, C., Mishkin, P., Zhang, C., Agarwal, S., Slama, K., Ray, A., et al.: Training language models to follow instructions with human feedback. Advances in Neural Information Processing Systems 35, 27730–27744 (2022)
  • [70] Panagopoulou, A., Xue, L., Yu, N., Li, J., Li, D., Joty, S., Xu, R., Savarese, S., Xiong, C., Niebles, J.C.: X-instructblip: A framework for aligning x-modal instruction-aware representations to llms and emergent cross-modal reasoning. arXiv preprint arXiv:2311.18799 (2023)
  • [71] Papineni, K., Roukos, S., Ward, T., Zhu, W.J.: Bleu: a method for automatic evaluation of machine translation. In: Proceedings of the 40th annual meeting of the Association for Computational Linguistics. pp. 311–318 (2002)
  • [72] Park, S., Senocak, A., Chung, J.S.: Marginnce: Robust sound localization with a negative margin. In: ICASSP 2023-2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). pp. 1–5. IEEE (2023)
  • [73] Peng, B., Li, C., He, P., Galley, M., Gao, J.: Instruction tuning with gpt-4. arXiv preprint arXiv:2304.03277 (2023)
  • [74] Peng, Z., Wang, W., Dong, L., Hao, Y., Huang, S., Ma, S., Wei, F.: Kosmos-2: Grounding multimodal large language models to the world. arXiv preprint arXiv:2306.14824 (2023)
  • [75] Plummer, B.A., Wang, L., Cervantes, C.M., Caicedo, J.C., Hockenmaier, J., Lazebnik, S.: Flickr30k entities: Collecting region-to-phrase correspondences for richer image-to-sentence models. In: Proceedings of the IEEE international conference on computer vision. pp. 2641–2649 (2015)
  • [76] Pramanick, S., Han, G., Hou, R., Nag, S., Lim, S.N., Ballas, N., Wang, Q., Chellappa, R., Almahairi, A.: Jack of all tasks, master of many: Designing general-purpose coarse-to-fine vision-language model. arXiv preprint arXiv:2312.12423 (2023)
  • [77] Pramanick, S., Jing, L., Nag, S., Zhu, J., Shah, H.J., LeCun, Y., Chellappa, R.: Volta: Vision-language transformer with weakly-supervised local-feature alignment. Transactions on Machine Learning Research (2023)
  • [78] Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International conference on machine learning. pp. 8748–8763. PMLR (2021)
  • [79] Radford, A., Kim, J.W., Xu, T., Brockman, G., McLeavey, C., Sutskever, I.: Robust speech recognition via large-scale weak supervision. In: International Conference on Machine Learning. pp. 28492–28518. PMLR (2023)
  • [80] Raffel, C., Shazeer, N., Roberts, A., Lee, K., Narang, S., Matena, M., Zhou, Y., Li, W., Liu, P.J.: Exploring the limits of transfer learning with a unified text-to-text transformer. The Journal of Machine Learning Research 21(1), 5485–5551 (2020)
  • [81] Raffel, C., Shazeer, N., Roberts, A., Lee, K., Narang, S., Matena, M., Zhou, Y., Li, W., Liu, P.J.: Exploring the limits of transfer learning with a unified text-to-text transformer. Journal of machine learning research 21(140), 1–67 (2020)
  • [82] Rasley, J., Rajbhandari, S., Ruwase, O., He, Y.: Deepspeed: System optimizations enable training deep learning models with over 100 billion parameters. In: Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining. pp. 3505–3506 (2020)
  • [83] Ren, S., Yao, L., Li, S., Sun, X., Hou, L.: Timechat: A time-sensitive multimodal large language model for long video understanding. arXiv preprint arXiv:2312.02051 (2023)
  • [84] Schwartz, I., Schwing, A.G., Hazan, T.: A simple baseline for audio-visual scene-aware dialog. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 12548–12558 (2019)
  • [85] Senocak, A., Oh, T.H., Kim, J., Yang, M.H., Kweon, I.S.: Learning to localize sound source in visual scenes. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. pp. 4358–4366 (2018)
  • [86] Senocak, A., Ryu, H., Kim, J., Oh, T.H., Pfister, H., Chung, J.S.: Sound source localization is all about cross-modal alignment. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. pp. 7777–7787 (2023)
  • [87] Shu, F., Zhang, L., Jiang, H., Xie, C.: Audio-visual llm for video understanding. arXiv preprint arXiv:2312.06720 (2023)
  • [88] Song, Z., Wang, Y., Fan, J., Tan, T., Zhang, Z.: Self-supervised predictive learning: A negative-free method for sound source localization in visual scenes. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 3222–3231 (2022)
  • [89] Su, Y., Lan, T., Li, H., Xu, J., Wang, Y., Cai, D.: Pandagpt: One model to instruction-follow them all. arXiv preprint arXiv:2305.16355 (2023)
  • [90] Sun, W., Zhang, J., Wang, J., Liu, Z., Zhong, Y., Feng, T., Guo, Y., Zhang, Y., Barnes, N.: Learning audio-visual source localization via false negative aware contrastive learning. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 6420–6429 (2023)
  • [91] Tan, R., Ray, A., Burns, A., Plummer, B.A., Salamon, J., Nieto, O., Russell, B., Saenko, K.: Language-guided audio-visual source separation via trimodal consistency. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 10575–10584 (2023)
  • [92] Taori, R., Gulrajani, I., Zhang, T., Dubois, Y., Li, X., Guestrin, C., Liang, P., Hashimoto, T.B.: Stanford alpaca: An instruction-following llama model. https://github.com/tatsu-lab/stanford_alpaca (2023)
  • [93] Taylor, R., Kardas, M., Cucurull, G., Scialom, T., Hartshorn, A., Saravia, E., Poulton, A., Kerkez, V., Stojnic, R.: Galactica: A large language model for science. arXiv preprint arXiv:2211.09085 (2022)
  • [94] Tian, Y., Guan, C., Goodman, J., Moore, M., Xu, C.: An attempt towards interpretable audio-visual video captioning. arXiv preprint arXiv:1812.02872 (2018)
  • [95] Tian, Y., Li, D., Xu, C.: Unified multisensory perception: Weakly-supervised audio-visual video parsing. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part III 16. pp. 436–454. Springer (2020)
  • [96] Tian, Y., Shi, J., Li, B., Duan, Z., Xu, C.: Audio-visual event localization in unconstrained videos. In: Proceedings of the European conference on computer vision (ECCV). pp. 247–263 (2018)
  • [97] Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023)
  • [98] Touvron, H., Martin, L., Stone, K., Albert, P., Almahairi, A., Babaei, Y., Bashlykov, N., Batra, S., Bhargava, P., Bhosale, S., et al.: Llama 2: Open foundation and fine-tuned chat models, 2023. URL https://arxiv. org/abs/2307.09288 (2023)
  • [99] Vedantam, R., Lawrence Zitnick, C., Parikh, D.: Cider: Consensus-based image description evaluation. In: Proceedings of the IEEE conference on computer vision and pattern recognition. pp. 4566–4575 (2015)
  • [100] Wang, P., Yang, A., Men, R., Lin, J., Bai, S., Li, Z., Ma, J., Zhou, C., Zhou, J., Yang, H.: Ofa: Unifying architectures, tasks, and modalities through a simple sequence-to-sequence learning framework. In: International Conference on Machine Learning. pp. 23318–23340. PMLR (2022)
  • [101] Wang, W., Lv, Q., Yu, W., Hong, W., Qi, J., Wang, Y., Ji, J., Yang, Z., Zhao, L., Song, X., et al.: Cogvlm: Visual expert for pretrained language models. arXiv preprint arXiv:2311.03079 (2023)
  • [102] Wang, W., Chen, Z., Chen, X., Wu, J., Zhu, X., Zeng, G., Luo, P., Lu, T., Zhou, J., Qiao, Y., et al.: Visionllm: Large language model is also an open-ended decoder for vision-centric tasks. Advances in Neural Information Processing Systems 36 (2024)
  • [103] Wei, J., Bosma, M., Zhao, V.Y., Guu, K., Yu, A.W., Lester, B., Du, N., Dai, A.M., Le, Q.V.: Finetuned language models are zero-shot learners. arXiv preprint arXiv:2109.01652 (2021)
  • [104] Workshop, B., Scao, T.L., Fan, A., Akiki, C., Pavlick, E., Ilić, S., Hesslow, D., Castagné, R., Luccioni, A.S., Yvon, F., et al.: Bloom: A 176b-parameter open-access multilingual language model. arXiv preprint arXiv:2211.05100 (2022)
  • [105] Wu, H.H., Seetharaman, P., Kumar, K., Bello, J.P.: Wav2clip: Learning robust audio representations from clip. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). pp. 4563–4567. IEEE (2022)
  • [106] Yang, P., Wang, X., Duan, X., Chen, H., Hou, R., Jin, C., Zhu, W.: Avqa: A dataset for audio-visual question answering on videos. In: Proceedings of the 30th ACM International Conference on Multimedia. pp. 3480–3491 (2022)
  • [107] Yang, Z., Gan, Z., Wang, J., Hu, X., Ahmed, F., Liu, Z., Lu, Y., Wang, L.: Unitab: Unifying text and box outputs for grounded vision-language modeling. In: European Conference on Computer Vision. pp. 521–539. Springer (2022)
  • [108] Ye, Q., Xu, H., Xu, G., Ye, J., Yan, M., Zhou, Y., Wang, J., Hu, A., Shi, P., Shi, Y., et al.: mplug-owl: Modularization empowers large language models with multimodality. arXiv preprint arXiv:2304.14178 (2023)
  • [109] You, H., Zhang, H., Gan, Z., Du, X., Zhang, B., Wang, Z., Cao, L., Chang, S.F., Yang, Y.: Ferret: Refer and ground anything anywhere at any granularity. arXiv preprint arXiv:2310.07704 (2023)
  • [110] Yun, H., Yu, Y., Yang, W., Lee, K., Kim, G.: Pano-avqa: Grounded audio-visual question answering on 360deg videos. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. pp. 2031–2041 (2021)
  • [111] Zhang, C., Cai, Y., Lin, G., Shen, C.: Deepemd: Few-shot image classification with differentiable earth mover’s distance and structured classifiers. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 12203–12213 (2020)
  • [112] Zhang, H., Li, X., Bing, L.: Video-llama: An instruction-tuned audio-visual language model for video understanding. arXiv preprint arXiv:2306.02858 (2023)
  • [113] Zhang, R., Han, J., Zhou, A., Hu, X., Yan, S., Lu, P., Li, H., Gao, P., Qiao, Y.: Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:2303.16199 (2023)
  • [114] Zhang, S., Sun, P., Chen, S., Xiao, M., Shao, W., Zhang, W., Chen, K., Luo, P.: Gpt4roi: Instruction tuning large language model on region-of-interest. arXiv preprint arXiv:2307.03601 (2023)
  • [115] Zhang, S., Roller, S., Goyal, N., Artetxe, M., Chen, M., Chen, S., Dewan, C., Diab, M., Li, X., Lin, X.V., et al.: Opt: Open pre-trained transformer language models. arXiv preprint arXiv:2205.01068 (2022)
  • [116] Zhao, Y., Lin, Z., Zhou, D., Huang, Z., Feng, J., Kang, B.: Bubogpt: Enabling visual grounding in multi-modal llms. arXiv preprint arXiv:2307.08581 (2023)
  • [117] Zhou, J., Wang, J., Zhang, J., Sun, W., Zhang, J., Birchfield, S., Guo, D., Kong, L., Wang, M., Zhong, Y.: Audio–visual segmentation. In: European Conference on Computer Vision. pp. 386–403. Springer (2022)
  • [118] Zhu, D., Chen, J., Shen, X., Li, X., Elhoseiny, M.: Minigpt-4: Enhancing vision-language understanding with advanced large language models. arXiv preprint arXiv:2304.10592 (2023)
[Uncaptioned image]Meerkat: Audio-Visual Large Language Model
for Grounding in Space and Time
Appendix

In this appendix we provide additional details about:
7 Data preparation strategy (referenced in Sec. 4.2 of main paper)
8 Dataset instruction templates (referenced in Sec. 3.4 and Sec. 5.2)
9 Dataset statistics and analysis
10 More qualitative results
11 More ablations (referenced in Sec. 5.3)
12 Comparison with contrastive loss (referenced in Sec. 3.2)
13 Comparison with two-stage training (referenced in Sec. 3.2)
14 Role of audio in AVQA task
15 More on optimal transport (referenced in Sec. 3.2)
16 AVSBench data collection
17 Comparison against ImageBind
18 Other Quantitative metrics on AVQA task (referenced in Sec. 5.2)
19 Evaluation metrics
20 Failure cases
21 Ethics statement

7 Data Preparation Strategy

7.1 Adaptation of Public Datasets.

To collect the image-audio pairs from video-based datasets and adapt them to our setup, we carefully choose one representative image from the video. We add task-wise dataset details in Fig. 5. To this end, we design a semi-automated strategy as explained later in each task section.

7.2 Fine-grained Data Preparation

Audio Referred Image Grounding (ARIG). For this task, the dataset collection consists of image-audio pairs from Openimages-AudioSet, Openimages-VGGSound, AVSBench, VGGSS, PASCAL Sound, and Flickr-Soundnet. Among these, for Openimages-AudioSet, Openimages-VGGSound and VGGSS we first obtain the top 3 image frames with the highest image-text CLIP similarity scores [78] and subsequently select the most suitable frame by manual inspection to form the image-audio pair. The frames are extracted from the video segment of interest (denoted in dataset annotation). Please refer to Tab. 9 for the Openimages-AudioSet / VGGSound classwise associations. We refer to this look-up table while matching the corresponding classes.

Refer to caption
Figure 5: Task-wise dataset distribution. The bi-coloured cells denote collections of paired image-audio samples from public datasets following our data curation strategy while single-coloured cells signify direct adaptation. Datasets with dashed outlines are used only during model training while the ones with Refer to caption are reserved for zero-shot evaluations. Other datasets have a defined train/test split. Numbers in the bottom right represent the total #samples present in each task.
  • Openimages-AudioSet: For every sample, we obtain the [start,end] time interval of the audio event of interest from the AudioSet dataset. Each sample is associated with an audio category. We use this class label while calculating the CLIP score with the image frames. We zero-pad and make length each audio piece 30s.

  • Openimages-VGGSound: We obtain the onset (start) of an event from the VGGSound dataset annotation and extract min(start + 30, len(audio)) second snippet. If the len(audio) is less than 30s we zero pad to maintain the audio sequence length.

  • AVSBench: AVSBench comes with 5 to 6 frames along with the audio snippet. We manually choose the best frame that most closely relates to the audio event under consideration through manual inspection.

  • VGGSS: We follow a similar strategy as that of VGGSound.

  • PASCAL Sound: We choose 566 image samples from the PASCAL dataset [27] ranging from 12 sounding classes and carefully pair them with AudioSet samples using the same protocol as Openimages-AudioSet.

  • Flickr-Soundnet [85]: Here we directly obtain the image audio pairs as released by the authors.

For all these cases we augment the image-audio pairs with our instruction tuning templates (refer to Section 8).

Openimages Label Name Audioset Label Name VGGSound Label Name
Aircraft Aircraft airplane
Alarm clock Alarm clock alarm clock ringing
Ambulance Ambulance (siren) ambulance siren
Bicycle Bicycle, tricycle
Bird Bird bird chirping, tweeting
Blender Blender, food processor electric blender running
Boat Boat, Water vehicle sailing
Bus Bus helicopter
Camera Camera
Cannon firing cannon
Car Car race car, auto racing
Cat Cat cat meowing
Cattle cattle mooing
Ceiling fan Mechanical fan running electric fan
Cello playing cello
Chainsaw Chainsaw chainsawing trees
Cheetah Roaring cats (lions, tigers) cheetah chirrup
Chicken Fowl
Chime Chime wind chime
Clock Clock
Computer keyboard Computer keyboard typing on computer keyboard
Computer mouse Mouse
Corded phone Dial tone cell phone buzzing
Cutting board Chopping (food) chopping food
Dagger Knife
Digital clock alarm clock ringing
Dog Dog dog baying
Door Door
Door handle Doorbell door slamming
Drill (Tool) Drill
Drum playing drum kit
Duck Quack duck quacking
Eagle eagle screaming
Elephant elephant trumpeting
Fireplace Fire
Fixed-wing aircraft Fixed-wing aircraft, airplane
Fountain Waterfall
Fox Canidae, wild dogs, wolves fox barking
French horn playing french horn
Frog Frog frog croaking
Girl Female speech, woman speaking
Glasses Glass
Goat Goat goat bleating
Golf cart Cart
Goose Ducks, geese, waterfowl goose honking
Grinder electric grinder grinding
Guitar playing acoustic guitar
Hair dryer Hair dryer hair dryer drying
Hammer Hammer
Hand dryer Hair dryer
Handgun Gunshot, gunfire machine gun shooting
Harmonica playing harmonica
Harp playing harp
Harpsichord playing harpsichord
Helicopter Helicopter helicopter
Horse Horse horse neighing
Human face Female speech, woman speaking|Male speech, man speaking
Infant bed baby crying
Ipod Music
Jaguar (Animal) Roaring cats (lions, tigers) cheetah chirrup
Jet ski Jet engine skiing
Kettle Steam whistle
Kitchen knife Knife
Kitchen utensil Kitchen and dining room sounds
Knife Knife
Land vehicle Vehicle car passing by
Laptop Typing typing on computer keyboard
Leopard Roar
Light switch Clicking
Limousine Car
Lion Roar lions roaring
Magpie Bird magpie calling
Mammal Animal
Man Male speech, man speaking
Mechanical fan Mechanical fan running electric fan
Microphone Microphone
Microwave oven Microwave oven
Missile missile launch
Mixer Blender, food processor
Mobile phone Telephone cell phone buzzing
Motorcycle Motorcycle driving motorcycle
Mouse Mouse
Musical instrument Music orchestra
Musical keyboard playing piano
Oboe playing oboe
Otter otter growling
Owl Owl owl hooting
Paper cutter Scissors ripping paper
Parrot Bird parrot talking
Person Female speech, woman speaking|Male speech, man speaking
Piano playing piano
Pig Pig pig oinking
Popcorn Burst, pop popping popcorn
Power plugs and sockets Power tool
Pressure cooker Steam
Printer Printer printer printing
Rabbit Rodents, rats, mice
Ratchet (Device) Ratchet, pawl
Raven Crow crow cawing
Reptile Snake
Rifle Machine gun machine gun shooting
Rocket missile launch
Saxophone playing saxophone
Sea lion sea lion barking
Segway Non-motorized land vehicle
Sewing machine Sewing machine using sewing machines
Sheep Sheep
Shotgun Gunshot, gunfire
Shower Shower
Skateboard Skateboard skateboarding
Ski skiing
Snail hail
Snake Snake snake rattling
Snowboard Skateboard skiing
Snowmobile Motorcycle
Snowplow Lawnmower
Spoon Kitchen and dining room sounds
Stationary bicycle Bicycle, tricycle driving motorcycle
Swan Quack
Swimming pool Water
Sword Knife
Table tennis racket playing table tennis
Tablet computer Computer keyboard typing on computer keyboard
Tap Tap
Taxi Car hail
Telephone Telephone telephone bell ringing
Television Television
Tiger Roar
Toilet Toilet flush toilet flushing
Train Train
Truck Truck
Trombone playing trombone
Trumpet playing trumpet
Turkey Turkey
Unicycle Bicycle bell
Van Car
Vehicle Vehicle vehicle horn, car horn, honking
Violin playing violin, fiddle
Wall clock Clock alarm clock ringing
Washing machine Washing machine
Watch Clock
Wine glass Glass
Whale whale calling
Woman Female speech, woman speaking|Male speech, man speaking
Woodpecker Wood woodpecker pecking tree
Table 9: Image audio class mapping. We associate the image and audio classes from the Openimages and the AudioSet / VGGSound datasets and prepare a lookup table through careful manual inspection.

Image Guided Audio Temporal Localization (IGATL).

  • Openimages-AudioSet (Strong): While curating the image samples we follow a similar strategy as before. To ensure a fair assessment we choose audio snippets that are considerably longer than the event of interest (EoI). However, through manual inspections, we ensure that the EoI lies within the extracted audio piece.

    Task Example Instruction
    \bullet Given the audio and image pair, identify the object category of the audio. Now, provide a bounding box for that object in the image. The answer should be in the form [<obj>,xLeft,yTop,xRight,yBottom]. <obj> represents the object category. xLeft,yTop are coordinates of the top-left corner and xRight,yBottom are coordinates of the bottom-right corner of the bounding box. The coordinates should be within the range 0 to 1.
    \bullet From the given audio and image pair first identify the object category of the audio. Then localize the corresponding object in the image by providing a bounding box. The answer should be in the form [<obj>,xLeft,yTop,xRight,yBottom]. <obj> represents the object category. xLeft,yTop are coordinates of the top-left corner and xRight,yBottom are coordinates of the bottom-right corner of the bounding box. The coordinates should be within the range 0 to 1.
    \bullet Given the audio and image pair, identify the object category of the audio. Now, localize the object in the image by providing a bounding box. The answer should be in the form [<obj>,xLeft,yTop,xRight,yBottom]. <obj> represents the object category. xLeft,yTop are coordinates of the top-left corner and xRight,yBottom are coordinates of the bottom-right corner of the bounding box. The coordinates should be within the range 0 to 1.
    ARIG \bullet Considering the audio and image pair, determine the object class of the audio. Next, localize the same object in the image by providing a bounding box. The answer should be in the form [<obj>,xLeft,yTop,xRight,yBottom]. <obj> represents the class of the object. xLeft,yTop are coordinates of the top-left corner and xRight,yBottom are coordinates of the bottom-right corner of the bounding box. The coordinates should be within the range 0 to 1.
    \bullet Considering the audio and image pair, recognize the object category of the audio. Subsequently, draw a bounding box around that object shown in the image. The answer should be in the form [<obj>,xLeft,yTop,xRight,yBottom]. <obj> represents the category of the object. xLeft,yTop are coordinates of the top-left corner and xRight,yBottom are coordinates of the bottom-right corner of the bounding box. The coordinates should be within the range 0 to 1.
    \bullet Considering the audio and image pair, recognize the object category of the audio. Next, draw a bounding box around that object in the image. The answer should be in the form [<obj>,xLeft,yTop,xRight,yBottom]. <obj> represents the category of the object. xLeft,yTop are coordinates of the top-left corner and xRight,yBottom are coordinates of the bottom-right corner of the bounding box. Ensure the bounding box is within the range 0 to 1.
    \bullet Identify the object category from the image. Now, find the time duration in the audio where that object is making the sound. The output should be in the form (tStart,tEnd) where tStart and tEnd are the start and end times respectively. tStart is less than tEnd. The minimum value of tStart is 0. The maximum value of tEnd is 30.
    \bullet Given the image, identify the object category. Next, output the time window in the audio where that object is making the sound. The output should be in the form (tStart,tEnd) where tStart and tEnd are the start and end times respectively. tStart is less than tEnd. The minimum value of tStart is 0. The maximum value of tEnd is 30.
    \bullet Which object do you see in the image? Please find the time window in the audio where that object is making the sound. The output should be in the form (tStart,tEnd) where tStart and tEnd are the start and end times respectively. tStart is less than tEnd. The minimum value of tStart is 0. The maximum value of tEnd is 30.
    IGATL \bullet Recognise the object category from the image. Now, indicate the time duration in the audio where that object is making the sound. The output should be in the form (tStart,tEnd) where tStart and tEnd are the start and end times respectively. tStart is less than tEnd. The minimum value of tStart is 0. The maximum value of tEnd is 30.
    \bullet What is the category of the object that you see in the image? Now, indicate the temporal duration in the audio where that object is making the sound. The output should be in the form (tStart,tEnd) where tStart and tEnd are the start and end times respectively. tStart is less than tEnd. The minimum value of tStart is 0. The maximum value of tEnd is 30.
    \bullet Does the object inside the bounding box <placeholder_bbox> of the image produce the same sound as in the given audio? Answer in True or False.
    \bullet Given the image, does the object inside the bounding box <placeholder_bbox> produce the same sound as in the given audio? Answer in True or False.
    \bullet The object inside the bounding box <placeholder_bbox> of the image produces the same sound as in the given audio. True or False?
    \bullet From the audio-image pair, verify if the object inside the bounding box <placeholder_bbox> produces the same sound as present in the given audio. Answer in True or False.
    \bullet The object in the given audio between time duration <placeholder_time> is present in the image. True or False?
    \bullet Listen to the audio in the time window <placeholder_time>. Does this object exist in the image? Answer in True or False.
    \bullet Listen to the audio in the time window <placeholder_time>. Verify if the same object is present in the image. True or False?
    AVFact \bullet The time segment <placeholder_time> contains the object as present in the image. True or False?
    \bullet Listen to the audio in the time window <placeholder_time>. The same object is within the bounding box <placeholder_bbox> in the image. True or False?
    \bullet Does the object inside the bounding box <placeholder_bbox> of the image produce the same sound as within the time duration <placeholder_time> in the given audio? Answer in True or False.
    \bullet The object inside the bounding box <placeholder_bbox> of the image produces the same sound as in the time segment <placeholder_time> of the audio. True or False?
    \bullet The time segment <placeholder_time> contains the object in the bounding box <placeholder_bbox> of the image. True or False?
    \bullet Here is an audio-image pair. Does the given audio correspond to the object shown in the image? Answer in True or False.
    \bullet Does the given audio correspond to the object shown in the image? Answer in True or False.
    \bullet Does the given audio associate with the object shown in the image? Answer in True or False.
    \bullet Here is an audio-image pair. Does the given image associate with the object sounding in the audio? Answer in True or False.
    \bullet How many instruments are sounding in the image?
    \bullet Which is the musical instrument that sounds at the same time as the <Object>?
    \bullet Is the <Object> on the <LR> louder than the <Object> on the <LR>?
    \bullet Is there a voiceover?
    AVQA \bullet Is the <Object> playing longer than the <Object>?
    AVC \bullet Considering the audio input, generate a caption for the image.
    Table 10: Task wise instructions template.
  • LLP: The LLP dataset provides fine-grained temporal annotations of the audio events in the format [onset, offset]. One representative image is chosen from within this time segment. While preparing our test set, we restrict ourselves to one category per video and their corresponding onset and offset values to rule out overlapping events within the same time interval.

Audio-Visual Fact-checking (AVFact).

  • Openimages-AudioSet: For Type 1 we collect samples from the AudioSet split while for Type 2, Type 3, Type 4 we choose samples from AudioSet Strong split as it consists of time-sensitive grounding information which is used in these three types of queries. For the image collection, we follow the same strategy as before.

7.3 Coarse-grained Data Preparation

For the coarse-grained tasks, we resort to direct adaptations of publicly available datasets.

Audio-Visual Question Answering. In the absence of audio class labels, we manually inspect the video to obtain the most suitable frame for each sample.

  • AVQA: The AVQA dataset contains the start time stamp which denotes the onset of the event of interest. We follow the same train/test split as proposed by the authors [106].

  • MUSIC-AVQA: We crop the 30s from the original 1-minute-long video sequence within which the event of interest lies.

Audio-Visual Captioning (AVC).

  • VALOR-32K: Each sample in the VALOR dataset comprises an elaborate caption of the audio-visual scene. We leverage this caption to calculate the CLIP similarity score between the image-text pair and obtain the top 3 most relevant frames from within the 10s long annotation as provided by the authors. Finally, we choose one representative frame through manual inspection.

8 Dataset Instruction Templates

We add task-wise sample instruction templates in Tab. 10. To make the instruction tuning robust and incorporate sufficient diversity, we manually write a few instructions and prompt GPT-3.5 [6] to generate different variants. We further refine the instruction templates using GPT-4 [1]. Note that for AVQA and AV captioning tasks, we restrict ourselves to the questions and captions provided by the authors.

9 Dataset Statistics and Analysis

Refer to caption
(a) Image-audio similarity scores across samples in Openimages-AudioSet.
Refer to caption
(b) Category-wise average duration of chosen audio samples from AudioSet dataset.
Refer to caption
(c) AVFact - Type 1: Top-6 category-wise number of True and False samples.
Refer to caption
(d) AVFact - Type 2: Category-wise distribution of chosen audio samples from AudioSet dataset.
Refer to caption
(e) AVFact - Type 3: Total audio time duration per image category of chosen samples.
Refer to caption
(f) AVFact - Type 4: Category-wise distribution of samples from images Openimages dataset.
Figure 6: AVFIT statistics and analysis.

Image Audio Similarity. To study the similarity between the image-audio pairs [21] from Openimages-AudioSet, we utilize the CLIP [78] and CLAP [26] scores by calculating 𝒮CLIP𝒮CLAPTsubscript𝒮CLIPsuperscriptsubscript𝒮CLAP𝑇\mathcal{S}_{\text{CLIP}}\;\mathcal{S}_{\text{CLAP}}^{T}caligraphic_S start_POSTSUBSCRIPT CLIP end_POSTSUBSCRIPT caligraphic_S start_POSTSUBSCRIPT CLAP end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT, where 𝒮N×N𝒮superscript𝑁𝑁\mathcal{S}\in\mathbb{R}^{N\times N}caligraphic_S ∈ blackboard_R start_POSTSUPERSCRIPT italic_N × italic_N end_POSTSUPERSCRIPT and denotes the pairwise cross-modal similarity scores for a batch of size N𝑁Nitalic_N. The CLIP similarity is calculated between the chosen image and the audio class label, similarly, the CLAP score is calculated between the audio class label and the audio snippet. The text modality acts as the bridging modality in this case. Fig. 6(a) reports the image-audio similarity scores over the most frequent 9 categories while ‘others’ denotes aggregation of all the remaining ones. Note the range of the scores is normalized between [0,1] with 0 being the lowest. The average score of image-audio pairs across all samples collected for the audio referred image grounding task is 0.77, supporting a strong association between the two modalities.

Audio Duration. We report the category-wise mean duration (in sec.) of the audio samples from the AudioSet dataset in Fig. 6(b) for image-guided audio temporal localization task. The ‘Train’ class has the overall highest value with an average duration of 9.83 sec while the ‘Clicking’ category has the lowest average duration at 0.32 sec.

Class wise Robustness. We report class-wise (top 6 classes based on occurrence) True/False sample count from the Openimages dataset for AVFact - Type 1 set in Fig. 6(c). We maintain a good balance of matched and mismatched pairs to ensure our model is robust to deceptive queries.

AudioSet Distribution. Fig. 6(d) reports the class-wise distribution of samples present in the AVFact Type-2 set as collected from the AudioSet dataset.

Audio Duration Per Image Class. In Fig. 6(e) we present the duration of audio samples across various image classes from Openimages in AVFact Type-4 split. This demonstrates the overall balanced mix of image-audio distributions across different pairings.

Category Wise Distribution. Fig. 6(f) presents image category-wise distributions of samples from the Openimages dataset for AVFact Type-4 samples.

Refer to caption
(a) Qualitative samples on image grounding task.
Refer to caption
(b) Qualitative results on audio temporal localization.
Refer to caption
(c) More results on AVFact task.
Figure 7: Qualitative results of Meerkat on different fine-grained downstream tasks from the MeerkatBench.

10 More Qualitative Results

We provide additional qualitative results from Meerkat in Fig. 8. In Fig. 7(a) we show excellent image grounding capabilities of our model when queried with audio inputs. We observe that even for small objects or visual scenes with complex associations among different components, Meerkat can correctly identify the referred object. This underlines the fine-grained comprehension capabilities acquired by Meerkat during its training phase. Meerkat is equipped with strong audio temporal localization as well while prompted with an image. As evident from Fig. 7(b), our model is capable of precisely understanding audio samples and accurately identifying the temporal onset of an event and the specific time duration of that particular event, even in the presence of other distractors and ambient sound. Fig. 7(c) depicts the fine-grained audio-visual comprehension capabilities of our method. Even when Meerkat is presented with noisy audio-visual samples and scenarios that demand detailed AV association understanding, our model can produce correct results with substantially high accuracy. Our method is also adept at coarse-grained tasks like AVQA and AV captioning as demonstrated in Fig. 8(a) and 8(b).

Refer to caption
(a) Meerkat performance on AVQA task.
Refer to caption
(b) Meerkat performance on AV Captioning task.
Figure 8: Qualitative results of Meerkat on different coarse-grained downstream tasks from the MeerkatBench.

11 More ablations

11.1 Other Image Encoders

We compare the performance of our model on employing different image encoders as shown in Tab. 11. We observe the best performance with CLIP-ViT-B/16 [78] and use this as our preferred image encoder due to its compatibility with the instruction-guided image tokenizer module in our system.

Image Encoder VGG-SS LLP AVFact AVQA VALOR
cIOU \uparrow F1-score \uparrow Avg F1-score \uparrow Avg Acc. \uparrow CIDEr \uparrow
CLIP-ViT-B/32 [78] 42.56 49.64 0.78 84.04 73.39
BLIP-ViT-B/16 [43] 47.22 52.83 0.82 85.79 75.13
CLIP-ViT-B/16 (Ours) 48.51 54.96 0.85 87.14 76.84
Table 11: Meerkat performance with different image encoders

11.2 Other Audio Encoders

We carry out experiments with various audio encoders in Tab. 12 such as Open L3 [24, 3], WAV2CLIP [105], and Wav2Vec2 [4] with the optimal performance obtained with the CLAP [26] encoder. We attribute this performance boost to its superior Swin Transformer [55] based backbone to get audio features from a log-Mel spectrogram. Owing to its large-scale contrastive language-audio pretraining to develop an audio representation by combining audio data with natural language descriptions, CLAP encoders are shown to perform exceptionally well on processing open-domain audio over speech-based encoders like Whisper [79].

Audio Encoder VGG-SS LLP AVFact AVQA VALOR
cIOU \uparrow F1-score \uparrow Avg F1-score \uparrow Avg Acc. \uparrow CIDEr \uparrow
Open L3 [24, 3] 44.52 51.28 0.76 83.29 72.38
WAV2CLIP [105] 45.34 51.94 0.78 84.46 73.77
Wav2Vec2 [4] 46.91 53.07 0.81 85.88 75.80
CLAP audio encoder 48.51 54.96 0.85 87.14 76.84
Table 12: Meerkat performance with different audio encoders

11.3 With Different LLM

We ablate our model and replace the LLM with other recent language models such as T5 [81], Vicuna [19], and Alpaca [92]. We observe a noticeable drop in performance when the LLM is not instruction-tuned compared to its instruction-tuned counterpart. This demonstrates the importance of leveraging instruction-tuned LLMs under a multi-modal instruction comprehension setup. We note instruction tuning allows equipping the LLM with a customized instructions template which results in improved performance under a multi-task setting, as demonstrated in Tab. 13.

Model VGG-SS LLP AVFact AVQA VALOR
cIOU \uparrow F1-score \uparrow Avg F1-score \uparrow Avg Acc. \uparrow CIDEr \uparrow
T5 41.49 48.50 0.78 82.49 72.56
Alpaca 42.74 49.98 0.80 83.75 74.84
Vicuna 47.06 53.68 0.83 86.38 75.88
Llama-2 48.51 54.96 0.85 87.14 76.84
Table 13: Ablative study under various LLMs.

11.4 Effect of λOTsubscript𝜆𝑂𝑇\mathcal{\lambda}_{OT}italic_λ start_POSTSUBSCRIPT italic_O italic_T end_POSTSUBSCRIPT and λACsubscript𝜆𝐴𝐶\mathcal{\lambda}_{AC}italic_λ start_POSTSUBSCRIPT italic_A italic_C end_POSTSUBSCRIPT

We ablate loss hyperparameters λOTsubscript𝜆OT\lambda_{\text{OT}}italic_λ start_POSTSUBSCRIPT OT end_POSTSUBSCRIPT and λACsubscript𝜆AC\lambda_{\text{AC}}italic_λ start_POSTSUBSCRIPT AC end_POSTSUBSCRIPT and compare performance of Meerkat on ARIG and IGATL tasks in Fig. 9(a) and Fig. 9(b), respectively. Experimental results suggest that best metrics are obtained with λACsubscript𝜆AC\lambda_{\text{AC}}italic_λ start_POSTSUBSCRIPT AC end_POSTSUBSCRIPT = 0.35 and λOTsubscript𝜆OT\lambda_{\text{OT}}italic_λ start_POSTSUBSCRIPT OT end_POSTSUBSCRIPT = 0.75, respectively.

0.250.250.250.250.650.650.650.65111120202020222222222424242426262626282828283030303032323232343434343636363638383838404040404242424244444444464646464848484850505050λ𝜆\lambdaitalic_λcIOU

λOTsubscript𝜆OT\lambda_{\text{OT}}italic_λ start_POSTSUBSCRIPT OT end_POSTSUBSCRIPT

λAVACEsubscript𝜆AVACE\lambda_{\text{AVACE}}italic_λ start_POSTSUBSCRIPT AVACE end_POSTSUBSCRIPT

(a) cIoU upper bound on VGG-SS with varying weightage of λOTsubscript𝜆OT\lambda_{\text{OT}}italic_λ start_POSTSUBSCRIPT OT end_POSTSUBSCRIPT and λACsubscript𝜆AC\lambda_{\text{AC}}italic_λ start_POSTSUBSCRIPT AC end_POSTSUBSCRIPT.
0.250.250.250.250.50.50.50.50.90.90.90.9202020202525252530303030353535354040404045454545505050505555555560606060λ𝜆\lambdaitalic_λAUC

λOTsubscript𝜆OT\lambda_{\text{OT}}italic_λ start_POSTSUBSCRIPT OT end_POSTSUBSCRIPT

λAVACEsubscript𝜆AVACE\lambda_{\text{AVACE}}italic_λ start_POSTSUBSCRIPT AVACE end_POSTSUBSCRIPT

(b) AUC upper bound on LLP with varying weightage of λOTsubscript𝜆OT\lambda_{\text{OT}}italic_λ start_POSTSUBSCRIPT OT end_POSTSUBSCRIPT and λACsubscript𝜆AC\lambda_{\text{AC}}italic_λ start_POSTSUBSCRIPT AC end_POSTSUBSCRIPT.
Figure 9: Ablative experiments on (a) spatial and (b) temporal localization tasks. In (a) and (b) we keep λACsubscript𝜆AC\lambda_{\text{AC}}italic_λ start_POSTSUBSCRIPT AC end_POSTSUBSCRIPT and λOTsubscript𝜆OT\lambda_{\text{OT}}italic_λ start_POSTSUBSCRIPT OT end_POSTSUBSCRIPT fixed at 0.35 and 0.75 respectively while varying the other λ𝜆\lambdaitalic_λ.

12 Comparison with Contrastive Loss

We compare the optimal transport [111] based objective (OTsubscriptOT\mathcal{L}_{\text{OT}}caligraphic_L start_POSTSUBSCRIPT OT end_POSTSUBSCRIPT) with the contrastive loss-based approach [34, 68, 78] to facilitate weak alignment in Meerkat. Contrastive approaches operate on the level of global features and therefore only capture class-level information. Although such an alignment strategy may be beneficial in coarse-grained tasks, they are not suitable for tasks which require fine-grained understanding. Conversely, as employed in AVOpT, OT-based alignment operates on the level of patches in a weakly-supervised manner. Such a form of guidance is interpretable since a transport plan is optimized which dictates the relationships between the cross-modal patch embeddings, and therefore, is more suitable for fine-grained downstream tasks. Even though OT-based alignment strategies have been employed earlier for word-region level alignment [14, 77, 18], we are the first to introduce it under audio-visual setting. We empirically find that using OTsubscriptOT\mathcal{L}_{\text{OT}}caligraphic_L start_POSTSUBSCRIPT OT end_POSTSUBSCRIPT is superior in all the downstream tasks (refer to Tab. 14). Note that in both cases ACsubscriptAC\mathcal{L}_{\text{AC}}caligraphic_L start_POSTSUBSCRIPT AC end_POSTSUBSCRIPT is employed to add strong supervision. Based on our results, we hypothesize that initial patch-level alignment with AVOpT yields high-quality representations which substantially assist AVACE to attend to the regions of interest, thereby improving localization performance, as opposed to using contrastive loss with AVACE.

Loss VGG-SS LLP AVFact AVQA VALOR
cIOU \uparrow F1-score \uparrow Avg F1-score \uparrow Avg Acc. \uparrow CIDEr \uparrow
ContrastivesubscriptContrastive\mathcal{L}_{\text{Contrastive}}caligraphic_L start_POSTSUBSCRIPT Contrastive end_POSTSUBSCRIPT 46.95 52.28 0.81 86.31 74.98
OTsubscriptOT\mathcal{L}_{\text{OT}}caligraphic_L start_POSTSUBSCRIPT OT end_POSTSUBSCRIPT 48.51 54.96 0.85 87.14 76.84
Table 14: Comparison against contrastive loss based weak alignment strategy

13 Comparison with Two-stage Training

We systematically study the effect of the two-stage vs. single-stage training paradigm. Inspired by recent works [25, 46] on fine-grained image understanding tasks, we design a two-stage experimental set-up. In stage I, we perform modality alignment among the image and audio encoders through weak supervision, by employing AVOpT module. We do not use LLM in this stage I and therefore the only objective we optimize is OTsubscriptOT\mathcal{L}_{\text{OT}}caligraphic_L start_POSTSUBSCRIPT OT end_POSTSUBSCRIPT. Stage I training is followed by stage II training involving the AVACE module to provide strong supervision. In stage II, we fine-tune LLM using LoRA. Experimental results show comparable performance in both cases as depicted in Tab. 15. We opt for single-stage training because not only it is superior (in terms of performance, see Tab. 15), but it is also computationally efficient and less resource intensive.

Model VGG-SS LLP AVFact AVQA VALOR
cIOU \uparrow F1-score \uparrow Avg F1-score \uparrow Avg Acc. \uparrow CIDEr \uparrow
Two-stage 48.43 54.81 0.85 87.11 76.59
Single-stage 48.51 54.96 0.85 87.14 76.84
Table 15: Comparison against two stage training

14 Role of Audio in AVQA Task

To study the role of the audio modality and how effectively our model can encode audio information, we perform an ablation study by removing the audio information altogether and performing visual-only question answering. We note the performance of our method drops significantly when only the visual modality is used to answer the same set of questions underlying the role of the audio modality. Tab. 16 demonstrates the quantitative results.

Model Exist \uparrow Localis \uparrow Count \uparrow World K \uparrow Temp \uparrow Avg \uparrow
Without audio 83.62 79.28 80.46 78.49 69.26 78.22
With audio 88.24 86.65 84.60 87.05 86.55 86.61
Table 16: Role of audio modality. Quantitative results on AVQA dataset when the model is presented with and without audio.

15 More on Optimal Transport

Our AVOpT is responsible for cross-modal alignment of image and audio feature representations in a weakly-supervised manner. This is enabled by minimizing the Wasserstein distance (𝒟Wassersteinsubscript𝒟Wasserstein\mathcal{D}_{\text{Wasserstein}}caligraphic_D start_POSTSUBSCRIPT Wasserstein end_POSTSUBSCRIPT) between the image and audio (spectrogram) patches and subsequently learning an optimal transport plan 𝛀𝛀\mathbf{\Omega}bold_Ω. The detailed steps of Optimal Transport-based Wasserstein Distance (WassersteinsubscriptWasserstein\mathcal{L}_{\text{Wasserstein}}caligraphic_L start_POSTSUBSCRIPT Wasserstein end_POSTSUBSCRIPT) computation are outlined in Algorithm 2.

Algorithm 2 Meerkat: Wasserstein Distance Computation in AVOpT
1:Images: {Ii}i=1ksuperscriptsubscriptsubscript𝐼𝑖𝑖1𝑘\{I_{i}\}_{i=1}^{k}{ italic_I start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT; Audios: {Aj}j=1ksuperscriptsubscriptsubscript𝐴𝑗𝑗1𝑘\{A_{j}\}_{j=1}^{k}{ italic_A start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT; Total Optimal Transport steps: 𝒮𝛀subscript𝒮𝛀\mathcal{S}_{\mathbf{\Omega}}caligraphic_S start_POSTSUBSCRIPT bold_Ω end_POSTSUBSCRIPT; Initial scaled unity matrix: 𝝈=1k𝟏𝐤𝝈1𝑘subscript1𝐤\bm{\sigma}=\frac{1}{k}\mathbf{1_{k}}bold_italic_σ = divide start_ARG 1 end_ARG start_ARG italic_k end_ARG bold_1 start_POSTSUBSCRIPT bold_k end_POSTSUBSCRIPT; Initial Transport Plan: 𝛀(1)=𝟏𝟏superscript𝛀1superscript11top\mathbf{\Omega}^{(1)}=\mathbf{1}\mathbf{1}^{\top}bold_Ω start_POSTSUPERSCRIPT ( 1 ) end_POSTSUPERSCRIPT = bold_11 start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT; Cosine similarity matrix: 𝐂ij=c(Ii,Aj)subscript𝐂𝑖𝑗𝑐subscript𝐼𝑖subscript𝐴𝑗\mathbf{C}_{ij}=c(I_{i},A_{j})bold_C start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT = italic_c ( italic_I start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_A start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ); similarity matrix decay factor: β𝛽\betaitalic_β; Scaled similarity matrix: 𝚼ij=e𝐂ijβsubscript𝚼𝑖𝑗superscriptesubscript𝐂𝑖𝑗𝛽\mathbf{\Upsilon}_{ij}={\rm e}^{-\frac{\mathbf{C}_{ij}}{\beta}}bold_Υ start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT = roman_e start_POSTSUPERSCRIPT - divide start_ARG bold_C start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT end_ARG start_ARG italic_β end_ARG end_POSTSUPERSCRIPT.
2:Learned Optimal Transport Plan: 𝛀𝛀\mathbf{\Omega}bold_Ω; Wasserstein Distance: 𝒟Wassersteinsubscript𝒟Wasserstein\mathcal{D}_{\text{Wasserstein}}caligraphic_D start_POSTSUBSCRIPT Wasserstein end_POSTSUBSCRIPT.
3:for t{1,2,3,𝒮𝛀}𝑡123subscript𝒮𝛀t\in\{1,2,3,\cdots\mathcal{S}_{\mathbf{\Omega}}\}italic_t ∈ { 1 , 2 , 3 , ⋯ caligraphic_S start_POSTSUBSCRIPT bold_Ω end_POSTSUBSCRIPT } do
4:    𝐐𝚼𝛀(t)𝐐direct-product𝚼superscript𝛀𝑡\mathbf{Q}\leftarrow\mathbf{\Upsilon}\odot\mathbf{\Omega}^{(t)}bold_Q ← bold_Υ ⊙ bold_Ω start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT \triangleright direct-product\odot is Hadamard product
5:    for l{1,2,3,,L}𝑙123𝐿l\in\{1,2,3,\cdots,L\}italic_l ∈ { 1 , 2 , 3 , ⋯ , italic_L } do
6:         𝜹1k𝐐𝝈𝜹1𝑘𝐐𝝈\bm{\delta}\leftarrow\frac{1}{k\mathbf{Q}{\bm{\sigma}}}bold_italic_δ ← divide start_ARG 1 end_ARG start_ARG italic_k bold_Q bold_italic_σ end_ARG, 𝝈1k𝐐𝜹𝝈1𝑘superscript𝐐top𝜹\bm{\sigma}\leftarrow\frac{1}{k\mathbf{Q}^{\top}\bm{\delta}}bold_italic_σ ← divide start_ARG 1 end_ARG start_ARG italic_k bold_Q start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT bold_italic_δ end_ARG     
7:    𝛀(t+1)diag(𝜹)𝐐diag(𝝈)superscript𝛀𝑡1diag𝜹𝐐diag𝝈\mathbf{\Omega}^{(t+1)}\leftarrow\text{diag}(\bm{\delta})\mathbf{Q}\text{diag}% (\bm{\sigma})bold_Ω start_POSTSUPERSCRIPT ( italic_t + 1 ) end_POSTSUPERSCRIPT ← diag ( bold_italic_δ ) bold_Q diag ( bold_italic_σ )
8:𝒟Wasserstein𝐂,𝛀subscript𝒟Wassersteinsuperscript𝐂top𝛀\mathcal{D}_{\text{Wasserstein}}\leftarrow\langle\mathbf{C}^{\top},\mathbf{% \Omega}\ranglecaligraphic_D start_POSTSUBSCRIPT Wasserstein end_POSTSUBSCRIPT ← ⟨ bold_C start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT , bold_Ω ⟩ \triangleright ,\langle\cdot,\cdot\rangle⟨ ⋅ , ⋅ ⟩ is the Frobenius dot-product
9:return 𝛀𝛀\mathbf{\Omega}bold_Ω, 𝒟Wassersteinsubscript𝒟Wasserstein\mathcal{D}_{\text{Wasserstein}}caligraphic_D start_POSTSUBSCRIPT Wasserstein end_POSTSUBSCRIPT

16 AVSBench Data Collection

Given the segmentation mask of an object, we consider the top-most, left-most, bottom-most and right-most points and draw horizontal and vertical projection lines as shown in Fig. 10. These lines intersect each other at four points which when connected gives us the desired bounding box that completely encloses the object of interest. For each such bounding box we consider the coordinates (xLeft,yTop)subscript𝑥Leftsubscript𝑦Top(x_{\text{Left}},y_{\text{Top}})( italic_x start_POSTSUBSCRIPT Left end_POSTSUBSCRIPT , italic_y start_POSTSUBSCRIPT Top end_POSTSUBSCRIPT ) and (xRight,yBottom)subscript𝑥Rightsubscript𝑦Bottom(x_{\text{Right}},y_{\text{Bottom}})( italic_x start_POSTSUBSCRIPT Right end_POSTSUBSCRIPT , italic_y start_POSTSUBSCRIPT Bottom end_POSTSUBSCRIPT ) as GT labels, as shown in Fig. 10.

Refer to caption
Figure 10: Transformation from segmentation mask to bounding box.

17 Comparison against ImageBind

We employ modality-specific encoders from ImageBind [32] and compare them with CLIP-CLAP combination as used in Meerkat (Tab. 17). Empirical results suggest that our encoder combination performs slightly superior compared to ImageBind. A more theoretical insight in this regard can be considered as a future work. However, this is beyond the scope of the current study.

Image and Audio Encoders VGG-SS LLP AVFact AVQA VALOR
cIOU \uparrow F1-score \uparrow Avg F1-score \uparrow Avg Acc. \uparrow CIDEr \uparrow
ImageBind Encoders 47.71 54.03 0.84 86.30 75.58
CLIP-CLAP (ours) 48.51 54.96 0.85 87.14 76.84
Table 17: Comparison against ImageBind.

18 Other Quantitative Metrics on AVQA task

We evaluate the performance of our method on two additional metrics from the AVQA task namely Counting (Count) and Comparative (Comp) and report the performance in Tab. 18. These two metrics along with the three other metrics (Existential, Localization, and Temporal reported in the main paper) complete the evaluation suite for the AVQA and MUSIC-AVQA tasks. We observe an overall steady performance of our method across these categorizations, by virtue of the excellent generalizability of Meerkat to coarse-grained tasks.

Model Generalist? AVQA MUSIC-AVQA
Count \uparrow World K \uparrow Count \uparrow Comp \uparrow
AVSD [84] 63.89 61.52 - -
PanoAVQA [110] 64.91 64.22 - -
ST-AVQA [42] 70.80 66.01 - -
CAD [67] 76.37 74.88 - -
AVST [42] - - 68.22 63.31
LAVISH [50] - - 73.28 63.49
LAST [54] - - 75.23 65.60
Macaw-LLM [60] 78.16 77.54 76.61 67.77
PandaGPT [89] 78.92 78.02 79.06 70.58
VideoLlama [112] 79.90 77.26 82.90 72.32
X-InstructBLIP [70] 81.14 82.29 83.89 73.43
Meerkat (ours) 84.60 87.05 85.70 75.98
ΔMeerkatX-InstructBLIPsubscriptΔMeerkatX-InstructBLIP{\color[rgb]{0,0,1}\definecolor[named]{pgfstrokecolor}{rgb}{0,0,1}\Delta_{% \text{{Meerkat}}-\text{X-InstructBLIP}}}roman_Δ start_POSTSUBSCRIPT Meerkat - X-InstructBLIP end_POSTSUBSCRIPT +4.26% +5.78% +2.16% +3.47%
Table 18: Quantitative results on AVQA task. The reported numbers on AVQA dataset [106] are on the val split. For the MUSIC-AVQA dataset [42], results are reported on the balanced test set. Here, Count: Counting, Comp: Comparative.

19 Evaluation Metrics

For the visual grounding task, we evaluate our model against other baselines two key metrics to assess visual grounding effectiveness: Intersection over Union (IoU) and Area Under Curve (AUC). These metrics provide a comprehensive measure of our model’s ability to accurately localize visual elements in correlation with auditory cues. For the image-guided audio temporal localization task, we report the segment-level F-score. For the Audio-Visual Fact-checking (AVFact) task, we split the evaluation tasks into four different categories, each with its unique dimension of assessment. We report the Precision and Recall scores for each category. We report the performance of audio-visual captioning task on several established metrics, including BLEU@4 [71], METEOR [5], ROUGE [48] and CIDEr [99]. Lastly, for the audiovisual visual question answering, we follow [54, 67] and report 5 different types of audio-visual relationships, including Existential, Location, Counting, Temporal, and Comparative.

20 Failure Cases

Although Meerkat demonstrates impressive reasoning and grounding capabilities under various audio-visual settings, there are still some cases where the model fails to comprehend complex and obscured references, especially in cluttered environments or audios with multiple overlapping sounds. Fig. 11 demonstrates a few cases where our method produces suboptimal or sometimes incorrect inference results. In Fig. 11(a) due to the lack of visibility of the object of interest (Chainsaw), our model couldn’t correctly identify the spatial region pertaining to it. Similarly, as the facial region of the speaker is not evident, Meerkat fails to correctly locate the active speaker. In Fig. 11(b) due to the overlapping audio of multiple instruments and the presence of ambient sound, our method could partially capture the duration through which the guitar makes sound (refer to supplementary video). The same happens with the other temporal audio localization example where the audio starts with a loud baby laughter sound which gradually fades with the adult person’s voice taking over. Our model could only identify the initial part of the baby’s sound. For the AV Fact task (in Fig. 11(c)), in the first example, due to occluded facial region, our model produces the wrong output, whereas in the second example, due to the indistinguishable, cluttered and blurry background, Meerkat fails to correctly identify the flying bird.

Refer to caption
(a) Failure cases on image grounding task.
Refer to caption
(b) Failure cases on audio temporal localization.
Refer to caption
(c) Failure cases on AVFact task.
Figure 11: Failure cases of Meerkat on different fine-grained tasks.

21 Ethics Statement

In this paper, we propose a novel framework for multi-modal LLM by combining the audio and visual modalities. For all the tasks we leverage publicly available datasets and do not engage in collecting any private data. However, we acknowledge that the public datasets may have implicit bias [28]. While LLMs being pre-trained on web-scale data inherently contain extensive knowledge about the real world, we recognize its potential learning bias as well. Moreover, these methods are prone to mistakes and might generate wrong or misleading results. The existing tools to measure various aspects of the LLM-generated outputs (e.g., toxicity [47]) are predominantly restricted to the language modality and not applicable across other modalities.

It’s important for the users to recognize these limitations and proceed with caution, especially in scenarios where the precision and neutrality of results hold significant importance. Users are encouraged to thoroughly scrutinize and validate the outputs of the model to avoid the possibility of disseminating inaccurate information. We will publicly release the codebase and curated datasets to ensure reproducibility and encourage future research. Finally, during our data preparation stage, we don’t collect or use any personal/human subject data.