(eccv) Package eccv Warning: Package ‘hyperref’ is loaded with option ‘pagebackref’, which is *not* recommended for camera-ready version

¹¹institutetext: University of Maryland, College Park ²²institutetext: University of Toronto ³³institutetext: Mila and Université de Montréal ⁴⁴institutetext: King Abdullah University of Science and Technology (KAUST)
Project page – https://github.com/schowdhury671/meerkat
⁴⁴email: {sanjoyc,rhgao,dmanocha}@umd.edu sayan.nag@mail.utoronto.ca subhrajyoti.dasgupta@umontreal.ca {jun.chen,mohamed.elhoseiny}@kaust.edu.sa

Meerkat: Audio-Visual Large Language Model for Grounding in Space and Time

Sanjoy Chowdhury* 11 Sayan Nag 22 Subhrajyoti Dasgupta 33 Jun Chen 44 Mohamed Elhoseiny^† 44 Ruohan Gao^† 11 Dinesh Manocha^† 11

Abstract

Leveraging Large Language Models’ remarkable proficiency in text-based tasks, recent works on Multi-modal LLMs (MLLMs) extend them to other modalities like vision and audio. However, the progress in these directions has been mostly focused on tasks that only require a coarse-grained understanding of the audio-visual semantics. We present Meerkat, an audio-visual LLM equipped with a fine-grained understanding of image and audio both spatially and temporally. With a new modality alignment module based on optimal transport and a cross-attention module that enforces audio-visual consistency, Meerkat can tackle challenging tasks such as audio referred image grounding, image guided audio temporal localization, and audio-visual fact-checking. Moreover, we carefully curate a large dataset AVFIT that comprises 3M instruction tuning samples collected from open-source datasets, and introduce MeerkatBench that unifies five challenging audio-visual tasks. We achieve state-of-the-art performance on all these downstream tasks with a relative improvement of up to 37.12%.

Keywords:

Audio-Visual LLM AV Localization AVFIT Dataset

Refer to caption — Figure 1: We present Meerkat, an audio-visual LLM that can effectively ground both spatially and temporally in image and audio. Our model is adept in tasks that require fine-grained understanding such as , & . It can also be extended to perform coarse-grained tasks like & .

^†^†^∗Equal contribution.^†^†^†Equal advising.

1 Introduction

Large Language Models (LLMs) [6, 97, 19, 20, 80] have demonstrated remarkable performance in various natural language processing tasks, achieving human-level accuracies in comprehension and reasoning abilities. Furthermore, powered by the emergent instruction fine-tuning paradigm [69, 23, 73], these language models can be equipped to follow open-ended natural language instructions, or even combined with other modalities, especially vision [2, 118, 7, 112, 51, 41, 113, 89, 61, 59, 60, 33]. Audio, though often complementary to the associated visual scene, remains largely under-explored in the context of LLMs. Building Multi-modal LLMs (MLLMs) that can listen may enable new applications in multimedia content analysis, multi-modal virtual assistants, education and training, etc.

Limited prior works (refer to Tab. 1) have incorporated audio in MLLMs [33, 70, 87]. However, they mostly focus on coarse-grained tasks such as captioning and question-answering, which is comparatively straightforward to be subsumed into an LLM interface [89, 60, 87, 112]. Although there have been some recent advancements in leveraging MLLMs for grounding [102, 116, 101, 109, 12, 13, 76], they either only focus on the visual modality [109, 12, 13, 76, 40], or struggles to capture fine-grained details occurring within audio-visual events due to insufficient joint modeling of the two modalities [60, 89, 112].

Our goal is to harness the power of LLMs for fine-grained audio-visual understanding. This is challenging mainly because: (i) there is a disparity of input and output formats across different tasks (e.g., image grounding from an audio query, image-guided audio temporal localization), (ii) no large-scale datasets exist for training audio-visual LLMs with grounding capabilities. Existing audio-visual LLMs [60, 89, 87] are restricted to coarse-grained tasks and do not incorporate cross-modality fusion, which is a crucial component for achieving fine-grained understanding and reasoning capabilities, as shown in [25, 46]. Although there exist individual models capable of handling image grounding (BuboGPT [116]) and temporal localization (TimeChat [83]) separately, they are either not suitable for open-domain audio (TimeChat) or are not trained in an end-to-end fashion (BuboGPT) (refer to Tab. 1).

In light of these challenges, we present Meerkat ¹¹1Meerkats are known for their strong spotting and listening abilities. (ref Fig. 1), the first unified audio-visual LLM framework that can effectively ground both spatially and temporally in image and audio, respectively. It has two crucial modules that are key to its strong capability in fine-grained understanding: a modality alignment module that learns the cross-modal alignment between image and audio patches in a weakly-supervised manner based on optimal transport, and a cross-modal attention module that is capable of enforcing consistency in the cross-attention heatmaps. Together, these two modules enable learning better joint audio-visual representations that subsequently enhance downstream tasks.

	Audio Types					Data Features
Model	Speech	Open-domain	Output Image Grounding	Output Audio Grounding	End-to-end	Convention	GPT-Prompted	Robustness
VideoLlama [112]	✓	✓	✗	✗	✓	✓	✗	✗
Macaw-LLM [60]	✓	✗	✗	✗	✓	✓	✓	✗
PandaGPT [89]	✓	✓	✗	✗	✓	✓	✗	✗
AV LLM [87]	✓	✓	✗	✗	✓	✓	✓	✗
X-InstructBLIP [70]	✗	✓	✓	✗	✓	✓	✗	✗
TimeChat [83]	✓	✗	✗	✓	✓	✓	✓	✗
BuboGPT [116]	✗	✓	✓	✗	✗	✓	✗	✗
Meerkat (ours)	✓	✓	✓	✓	✓	✓	✓	✓

Table 1: Comparison of Meerkat with recent Audio-Visual LLMs. ‘Convention’ refers to a collection of publicly available data that has been transformed using templates, ‘GPT-Prompted’ signifies if the generated instructions are obtained/refined employing GPT, and ‘Robustness’ is the model’s ability to tackle negative samples. We compare our method against these approaches in Sec. 5.

To support Meerkat, we further introduce MeerkatBench that unifies five different audio-visual tasks (shown in Tab. 2), including audio referred image grounding, image-guided audio temporal localization, audio-visual fact checking, audio-visual question answering, and audio-visual captioning (see Fig. 1 for examples). To enable the training of these five tasks, we also curate a large dataset AVFIT, which contains 3M instruction tuning samples with various degrees of difficulties for learning fine-grained audio-visual semantics. Extensive experiments on these tasks demonstrate the effectiveness of our proposed model.

In summary, we make the following main contributions:

•

We present Meerkat, the first audio-visual LLM equipped with fine-grained spatio-temporal understanding that can ground in image and audio.
•

We introduce MeerkatBench that unifies five audio-visual learning tasks, and a new large instruction-tuning dataset AVFIT to enable learning fine-grained audio-visual semantics.
•

Evaluating on these five benchmark tasks, we set new state-of-the-art results on all of them with a relative improvement up to 37.12%.

2 Related Works

Multi-modal Large Language Models. Inspired by the success of instruction following capabilities of large language models [69, 19, 92], the community has recently started to leverage LLMs for understanding multi-modal contents. Powered by high-quality multi-modal instructional data, recent methods [118, 51, 41, 89, 7, 74, 13, 2] extend LLMs for multi-modal learning. While some approaches such as MiniGPT4 [118], X-LLM [7], and Video-ChatGPT [61] perform latent alignment between the pre-trained LLM and other modalities via learned visual encoder. Other methods like Otter [41], and LLaMA-Adapter [113] learn cross-attention layers into the LLM to infuse multi-modal information. Prior works in the realm of LLMs predominantly focus on either visual-only inputs [41, 51, 118, 108] or tackle coarse-grained tasks [45, 61] leaving room for fine-grained audio-visual understanding. Unlike prior approaches, in this work, we focus on equipping LLMs with strong audio-visual comprehension abilities.

Task

Granularity

Task Name

Dataset

Train

Test

Spatial

Bounding Box

Time

Interval

# Samples

Train / Test

Metrics

Openimages-AudioSet

✓

✗

✓

✗

1.07M / –

–

Openimages-VGGSound

✓

✗

✓

✗

180K / –

–

AVSBench^†

✓

✗

2.30K / 0.49K

cIOU, AUC

VGGSS

✗

✓

✗

– / 4.38K

cIOU, AUC

PASCAL Sound

✗

✓

✗

– / 0.56K

cIOU, AUC

Audio Referred Image Grounding

Flickr-Soundnet

✗

✓

✗

– / 2.78K

cIOU, AUC

Openimages-AudioSet Strong

✓

✗

✓

96.5K / 24.1K

F1-score

Image Guided Audio Temporal Localization

LLP

✗

✓

✗

✓

– / 2.32K

F1-score

Fine

Audio-Visual Fact-checking

Openimages-AudioSet

✓

✗

1.18M / 321K

F1-score

AVQA

✓

✗

40.4K / 16.9K

Accuracy

AV Question Answering

Music AVQA

✓

✗

25.7K / 7.36K

Accuracy

Coarse

AV Captioning

VALOR

✓

✗

25.0K / 3.50K

B@4, M, R, C

Table 2: Task-wise dataset distribution, dataset details, and metrics. We collect AVFIT, which is a collection of 12 datasets. We denote dataset-wise train/test usage. The visual grounding datasets contain spatial bounding box annotations while the audio temporal localization contains time-interval annotations. We consider audio-visual fact-checking as a fine-grained task as it requires an understanding of spatio-temporal grounding information (refer to Sec. 3 for more details). Here B@4: BLUE@4, M: METEOR, R: ROUGE, C: CIDEr. For all our experiments we consider F1@0.5. ^† We obtain the bounding box from the segmentation maps.

Fine-grained Multi-modal Understanding. Of late, general-purpose multi-modal large language models have demonstrated their effectiveness in unifying a versatile array of vision-language or video-understanding tasks. These models, powered by LLMs [97, 98, 103, 104, 115, 93, 20] have superior reasoning and understanding capabilities. As a natural extension, MLLMs have been leveraged to unify region-based grounding tasks [74, 13, 12, 109, 101, 116, 40, 114, 102]. Despite significant strides, these models are still limited to fine-grained comprehension within a single modality. In this work, we propose Meerkat to precisely address this research gap under in-the-wild audio-visual event settings. To this end, we present a novel audio-visual task unification framework which promotes strong multi-modal reasoning and understanding capabilities.

LLM guided Task Unification. LLMs as an interface of task unification framework have seen massive advancements in recent times. Fuelled by the success of language models [107, 100, 57], the community has started to explore ways to unify generative and reasoning tasks under the sphere of language models leveraging its ease of accessibility. Various approaches [109, 12, 70, 45] present alternative ways to integrate new tasks within the scope of LLMs. Inspired by the success of these approaches, we present, to the best of our knowledge, the first approach to unifying fine-grained audio-visual tasks.

Audio-Visual Learning. Benefiting from the natural synchrony between the visual and the auditory modalities, audio-visual learning has opened up abundant applications including audio-visual sound source localization [64, 66, 90, 37], audio-visual sound separation [11, 91, 62], audio-visual segmentation [63, 117, 53], audio-visual question answering [106, 110, 42], audio-visual captioning [16, 15, 94]. Different from these lines of work that focus on a single task, we aim to harness the power of LLM to propose a multi-task learning setting by unifying five different audio-visual tasks with the LLM serving as a common interface.

3 Methodology

In this section, we introduce Meerkat. Fig. 2 provides an overview of our approach. We first discuss the multi-modal feature extraction in Sec. 3. In Sec. 3 we introduce our novel audio-visual feature alignment modules. In Sec. 3 we add the overall training objective followed by Sec. 3 where we elaborate the numerical representations of the visual bounding box and time intervals.

Image Encoder. Given a batch of $k$ input images $\mathbf{I}=\{I_{i}\}^{k}_{i=1}:I_{i}\in\mathbb{R}^{H\times W\times C}$ where $H$ , $W$ , $C$ represent the height, width and channels respectively, we employ a pretrained CLIP-ViT-B/16 [78] encoder $\mathcal{E}^{I}(\cdot)$ to extract the image embeddings. Where $i^{\text{th}}$ image embedding can be represented as $z_{I}\in\mathbb{R}^{\mathcal{S}_{I}\times\mathcal{D}_{I}}$ , where $\mathcal{S}_{I}$ and $\mathcal{D}_{I}$ denote the number of image tokens and hidden dimension respectively.

Audio Encoder. The audio encoder transforms the raw audio input into an audio embedding. We use the audio transformer backbone from CLAP [26] as our audio encoder due to its success in diverse audio tasks owing to its superior multi-modal alignment. We leverage this powerful pre-trained encoder ( $\mathcal{E}^{A}(\cdot)$ ) to extract meaningful audio representations. For a batch of $k$ processed audio inputs $\textbf{A}=\{A_{i}\}^{k}_{i=1}$ : $A_{i}\in\mathbb{R}^{F\times T}$ where $F$ is the number of spectral components (e.g. Mel bins) and $T$ is the number of time bins. Each $i^{\text{th}}$ audio embedding is denoted as $z_{A}\in\mathbb{R}^{\mathcal{S}_{A}\times\mathcal{D}_{A}}$ , $\mathcal{S}_{A}$ and $\mathcal{D}_{A}$ are the number of audio tokens and hidden dimension respectively.

LLM. Meerkat adopts the open sourced Llama 2-Chat (7B) [97] as the large language model backbone. Pre-trained LLMs tokenizer projects the text sequence T into embeddings $z_{T}\in\mathbb{R}^{\mathcal{S}_{T}\times\mathcal{D}_{T}}$ , where $\mathcal{S}_{T}$ and $\mathcal{D}_{T}$ refer to token length and hidden dimension respectively. Before passing the image and audio embeddings into the LLM, they undergo transformations via additional linear layers to ensure the embedding dimensions across different modalities remain consistent. Since the LLM serve as the unified interface for audio-visual inputs, we rely on the language tokens to carry out the individual tasks.

Inspired by the success of recent pre-training frameworks in grounding tasks [25, 12, 46], we equip our model with two different levels of supervision: weak supervision through modality alignment module (AVOpT) and strong supervision through audio-visual consistency enforcement module (AVACE). We follow a single-stage training strategy and empirically show our method achieves similar performance compared to two-stage training (more details in the appendix).

Audio-Visual Optimal Transport Alignment Module (AVOpT). Weak supervision as a precursor to fine-grained supervision has been proven to be an effective training strategy in various tasks [25, 44]. Earth Mover Distance based algorithms [111] involving Optimal Transport (OT) methods [14] have been recently leveraged for patch-level alignment between the query and the support images in a siamese network [111]. Furthermore, in the context of vision-language models, OT-based algorithms have been employed for patch-word alignment [18]. As the image (CLIP) and audio (CLAP) encoders are trained separately their learned embeddings are in a different semantic space. Our intuition is that such a patch-level alignment can improve vision and audio semantic consistency[31]. We experimentally demonstrate that this patch-level weak guidance is superior to contrastive loss-based [34, 68] global supervision (more details in appendix).

From a given image $I$ and audio $A$ pair, we obtain patch-level (local) feature embeddings $z_{I}$ and $z_{A}$ where, $z_{I}=\mathcal{E}^{I}(I);z_{A}=\mathcal{E}^{A}(A)$ . For modeling cross-modal relations by utilizing the inherent rich semantic structures in these feature representations, we generate two discrete distributions, represented by $\theta_{I}\in\mathbf{P}(\mathbb{Z}_{I})$ and $\theta_{A}\in\mathbf{P}(\mathbb{Z}_{A})$ , for image and audio respectively:

\vspace{-0.05in}\theta_{I}=\sum_{k=1}^{M}u_{I}(k)\delta_{z_{I}}(k);\theta_{A}=% \sum_{l=1}^{N}u_{A}(l)\delta_{z_{A}}(l)\vspace{-0.05in}

(1)

where, $\sum_{k=1}^{M}u_{I}(k)=\sum_{l=1}^{N}u_{A}(l)=1$ , $u_{I}$ and $u_{A}$ being the respective weight vectors for the probability distributions $\theta_{I}$ and $\theta_{A}$ . $\delta_{z}$ is the Dirac delta function placed at support point $z$ in the embedding space [8]. The goal is to discern the optimal transport plan while matching these two distributions. Therefore, we compute the Wasserstein Distance (WD) between these probability distributions $\theta_{I}$ and $\theta_{A}$ while preserving the topological information during the cross-domain alignment process, mathematically given as follows:

\vspace{-0.05in}\mathcal{L}_{\text{OT}}=\mathcal{D}_{\mathrm{Wasserstein}}(% \theta_{I},\theta_{A})=\min_{\mathbf{\Omega}\in\mathrm{\Psi}(u_{I},u_{A})}\sum% _{k}\sum_{l}\mathbf{\Omega}_{kl}\cdot\phi(z_{I}(k),z_{A}(l))\vspace{-0.05in}

(2)

Here, $\mathrm{\Psi}(u_{I},u_{A})=\{\mathbf{\Omega}\in\mathbb{R}_{+}^{M\times N}|% \mathbf{\Omega}\mathbf{1}_{N}=u_{I},\mathbf{\Omega}^{\top}\mathbf{1}_{M}=u_{A}\}$ , $\phi(z_{I}(k),z_{A}(l))$ is the function computing the cosine distance between the cross-modal embedding pair, and $\mathbf{\Omega}$ is the transport plan, imitating the amount of mass shifted from the distribution $\theta_{I}$ to the distribution $\theta_{A}$ . An exact solution to the above expression leads to a sparse representation of the transport plan $\mathbf{\Omega}$ which at most $(2\cdot\text{max}(M,N)-1)$ non-zero elements, ensuing an explainable and robust cross-modal alignment. We defer additional details to the appendix.

Audio-Visual Attention Consistency Enforcement Module (AVACE). Cross-modal interaction is essential for aligning the audio and visual modalities. Moreover, region-level supervision can encourage efficient localization. Inspired by the success of recent methods [25, 22, 86], we employ an adapter-based cross-attention strategy for efficient sound source localization. The modality-specific features in AVOpT lack awareness [38] of information from alternative modalities which can be infused through cross-modal attention. Therefore, to enable the audio-visual cross-modal reciprocity, we propose the AVACE module.

Although in a multi-modal context, feature fusion through a cross-attention scheme is effective in attending to relevant objects in the image, inconsistencies may arise such as attended regions being dispersed throughout the image including background objects. The reasons can be attributed to the quality of interplay between the feature embeddings. Considering CLAP audio encoder pre-trained with examples such as ‘a man playing the violin’ (refer Fig. 2) paired with audio of a violin, the cross-modal knowledge of audio representations encourages it to focus on both the man and the violin in the image. Therefore, to ensure superior region-level alignment we confine the cross-modality attention map ( $\mathcal{A}^{c}$ ) within the boundaries of the object of interest, denoted by the ground-truth bounding box. Considering a bounding box represented as $[x_{\text{Left}},y_{\text{Top}},x_{\text{Right}},y_{\text{Bottom}}]$ , we define a mask $\mathcal{M}$ such that $\mathcal{M}(y_{\text{Top}}:y_{\text{Bottom}},x_{\text{Left}}:x_{\text{Right}})% =1\text{, otherwise 0}$ . Our goal is to maximize the attention within this bounding box and minimize it elsewhere. Therefore, we mathematically formulate the attention consistency objective $\mathcal{L}_{\text{AC}}$ as follows:

\vspace{-0.05in}\mathcal{L}_{\text{AC}}=\lambda_{1}\left(1-\frac{\sum_{i,j}{% \mathcal{M}(i,j)\mathcal{A}^{c}(i,j)}}{\sum_{i,j}{\mathcal{M}(i,j)}+\epsilon_{% 1}}\right)+\lambda_{2}\left(\frac{\sum_{i,j}{\left(1-\mathcal{M}(i,j)\right)% \mathcal{A}^{c}(i,j)}}{\sum_{i,j}{\left(1-\mathcal{M}(i,j)\right)}+\epsilon_{2% }}\right)

(3)

Here, $\mathcal{A}^{c}$ denotes the audio-visual cross-modality attention, $(i,j)$ represents the pixel location, $\lambda_{1}$ , $\lambda_{2}$ are the loss hyper-parameters (we keep $\lambda_{1}=\lambda_{2}=0.5$ ), and $\epsilon_{1}$ , $\epsilon_{2}$ are the stability factors respectively. In Sec. 5, we demonstrate that $\mathcal{L}_{\text{AC}}$ encourages efficient localization and audio-visual alignment of the cross-attention maps, eventually leading to improved fine-grained cross-modal representations for downstream tasks.

Our overall training objective comprises a combination of three sub-objectives: cross-entropy loss ( $\mathcal{L}_{\text{CE}}$ ), weak AV alignment loss ( $\mathcal{L}_{\text{OT}}$ ), and attention consistency loss ( $\mathcal{L}_{\text{AC}}$ ). These losses are added together to obtain the final training loss for Meerkat given as:

\mathcal{L}_{\textsc{Meerkat}}=\mathcal{L}_{\text{CE}}+\lambda_{\text{OT}}% \cdot\mathcal{L}_{\text{OT}}+\lambda_{\text{AC}}\cdot\mathcal{L}_{\text{AC}}

(4)

Here, $\lambda_{\text{OT}}$ and $\lambda_{\text{AC}}$ are the loss weighting factors. We provide Algorithm 1 outlining the overall training procedure.

Algorithm 1 Meerkat: Training

1:Image:

I

; Audio:

A

; Textual Instruction:

T

; Pre-trained LLM:

\mathcal{E}^{\text{LLM}}(\cdot)

; LLM Tokenizer:

\tau^{\text{LLM}}(\cdot)

; Pre-trained Image Encoder:

\mathcal{E}^{I}(\cdot)

; Pre-trained Audio Encoder:

\mathcal{E}^{A}(\cdot)

; AVACE Module:

\text{AVACE}(\cdot,\cdot)

; Masks from GT Bounding-Boxes:

\mathcal{M}

; Loss Hyperparameters:

\lambda_{\text{OT}},\lambda_{\text{AC}}

; GT Tokens:

\phi_{\text{GT}}

2:Fine-tuned LLM:

\mathcal{E}^{T}(\cdot)

; Trained AVACE Module:

\text{AVACE}(\cdot,\cdot)

; Predicted Tokens:

\phi_{\text{pred}}

z_{I}\leftarrow\mathcal{E}^{I}(I);z_{A}\leftarrow\mathcal{E}^{A}(A)

\triangleright

Obtain Visual and Audio Embeddings.

z_{T}\leftarrow\tau^{\text{LLM}}(T)

\triangleright

Tokenize and Obtain Textual Encodings.

\tilde{z}_{I},\tilde{z}_{A},\mathcal{A}^{c}\leftarrow\text{AVACE}(z_{I},z_{A})

\triangleright

Obtain Audio-Visual Projections, Cross-Attn Map.

z_{AVT}\leftarrow(\tilde{z}_{I}\parallel\tilde{z}_{A}\parallel z_{T})

\triangleright

Concatenate Embeddings.

\phi_{\text{pred}}\leftarrow\mathcal{E}^{\text{LLM}}(z_{AVT})

\triangleright

LLM Output.

\mathcal{L}_{\textsc{Meerkat}}\leftarrow\mathcal{L}_{\text{CE}}(\phi_{\text{% pred}},\phi_{\text{GT}})+\lambda_{\text{OT}}\cdot\mathcal{L}_{\text{OT}}(z_{I}% ,z_{A})+\lambda_{\text{AC}}\cdot\mathcal{L}_{\text{AC}}(\mathcal{A}^{c},% \mathcal{M})

9:Optimize model parameters to reduce

\mathcal{L}_{\textsc{Meerkat}}

until convergence.

10:return

\mathcal{E}^{T}(\cdot)

\text{AVACE}(\cdot,\cdot)

\phi_{\text{pred}}

Representation of Box Location. We embed the location of bounding boxes with numerical values in the natural language sequence. A box is represented intuitively by its top-left and bottom-right corners, i.e., [ $x_{\text{Left}}$ , $y_{\text{Top}}$ , $x_{\text{Right}}$ , $y_{\text{Bottom}}$ ]. Notably, these values are normalized whose factors are determined by the size of the respective image to which the bbox belongs. These coordinates may appear in either the input or the output sequences depending on the task. For instance, in Audio Referred Image Grounding task, Meerkat predicts the bounding box of the object of interest, whereas, for Audio-Visual Fact-checking task, the text input to Meerkat might contain the box coordinates.

Representation of Time Segment. We embed the time interval information using numerical figures in the natural language expression. A time segment is intuitively represented by its start and end times, i.e., [tStart, tEnd], designating the onset of an event or an activity. Similar to boxes, these representations may appear in either the input or the output sequences depending on the task. For instance, in Image Guided Audio Temporal Localization task, the model predicts the time interval within which the query might have occurred, while for Audio-Visual Fact-checking, the input sequence might contain a reference time window. We add more details on the instruction preparation formats in the appendix.

4 MeerkatBench: A Unified Benchmark Suite for Fine-grained Audio-Visual Understanding

Multi-modal conversation as an emergent ability is gaining prominence in the context of MLLMs. Although a line of research [109, 76, 12] addresses vision-language tasks, extension to other modalities such as audio is relatively underexplored. The task’s difficulty escalates further when an intricate understanding of the modality-specific information is necessitated. To add to this, there doesn’t exist any publicly available dataset that particularly facilitates such tasks. One of our primary contributions is to introduce a novel audio-visual fine-grained task unification benchmark. To this end, we present MeerkatBench comprising three fine-grained tasks: (i) audio referred image grounding, (ii) image guided audio temporal localization, (iii) audio-visual fact-checking, and two coarse-grained tasks: (iv) audio-visual question answering, (v) audio-visual captioning.

In this section, we present AVFIT, an AV instruction tuning dataset comprising 3M multi-modal dialogues for model training. AVFIT consists of samples collected in the following ways: (i) suitable adaptation of public datasets and (ii) instruction-tuning data generation via prompting GPT-3.5 [6]. Next, we discuss the data curation procedure:

Adaptation of Public Datasets. Depending on the task and availability of datasets, we either collect the image-audio pairs directly from the publicly available datasets (VGG-SS [9], AVSBench [117], Flickr-SoundNet [85], LLP [95], AVQA [106], MUSIC-AVQA [42], VALOR [15]) or follow a semi-automated strategy to prepare the pairs by forming matching image-audio pairs from large-scale datasets having visual grounding annotation such as Openimages [39], PASCAL [27] and audio event datasets like AudioSet/AudioSet Strong [30], VGG-Sound [10]. We retain the original category labels (‘Existential’, ‘Temporal’, etc.) from the MUSIC-AVQA. To get similar insights in the AVQA dataset, we categorise every sample into one of the ‘Existential’, ‘Temporal’, ‘Localisation’, ‘Count’ and ‘World Knowledge’ categories. During the direct collection of pairs, we augment the audio snippet with a carefully chosen representative frame from the associated video. On the other hand, while forming pairs ourselves, we refer to a lookup table which we prepare beforehand by matching the corresponding class labels from the image and the audio datasets (more details in the appendix). We associate each image sample with its counterpart from the audio dataset. Finally, we supplement the image-audio pairs with the generated instructions as explained next. Details on the task-wise dataset details can be found in Tab. 2.

GPT-Assisted Instruction Generation. Instruction tuning datasets [51, 58, 75, 35] have primarily focused on coarse-grained details like global image descriptions in the form of captioning or question answering without explicitly capturing fine-grained details. In this work, we aim to bridge this gap by introducing AVFIT that promotes region-level and time-sensitive understanding in the following ways: (i) AVFIT includes spatial coordinates of objects of interest (bounding box) along with corresponding audio snippets which leverage the synergy between audio-visual data. (ii) The designed dialogues audio time intervals either in input or output or both. (iii) To generate high-quality instructions we manually write a few example descriptions of each task and resort to GPT-3.5 [6] to create different variations. For further refinement of the generated dialogues we re-prompt GPT-4 [1] to ensure quality by reducing its context size. During training, we randomly pick one instruction for each sample. Fig. 2 illustrates a sample instruction from MeerkatBench. We use special tokens <image>, <audio>, <obj> which we later replace with instruction-guided image, audio and object categories respectively to generate prefix-based prompting.

5 Experiments and Results

To the best of our knowledge, Meerkat is the first MLLM that unifies audio-visual spatial and temporal grounding, alongside possessing strong reasoning capabilities. We carefully choose the closest baseline for each task and suitably adapt them for fair comparisons. Owing to BuboGPT’s [116] spatial localization ability, we select it as our baseline for the audio referred image grounding task. Most similar in spirit to our image guided audio-temporal localization task is TimeChat [83]. It leverages the pre-trained VideoLlama model and suitably instruction-tune it to tackle temporal grounding tasks. Due to their audio-visual comprehension abilities, we resort to X-InstructBLIP [70], Macaw-LLM [60], PandaGPT [89], and VideoLlama [112] as baselines for audio-visual fact-checking, AV question answering, and AV captioning tasks respectively. Please refer to Tab. 1 for an overview of the characteristics of the generalist baselines. For specialist baselines, refer to the corresponding task tables. We finetune all baselines on our datasets except for using Openimages-AudioSet and Openimages-VGGSound train splits from the audio-referred visual grounding task.

Audio Referred Image Grounding (ARIG) This task involves visual grounding by predicting the coordinates of a bounding box around the object of interest guided by the input audio. We prepare 1.2M image-audio-instruction pairs using steps explained in Sec. 4. We add details of the input instruction format and model output in the appendix. Meerkat achieves superior performance in sounding object localization task, setting a new benchmark as shown in Tab. 3.

		VGG-SS		Flickr-SoundNet		PascalSound		AVSBench
Models	Generalist?	cIoU $\uparrow$	AUC $\uparrow$	cIoU $\uparrow$	AUC $\uparrow$	cIoU $\uparrow$	AUC $\uparrow$	cIoU $\uparrow$	AUC $\uparrow$
SSPL [88]	✗	33.90	38.00	76.70	60.50	51.72	39.79	61.32	48.44
EZ-VSL [65]	✗	38.85	39.54	83.94	63.60	51.90	40.25	60.06	49.64
SSL-TIE [52]	✗	38.63	39.65	79.50	61.20	52.14	40.44	62.88	51.28
SLAVC [64]	✗	39.80	–	86.00	–	52.29	42.19	63.39	51.07
MarginNCE [72]	✗	39.78	40.01	85.14	64.55	53.61	45.52	65.85	52.92
HearTheFlow [29]	✗	39.40	40.00	84.80	64.00	55.48	47.40	67.49	54.39
FNAC [90]	✗	41.85	40.80	85.14	64.30	57.38	48.03	68.78	56.19
Alignment [86]	✗	42.64	41.48	82.40	64.60	58.34	49.86	71.57	57.52
BuboGPT [116]	✓	40.31	39.68	81.17	62.29	58.52	51.63	74.33	59.49
Meerkat (ours)	✓	48.51	45.62	88.35	67.88	65.23	56.10	79.82	65.35
${\color[rgb]{0,0,1}\definecolor[named]{pgfstrokecolor}{rgb}{0,0,1}\Delta_{% \text{{Meerkat}}-\text{BuboGPT}}}$	✓	+20.34%	+14.97%	+8.85%	+8.97%	+11.47%	+8.66%	+7.39%	+9.85%

Table 3: Audio referred image grounding results. For AVSBench we follow the same train/test splits for all methods. We use the VGG-SS, Flickr-SoundNet, and PascalSound datasets only for evaluation.

LLP AudioSet Strong Models Generalist? F1-score $\uparrow$ F1-score $\uparrow$ AVE [96] ✗ 35.47 37.42 AVSDN [49] ✗ 37.15 41.48 AVVP [95] ✗ 48.93 49.20 TimeChat [83] ✓ 51.28 54.66 Meerkat (ours) ✓ 54.96 56.85 ${\color[rgb]{0,0,1}\definecolor[named]{pgfstrokecolor}{rgb}{0,0,1}\Delta_{% \text{{Meerkat}}-\text{TimeChat}}}$ ✓ +7.18% +4.01%

Table 4: Image guided audio temporal localization results. We report the segment level F1-scores and attribute our performance gain over specialist models to our multi-task learning strategy.

	Type 1	Type 2	Type 3	Type 4
Model	F1-score $\uparrow$	F1-score $\uparrow$	F1-score $\uparrow$	F1-score $\uparrow$
Macaw-LLM [60]	0.65	0.70	0.56	0.77
PandaGPT [89]	0.67	0.70	0.66	0.70
VideoLlama [112]	0.71	0.72	0.72	0.78
BuboGPT [116]	0.72	0.66	0.67	0.70
X-InstructBLIP [70]	0.73	0.72	0.72	0.80
TimeChat [83]	0.74	0.76	0.74	0.82
Meerkat (ours)	0.85	0.83	0.84	0.88
${\color[rgb]{0,0,1}\definecolor[named]{pgfstrokecolor}{rgb}{0,0,1}\Delta_{% \text{{Meerkat}}-\text{TimeChat}}}$	+14.86%	+9.21%	+13.51%	+7.32%

Table 5: Audio-Visual fact-checking requires powerful reasoning capabilities across audio-visual modalities.

Image Guided Audio Temporal Localization (IGATL). When prompted to indicate a time interval within which a certain audio event occurs, Meerkat is capable of producing accurate time bounds in the form [tStart, tEnd], where tStart and tEnd are the start and end times, respectively. For all our experiments, we maintain the audio duration to be 30s. Different from prior visual grounding-based approaches [109, 76, 12], we present a new audio event localization task by setting a new baseline. We attribute the superior performance of our method on fine-grained audio temporal localization task to our specially designed AVOpT and AVACE modules, which ensure superior modality-specific guidance. Fig. 3 demonstrates our model can locate a precise time interval associated with an audio event. Tab. 5 reports the quantitative comparison of our method against other baselines.

Audio-Visual Fact-checking (AVFact). In this section we introduce a new suite of tasks that involves a strong comprehension of the audio-visual semantic information. These tasks broadly require the model to analyze and verify whether a given statement about an audio-visual scenario holds or not. Although we do not use GT spatio-temporal annotations to train the model, we classify this task under the fine-grained category as the task requires the model to attend to a specific region/time interval as passed in the query. To alleviate inconsistencies in evaluation, we restrict the model’s response to binary True/False only. We divide these tasks into the following 4 categories:

Type 1: Given an audio-image pair, verify if the object within the bounding box produces sound that corresponds to the input audio.

Type 2: Given an audio snippet, verify whether its visual counterpart is present in the image or not.

Type 3: Given an audio-image pair, verify if the object present within the provided bounding box produces sound that corresponds to the audio within a given time segment.

Type 4: Given an audio-image pair, verify if the supplied audio is related to the object within the provided bounding box.
In Tab. 5 we contrast the performance of other baselines against Meerkat on all four types of AVFact tasks.

Model	Generalist?	AVQA			MUSIC AVQA			VALOR-32K
Model	Generalist?	Exist $\uparrow$	Localis $\uparrow$	Temp $\uparrow$	Exist $\uparrow$	Localis $\uparrow$	Temp $\uparrow$	BLEU@4 $\uparrow$	METEOR $\uparrow$	ROUGE $\uparrow$	CIDEr $\uparrow$
AVSD [84]	✗	81.61	58.79	61.41	-	-	-	-	-	-	-
PanoAVQA [110]	✗	81.21	59.33	63.23	-	-	-	-	-	-	-
ST-AVQA [42]	✗	81.81	64.51	63.23	-	-	-	-	-	-	-
CAD [67]	✗	83.42	73.97	76.16	-	-	-	-	-	-	-
AVST [42]	✗	-	-	-	72.44	65.54	59.36	-	-	-	-
LAVISH [50]	✗	-	-	-	73.83	65.00	60.81	-	-	-	-
LAST [54]	✗	-	-	-	76.21	68.91	60.60	-	-	-	-
SMPFF [17]	✗	-	-	-	-	-	-	7.59	12.64	28.69	37.18
VALOR [15]	✗	-	-	-	-	-	-	8.97	14.88	30.86	55.73
Macaw-LLM [60]	✓	82.19	74.86	78.98	72.99	71.28	59.36	9.36	15.28	33.31	58.98
PandaGPT [89]	✓	83.38	76.81	79.11	78.48	73.12	65.85	10.35	16.92	34.88	61.22
VideoLlama [112]	✓	84.48	77.06	81.36	81.21	76.10	67.52	11.45	17.39	35.14	63.63
X-InstructBLIP [70]	✓	85.53	80.09	83.91	80.28	77.45	68.83	12.31	18.82	37.93	65.73
Meerkat (ours)	✓	88.24	86.65	86.55	83.62	80.51	73.33	16.88	23.18	45.67	76.84
${\color[rgb]{0,0,1}\definecolor[named]{pgfstrokecolor}{rgb}{0,0,1}\Delta_{% \text{{Meerkat}}-\text{X-InstructBLIP}}}$	✓	+3.17%	+8.19%	+3.15%	+4.16%	+3.95%	+6.54%	+37.12%	+23.17%	+20.41%	+16.9%

Table 6: Quantitative results on AVQA and AV captioning tasks. The reported numbers on AVQA dataset [106] are on the val split. For the MUSIC-AVQA dataset [42], results are reported on the balanced test set. Here, Exist: Existential, Localis: Localisation, Temp: Temporal. Evaluation for AV captioning is done on VALOR-32K [15] val set. Meerkat demonstrates strong coarse-grained understanding abilities.

Audio-Visual Question Answering (AVQA). Audio-visual question answering aims to answer questions encompassing both audio and visual modalities. We collect question-answer pairs from the AVQA [106] and MusicAVQA [42] datasets and augment them with instruction tuning templates (details in appendix) to prepare the data samples. We contrast our method against SoTA generalist and specialist models on the AVQA task in Tab. 6. We report the evaluation results on the other metrics like Count and Comp in the appendix.

Audio-Visual Captioning (AVC). This task learns how to generate text tokens conditioned on audio-visual inputs. In contrast to image/audio-only captioning methods, this requires strong multi-modal understanding and reasoning capabilities. We note that Meerkat outperforms existing specialist and generalist models by a considerable margin and sets a new baseline on a recent benchmark dataset VALOR [15], as shown in Tab. 6.

We argue that the seamless extension of Meerkat to coarse-grained tasks is facilitated by the strong semantic understanding acquired by our model during training. This comprehension ability enables our model to effectively navigate and interpret the complexities inherent in coarse-grained tasks, showcasing the versatility and easy extensibility of our approach.

Weak vs. Strong Alignment. We ablate the quantitative effectiveness of our proposed weak and strong alignment modules in Tab. 7. Without the AVACE module, the method’s performance on the visual grounding task is considerably worse. For a similar reason, ablating this module in AVFact (Type 3), which requires region-level visual understanding, also shows inferior performance. For coarse-grained tasks (AV Captioning, AVQA), introducing $\mathcal{L}_{\text{OT}}$ boosts performance compared to the baseline. Overall, optimal performance is achieved when two objective functions work in tandem with optimal weight factors.

Training Objective			VGGSS	LLP	AVFact(T3)	AVQA	VALOR
$\mathcal{L}_{\text{CE}}$	$\mathcal{L}_{\text{OT}}$	$\mathcal{L}_{\text{AC}}$	cIOU $\uparrow$	F1-score $\uparrow$	F1-score $\uparrow$	Avg $\uparrow$	CIDEr $\uparrow$
✓	✗	✗	42.93	52.13	0.76	84.00	71.52
✓	✓	✗	43.75	53.41	0.78	85.91	73.49
✓	✗	✓	46.83	52.57	0.81	85.82	73.14
✓	✓	✓	48.51	54.96	0.84	87.14	76.84

Table 7: Ablation on different combinations of

\mathcal{L}_{\text{OT}}

and

\mathcal{L}_{\text{AC}}

. Meerkat achieves optimal performance with a weighted linear combination of the 3 objective functions on all tasks. AVQA avg is calculated over Exist, Localis, and Temp.

Evaluation on Pre-training Tasks. To study the effect of unified pre-training, we evaluate our model under single task vs. multi-task learning setting. We gradually add datasets for each task and assess the model’s performance. On quantitative evaluation, we note that our multi-task setting is indeed benefitting from each other in achieving superior performance as shown in Tab. 8. While the model trained on fine-grained tasks performs significantly well on the coarse-grained tasks, introducing the coarse-grained tasks in the training set doesn’t have a considerable impact on ARIG, IGATL, and AVFact - underlining the importance of our collected fine-grained datasets.

Full vs. LoRA Finetuning. We conduct experiments on different modes of LLM fine-tuning. As shown in Fig. 4, LoRA [36] based fine-tuning with r=32 achieves optimal performance. Lower values of r (4,16) performs poorly compared to 32 and we empirically find full-finetuning performs slightly worse than LoRA (r=32). We add more ablation results in the appendix.

Fig. 3 illustrates the comparison of Meerkat with its closest baseline on all downstream tasks. We observe that our model powered by the combination of AVOpT and AVACE is equipped with finer region-level understanding compared to Bubo-GPT [116]. Similarly, on image-guided audio temporal localization, our method outperforms TimeChat [83]. We attribute the excellent performance of Meerkat to the strong AV association learning backed by the instruction tuning data and multi-task learning set-up. For the AVQA task, the recently proposed X-InstructBLIP [70] achieves comparable results. We argue that fuelled by a strong fine-grained understanding acquired through the pre-training stages, Meerkat can extract additional contextual information from the visual modality. Our training paradigm emphasizes on both audio and visual modalities facilitating precise audio understanding by the model when compared against Video-LLaMA [112]. Finally, on the AVFact tasks, our approach achieves superior performance due to its better multi-modal comprehension skills.

Pre-training Task					VGG-SS	LLP	AVFact	AVQA	VALOR
ARIG	IGATL	AVFC	AVQA	AVC	cIOU $\uparrow$	F1-score $\uparrow$	Avg F1-score $\uparrow$	Avg Acc. $\uparrow$	CIDEr $\uparrow$
✓	✗	✗	✗	✗	47.53	18.73	0.71	77.22	67.82
✓	✓	✗	✗	✗	47.75	54.26	0.74	79.74	70.19
✓	✓	✓	✗	✗	48.17	54.65	0.83	81.11	72.13
✓	✓	✓	✓	✗	48.29	54.82	0.83	86.68	74.14
✓	✓	✓	✓	✓	48.51	54.96	0.85	87.14	76.84

Table 8: We systematically analyze the effect of multi-task learning. Here ARIG: audio referred image grounding, IGATL: image guided audio temporal localization, AVFC: audio-visual fact-checking, AVQA: audio-visual question answering, and AVC: audio-visual captioning. AVQA avg accuracy calculated over Exist, Localis, and Temp.

Figure 4: cIoU upper bound on VGG-SS for Full vs. LoRA based finetuning.

We train the model for $5$ epochs and report results using the checkpoint with the best validation loss. We use 8 A100 GPUs for training with validation at the end of every epoch. Inspired by the recent success of Low-Rank Adaptation (LoRA) [36], we use it to finetune the LLM. Meerkat is trained using AdamW optimizer [56]. We use a gradient accumulation step of $3$ . Training our model takes around 52 hours for 5 epochs. We utilize DeepSpeed [82] for optimization during the training process. The model is trained with a learning rate of $3\times 10^{-5}$ . The warmup ratio is $0.03$ , along with a cosine learning rate scheduler. We use FP16 precision for both training and inference.

6 Conclusions and Future Works

We presented Meerkat, a powerful multi-modal large language model adept at processing audio-visual inputs to comprehend fine-grained spatio-temporal information. Our novel audio-visual alignment strategy powered by the AVOpT and AVACE modules instil strong compositional understanding into Meerkat, thereby making it suitable for challenging tasks like audio-referred visual grounding, image to audio temporal localization, audio-visual fact-checking, etc. To pave the way for future research in this direction, we collect AVFIT comprising 3M instruction tuning samples and introduce MeerkatBench that unifies five challenging audio-visual learning tasks. Extensive experiments demonstrate the effectiveness of our approach on a wide range of downstream tasks, consistently achieving state-of-the-art performance.

In future work, we plan to equip our model to address more challenging tasks like AV segmentation. We also plan to extend the model’s capability to operate on videos and handle associated tasks such as video temporal grounding, and video summarization. Future work can also focus on collecting video-centric multi-modal training data and reasoning benchmarks for evaluation at scale. Finally, our work opens up avenues to study robustness and compositional understanding of AV LLMs with fine-grained comprehension abilities.

References

[1] Achiam, J., Adler, S., Agarwal, S., Ahmad, L., Akkaya, I., Aleman, F.L., Almeida, D., Altenschmidt, J., Altman, S., Anadkat, S., et al.: Gpt-4 technical report. arXiv preprint arXiv:2303.08774 (2023)
[2] Alayrac, J.B., Donahue, J., Luc, P., Miech, A., Barr, I., Hasson, Y., Lenc, K., Mensch, A., Millican, K., Reynolds, M., et al.: Flamingo: a visual language model for few-shot learning. Advances in Neural Information Processing Systems 35, 23716–23736 (2022)
[3] Arandjelovic, R., Zisserman, A.: Look, listen and learn. In: Proceedings of the IEEE international conference on computer vision. pp. 609–617 (2017)
[4] Baevski, A., Zhou, Y., Mohamed, A., Auli, M.: wav2vec 2.0: A framework for self-supervised learning of speech representations. Advances in neural information processing systems 33, 12449–12460 (2020)
[5] Banerjee, S., Lavie, A.: Meteor: An automatic metric for mt evaluation with improved correlation with human judgments. In: Proceedings of the acl workshop on intrinsic and extrinsic evaluation measures for machine translation and/or summarization. pp. 65–72 (2005)
[6] Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in neural information processing systems 33, 1877–1901 (2020)
[7] Chen, F., Han, M., Zhao, H., Zhang, Q., Shi, J., Xu, S., Xu, B.: X-llm: Bootstrapping advanced large language models by treating multi-modalities as foreign languages. arXiv preprint arXiv:2305.04160 (2023)
[8] Chen, G., et al: Plot: Prompt learning with optimal transport for vision-language models. ICLR (2023)
[9] Chen, H., Xie, W., Afouras, T., Nagrani, A., Vedaldi, A., Zisserman, A.: Localizing visual sounds the hard way. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 16867–16876 (2021)
[10] Chen, H., Xie, W., Vedaldi, A., Zisserman, A.: Vggsound: A large-scale audio-visual dataset. In: ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). pp. 721–725. IEEE (2020)
[11] Chen, J., Zhang, R., Lian, D., Yang, J., Zeng, Z., Shi, J.: iquery: Instruments as queries for audio-visual sound separation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 14675–14686 (2023)
[12] Chen, J., Zhu, D., Shen, X., Li, X., Liu, Z., Zhang, P., Krishnamoorthi, R., Chandra, V., Xiong, Y., Elhoseiny, M.: Minigpt-v2: large language model as a unified interface for vision-language multi-task learning. arXiv preprint arXiv:2310.09478 (2023)
[13] Chen, K., Zhang, Z., Zeng, W., Zhang, R., Zhu, F., Zhao, R.: Shikra: Unleashing multimodal llm’s referential dialogue magic. arXiv preprint arXiv:2306.15195 (2023)
[14] Chen, L., Gan, Z., Cheng, Y., Li, L., Carin, L., Liu, J.: Graph optimal transport for cross-domain alignment. In: International Conference on Machine Learning. pp. 1542–1553. PMLR (2020)
[15] Chen, S., He, X., Guo, L., Zhu, X., Wang, W., Tang, J., Liu, J.: Valor: Vision-audio-language omni-perception pretraining model and dataset. arXiv preprint arXiv:2304.08345 (2023)
[16] Chen, S., Li, H., Wang, Q., Zhao, Z., Sun, M., Zhu, X., Liu, J.: Vast: A vision-audio-subtitle-text omni-modality foundation model and dataset. Advances in Neural Information Processing Systems 36 (2024)
[17] Chen, S., Zhu, X., Hao, D., Liu, W., Liu, J., Zhao, Z., Guo, L., Liu, J.: Mm21 pre-training for video understanding challenge: Video captioning with pretraining techniques. In: Proceedings of the 29th ACM International Conference on Multimedia. pp. 4853–4857 (2021)
[18] Chen, Y.C., Li, L., Yu, L., El Kholy, A., Ahmed, F., Gan, Z., Cheng, Y., Liu, J.: Uniter: Universal image-text representation learning. In: European conference on computer vision. pp. 104–120. Springer (2020)
[19] Chiang, W.L., Li, Z., Lin, Z., Sheng, Y., Wu, Z., Zhang, H., Zheng, L., Zhuang, S., Zhuang, Y., Gonzalez, J.E., Stoica, I., Xing, E.P.: Vicuna: An open-source chatbot impressing gpt-4 with 90%* chatgpt quality (March 2023), https://lmsys.org/blog/2023-03-30-vicuna/
[20] Chowdhery, A., Narang, S., Devlin, J., Bosma, M., Mishra, G., Roberts, A., Barham, P., Chung, H.W., Sutton, C., Gehrmann, S., et al.: Palm: Scaling language modeling with pathways. Journal of Machine Learning Research 24(240), 1–113 (2023)
[21] Chowdhury, S., Nag, S., Joseph, K., Srinivasan, B.V., Manocha, D.: Melfusion: Synthesizing music from image and language cues using diffusion models. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 26826–26835 (2024)
[22] Chowdhury, S., Nag, S., Manocha, D.: Apollo: Unified adapter and prompt learning for vision language models. In: The 2023 Conference on Empirical Methods in Natural Language Processing (2023)
[23] Chung, H.W., Hou, L., Longpre, S., Zoph, B., Tay, Y., Fedus, W., Li, Y., Wang, X., Dehghani, M., Brahma, S., et al.: Scaling instruction-finetuned language models. arXiv preprint arXiv:2210.11416 (2022)
[24] Cramer, A.L., Wu, H.H., Salamon, J., Bello, J.P.: Look, listen, and learn more: Design choices for deep audio embeddings. In: ICASSP 2019-2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). pp. 3852–3856. IEEE (2019)
[25] Dou, Z.Y., Kamath, A., Gan, Z., Zhang, P., Wang, J., Li, L., Liu, Z., Liu, C., LeCun, Y., Peng, N., et al.: Coarse-to-fine vision-language pre-training with fusion in the backbone. Advances in neural information processing systems 35, 32942–32956 (2022)
[26] Elizalde, B., Deshmukh, S., Al Ismail, M., Wang, H.: Clap learning audio concepts from natural language supervision. In: ICASSP 2023-2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). pp. 1–5. IEEE (2023)
[27] Everingham, M., Eslami, S.A., Van Gool, L., Williams, C.K., Winn, J., Zisserman, A.: The pascal visual object classes challenge: A retrospective. International journal of computer vision 111, 98–136 (2015)
[28] Fabbrizzi, S., Papadopoulos, S., Ntoutsi, E., Kompatsiaris, I.: A survey on bias in visual datasets. Computer Vision and Image Understanding 223, 103552 (2022)
[29] Fedorishin, D., Mohan, D.D., Jawade, B., Setlur, S., Govindaraju, V.: Hear the flow: Optical flow-based self-supervised visual sound source localization. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision. pp. 2278–2287 (2023)
[30] Gemmeke, J.F., Ellis, D.P., Freedman, D., Jansen, A., Lawrence, W., Moore, R.C., Plakal, M., Ritter, M.: Audio set: An ontology and human-labeled dataset for audio events. In: 2017 IEEE international conference on acoustics, speech and signal processing (ICASSP). pp. 776–780. IEEE (2017)
[31] Georgescu, M.I., Fonseca, E., Ionescu, R.T., Lucic, M., Schmid, C., Arnab, A.: Audiovisual masked autoencoders. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. pp. 16144–16154 (2023)
[32] Girdhar, R., El-Nouby, A., Liu, Z., Singh, M., Alwala, K.V., Joulin, A., Misra, I.: Imagebind: One embedding space to bind them all. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 15180–15190 (2023)
[33] Gong, Y., Luo, H., Liu, A.H., Karlinsky, L., Glass, J.: Listen, think, and understand. arXiv preprint arXiv:2305.10790 (2023)
[34] Gutmann, M.U., Hyvärinen, A.: Noise-contrastive estimation of unnormalized statistical models, with applications to natural image statistics. Journal of machine learning research 13(2) (2012)
[35] Honovich, O., Scialom, T., Levy, O., Schick, T.: Unnatural instructions: Tuning language models with (almost) no human labor. arXiv preprint arXiv:2212.09689 (2022)
[36] Hu, E.J., Shen, Y., Wallis, P., Allen-Zhu, Z., Li, Y., Wang, S., Wang, L., Chen, W.: Lora: Low-rank adaptation of large language models. arXiv preprint arXiv:2106.09685 (2021)
[37] Huang, C., Tian, Y., Kumar, A., Xu, C.: Egocentric audio-visual object localization. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 22910–22921 (2023)
[38] Huang, S., Qin, L., Wang, B., Tu, G., Xu, R.: Sdif-da: A shallow-to-deep interaction framework with data augmentation for multi-modal intent detection. arXiv preprint arXiv:2401.00424 (2023)
[39] Kuznetsova, A., Rom, H., Alldrin, N., Uijlings, J., Krasin, I., Pont-Tuset, J., Kamali, S., Popov, S., Malloci, M., Kolesnikov, A., et al.: The open images dataset v4: Unified image classification, object detection, and visual relationship detection at scale. International Journal of Computer Vision 128(7), 1956–1981 (2020)
[40] Lai, X., Tian, Z., Chen, Y., Li, Y., Yuan, Y., Liu, S., Jia, J.: Lisa: Reasoning segmentation via large language model. arXiv preprint arXiv:2308.00692 (2023)
[41] Li, B., Zhang, Y., Chen, L., Wang, J., Yang, J., Liu, Z.: Otter: A multi-modal model with in-context instruction tuning. arXiv preprint arXiv:2305.03726 (2023)
[42] Li, G., Wei, Y., Tian, Y., Xu, C., Wen, J.R., Hu, D.: Learning to answer questions in dynamic audio-visual scenarios. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 19108–19118 (2022)
[43] Li, J., Li, D., Xiong, C., Hoi, S.: Blip: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International conference on machine learning. pp. 12888–12900. PMLR (2022)
[44] Li, J., Selvaraju, R., Gotmare, A., Joty, S., Xiong, C., Hoi, S.C.H.: Align before fuse: Vision and language representation learning with momentum distillation. Advances in neural information processing systems 34, 9694–9705 (2021)
[45] Li, K., He, Y., Wang, Y., Li, Y., Wang, W., Luo, P., Wang, Y., Wang, L., Qiao, Y.: Videochat: Chat-centric video understanding. arXiv preprint arXiv:2305.06355 (2023)
[46] Li, L.H., Zhang, P., Zhang, H., Yang, J., Li, C., Zhong, Y., Wang, L., Yuan, L., Zhang, L., Hwang, J.N., et al.: Grounded language-image pre-training. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 10965–10975 (2022)
[47] Liang, P., Bommasani, R., Lee, T., Tsipras, D., Soylu, D., Yasunaga, M., Zhang, Y., Narayanan, D., Wu, Y., Kumar, A., et al.: Holistic evaluation of language models. arXiv preprint arXiv:2211.09110 (2022)
[48] Lin, C.Y.: Rouge: A package for automatic evaluation of summaries. In: Text summarization branches out. pp. 74–81 (2004)
[49] Lin, Y.B., Li, Y.J., Wang, Y.C.F.: Dual-modality seq2seq network for audio-visual event localization. In: ICASSP 2019-2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). pp. 2002–2006. IEEE (2019)
[50] Lin, Y.B., Sung, Y.L., Lei, J., Bansal, M., Bertasius, G.: Vision transformers are parameter-efficient audio-visual learners. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 2299–2309 (2023)
[51] Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in neural information processing systems 36 (2024)
[52] Liu, J., Ju, C., Xie, W., Zhang, Y.: Exploiting transformation invariance and equivariance for self-supervised sound localisation. In: Proceedings of the 30th ACM International Conference on Multimedia. pp. 3742–3753 (2022)
[53] Liu, J., Wang, Y., Ju, C., Ma, C., Zhang, Y., Xie, W.: Annotation-free audio-visual segmentation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision. pp. 5604–5614 (2024)
[54] Liu, X., Dong, Z., Zhang, P.: Tackling data bias in music-avqa: Crafting a balanced dataset for unbiased question-answering. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision. pp. 4478–4487 (2024)
[55] Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin transformer: Hierarchical vision transformer using shifted windows. In: Proceedings of the IEEE/CVF international conference on computer vision. pp. 10012–10022 (2021)
[56] Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017)
[57] Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-io: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022)
[58] Lu, P., Mishra, S., Xia, T., Qiu, L., Chang, K.W., Zhu, S.C., Tafjord, O., Clark, P., Kalyan, A.: Learn to explain: Multimodal reasoning via thought chains for science question answering. Advances in Neural Information Processing Systems 35, 2507–2521 (2022)
[59] Luo, R., Zhao, Z., Yang, M., Dong, J., Qiu, M., Lu, P., Wang, T., Wei, Z.: Valley: Video assistant with large language model enhanced ability. arXiv preprint arXiv:2306.07207 (2023)
[60] Lyu, C., Wu, M., Wang, L., Huang, X., Liu, B., Du, Z., Shi, S., Tu, Z.: Macaw-llm: Multi-modal language modeling with image, audio, video, and text integration. arXiv preprint arXiv:2306.09093 (2023)
[61] Maaz, M., Rasheed, H., Khan, S., Khan, F.S.: Video-chatgpt: Towards detailed video understanding via large vision and language models. arXiv preprint arXiv:2306.05424 (2023)
[62] Majumder, S., Grauman, K.: Active audio-visual separation of dynamic sound sources. In: European Conference on Computer Vision. pp. 551–569. Springer (2022)
[63] Mao, Y., Zhang, J., Xiang, M., Zhong, Y., Dai, Y.: Multimodal variational auto-encoder based audio-visual segmentation. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. pp. 954–965 (2023)
[64] Mo, S., Morgado, P.: A closer look at weakly-supervised audio-visual source localization. Advances in Neural Information Processing Systems 35, 37524–37536 (2022)
[65] Mo, S., Morgado, P.: Localizing visual sounds the easy way. In: European Conference on Computer Vision. pp. 218–234. Springer (2022)
[66] Mo, S., Tian, Y.: Audio-visual grouping network for sound localization from mixtures. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 10565–10574 (2023)
[67] Nadeem, A., Hilton, A., Dawes, R., Thomas, G., Mustafa, A.: Cad-contextual multi-modal alignment for dynamic avqa. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision. pp. 7251–7263 (2024)
[68] Oord, A.v.d., Li, Y., Vinyals, O.: Representation learning with contrastive predictive coding. arXiv preprint arXiv:1807.03748 (2018)
[69] Ouyang, L., Wu, J., Jiang, X., Almeida, D., Wainwright, C., Mishkin, P., Zhang, C., Agarwal, S., Slama, K., Ray, A., et al.: Training language models to follow instructions with human feedback. Advances in Neural Information Processing Systems 35, 27730–27744 (2022)
[70] Panagopoulou, A., Xue, L., Yu, N., Li, J., Li, D., Joty, S., Xu, R., Savarese, S., Xiong, C., Niebles, J.C.: X-instructblip: A framework for aligning x-modal instruction-aware representations to llms and emergent cross-modal reasoning. arXiv preprint arXiv:2311.18799 (2023)
[71] Papineni, K., Roukos, S., Ward, T., Zhu, W.J.: Bleu: a method for automatic evaluation of machine translation. In: Proceedings of the 40th annual meeting of the Association for Computational Linguistics. pp. 311–318 (2002)
[72] Park, S., Senocak, A., Chung, J.S.: Marginnce: Robust sound localization with a negative margin. In: ICASSP 2023-2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). pp. 1–5. IEEE (2023)
[73] Peng, B., Li, C., He, P., Galley, M., Gao, J.: Instruction tuning with gpt-4. arXiv preprint arXiv:2304.03277 (2023)
[74] Peng, Z., Wang, W., Dong, L., Hao, Y., Huang, S., Ma, S., Wei, F.: Kosmos-2: Grounding multimodal large language models to the world. arXiv preprint arXiv:2306.14824 (2023)
[75] Plummer, B.A., Wang, L., Cervantes, C.M., Caicedo, J.C., Hockenmaier, J., Lazebnik, S.: Flickr30k entities: Collecting region-to-phrase correspondences for richer image-to-sentence models. In: Proceedings of the IEEE international conference on computer vision. pp. 2641–2649 (2015)
[76] Pramanick, S., Han, G., Hou, R., Nag, S., Lim, S.N., Ballas, N., Wang, Q., Chellappa, R., Almahairi, A.: Jack of all tasks, master of many: Designing general-purpose coarse-to-fine vision-language model. arXiv preprint arXiv:2312.12423 (2023)
[77] Pramanick, S., Jing, L., Nag, S., Zhu, J., Shah, H.J., LeCun, Y., Chellappa, R.: Volta: Vision-language transformer with weakly-supervised local-feature alignment. Transactions on Machine Learning Research (2023)
[78] Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International conference on machine learning. pp. 8748–8763. PMLR (2021)
[79] Radford, A., Kim, J.W., Xu, T., Brockman, G., McLeavey, C., Sutskever, I.: Robust speech recognition via large-scale weak supervision. In: International Conference on Machine Learning. pp. 28492–28518. PMLR (2023)
[80] Raffel, C., Shazeer, N., Roberts, A., Lee, K., Narang, S., Matena, M., Zhou, Y., Li, W., Liu, P.J.: Exploring the limits of transfer learning with a unified text-to-text transformer. The Journal of Machine Learning Research 21(1), 5485–5551 (2020)
[81] Raffel, C., Shazeer, N., Roberts, A., Lee, K., Narang, S., Matena, M., Zhou, Y., Li, W., Liu, P.J.: Exploring the limits of transfer learning with a unified text-to-text transformer. Journal of machine learning research 21(140), 1–67 (2020)
[82] Rasley, J., Rajbhandari, S., Ruwase, O., He, Y.: Deepspeed: System optimizations enable training deep learning models with over 100 billion parameters. In: Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining. pp. 3505–3506 (2020)
[83] Ren, S., Yao, L., Li, S., Sun, X., Hou, L.: Timechat: A time-sensitive multimodal large language model for long video understanding. arXiv preprint arXiv:2312.02051 (2023)
[84] Schwartz, I., Schwing, A.G., Hazan, T.: A simple baseline for audio-visual scene-aware dialog. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 12548–12558 (2019)
[85] Senocak, A., Oh, T.H., Kim, J., Yang, M.H., Kweon, I.S.: Learning to localize sound source in visual scenes. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. pp. 4358–4366 (2018)
[86] Senocak, A., Ryu, H., Kim, J., Oh, T.H., Pfister, H., Chung, J.S.: Sound source localization is all about cross-modal alignment. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. pp. 7777–7787 (2023)
[87] Shu, F., Zhang, L., Jiang, H., Xie, C.: Audio-visual llm for video understanding. arXiv preprint arXiv:2312.06720 (2023)
[88] Song, Z., Wang, Y., Fan, J., Tan, T., Zhang, Z.: Self-supervised predictive learning: A negative-free method for sound source localization in visual scenes. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 3222–3231 (2022)
[89] Su, Y., Lan, T., Li, H., Xu, J., Wang, Y., Cai, D.: Pandagpt: One model to instruction-follow them all. arXiv preprint arXiv:2305.16355 (2023)
[90] Sun, W., Zhang, J., Wang, J., Liu, Z., Zhong, Y., Feng, T., Guo, Y., Zhang, Y., Barnes, N.: Learning audio-visual source localization via false negative aware contrastive learning. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 6420–6429 (2023)
[91] Tan, R., Ray, A., Burns, A., Plummer, B.A., Salamon, J., Nieto, O., Russell, B., Saenko, K.: Language-guided audio-visual source separation via trimodal consistency. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 10575–10584 (2023)
[92] Taori, R., Gulrajani, I., Zhang, T., Dubois, Y., Li, X., Guestrin, C., Liang, P., Hashimoto, T.B.: Stanford alpaca: An instruction-following llama model. https://github.com/tatsu-lab/stanford_alpaca (2023)
[93] Taylor, R., Kardas, M., Cucurull, G., Scialom, T., Hartshorn, A., Saravia, E., Poulton, A., Kerkez, V., Stojnic, R.: Galactica: A large language model for science. arXiv preprint arXiv:2211.09085 (2022)
[94] Tian, Y., Guan, C., Goodman, J., Moore, M., Xu, C.: An attempt towards interpretable audio-visual video captioning. arXiv preprint arXiv:1812.02872 (2018)
[95] Tian, Y., Li, D., Xu, C.: Unified multisensory perception: Weakly-supervised audio-visual video parsing. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part III 16. pp. 436–454. Springer (2020)
[96] Tian, Y., Shi, J., Li, B., Duan, Z., Xu, C.: Audio-visual event localization in unconstrained videos. In: Proceedings of the European conference on computer vision (ECCV). pp. 247–263 (2018)
[97] Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023)
[98] Touvron, H., Martin, L., Stone, K., Albert, P., Almahairi, A., Babaei, Y., Bashlykov, N., Batra, S., Bhargava, P., Bhosale, S., et al.: Llama 2: Open foundation and fine-tuned chat models, 2023. URL https://arxiv. org/abs/2307.09288 (2023)
[99] Vedantam, R., Lawrence Zitnick, C., Parikh, D.: Cider: Consensus-based image description evaluation. In: Proceedings of the IEEE conference on computer vision and pattern recognition. pp. 4566–4575 (2015)
[100] Wang, P., Yang, A., Men, R., Lin, J., Bai, S., Li, Z., Ma, J., Zhou, C., Zhou, J., Yang, H.: Ofa: Unifying architectures, tasks, and modalities through a simple sequence-to-sequence learning framework. In: International Conference on Machine Learning. pp. 23318–23340. PMLR (2022)
[101] Wang, W., Lv, Q., Yu, W., Hong, W., Qi, J., Wang, Y., Ji, J., Yang, Z., Zhao, L., Song, X., et al.: Cogvlm: Visual expert for pretrained language models. arXiv preprint arXiv:2311.03079 (2023)
[102] Wang, W., Chen, Z., Chen, X., Wu, J., Zhu, X., Zeng, G., Luo, P., Lu, T., Zhou, J., Qiao, Y., et al.: Visionllm: Large language model is also an open-ended decoder for vision-centric tasks. Advances in Neural Information Processing Systems 36 (2024)
[103] Wei, J., Bosma, M., Zhao, V.Y., Guu, K., Yu, A.W., Lester, B., Du, N., Dai, A.M., Le, Q.V.: Finetuned language models are zero-shot learners. arXiv preprint arXiv:2109.01652 (2021)
[104] Workshop, B., Scao, T.L., Fan, A., Akiki, C., Pavlick, E., Ilić, S., Hesslow, D., Castagné, R., Luccioni, A.S., Yvon, F., et al.: Bloom: A 176b-parameter open-access multilingual language model. arXiv preprint arXiv:2211.05100 (2022)
[105] Wu, H.H., Seetharaman, P., Kumar, K., Bello, J.P.: Wav2clip: Learning robust audio representations from clip. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). pp. 4563–4567. IEEE (2022)
[106] Yang, P., Wang, X., Duan, X., Chen, H., Hou, R., Jin, C., Zhu, W.: Avqa: A dataset for audio-visual question answering on videos. In: Proceedings of the 30th ACM International Conference on Multimedia. pp. 3480–3491 (2022)
[107] Yang, Z., Gan, Z., Wang, J., Hu, X., Ahmed, F., Liu, Z., Lu, Y., Wang, L.: Unitab: Unifying text and box outputs for grounded vision-language modeling. In: European Conference on Computer Vision. pp. 521–539. Springer (2022)
[108] Ye, Q., Xu, H., Xu, G., Ye, J., Yan, M., Zhou, Y., Wang, J., Hu, A., Shi, P., Shi, Y., et al.: mplug-owl: Modularization empowers large language models with multimodality. arXiv preprint arXiv:2304.14178 (2023)
[109] You, H., Zhang, H., Gan, Z., Du, X., Zhang, B., Wang, Z., Cao, L., Chang, S.F., Yang, Y.: Ferret: Refer and ground anything anywhere at any granularity. arXiv preprint arXiv:2310.07704 (2023)
[110] Yun, H., Yu, Y., Yang, W., Lee, K., Kim, G.: Pano-avqa: Grounded audio-visual question answering on 360deg videos. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. pp. 2031–2041 (2021)
[111] Zhang, C., Cai, Y., Lin, G., Shen, C.: Deepemd: Few-shot image classification with differentiable earth mover’s distance and structured classifiers. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 12203–12213 (2020)
[112] Zhang, H., Li, X., Bing, L.: Video-llama: An instruction-tuned audio-visual language model for video understanding. arXiv preprint arXiv:2306.02858 (2023)
[113] Zhang, R., Han, J., Zhou, A., Hu, X., Yan, S., Lu, P., Li, H., Gao, P., Qiao, Y.: Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:2303.16199 (2023)
[114] Zhang, S., Sun, P., Chen, S., Xiao, M., Shao, W., Zhang, W., Chen, K., Luo, P.: Gpt4roi: Instruction tuning large language model on region-of-interest. arXiv preprint arXiv:2307.03601 (2023)
[115] Zhang, S., Roller, S., Goyal, N., Artetxe, M., Chen, M., Chen, S., Dewan, C., Diab, M., Li, X., Lin, X.V., et al.: Opt: Open pre-trained transformer language models. arXiv preprint arXiv:2205.01068 (2022)
[116] Zhao, Y., Lin, Z., Zhou, D., Huang, Z., Feng, J., Kang, B.: Bubogpt: Enabling visual grounding in multi-modal llms. arXiv preprint arXiv:2307.08581 (2023)
[117] Zhou, J., Wang, J., Zhang, J., Sun, W., Zhang, J., Birchfield, S., Guo, D., Kong, L., Wang, M., Zhong, Y.: Audio–visual segmentation. In: European Conference on Computer Vision. pp. 386–403. Springer (2022)
[118] Zhu, D., Chen, J., Shen, X., Li, X., Elhoseiny, M.: Minigpt-4: Enhancing vision-language understanding with advanced large language models. arXiv preprint arXiv:2304.10592 (2023)

Meerkat: Audio-Visual Large Language Model
for Grounding in Space and Time
Appendix

In this appendix we provide additional details about:
7 Data preparation strategy (referenced in Sec. 4.2 of main paper)
8 Dataset instruction templates (referenced in Sec. 3.4 and Sec. 5.2)
9 Dataset statistics and analysis
10 More qualitative results
11 More ablations (referenced in Sec. 5.3)
12 Comparison with contrastive loss (referenced in Sec. 3.2)
13 Comparison with two-stage training (referenced in Sec. 3.2)
14 Role of audio in AVQA task
15 More on optimal transport (referenced in Sec. 3.2)
16 AVSBench data collection
17 Comparison against ImageBind
18 Other Quantitative metrics on AVQA task (referenced in Sec. 5.2)
19 Evaluation metrics
20 Failure cases
21 Ethics statement

7 Data Preparation Strategy

7.1 Adaptation of Public Datasets.

To collect the image-audio pairs from video-based datasets and adapt them to our setup, we carefully choose one representative image from the video. We add task-wise dataset details in Fig. 5. To this end, we design a semi-automated strategy as explained later in each task section.

7.2 Fine-grained Data Preparation

Audio Referred Image Grounding (ARIG). For this task, the dataset collection consists of image-audio pairs from Openimages-AudioSet, Openimages-VGGSound, AVSBench, VGGSS, PASCAL Sound, and Flickr-Soundnet. Among these, for Openimages-AudioSet, Openimages-VGGSound and VGGSS we first obtain the top 3 image frames with the highest image-text CLIP similarity scores [78] and subsequently select the most suitable frame by manual inspection to form the image-audio pair. The frames are extracted from the video segment of interest (denoted in dataset annotation). Please refer to Tab. 9 for the Openimages-AudioSet / VGGSound classwise associations. We refer to this look-up table while matching the corresponding classes.

•

Openimages-AudioSet: For every sample, we obtain the [start,end] time interval of the audio event of interest from the AudioSet dataset. Each sample is associated with an audio category. We use this class label while calculating the CLIP score with the image frames. We zero-pad and make length each audio piece 30s.
•

Openimages-VGGSound: We obtain the onset (start) of an event from the VGGSound dataset annotation and extract min(start + 30, len(audio)) second snippet. If the len(audio) is less than 30s we zero pad to maintain the audio sequence length.
•

AVSBench: AVSBench comes with 5 to 6 frames along with the audio snippet. We manually choose the best frame that most closely relates to the audio event under consideration through manual inspection.
•

VGGSS: We follow a similar strategy as that of VGGSound.
•

PASCAL Sound: We choose 566 image samples from the PASCAL dataset [27] ranging from 12 sounding classes and carefully pair them with AudioSet samples using the same protocol as Openimages-AudioSet.
•

Flickr-Soundnet [85]: Here we directly obtain the image audio pairs as released by the authors.

For all these cases we augment the image-audio pairs with our instruction tuning templates (refer to Section 8).

Openimages Label Name	Audioset Label Name	VGGSound Label Name
Aircraft	Aircraft	airplane
Alarm clock	Alarm clock	alarm clock ringing
Ambulance	Ambulance (siren)	ambulance siren
Bicycle	Bicycle, tricycle	–
Bird	Bird	bird chirping, tweeting
Blender	Blender, food processor	electric blender running
Boat	Boat, Water vehicle	sailing
Bus	Bus	helicopter
Camera	Camera	–
Cannon	–	firing cannon
Car	Car	race car, auto racing
Cat	Cat	cat meowing
Cattle	–	cattle mooing
Ceiling fan	Mechanical fan	running electric fan
Cello	–	playing cello
Chainsaw	Chainsaw	chainsawing trees
Cheetah	Roaring cats (lions, tigers)	cheetah chirrup
Chicken	Fowl	–
Chime	Chime	wind chime
Clock	Clock	–
Computer keyboard	Computer keyboard	typing on computer keyboard
Computer mouse	Mouse	–
Corded phone	Dial tone	cell phone buzzing
Cutting board	Chopping (food)	chopping food
Dagger	Knife	–
Digital clock	–	alarm clock ringing
Dog	Dog	dog baying
Door	Door	–
Door handle	Doorbell	door slamming
Drill (Tool)	Drill	–
Drum	–	playing drum kit
Duck	Quack	duck quacking
Eagle	–	eagle screaming
Elephant	–	elephant trumpeting
Fireplace	Fire	–
Fixed-wing aircraft	Fixed-wing aircraft, airplane	–
Fountain	Waterfall	–
Fox	Canidae, wild dogs, wolves	fox barking
French horn	–	playing french horn
Frog	Frog	frog croaking
Girl	Female speech, woman speaking	–
Glasses	Glass	–
Goat	Goat	goat bleating
Golf cart	Cart	–
Goose	Ducks, geese, waterfowl	goose honking
Grinder	–	electric grinder grinding
Guitar	–	playing acoustic guitar
Hair dryer	Hair dryer	hair dryer drying
Hammer	Hammer	–
Hand dryer	Hair dryer	–
Handgun	Gunshot, gunfire	machine gun shooting
Harmonica	–	playing harmonica
Harp	–	playing harp
Harpsichord	–	playing harpsichord
Helicopter	Helicopter	helicopter
Horse	Horse	horse neighing
Human face	Female speech, woman speaking\|Male speech, man speaking	–
Infant bed	–	baby crying
Ipod	Music	–
Jaguar (Animal)	Roaring cats (lions, tigers)	cheetah chirrup
Jet ski	Jet engine	skiing
Kettle	Steam whistle	–
Kitchen knife	Knife	–
Kitchen utensil	Kitchen and dining room sounds	–
Knife	Knife	–
Land vehicle	Vehicle	car passing by
Laptop	Typing	typing on computer keyboard
Leopard	Roar	–
Light switch	Clicking	–
Limousine	Car	–
Lion	Roar	lions roaring

Magpie	Bird	magpie calling
Mammal	Animal	–
Man	Male speech, man speaking	–
Mechanical fan	Mechanical fan	running electric fan
Microphone	Microphone	–
Microwave oven	Microwave oven	–
Missile	–	missile launch
Mixer	Blender, food processor	–
Mobile phone	Telephone	cell phone buzzing
Motorcycle	Motorcycle	driving motorcycle
Mouse	Mouse	–
Musical instrument	Music	orchestra
Musical keyboard	–	playing piano
Oboe	–	playing oboe
Otter	–	otter growling
Owl	Owl	owl hooting
Paper cutter	Scissors	ripping paper
Parrot	Bird	parrot talking
Person	Female speech, woman speaking\|Male speech, man speaking	–
Piano	–	playing piano
Pig	Pig	pig oinking
Popcorn	Burst, pop	popping popcorn
Power plugs and sockets	Power tool	–
Pressure cooker	Steam	–
Printer	Printer	printer printing
Rabbit	Rodents, rats, mice	–
Ratchet (Device)	Ratchet, pawl	–
Raven	Crow	crow cawing
Reptile	Snake	–
Rifle	Machine gun	machine gun shooting
Rocket	–	missile launch
Saxophone	–	playing saxophone
Sea lion	–	sea lion barking
Segway	Non-motorized land vehicle	–
Sewing machine	Sewing machine	using sewing machines
Sheep	Sheep	–
Shotgun	Gunshot, gunfire	–
Shower	Shower	–
Skateboard	Skateboard	skateboarding
Ski	–	skiing
Snail	–	hail
Snake	Snake	snake rattling
Snowboard	Skateboard	skiing
Snowmobile	Motorcycle	–
Snowplow	Lawnmower	–
Spoon	Kitchen and dining room sounds	–
Stationary bicycle	Bicycle, tricycle	driving motorcycle
Swan	Quack	–
Swimming pool	Water	–
Sword	Knife	–
Table tennis racket	–	playing table tennis
Tablet computer	Computer keyboard	typing on computer keyboard
Tap	Tap	–
Taxi	Car	hail
Telephone	Telephone	telephone bell ringing
Television	Television	–
Tiger	Roar	–
Toilet	Toilet flush	toilet flushing
Train	Train	–
Truck	Truck	–
Trombone	–	playing trombone
Trumpet	–	playing trumpet
Turkey	Turkey	–
Unicycle	Bicycle bell	–
Van	Car	–
Vehicle	Vehicle	vehicle horn, car horn, honking
Violin	–	playing violin, fiddle
Wall clock	Clock	alarm clock ringing
Washing machine	Washing machine	–
Watch	Clock	–
Wine glass	Glass	–
Whale	–	whale calling
Woman	Female speech, woman speaking\|Male speech, man speaking	–
Woodpecker	Wood	woodpecker pecking tree

Table 9: Image audio class mapping. We associate the image and audio classes from the Openimages and the AudioSet / VGGSound datasets and prepare a lookup table through careful manual inspection.

Image Guided Audio Temporal Localization (IGATL).

•

Openimages-AudioSet (Strong): While curating the image samples we follow a similar strategy as before. To ensure a fair assessment we choose audio snippets that are considerably longer than the event of interest (EoI). However, through manual inspections, we ensure that the EoI lies within the extracted audio piece.

Task	Example Instruction
	$\bullet$ Given the audio and image pair, identify the object category of the audio. Now, provide a bounding box for that object in the image. The answer should be in the form [<obj>,xLeft,yTop,xRight,yBottom]. <obj> represents the object category. xLeft,yTop are coordinates of the top-left corner and xRight,yBottom are coordinates of the bottom-right corner of the bounding box. The coordinates should be within the range 0 to 1.
	$\bullet$ From the given audio and image pair first identify the object category of the audio. Then localize the corresponding object in the image by providing a bounding box. The answer should be in the form [<obj>,xLeft,yTop,xRight,yBottom]. <obj> represents the object category. xLeft,yTop are coordinates of the top-left corner and xRight,yBottom are coordinates of the bottom-right corner of the bounding box. The coordinates should be within the range 0 to 1.
	$\bullet$ Given the audio and image pair, identify the object category of the audio. Now, localize the object in the image by providing a bounding box. The answer should be in the form [<obj>,xLeft,yTop,xRight,yBottom]. <obj> represents the object category. xLeft,yTop are coordinates of the top-left corner and xRight,yBottom are coordinates of the bottom-right corner of the bounding box. The coordinates should be within the range 0 to 1.
ARIG	$\bullet$ Considering the audio and image pair, determine the object class of the audio. Next, localize the same object in the image by providing a bounding box. The answer should be in the form [<obj>,xLeft,yTop,xRight,yBottom]. <obj> represents the class of the object. xLeft,yTop are coordinates of the top-left corner and xRight,yBottom are coordinates of the bottom-right corner of the bounding box. The coordinates should be within the range 0 to 1.
	$\bullet$ Considering the audio and image pair, recognize the object category of the audio. Subsequently, draw a bounding box around that object shown in the image. The answer should be in the form [<obj>,xLeft,yTop,xRight,yBottom]. <obj> represents the category of the object. xLeft,yTop are coordinates of the top-left corner and xRight,yBottom are coordinates of the bottom-right corner of the bounding box. The coordinates should be within the range 0 to 1.
	$\bullet$ Considering the audio and image pair, recognize the object category of the audio. Next, draw a bounding box around that object in the image. The answer should be in the form [<obj>,xLeft,yTop,xRight,yBottom]. <obj> represents the category of the object. xLeft,yTop are coordinates of the top-left corner and xRight,yBottom are coordinates of the bottom-right corner of the bounding box. Ensure the bounding box is within the range 0 to 1.
	$\bullet$ Identify the object category from the image. Now, find the time duration in the audio where that object is making the sound. The output should be in the form (tStart,tEnd) where tStart and tEnd are the start and end times respectively. tStart is less than tEnd. The minimum value of tStart is 0. The maximum value of tEnd is 30.
	$\bullet$ Given the image, identify the object category. Next, output the time window in the audio where that object is making the sound. The output should be in the form (tStart,tEnd) where tStart and tEnd are the start and end times respectively. tStart is less than tEnd. The minimum value of tStart is 0. The maximum value of tEnd is 30.
	$\bullet$ Which object do you see in the image? Please find the time window in the audio where that object is making the sound. The output should be in the form (tStart,tEnd) where tStart and tEnd are the start and end times respectively. tStart is less than tEnd. The minimum value of tStart is 0. The maximum value of tEnd is 30.
IGATL	$\bullet$ Recognise the object category from the image. Now, indicate the time duration in the audio where that object is making the sound. The output should be in the form (tStart,tEnd) where tStart and tEnd are the start and end times respectively. tStart is less than tEnd. The minimum value of tStart is 0. The maximum value of tEnd is 30.
	$\bullet$ What is the category of the object that you see in the image? Now, indicate the temporal duration in the audio where that object is making the sound. The output should be in the form (tStart,tEnd) where tStart and tEnd are the start and end times respectively. tStart is less than tEnd. The minimum value of tStart is 0. The maximum value of tEnd is 30.

	$\bullet$ Does the object inside the bounding box <placeholder_bbox> of the image produce the same sound as in the given audio? Answer in True or False.
	$\bullet$ Given the image, does the object inside the bounding box <placeholder_bbox> produce the same sound as in the given audio? Answer in True or False.
	$\bullet$ The object inside the bounding box <placeholder_bbox> of the image produces the same sound as in the given audio. True or False?
	$\bullet$ From the audio-image pair, verify if the object inside the bounding box <placeholder_bbox> produces the same sound as present in the given audio. Answer in True or False.
	$\bullet$ The object in the given audio between time duration <placeholder_time> is present in the image. True or False?
	$\bullet$ Listen to the audio in the time window <placeholder_time>. Does this object exist in the image? Answer in True or False.
	$\bullet$ Listen to the audio in the time window <placeholder_time>. Verify if the same object is present in the image. True or False?
AVFact	$\bullet$ The time segment <placeholder_time> contains the object as present in the image. True or False?
	$\bullet$ Listen to the audio in the time window <placeholder_time>. The same object is within the bounding box <placeholder_bbox> in the image. True or False?
	$\bullet$ Does the object inside the bounding box <placeholder_bbox> of the image produce the same sound as within the time duration <placeholder_time> in the given audio? Answer in True or False.
	$\bullet$ The object inside the bounding box <placeholder_bbox> of the image produces the same sound as in the time segment <placeholder_time> of the audio. True or False?
	$\bullet$ The time segment <placeholder_time> contains the object in the bounding box <placeholder_bbox> of the image. True or False?
	$\bullet$ Here is an audio-image pair. Does the given audio correspond to the object shown in the image? Answer in True or False.
	$\bullet$ Does the given audio correspond to the object shown in the image? Answer in True or False.
	$\bullet$ Does the given audio associate with the object shown in the image? Answer in True or False.
	$\bullet$ Here is an audio-image pair. Does the given image associate with the object sounding in the audio? Answer in True or False.
	$\bullet$ How many instruments are sounding in the image?
	$\bullet$ Which is the musical instrument that sounds at the same time as the <Object>?
	$\bullet$ Is the <Object> on the <LR> louder than the <Object> on the <LR>?
	$\bullet$ Is there a voiceover?
AVQA	$\bullet$ Is the <Object> playing longer than the <Object>?
AVC	$\bullet$ Considering the audio input, generate a caption for the image.

Table 10: Task wise instructions template.

•

LLP: The LLP dataset provides fine-grained temporal annotations of the audio events in the format [onset, offset]. One representative image is chosen from within this time segment. While preparing our test set, we restrict ourselves to one category per video and their corresponding onset and offset values to rule out overlapping events within the same time interval.

Audio-Visual Fact-checking (AVFact).

•

Openimages-AudioSet: For Type 1 we collect samples from the AudioSet split while for Type 2, Type 3, Type 4 we choose samples from AudioSet Strong split as it consists of time-sensitive grounding information which is used in these three types of queries. For the image collection, we follow the same strategy as before.

7.3 Coarse-grained Data Preparation

For the coarse-grained tasks, we resort to direct adaptations of publicly available datasets.

Audio-Visual Question Answering. In the absence of audio class labels, we manually inspect the video to obtain the most suitable frame for each sample.

•

AVQA: The AVQA dataset contains the start time stamp which denotes the onset of the event of interest. We follow the same train/test split as proposed by the authors [106].
•

MUSIC-AVQA: We crop the 30s from the original 1-minute-long video sequence within which the event of interest lies.

Audio-Visual Captioning (AVC).

•

VALOR-32K: Each sample in the VALOR dataset comprises an elaborate caption of the audio-visual scene. We leverage this caption to calculate the CLIP similarity score between the image-text pair and obtain the top 3 most relevant frames from within the 10s long annotation as provided by the authors. Finally, we choose one representative frame through manual inspection.

8 Dataset Instruction Templates

We add task-wise sample instruction templates in Tab. 10. To make the instruction tuning robust and incorporate sufficient diversity, we manually write a few instructions and prompt GPT-3.5 [6] to generate different variants. We further refine the instruction templates using GPT-4 [1]. Note that for AVQA and AV captioning tasks, we restrict ourselves to the questions and captions provided by the authors.

9 Dataset Statistics and Analysis

Image Audio Similarity. To study the similarity between the image-audio pairs [21] from Openimages-AudioSet, we utilize the CLIP [78] and CLAP [26] scores by calculating $\mathcal{S}_{\text{CLIP}}\;\mathcal{S}_{\text{CLAP}}^{T}$ , where $\mathcal{S}\in\mathbb{R}^{N\times N}$ and denotes the pairwise cross-modal similarity scores for a batch of size $N$ . The CLIP similarity is calculated between the chosen image and the audio class label, similarly, the CLAP score is calculated between the audio class label and the audio snippet. The text modality acts as the bridging modality in this case. Fig. 6(a) reports the image-audio similarity scores over the most frequent 9 categories while ‘others’ denotes aggregation of all the remaining ones. Note the range of the scores is normalized between [0,1] with 0 being the lowest. The average score of image-audio pairs across all samples collected for the audio referred image grounding task is 0.77, supporting a strong association between the two modalities.

Audio Duration. We report the category-wise mean duration (in sec.) of the audio samples from the AudioSet dataset in Fig. 6(b) for image-guided audio temporal localization task. The ‘Train’ class has the overall highest value with an average duration of 9.83 sec while the ‘Clicking’ category has the lowest average duration at 0.32 sec.

Class wise Robustness. We report class-wise (top 6 classes based on occurrence) True/False sample count from the Openimages dataset for AVFact - Type 1 set in Fig. 6(c). We maintain a good balance of matched and mismatched pairs to ensure our model is robust to deceptive queries.

AudioSet Distribution. Fig. 6(d) reports the class-wise distribution of samples present in the AVFact Type-2 set as collected from the AudioSet dataset.

Audio Duration Per Image Class. In Fig. 6(e) we present the duration of audio samples across various image classes from Openimages in AVFact Type-4 split. This demonstrates the overall balanced mix of image-audio distributions across different pairings.

Category Wise Distribution. Fig. 6(f) presents image category-wise distributions of samples from the Openimages dataset for AVFact Type-4 samples.

10 More Qualitative Results

We provide additional qualitative results from Meerkat in Fig. 8. In Fig. 7(a) we show excellent image grounding capabilities of our model when queried with audio inputs. We observe that even for small objects or visual scenes with complex associations among different components, Meerkat can correctly identify the referred object. This underlines the fine-grained comprehension capabilities acquired by Meerkat during its training phase. Meerkat is equipped with strong audio temporal localization as well while prompted with an image. As evident from Fig. 7(b), our model is capable of precisely understanding audio samples and accurately identifying the temporal onset of an event and the specific time duration of that particular event, even in the presence of other distractors and ambient sound. Fig. 7(c) depicts the fine-grained audio-visual comprehension capabilities of our method. Even when Meerkat is presented with noisy audio-visual samples and scenarios that demand detailed AV association understanding, our model can produce correct results with substantially high accuracy. Our method is also adept at coarse-grained tasks like AVQA and AV captioning as demonstrated in Fig. 8(a) and 8(b).

11 More ablations

11.1 Other Image Encoders

We compare the performance of our model on employing different image encoders as shown in Tab. 11. We observe the best performance with CLIP-ViT-B/16 [78] and use this as our preferred image encoder due to its compatibility with the instruction-guided image tokenizer module in our system.

Image Encoder	VGG-SS	LLP	AVFact	AVQA	VALOR
Image Encoder	cIOU $\uparrow$	F1-score $\uparrow$	Avg F1-score $\uparrow$	Avg Acc. $\uparrow$	CIDEr $\uparrow$
CLIP-ViT-B/32 [78]	42.56	49.64	0.78	84.04	73.39
BLIP-ViT-B/16 [43]	47.22	52.83	0.82	85.79	75.13
CLIP-ViT-B/16 (Ours)	48.51	54.96	0.85	87.14	76.84

Table 11: Meerkat performance with different image encoders

11.2 Other Audio Encoders

We carry out experiments with various audio encoders in Tab. 12 such as Open L3 [24, 3], WAV2CLIP [105], and Wav2Vec2 [4] with the optimal performance obtained with the CLAP [26] encoder. We attribute this performance boost to its superior Swin Transformer [55] based backbone to get audio features from a log-Mel spectrogram. Owing to its large-scale contrastive language-audio pretraining to develop an audio representation by combining audio data with natural language descriptions, CLAP encoders are shown to perform exceptionally well on processing open-domain audio over speech-based encoders like Whisper [79].

Audio Encoder	VGG-SS	LLP	AVFact	AVQA	VALOR
Audio Encoder	cIOU $\uparrow$	F1-score $\uparrow$	Avg F1-score $\uparrow$	Avg Acc. $\uparrow$	CIDEr $\uparrow$
Open L3 [24, 3]	44.52	51.28	0.76	83.29	72.38
WAV2CLIP [105]	45.34	51.94	0.78	84.46	73.77
Wav2Vec2 [4]	46.91	53.07	0.81	85.88	75.80
CLAP audio encoder	48.51	54.96	0.85	87.14	76.84

Table 12: Meerkat performance with different audio encoders

11.3 With Different LLM

We ablate our model and replace the LLM with other recent language models such as T5 [81], Vicuna [19], and Alpaca [92]. We observe a noticeable drop in performance when the LLM is not instruction-tuned compared to its instruction-tuned counterpart. This demonstrates the importance of leveraging instruction-tuned LLMs under a multi-modal instruction comprehension setup. We note instruction tuning allows equipping the LLM with a customized instructions template which results in improved performance under a multi-task setting, as demonstrated in Tab. 13.

Model	VGG-SS	LLP	AVFact	AVQA	VALOR
Model	cIOU $\uparrow$	F1-score $\uparrow$	Avg F1-score $\uparrow$	Avg Acc. $\uparrow$	CIDEr $\uparrow$
T5	41.49	48.50	0.78	82.49	72.56
Alpaca	42.74	49.98	0.80	83.75	74.84
Vicuna	47.06	53.68	0.83	86.38	75.88
Llama-2	48.51	54.96	0.85	87.14	76.84

Table 13: Ablative study under various LLMs.

11.4 Effect of $\mathcal{\lambda}_{OT}$ and $\mathcal{\lambda}_{AC}$

We ablate loss hyperparameters $\lambda_{\text{OT}}$ and $\lambda_{\text{AC}}$ and compare performance of Meerkat on ARIG and IGATL tasks in Fig. 9(a) and Fig. 9(b), respectively. Experimental results suggest that best metrics are obtained with $\lambda_{\text{AC}}$ = 0.35 and $\lambda_{\text{OT}}$ = 0.75, respectively.

(a) cIoU upper bound on VGG-SS with varying weightage of

\lambda_{\text{OT}}

and

\lambda_{\text{AC}}

(b) AUC upper bound on LLP with varying weightage of

\lambda_{\text{OT}}

and

\lambda_{\text{AC}}

Figure 9: Ablative experiments on (a) spatial and (b) temporal localization tasks. In (a) and (b) we keep

\lambda_{\text{AC}}

and

\lambda_{\text{OT}}

fixed at 0.35 and 0.75 respectively while varying the other

\lambda

12 Comparison with Contrastive Loss

We compare the optimal transport [111] based objective ( $\mathcal{L}_{\text{OT}}$ ) with the contrastive loss-based approach [34, 68, 78] to facilitate weak alignment in Meerkat. Contrastive approaches operate on the level of global features and therefore only capture class-level information. Although such an alignment strategy may be beneficial in coarse-grained tasks, they are not suitable for tasks which require fine-grained understanding. Conversely, as employed in AVOpT, OT-based alignment operates on the level of patches in a weakly-supervised manner. Such a form of guidance is interpretable since a transport plan is optimized which dictates the relationships between the cross-modal patch embeddings, and therefore, is more suitable for fine-grained downstream tasks. Even though OT-based alignment strategies have been employed earlier for word-region level alignment [14, 77, 18], we are the first to introduce it under audio-visual setting. We empirically find that using $\mathcal{L}_{\text{OT}}$ is superior in all the downstream tasks (refer to Tab. 14). Note that in both cases $\mathcal{L}_{\text{AC}}$ is employed to add strong supervision. Based on our results, we hypothesize that initial patch-level alignment with AVOpT yields high-quality representations which substantially assist AVACE to attend to the regions of interest, thereby improving localization performance, as opposed to using contrastive loss with AVACE.

Loss	VGG-SS	LLP	AVFact	AVQA	VALOR
Loss	cIOU $\uparrow$	F1-score $\uparrow$	Avg F1-score $\uparrow$	Avg Acc. $\uparrow$	CIDEr $\uparrow$
$\mathcal{L}_{\text{Contrastive}}$	46.95	52.28	0.81	86.31	74.98
$\mathcal{L}_{\text{OT}}$	48.51	54.96	0.85	87.14	76.84

Table 14: Comparison against contrastive loss based weak alignment strategy

13 Comparison with Two-stage Training

We systematically study the effect of the two-stage vs. single-stage training paradigm. Inspired by recent works [25, 46] on fine-grained image understanding tasks, we design a two-stage experimental set-up. In stage I, we perform modality alignment among the image and audio encoders through weak supervision, by employing AVOpT module. We do not use LLM in this stage I and therefore the only objective we optimize is $\mathcal{L}_{\text{OT}}$ . Stage I training is followed by stage II training involving the AVACE module to provide strong supervision. In stage II, we fine-tune LLM using LoRA. Experimental results show comparable performance in both cases as depicted in Tab. 15. We opt for single-stage training because not only it is superior (in terms of performance, see Tab. 15), but it is also computationally efficient and less resource intensive.

Model	VGG-SS	LLP	AVFact	AVQA	VALOR
Model	cIOU $\uparrow$	F1-score $\uparrow$	Avg F1-score $\uparrow$	Avg Acc. $\uparrow$	CIDEr $\uparrow$
Two-stage	48.43	54.81	0.85	87.11	76.59
Single-stage	48.51	54.96	0.85	87.14	76.84

Table 15: Comparison against two stage training

14 Role of Audio in AVQA Task

To study the role of the audio modality and how effectively our model can encode audio information, we perform an ablation study by removing the audio information altogether and performing visual-only question answering. We note the performance of our method drops significantly when only the visual modality is used to answer the same set of questions underlying the role of the audio modality. Tab. 16 demonstrates the quantitative results.

Model	Exist $\uparrow$	Localis $\uparrow$	Count $\uparrow$	World K $\uparrow$	Temp $\uparrow$	Avg $\uparrow$
Without audio	83.62	79.28	80.46	78.49	69.26	78.22
With audio	88.24	86.65	84.60	87.05	86.55	86.61

Table 16: Role of audio modality. Quantitative results on AVQA dataset when the model is presented with and without audio.

15 More on Optimal Transport

Our AVOpT is responsible for cross-modal alignment of image and audio feature representations in a weakly-supervised manner. This is enabled by minimizing the Wasserstein distance ( $\mathcal{D}_{\text{Wasserstein}}$ ) between the image and audio (spectrogram) patches and subsequently learning an optimal transport plan $\mathbf{\Omega}$ . The detailed steps of Optimal Transport-based Wasserstein Distance ( $\mathcal{L}_{\text{Wasserstein}}$ ) computation are outlined in Algorithm 2.

Algorithm 2 Meerkat: Wasserstein Distance Computation in AVOpT

1:Images:

\{I_{i}\}_{i=1}^{k}

; Audios:

\{A_{j}\}_{j=1}^{k}

; Total Optimal Transport steps:

\mathcal{S}_{\mathbf{\Omega}}

; Initial scaled unity matrix:

\bm{\sigma}=\frac{1}{k}\mathbf{1_{k}}

; Initial Transport Plan:

\mathbf{\Omega}^{(1)}=\mathbf{1}\mathbf{1}^{\top}

; Cosine similarity matrix:

\mathbf{C}_{ij}=c(I_{i},A_{j})

; similarity matrix decay factor:

\beta

; Scaled similarity matrix:

\mathbf{\Upsilon}_{ij}={\rm e}^{-\frac{\mathbf{C}_{ij}}{\beta}}

2:Learned Optimal Transport Plan:

\mathbf{\Omega}

; Wasserstein Distance:

\mathcal{D}_{\text{Wasserstein}}

3:for

t\in\{1,2,3,\cdots\mathcal{S}_{\mathbf{\Omega}}\}

\mathbf{Q}\leftarrow\mathbf{\Upsilon}\odot\mathbf{\Omega}^{(t)}

\triangleright

\odot

is Hadamard product

5: for

l\in\{1,2,3,\cdots,L\}

\bm{\delta}\leftarrow\frac{1}{k\mathbf{Q}{\bm{\sigma}}}

\bm{\sigma}\leftarrow\frac{1}{k\mathbf{Q}^{\top}\bm{\delta}}

\mathbf{\Omega}^{(t+1)}\leftarrow\text{diag}(\bm{\delta})\mathbf{Q}\text{diag}% (\bm{\sigma})

\mathcal{D}_{\text{Wasserstein}}\leftarrow\langle\mathbf{C}^{\top},\mathbf{% \Omega}\rangle

\triangleright

\langle\cdot,\cdot\rangle

is the Frobenius dot-product

9:return

\mathbf{\Omega}

\mathcal{D}_{\text{Wasserstein}}

16 AVSBench Data Collection

Given the segmentation mask of an object, we consider the top-most, left-most, bottom-most and right-most points and draw horizontal and vertical projection lines as shown in Fig. 10. These lines intersect each other at four points which when connected gives us the desired bounding box that completely encloses the object of interest. For each such bounding box we consider the coordinates $(x_{\text{Left}},y_{\text{Top}})$ and $(x_{\text{Right}},y_{\text{Bottom}})$ as GT labels, as shown in Fig. 10.

17 Comparison against ImageBind

We employ modality-specific encoders from ImageBind [32] and compare them with CLIP-CLAP combination as used in Meerkat (Tab. 17). Empirical results suggest that our encoder combination performs slightly superior compared to ImageBind. A more theoretical insight in this regard can be considered as a future work. However, this is beyond the scope of the current study.

Image and Audio Encoders	VGG-SS	LLP	AVFact	AVQA	VALOR
Image and Audio Encoders	cIOU $\uparrow$	F1-score $\uparrow$	Avg F1-score $\uparrow$	Avg Acc. $\uparrow$	CIDEr $\uparrow$
ImageBind Encoders	47.71	54.03	0.84	86.30	75.58
CLIP-CLAP (ours)	48.51	54.96	0.85	87.14	76.84

Table 17: Comparison against ImageBind.

18 Other Quantitative Metrics on AVQA task

We evaluate the performance of our method on two additional metrics from the AVQA task namely Counting (Count) and Comparative (Comp) and report the performance in Tab. 18. These two metrics along with the three other metrics (Existential, Localization, and Temporal reported in the main paper) complete the evaluation suite for the AVQA and MUSIC-AVQA tasks. We observe an overall steady performance of our method across these categorizations, by virtue of the excellent generalizability of Meerkat to coarse-grained tasks.

Model	Generalist?	AVQA		MUSIC-AVQA
Model	Generalist?	Count $\uparrow$	World K $\uparrow$	Count $\uparrow$	Comp $\uparrow$
AVSD [84]	✗	63.89	61.52	-	-
PanoAVQA [110]	✗	64.91	64.22	-	-
ST-AVQA [42]	✗	70.80	66.01	-	-
CAD [67]	✗	76.37	74.88	-	-
AVST [42]	✗	-	-	68.22	63.31
LAVISH [50]	✗	-	-	73.28	63.49
LAST [54]	✗	-	-	75.23	65.60
Macaw-LLM [60]	✓	78.16	77.54	76.61	67.77
PandaGPT [89]	✓	78.92	78.02	79.06	70.58
VideoLlama [112]	✓	79.90	77.26	82.90	72.32
X-InstructBLIP [70]	✓	81.14	82.29	83.89	73.43
Meerkat (ours)	✓	84.60	87.05	85.70	75.98
${\color[rgb]{0,0,1}\definecolor[named]{pgfstrokecolor}{rgb}{0,0,1}\Delta_{% \text{{Meerkat}}-\text{X-InstructBLIP}}}$	✓	+4.26%	+5.78%	+2.16%	+3.47%

Table 18: Quantitative results on AVQA task. The reported numbers on AVQA dataset [106] are on the val split. For the MUSIC-AVQA dataset [42], results are reported on the balanced test set. Here, Count: Counting, Comp: Comparative.

19 Evaluation Metrics

For the visual grounding task, we evaluate our model against other baselines two key metrics to assess visual grounding effectiveness: Intersection over Union (IoU) and Area Under Curve (AUC). These metrics provide a comprehensive measure of our model’s ability to accurately localize visual elements in correlation with auditory cues. For the image-guided audio temporal localization task, we report the segment-level F-score. For the Audio-Visual Fact-checking (AVFact) task, we split the evaluation tasks into four different categories, each with its unique dimension of assessment. We report the Precision and Recall scores for each category. We report the performance of audio-visual captioning task on several established metrics, including BLEU@4 [71], METEOR [5], ROUGE [48] and CIDEr [99]. Lastly, for the audiovisual visual question answering, we follow [54, 67] and report 5 different types of audio-visual relationships, including Existential, Location, Counting, Temporal, and Comparative.

20 Failure Cases

Although Meerkat demonstrates impressive reasoning and grounding capabilities under various audio-visual settings, there are still some cases where the model fails to comprehend complex and obscured references, especially in cluttered environments or audios with multiple overlapping sounds. Fig. 11 demonstrates a few cases where our method produces suboptimal or sometimes incorrect inference results. In Fig. 11(a) due to the lack of visibility of the object of interest (Chainsaw), our model couldn’t correctly identify the spatial region pertaining to it. Similarly, as the facial region of the speaker is not evident, Meerkat fails to correctly locate the active speaker. In Fig. 11(b) due to the overlapping audio of multiple instruments and the presence of ambient sound, our method could partially capture the duration through which the guitar makes sound (refer to supplementary video). The same happens with the other temporal audio localization example where the audio starts with a loud baby laughter sound which gradually fades with the adult person’s voice taking over. Our model could only identify the initial part of the baby’s sound. For the AV Fact task (in Fig. 11(c)), in the first example, due to occluded facial region, our model produces the wrong output, whereas in the second example, due to the indistinguishable, cluttered and blurry background, Meerkat fails to correctly identify the flying bird.

21 Ethics Statement

In this paper, we propose a novel framework for multi-modal LLM by combining the audio and visual modalities. For all the tasks we leverage publicly available datasets and do not engage in collecting any private data. However, we acknowledge that the public datasets may have implicit bias [28]. While LLMs being pre-trained on web-scale data inherently contain extensive knowledge about the real world, we recognize its potential learning bias as well. Moreover, these methods are prone to mistakes and might generate wrong or misleading results. The existing tools to measure various aspects of the LLM-generated outputs (e.g., toxicity [47]) are predominantly restricted to the language modality and not applicable across other modalities.

It’s important for the users to recognize these limitations and proceed with caution, especially in scenarios where the precision and neutrality of results hold significant importance. Users are encouraged to thoroughly scrutinize and validate the outputs of the model to avoid the possibility of disseminating inaccurate information. We will publicly release the codebase and curated datasets to ensure reproducibility and encourage future research. Finally, during our data preparation stage, we don’t collect or use any personal/human subject data.

Meerkat: Audio-Visual Large Language Model for Grounding in Space and Time

Abstract

Keywords:

1 Introduction

2 Related Works

3 Methodology

4 MeerkatBench: A Unified Benchmark Suite for Fine-grained Audio-Visual Understanding

5 Experiments and Results

6 Conclusions and Future Works

References

7 Data Preparation Strategy

7.1 Adaptation of Public Datasets.

7.2 Fine-grained Data Preparation

7.3 Coarse-grained Data Preparation

8 Dataset Instruction Templates

9 Dataset Statistics and Analysis

10 More Qualitative Results

11 More ablations

11.1 Other Image Encoders

11.2 Other Audio Encoders

11.3 With Different LLM

11.4 Effect of λO⁢Tsubscript𝜆𝑂𝑇\mathcal{\lambda}_{OT}italic_λ start_POSTSUBSCRIPT italic_O italic_T end_POSTSUBSCRIPT and λA⁢Csubscript𝜆𝐴𝐶\mathcal{\lambda}_{AC}italic_λ start_POSTSUBSCRIPT italic_A italic_C end_POSTSUBSCRIPT

12 Comparison with Contrastive Loss

13 Comparison with Two-stage Training

14 Role of Audio in AVQA Task

15 More on Optimal Transport

16 AVSBench Data Collection

17 Comparison against ImageBind

18 Other Quantitative Metrics on AVQA task

19 Evaluation Metrics

20 Failure Cases

21 Ethics Statement

11.4 Effect of $\mathcal{\lambda}_{OT}$ and $\mathcal{\lambda}_{AC}$