research-article

Open access

Hierarchical Synergy-Enhanced Multimodal Relational Network for Video Question Answering

Authors:

Min Peng,

Xiaohu Shao,

Yu Shi,

Xiangdong ZhouAuthors Info & Claims

ACM Transactions on Multimedia Computing, Communications and Applications, Volume 20, Issue 4

Article No.: 91, Pages 1 - 22

https://doi.org/10.1145/3630101

Published: 11 December 2023 Publication History

PDF eReader

Abstract

Video question answering (VideoQA) is challenging as it requires reasoning about natural language and multimodal interactive relations. Most existing methods apply attention mechanisms to extract interactions between the question and the video or to extract effective spatio-temporal relational representations. However, these methods neglect the implication of relations between intra- and inter-modal interactions for multimodal learning, and they fail to fully exploit the synergistic effect of multiscale semantics in answer reasoning. In this article, we propose a novel hierarchical synergy-enhanced multimodal relational network (HMRNet) to address these issues. Specifically, we devise (i) a compact and unified relation-oriented interaction module that explores the relation between intra- and inter-modal interactions to enable effective multimodal learning; and (ii) a hierarchical synergistic memory unit that leverages a memory-based interaction scheme to complement and fuse multimodal semantics at multiple scales to achieve synergistic enhancement of answer reasoning. With careful design of each component, our HMRNet has fewer parameters and is computationally efficient. Extensive experiments and qualitative analyses demonstrate that the HMRNet is superior to previous state-of-the-art methods on eight benchmark datasets. We also demonstrate the effectiveness of the different components of our method.

1 Introduction

Understanding multimodal information in the real world is a significant manifestation of a machine’s progress toward cognitive intelligence. Thanks to the great advances made in computer vision [15] and natural language processing [46], researchers are paying more attention to vision–language tasks, e.g., image/video captioning [8, 9], language video localization [53, 70], and visual question answering [2, 18]. Therein, a particularly challenging task is video question answering (VideoQA). Compared with image question answering (ImageQA) [2, 32, 42, 71], VideoQA is more difficult because it not only needs to model the semantic connection between the question and the image but also needs to extract complex interactive relations between the question and the temporal content of the video.

Many existing VideoQA methods [5, 13, 19, 25, 28, 34, 39, 47, 69, 72] adopted recurrent neural networks (RNNs) and convolutional neural networks (CNNs) to encode linguistic embeddings and video features, respectively. Several works [36, 58, 63] designed attention schemes to extract the semantics of important words and question-related spatio-temporal representations. To obtain more expressive features, some methods proposed the use of global self-attention mechanisms [11, 20, 30] or designed extra memory modules [24, 62] to augment interactive feature encoding capabilities. Recently, with the great success of large-scale pre-training models [23, 46] seen in natural language processing, some works [27, 45, 57, 59, 64, 67] used BERT-style [23] pre-training methods for VideoQA.

While previous studies have yielded promising results, the majority of them focused on learning interactions between the question and the video. For example, question-guided attention methods [18, 19, 55] focused on extracting video features that were related to words or sentences. Co-attention [7, 30] or multistep attention [5, 13] methods extracted the relation between words and frames as well as that between sentences and clips. Furthermore, memory-augmented networks [6, 24, 62] have learned the long-term dependencies between the question and the video. Going in another direction, several other approaches have focused on modeling spatio-temporal relations of video more effectively. For example, Seo et al. [44] explored the high-level relation between appearance and motion information. Some works [4, 34, 68] leveraged object detection across video frames to acquire fine-grained interactive representations. Moreover, works [21, 49, 60] employed large-scale language models [23, 46] to extract deep spatial-temporal contextual semantics. However, these above-mentioned methods have failed to effectively explore the impact of the relationship between intra- and inter-modal interactions on multimodal learning. While recent BERT-style approaches [1, 27, 57, 59, 64, 67] have achieved promising inferential performance by developing comparable independent transformer encoding structures to extract intra- and inter-modal interactions, they still lack adequate consideration of the relationship between these two features during the multimodal learning process. Moreover, these approaches heavily relied on additional large-scale pre-training, and the models proposed using such methods are excessively large, making training challenging. As shown in Figure 1(a), to answer the question, we need to establish the inter-modal connections and also capture the contextual clues within the intra-modal context. Moreover, it is essential to establish the relationship between inter-modal and intra-modal to reason about the answer “Guitar”.

Fig. 1.

The majority of existing methods infer video information at a single temporal scale by dense sampling and fail to exploit the synergistic complementarity of the multiscale interactive semantics of questions and videos for answer reasoning. Given an arbitrary question, one may better infer the correct answer from visual contents at different temporal scales. As shown in Figure 1(b), the information presented at a coarser scale (e.g., scale 1 or 2, which involves fewer video frames from the beginning to the end) offers a better understanding of the multimodal interaction. Nevertheless, for a different type of task, e.g., counting the number of repeated actions, visual information at a finer scale could also be helpful. Although a few studies [7, 12, 19, 31, 34, 35] have investigated the usefulness of multiscale information for answer reasoning, they all constructed multiscale information from a specific scale and did not sufficiently explore the synergistic effect of multiscale semantics. One of our recent works [40] used questions to recursively extract relationships with multiscale visual content. However, this work ignored the multiscale semantics of the question and failed to consider the synergistic reasoning of multiscale interactive relations.

To solve the above problems, we propose a novel hierarchical synergy-enhanced multimodal relational network (HMRNet) for VideoQA. Our HMRNet comprises two main components: (i) a relation-oriented interaction module for multimodal learning by exploring the relation between intra- and inter-modal interactions in a compact and unified framework; and (ii) a hierarchical synergistic memory (HSM) unit that can enhance the answer reasoning by exploiting the synergy effect of multiscale semantics. The former module redesigns the transformation of the attention head and builds adaptive self-attentive learning to implement cross-modal interaction and intra-modal reasoning, respectively. This module uses parameter sharing to parallelize the extraction of multimodal interactions at multiple temporal scales, preserving efficiency and compactness. The latter component develops a bottom-up and top-down memory-based interaction scheme to complement and fuse multimodal semantics at different hierarchies to achieve multiscale synergistic reasoning.

Furthermore, our proposed HMRNet has fewer parameters and higher computational efficiency than existing methods. With these advantages, HMRNet can be efficiently extended to different types of VideoQA tasks such as open-ended classification, open-ended generation, repetition counting, and multi-choice QA. Extensive experiments conducted on several benchmark datasets demonstrate the effectiveness of our method. Our technical contributions are summarized below.

—

We propose a novel HMRNet for VideoQA. With careful design, the HMRNet requires few parameters, is computationally efficient and robust, and can be reused when changing feature encoders.

—

A relation-oriented interaction module is proposed to extract both intra- and inter-modal interactions within a compact and unified framework, and we find that intra-modal interactions can build on inter-modal interactions to further improve multimodal learning.

—

An HSM unit is proposed to complement and fuse multiscale semantics in a bottom-up and top-down interactive approach to achieve synergistic enhancement of answer reasoning.

—

We conduct extensive evaluations on eight benchmark datasets, namely, MSVD-QA, MSRVTT-QA, ActivityNet-QA, TGIF-QA, Youtube2Text-QA, SUTD-TrafficQA, Social-IQ, and NExT-QA. When comparing our HMRNet to other methods, we observe significant improvements in performance on nearly all datasets.

2 Related Work

Popular VideoQA methods usually extract question and video features using off-the-shelf models. They also often design various attentive interaction structures to extract multimodal representations for answering.

Attention mechanisms. Attention mechanisms are widely used in several areas, e.g., improving the accuracy of video classification [37], and enhancing sequential modeling with a transformer architecture [46]. Some VideoQA studies designed various attention schemes to investigate the inter-modal interactions between the question and the video. For example, Xu et al. [55] proposed a method called “gradually refined attention” that uses the question as guidance for the extraction of appearance and motion features. Some works [30, 69] proposed a co-attention model to extract question-relevant video features as well as video-relevant question semantics. Some other works [5, 24, 62] proposed memory-augmented networks to handle the semantic interaction between questions and videos across the long temporal dimension. Zhao et al. [73] extracted interactions between the video and the question from the frame level to the segment level. Jin et al. [22] proposed a question-knowledge-guided spatial-temporal attention model to learn the video representation. Xiao et al. [50] proposed modeling video as a conditional hierarchical structure, where a multi-granular visual representation for language concept alignment is obtained under the guidance of textual clues.

Several other methods instead focused on extracting more efficient spatio-temporal representations from videos using attention mechanisms. With the inherent structural properties between frames and clips in the video, works [25, 26] extracted the near-term and far-term relations of spatio-temporal representation from the clip level and video level. Park et al. [39] proposed bridged visual-to-visual interaction to incorporate two complementary visual pieces of information, appearance and motion, using the question graph as an intermediate bridge. Some works [4, 44, 68] approached the VideoQA task by leveraging a Fast-RCNN [43] to detect objects-of-interest and acquire fine-grained spatio-temporal relations. Huang et al. [16] proposed introducing the positional information of objects in the video to establish a position-aware graph model, while Wang et al. [48] proposed another dual-visual graph reasoning unit to infer answers in the video. In addition, Zhang et al. [72] proposed an action-centric relation transformer network to emphasize dynamic temporal attributes. Liu et al. [36] proposed a dynamic self-attention model to select important tokens for a more efficient extraction of the interaction between appearance and motion.

However, the aforementioned approaches have given limited consideration to both inter- and intra-modal interactions, failing to effectively explore the impact of their relationships on multimodal learning. Hence, in this work, we introduce a compact and unified relation-oriented interaction module to address this issue.

Multiscale methods. The purpose of multiscale methods is usually to allow models to extract features at multiple scales, thereby improving performance on the target task. Multiscale methods are used in a variety of fields, such as salient object detection [43], video action classification [37], and natural language video localization [70]. For VideoQA, Jiang et al. [19] designed a multiscale temporal contextual attention block to extract the multimodal interactions by setting a 1D temporal convolution kernel with different dilation rates. Similarly, Liu et al. [35] proposed to use temporal convolution to down-sample feature maps along the temporal dimension and perform weighted aggregation for features at different scales. Li et al. [31] designed a multiscale relation unit, which captures temporal information by modeling different distances between motions. Liu et al. [34] suggested to use temporal average pooling with different kernel sizes to capture multiscale temporal information, and then aggregate the output of each scale feature with attention guided by the question. Guo et al. [12] proposed building graph networks at different scales and progressively achieved graph fusion in a bottom-up and top-down way to acquire question-relevant visual features. Lastly, Gao et al. [7] proposed a generalized pyramid co-attention structure to extract rich video contextual semantic information.

However, all of these methods extracted multiscale interactions from a specific scale and failed to exploit the complementarity of multiscale information in the answer reasoning. Therefore, in the present work, we build on multiscale sampling and develop an HSM unit to complement and fuse multimodal interactions at different scales to achieve synergistic reasoning.

3 Methods

An overview of the proposed HMRNet is shown in Figure 2. It is worth mentioning that instead of focusing on the exploitation of large-scale pairwise video–text data for pre-training as well as building huge models to enhance the answer reasoning ability (e.g., [57] was pre-trained on an additional corpus with 100 million videos and was required to train more than 200 million parameters), we focus on designing compact and efficient components. We believe that our method is equally valuable and also a way to promote broader research, especially while powerful computational resources are still expensive to access.

Fig. 2.

Specifically, as in previous works, we first extract essential features from the video and the question. We utilize CNN networks to extract multiscale appearance-motion features and a GRU network to obtain question embeddings. Then, we design a relation-oriented interaction module to extract the intra- and inter-modal interactions for the question and the video at multiple scales. An HSM unit is built to incorporate the multiscale interactive relations and produce the final representation for answering.

3.1 Feature Encoders

Similar to previous studies [7, 8, 10, 29, 36, 50, 62, 68, 69, 72], we first encode essential features for videos and questions. However, different from previous methods that constructed multiscale interactive features at a single scale by dense sampling, we perform sampling at multiple scales to extract the appearance-motion feature and its interaction with the question. Compared to existing strategies, when sampling on the same number of video clips, our sampling strategy has advantages in providing different aspects of the local and the global event in videos. This rich visual information enables us to better conduct synergistic reasoning about the answers.

Specifically, for input video \(\mathcal {V}\) , we apply a uniform sampling along the temporal dimension to acquire a group of frames \(\lbrace I_i^n\rbrace _{i=1}^{\tau {T}}\) , where \(n\in [1,N]\) denotes the sampling scale and \(\tau = 2^{n-1}\) is the sampling rate. We adopt 2D and 3D ResNets [14, 15] to encode the frame-wise appearance feature \(\mathbf {V}^n\) and clip-wise motion feature \(\mathbf {M}^n\) , respectively.

\begin{equation} {\mathbf {V}}^n = \left\lbrace v_i^n | v_i^n\in \mathbb {R}^{2048} \right\rbrace _{i=1}^{\tau {T}}, \end{equation}

(1)

\begin{equation} {\mathbf {M}}^n = \left\lbrace m_t^n | m_t^n\in \mathbb {R}^{2048} \right\rbrace _{t=1}^{\tau }, \end{equation}

(2)

Each feature vector in \(\mathbf {V}^n\) and \(\mathbf {M}^n\) is fed into linear transformation layers to map the feature into a \(d-\) dimension space. We connect these features along the temporal dimension and add a learnable temporal positional embedding \(\mathbf {P}\in \mathbb {R}^{(\tau T + \tau) \times d}\) [46] to acquire the appearance-motion feature as follows:

\begin{equation} {\mathbf {X}}^n = \left\lbrace x_j^n | x_j^n\in \mathbb {R}^d \right\rbrace _{j=1}^{L_X^n}, \end{equation}

(3)

For question and answer candidates, we extract token-wise features with a pre-trained GloVe word model [41]. We map such features into the \(d\) -dimension space and extract sequential features with a bidirectional GRU network. The acquired question embedding \(\mathbf {Q}\) and answer candidate feature \(\mathbf {A}^k\) can be represented as

\begin{equation} \mathbf {Q}=\lbrace q_j | q_j\in \mathbb {R}^d\rbrace _{j=1}^{L_Q}, \end{equation}

(4)

\begin{equation} {\mathbf {A}}^k = \left\lbrace a_j^k | a_j^k\in \mathbb {R}^d \right\rbrace _{j=1}^{L_A^k}, \end{equation}

(5)

where \(L_Q\) and \(L_A^k\) denote the number of tokens in the question and the \(k-\) th answer candidate, respectively.

3.2 Relation-Oriented Interaction

Effective handling of intra- and inter-modal interactions is important for extracting multimodal interactive semantics in VideoQA. Existing approaches seldom focus on both types of interaction and neglect the role of their relations in multimodal learning. To address this issue, we propose a relation-oriented interaction module to consider both intra- and inter-modal interactions and their relations in a compact and unified framework. Specifically, we follow the ideas of simplicity and efficiency to extract multimodal interactions. By default, we construct intra-modal interactions after the inter-modal interactions. This enables us to explore their relations. We assume that the intra-modal interactions can refine the inter-modal interactions and thus promote answer reasoning. Figure 3(a) illustrates this sequence, and we will analyze the module in detail through ablation experiments (see Section 4.4.2)

Fig. 3.

To extract inter-modal interactions, we construct a compact multi-headed cross-modal attention structure that integrates question-to-video and video-to-question processes. Given the appearance-motion feature \(\mathbf {X}^n\) and question feature \(\mathbf {Q}\) , the interactive output of the question-to-video process is computed as

\begin{equation} {\tilde{\mathbf {X}}}^n={\mathbf {X}}^n+\mathrm{MCA}({\mathbf {X}}^n,\mathbf {Q}), \end{equation}

(6)

where \(\mathrm{MCA}(\cdot)\) denotes the operation of a multi-headed cross-modal attention layer. For a single attentional head, the output of \(\mathrm{MCA}_{X}^h\) is

\begin{equation} \mathrm{MCA}_{X}^h=\mathrm{softmax}\left(\frac{{\mathop {F}}_{X}^h{\mathop {F}}_{{\tilde{Q}}}^{h\top }}{\sqrt {d}}\right){\mathop {F}}_{Q}^h, \end{equation}

(7)

with

\begin{equation} \left\lbrace \begin{array}{l} \mathop {F}_{X}^h=\mathrm{LN}({\mathbf {X}}^n){\mathbf {W}}_{X}^h, \\ \mathop {F}_{{\tilde{Q}}}^h=\mathrm{LN}(\mathbf {Q}){\tilde{\mathbf {W}}}^h, \\ \mathop {F}_{Q}^h=\mathrm{LN}(\mathbf {Q}){\mathbf {W}}_{Q}^h, \\ \end{array} \right. \end{equation}

(8)

where \(\mathrm{LN}(\cdot)\) is the normalization layer. The learnable weight matrices are \(\mathbf {W}_{X}^h\) , \({\tilde{\mathbf {W}}}^h\) , \(\mathbf {W}_{Q}^h\) \(\in \mathbb {R}^{d\times d/H}\) . \(\mathrm{MCA}({\mathbf {X}}^n,\mathbf {Q})\) is obtained by concatenating all \(H\) attentional heads in the feature dimension. Similarly, the interactive output of the video-to-question process is computed as

\begin{equation} \tilde{\mathbf {Q}}^n=\mathbf {Q}+\mathrm{MCA}(\mathbf {Q},{\mathbf {X}}^n), \end{equation}

(9)

Considering the symmetry of these two interactive processes, the attention head parameters of \(\mathrm{MCA}(\mathbf {Q},{\mathbf {X}}^n)\) in Equation (9) are shared with those in Equation (6) to maintain the semantic consistency of the feature space. We have

\begin{equation} \mathrm{MCA}_Q^h=\mathrm{softmax}\left(\frac{\mathop {F}_Q^h\mathop {F}_{{{\tilde{X}}}}^{h\top }}{\sqrt {d}}\right){\mathop {F}}_{{X}}^h, \end{equation}

(10)

\begin{equation} {\mathop {F}}_{{{\tilde{X}}}}^{h}=\mathrm{LN}(\mathbf {X}^n){\tilde{\mathbf {W}}}^h, \end{equation}

(11)

we also concatenate the output of each attentional head to obtain \(\mathrm{MCA}(\mathbf {Q},{\mathbf {X}}^n)\) . In short, the question-to-video and video-to-question processes are dominated by video features and question features, respectively. Both processes use cross-modal attention to generate feature representations of modal-driven activation.

To explore the relations between intra- and inter-modal interactions, we add to our relation-oriented interaction module an adaptive self-attention scheme for intra-modal learning using a unified framework. Given inter-modal interaction features \(\tilde{\mathbf {X}}^n\) and \(\tilde{\mathbf {Q}}^n\) , their intra-modal outputs are represented, respectively, as

\begin{equation} \hat{\mathbf {X}}^n=\tilde{\mathbf {X}}^n+\sigma (\mathbf {R}_X\tilde{\mathbf {X}}^n{\hat{\mathbf {W}}}_{X}), \end{equation}

(12)

\begin{equation} \hat{\mathbf {Q}}^n=\tilde{\mathbf {Q}}^n+\sigma (\mathbf {R}_Q\tilde{\mathbf {Q}}^n{\hat{\mathbf {W}}}_{Q}), \end{equation}

(13)

where \(\sigma (\cdot)\) is the \(\mathrm{ELU}\) activation function, and \(\hat{\mathbf {W}}_{X}\) , \(\hat{\mathbf {W}}_{Q}\) \(\in \mathbb {R}^{d\times d}\) are learnable weight matrices. \(\mathbf {R}_X\) and \(\mathbf {R}_Q\) denote the self-attentive matrices of \(\tilde{\mathbf {X}}^n\) and \(\tilde{\mathbf {Q}}^n\) , respectively, and are obtained by applying linear transformation and dot-product to the modal features.

\begin{equation} \mathbf {R}_X=\mathrm{softmax}(\sigma (\tilde{\mathbf {X}}^n{\mathbf {W}}_1)\sigma (\tilde{\mathbf {X}}^n{\mathbf {W}}_2)^\top), \end{equation}

(14)

\begin{equation} \mathbf {R}_Q=\mathrm{softmax}(\sigma (\tilde{\mathbf {Q}}^n{\mathbf {W}}_3)\sigma (\tilde{\mathbf {Q}}^n{\mathbf {W}}_4)^\top), \end{equation}

(15)

Here, \(\mathbf {W}_1\) , \(\mathbf {W}_2\) , \(\mathbf {W}_3\) , and \(\mathbf {W}_4\) are learnable weight matrices. The above self-attentive scheme can adaptively learn the weights of all features within the modality and perform interactive reasoning for each modality simultaneously.

We concatenate \(\hat{\mathbf {X}}^n\) and \(\hat{\mathbf {Q}}^n\) in the temporal dimension to obtain the multimodal representation \(\mathbf {U}^n\) . For other \(n\) values, we adopt the same parameters as in Equations (6)–(15) to acquire multiscale interactive semantics. This parameter sharing makes the module compact and ensures that the interactive feature space is compatible at all scales. In summary, our relation-oriented interaction module incorporates multi-headed cross-modal attention and adaptive self-attention to extract inter- and intra-modal interactions, respectively, while considering the relationship between these features to facilitate multimodal learning. Unlike previous methods that used cross-modal attention (e.g., STA [10], Bridge2Answer [39], and HQGA [50]), or co-attention mechanisms (e.g., LAD-Net [29], PSAC [30], and MGTA-Net [54]) to extract inter-modal interactions, we leverage the multi-headed attention mechanism and redesign the transformation on each head (see Equations (7) and (10)) to synchronously extract question-to-video interactions and video-to-question interactions. This approach ensures semantic consistency in the multimodal feature space while maintaining the simplicity of the module. The ablation experiments in Section 4.4.2 demonstrate the benefits of this redesigned attention head. In addition, unlike Bridge2Answer [39], which requires semantic dependency analysis on questions to construct a graph structure, or PSAC [30], which directly uses the self-attention structure of the transformer [46] to extract intra-modal interactions, our adaptive self-attention approach uses a simple linear transformation to learn the relationships between features within each modality and conducts interaction inference for each modality simultaneously. More importantly, our relation-oriented interaction module is designed to be compact and unified, considering both inter- and intra-modal interactions and effectively exploring the impact of their relationship on multimodal learning.

3.3 Hierarchical Synergistic Memory

Given a question, a more reliable way of searching for answer clues is to incorporate visual information at different temporal scales. Most existing methods fail to exploit the complementarity between multiscale semantics in answer reasoning. In this section, we introduce the HSM unit (see Figure 3(b)), which is inspired by the GRU memory unit [3] to complement and fuse multimodal semantics at different scales using a bottom-up and top-down approach. This results in more confident answering.

Specifically, the bottom-up and top-down operations provide a memory-based interaction to iteratively update the features at each scale, thus enabling synergistic reasoning of hierarchical semantics. For multimodal representation \({\mathbf {U}}^n, n\gt 1\) , the iterative output of the bottom-up operation is

\begin{equation} {\mathbf {U}}_{\uparrow }^n=(1-{\mathbf {\alpha }}^n) \odot \mathbf {\gamma }^n{{\mathbf {U}}}_{\uparrow }^{n-1}+{\mathbf {\alpha }}^n \odot {\Delta {\mathbf {U}}^n}, \end{equation}

(16)

where \({\mathbf {U}}_{\uparrow }^1={\mathbf {U}}^1\) , and \(\odot\) denotes the matrix dot product. \({\mathbf {\gamma }}^n\) represents the connection matrix, which can map the hidden information from the previous level to the current level, thereby enabling the transfer of semantic features with different sizes. In addition, the connection matrix \({\mathbf {\gamma }}^n\) enables interactions to occur between different hierarchical levels. \({\mathbf {\alpha }}^n\) is the output of the update gate, which determines how much memory information is passed from the previous level to the current level. \(\Delta {\mathbf {U}}^n\) indicates potentially hidden information. We have

\begin{equation} {\mathbf {\gamma }}^n=\mathbf {U}^n{\mathbf {W}}_{{\gamma }{1}} ({{\mathbf {U}}_{\uparrow }^{n-1}{\mathbf {W}}_{{\gamma }{2}}})^\top , \end{equation}

(17)

\begin{equation} {\mathbf {\alpha }}^n=\delta (\mathbf {U}^n{\mathbf {W}}_{\alpha {1}} + \mathbf {\gamma }^n{\mathbf {U}}_{\uparrow }^{n-1}{\mathbf {W}}_{\alpha {2}} + b_{\mathbf {\alpha }}), \end{equation}

(18)

\begin{equation} \Delta {\mathbf {U}}^n=\sigma (\mathbf {U}^n{\mathbf {W}}_{\Delta {1}} + \mathbf {\mu }^n \odot \mathbf {\gamma }^n{\mathbf {U}}_{\uparrow }^{n-1}{\mathbf {W}}_{\Delta {2}} + b_{\mathbf {\Delta }}), \end{equation}

(19)

with

\begin{equation} {\mathbf {\mu }}^n=\delta (\mathbf {U}^n{\mathbf {W}}_{\mu {1}} + \mathbf {\gamma }^n{\mathbf {U}}_{\uparrow }^{n-1}{\mathbf {W}}_{\mu {2}} + b_{\mathbf {\mu }}), \end{equation}

(20)

where \(\delta (\cdot)\) is the \(sigmoid\) function. \({\mathbf {\mu }}^n\) is the output of the reset gate, which determines how to combine the new input information with the previous memory input. The top-down operation is the opposite of the bottom-up operation. for \(n\) from \(N\) to 1, it updates the multimodal representations in a manner similar to that described in Equations (16)–(20)

Notably, the HSM treats the multimodal feature matrix of varying sizes at different scales as sequential inputs. This is beyond the capability of the vanilla GRU unit. It then complements and fuses multimodal semantics from different scales in each iteration step, using a memory-based interaction to accomplish hierarchical synergistic reasoning. We note that the features obtained by the last iteration step in the bottom-up and top-down operations are \({\mathbf {U}}_{\uparrow }\) and \({\mathbf {U}}_{\downarrow }\) , respectively; we then add them together to obtain the final semantic representations \({\mathbf {O}}\) for answer decoding.

3.4 Answer Decoder and Loss Function

Following the existing practices [4, 7, 11, 16, 19, 20, 21, 25, 27, 34, 39, 44, 51, 52, 54], we adopt different decoding functions for different types of questions. For the open-ended classification task, we use two fully connected layers and the vanilla cross-entropy function to compute the loss. To generate answers for the open-ended generation task, we use a layer of GRU with soft attention over the questions, similar to [49]. For the repetition counting task, we adopt Mean Squared Error (MSE) as our loss function and apply a round function to the output to obtain an integer result. For the multi-choice task, we process each answer candidate in the same way as the question input and share its parameters. We use the hinge loss to calculate the loss between each predicted answer and the correct answer.

4 Experiments

In this section, we first introduce the VideoQA datasets used for our experiments as well as implementation details. We then compare the proposed method with other state-of-the-art methods and conduct ablation studies to justify each component of our method.

4.1 Datasets

Eight benchmark datasets for VideoQA are adopted for evaluation which contain videos of different lengths, scenarios, and question-answering types.

MSVD-QA [55] contains 50,505 question-answer pairs and 1,970 video clips, with questions divided into what, who, how, when, and where, all of which are open-ended tasks.

MSRVTT-QA [55] contains 243k question-answer pairs and 10k videos, with questions in the same form as MSVD-QA but with more complex visual scenarios and longer video lengths of around 10 to 30 seconds.

ActivityNet-QA [65] contains 58k open-ended question-answer pairs and 5.8k videos. All videos are untrimmed web videos with an average length of 180 seconds.

TGIF-QA [18] contains 165k question-answer pairs and 72k animated GIFs. This dataset covers four types of questions: Action, Trans, FrameQA, and Count. Action is a multi-choice task to identify the repetitive action; Trans is another multi-choice task to identify the temporal transition of the action; FrameQA is an open-ended task, with answers being particularly obvious on a single frame; Count is a repetition counting task to figure out the number of repetitions of the action.

Youtube2Text-QA [61] contains 9.9k question-answer pairs, with video data from MSVD-QA. Questions are divided into what, who, other. We select the multi-choice task in the dataset for evaluation.

SUTD-TrafficQA [56] contains 62k question-answer pairs and 10k videos. This dataset includes six challenging traffic-related reasoning tasks, with each question providing four candidate answers, only one of which is correct. To solve these questions, the model needs strong common sense and logical reasoning abilities.

Social-IQ [66] contains 7.5k question-answer pairs and 1.25k videos. All questions are multi-choice tasks and are specific to in-the-wild social situations, aiming to evaluate artificial social intelligence through question-answering. This dataset has a larger proportion of questions starting with why and how, which often require strong reasoning abilities.

NExT-QA [49] contains 52k manually annotated question-answer pairs and 5.44k videos. This dataset contains three types of questions: Causal, Temporal, and Descriptive. The Causal questions are designed to explain actions or uncover the intentions of previously occurring actions. The Temporal questions are designed to assess the model’s ability to reason about temporal relationships between actions. The Descriptive questions focus on the description of the scene in the video. All questions are divided into multi-choice and open-ended tasks, where the latter requires the model to generate answers in short phrases.

4.2 Implementation Details

Experiment Settings. To split the datasets, we use the standard training, validation, and test sets provided in each dataset. Similar to existing works, we take the open-ended questions in MSVD-QA, MSRVTT-QA, ActivityNet-QA, and TGIF-QA datasets as open-ended classification tasks. We treat the open-ended questions in the NExT-QA dataset as open-ended generation tasks, truncating each answer to a maximum length of 6 and considering each word in the training set as a member of the answer vocabulary.

We use a ResNet with 152 layers to extract appearance-motion features, and the maximum sampling scale \(N\) is set to 3 by default. The feature dimension \(d\) is set to 512. During network training, each VideoQA dataset has a pre-defined vocabulary composed of the top \(K\) most frequent words in the training set. We set the \(K\) value for the MSVD-QA dataset to 4,000 and that for other datasets to 8,000. We use the Adam optimizer to train the network, with the initial learning rate set to \(1e-4\) . When the loss is not decreased for 5 epochs, the learning rate is reduced by half. The maximum number of epochs is set to 50, and the batch size is set to 32. We implement our model with the PyTorch deep learning library on a PC with only one GTX 1080 Ti GPU.

Evaluation Metrics. To evaluate the performance of a model, we use different metrics depending on the task. For open-ended classification and multi-choice tasks, we adopt accuracy (%) to evaluate the performance. For repetition counting tasks, we use MSE computed between the predicted answer and the ground truth answer to evaluate the performance. For open-ended generation tasks, we determine the Wu-Palmer similarity (WUPS) score [49] to evaluate the quality of the generated answers.

4.3 Comparison with State-of-the-art Methods

In this section, we compare our HMRNet with the state-of-the-art methods on eight popular VideoQA datasets. Unless otherwise stated, the reported results are those obtained from experiments done on the testing set.

Table 1 reports the comparison results on MSVD-QA, MSRVTT-QA, and ActivityNet-QA datasets. For videos with varying durations and complex scenes, our proposed HMRNet still achieves the best performance, i.e., 41.8% (+0.6%), 39.5% (+0.9%), and 40.6% (+3.3%) on MSVD-QA, MSRVTT-QA, and ActivityNet-QA datasets, respectively. The results of these experiments demonstrate that our method is well adapted to both short and long untrimmed video question answering.

Table 1.

Method	MSVD-QA	MSRVTT-QA	ActivityNet-QA
E-SA [65]	27.6	29.3	31.8
MAR-VQA [74]	-	-	34.6
CAN [63]	32.4	33.2	35.4
HGA [20]	34.7	35.5	-
GMN [11]	35.4	36.1	-
MHMAN [62]	35.6	34.6	37.1
Bridge2Answer [39]	37.2	36.9	-
HAIR [34]	37.5	36.9	-
HOSTR [4]	39.4	35.9	-
DSAVS [36]	37.2	35.8	-
ACRTransformer [72]	-	-	37.3
DualVGR [48]	39.0	35.5	-
MHN [40]	40.4	38.6	-
PKOL [68]	41.1	36.9	-
HQGA [50]	41.2	38.6	-
SSML [1]	35.1	35.1	-
ClipBERT [27]	-	37.4	-
HMRNet (ours)	41.8	39.5	40.6

Table 1. The Comparison Experiments on the MSVD-QA, MSRVTT-QA, and ActivityNet-QA Datasets

The best and second-best results are bolded and underlined, respectively.

Table 2 shows the comparison results on the TGIF-QA dataset. Our proposed HMRNet achieves the best performance on the Action (+2.9%), Trans (+1.7%), and Count ( \(-0.15\) ) tasks. The improvement is particularly noticeable for Action and Trans tasks. These tasks challenge the modeling of temporal relations between videos and questions. The PKOL [68] obtained the best performance on FrameQA tasks, in contrast, our method achieves 12.7 and 9.7 percentage points of improvement on Action and Trans tasks, respectively. The better performance of HMRNet could be owed to its ability to extract multimodal interactive relations from questions and videos and search for answer clues at different semantic levels.

Table 2.

Method	Action	Trans	FrameQA	Count \(\downarrow\)
L-GCN [16]	74.3	81.1	56.3	3.95
QueST [19]	75.9	81.0	59.7	4.19
HCRN [25]	75.0	81.4	55.9	3.82
GMN [11]	73.0	81.7	57.5	4.16
ACRTransformer [72]	75.8	81.6	57.7	4.08
Bridge2Answer [39]	75.9	82.6	57.5	3.71
HAIR [34]	77.8	82.3	60.2	3.88
HOSTR [4]	75.0	83.0	58.0	3.65
MSPAN [12]	78.4	83.3	59.7	3.57
CPTC [7]	78.1	84.0	57.4	3.98
MASN [44]	84.4	87.4	59.5	3.75
PKOL [68]	74.6	82.8	61.8	3.67
HQGA [50]	76.9	85.6	61.3	-
MHN [40]	83.5	90.8	58.1	3.58
SiaSamRea [64]	79.7	85.3	60.2	3.61
ClipBERT [27]	82.8	87.8	60.3	-
HD-VILA [57]	84.3	90.0	60.5	-
HMRNet (ours)	87.3	92.5	60.0	3.42

Table 2. The Comparison Experiment on the TGIF-QA Dataset

The best and second-best results are bolded and underlined, respectively.

We note here that, in addition to using the necessary encoders to extract video and question features, some of the compared methods apply other complex feature extraction procedures to promote the video–question interaction. For example, Bridge2Answer [39] additionally extracted compositional semantic relations between words using an NLP tool [38]. GMN [11], HAIR [34], HOSTR [4], MASN [44], PKOL [68], and HQGA [50] applied a Faster R-CNN [43] for object detection per frame to obtain fine-grained spatio-temporal features. Conversely, ACRTransformer [72] adopted a BSN [33] to specifically locate high-action frames in a given video, and MSPAN [12] extracted motion features in a frame-by-frame style despite the large parameter size of the motion network. Furthermore, PKOL [68] used additional prior knowledge from a massive corpus and thus achieved strong results on the FrameQA task. By contrast, our proposed relation-oriented interaction module can effectively extract multimodal interactive representations without these complex feature pre-processing steps. Moreover, most of the previous methods applied dense sampling to the input video. For example, HCRN [25], Bridge2Answer [39], and HQGA [50] sampled 8 clips with each comprising 16 frames; E-SA [65], MAR-VQA [74], DSAVS [36], and DualVGR [48] sampled 20 clips for long videos. However, with fewer clips sampled (only seven clips are sampled when \(N\) is set to 3 by default), our proposed HSM is able to search for answer cues across multiscale semantics to achieve the best performance.

Tables 1 and 2 also provide comparison results with recent large-scale pre-training approaches. Specifically, SSML [1] was pre-trained on the HowTo100M video–text corpus. ClipBERT [27] proposed a sparse sampling strategy and used the COCO Captions and Visual Genome Captions datasets for pre-training. With the same pre-training approach as in [27], SiaSamRea [64] proposed a Siamese sampling mechanism to enhance answer reasoning. HD-VILA [57] built a large-scale, high-resolution video–text dataset to improve reasoning performance. Moreover, their models are huge, complex, and demand ample computing resources to implement training. As a comparison, our proposed HMRNet is compact and does not require any extra video–text corpus for pre-training to achieve superior performance.

Nevertheless, we also recognize that large-scale pre-training indeed largely improves the performances of several methods [59, 67] on the above datasets. e.g., [59] built a domain-specific dataset for pre-training and achieved a higher recognition rate. [67] collected a very large video–text corpus called YT-Temporal-180M for pre-training and achieved a performance superior to that of our approach. However, we believe that our method, which focuses on designing compact and efficient components for the problems present in VideoQA is equally valuable. Moreover, it provides a way to promote broader research, especially while powerful computational resources are still expensive to access.

Tables 3–5 show evaluation results on Youtube2Text-QA, SUTD-TrafficQA, and Social-IQ datasets, respectively. For the Youtube2Text-QA dataset, we achieved an improvement of 3% in overall performance compared to the previously reported best results. For the SUTD-TrafficQA dataset, we obtain the highest accuracy rate, i.e., 40.91%. We also show the performance of our HMRNet for binary and four-class classification tasks on the Social-IQ dataset: here, HMRNet improves the recognition rate by 10.1% and 10.4%, respectively, compared with Tensor-MFN [66]. This outstanding performance suggests that our proposed HMRNet has the ability to perform causal reasoning in in-the-wild social situations and complex traffic scenarios.

Table 3.

Method	Youtube2Text-QA
EVQA [2]	47.6
r-ANL [61]	52.0
HME[5]	80.8
L-GCN [16]	83.9
HAIR [34]	85.3
DSAVS [36]	85.0
HMRNet (ours)	88.3

Table 3. Test Results on the Youtube2Text-QA Dataset

Table 4.

Method	SUTD-TrafficQA
VIS+LSTM [42]	29.91
BERT-VQA [60]	33.68
TVQA [28]	35.16
HCRN [25]	36.49
Eclips [56]	37.05
HMRNet (ours)	40.91

Table 4. Test Results on the SUTD-TrafficQA Dataset

Table 5.

Method	Social-IQ
Method	binay	four-class
LMN [47]	61.1	31.8
FVTA [32]	60.9	31.0
MDAM [24]	62.2	30.7
TVQA [28]	60.0	30.0
Tensor-MFN [66]	64.8	34.1
HMRNet (ours)	74.9	44.5

Table 5. Test Results on the Social-IQ Dataset

Table 6 shows evaluation results for the open-ended generation task conducted on the NExT-QA dataset. Similar to [49], we compared experimental results across all question types. \(\mathrm{WUPS}_C\) , \(\mathrm{WUPS}_T\) , and \(\mathrm{WUPS}_D\) denote the generated answer scores for Causal, Temporal, and Descriptive questions, respectively. \(\mathrm{WUPS}\) denotes the overall score for all questions. The higher the score, the more accurate the generated answer. Our method achieves better overall performance compared with other state-of-the-art approaches. Table 7 shows evaluation results for the multi-choice task conducted using the NExT-QA dataset. We experiment with the multi-choice task on the validation set. \(\mathrm{Acc}_C\) , \(\mathrm{Acc}_T\) , and \(\mathrm{Acc}_D\) denote the recognition rates for Causal, Temporal, and Descriptive questions, respectively. \(\mathrm{Acc}\) denotes the overall accuracy of all questions. We denote the approach introduced in reference [52] as VTQG, which encompasses three graphical reasoning architectures: VQG, VTG, and TQG. The results demonstrate that our HMRNet outperforms the majority of existing methods across all question types and achieves comparable performance to VTQG.

Table 6.

Method	\(\mathrm{WUPS}_C\)	\(\mathrm{WUPS}_T\)	\(\mathrm{WUPS}_D\)	\(\mathrm{WUPS}\)
STVQA [17]	15.24	18.03	47.11	23.04
HME [5]	15.78	18.40	50.03	24.06
HCRN [25]	16.05	17.68	49.78	23.92
UATT [58]	16.73	18.68	48.42	24.25
HGA [20]	17.98	17.95	50.84	25.18
HMRNet (ours)	18.51	19.64	52.18	26.22

Table 6. Comparison of Methods on the NExT-QA Dataset for Open-ended Generation Tasks

The best results are bolded.

Table 7.

Method	\(\mathrm{Acc}_C\)			\(\mathrm{Acc}_T\)			\(\mathrm{Acc}_D\)				\(\mathrm{Acc}\)
Method	Why	How	All	Prev	Next	Present	All	Count	Location	Other	All
EVQA [2]	28.38	29.58	28.69	29.82	33.33	31.27	43.50	43.39	38.36	41.44	31.51
PSAC [30]	35.81	29.58	34.18	28.56	35.75	31.51	39.55	67.90	35.41	48.65	35.57
PSAC+ [30]	35.03	29.87	33.68	30.77	35.44	32.69	38.42	71.53	38.03	50.84	36.03
Co-Mem [6]	36.12	32.21	35.10	34.04	41.93	37.28	39.55	67.12	40.66	50.45	38.19
ST-TP [18]	37.58	32.50	36.25	33.09	40.87	36.29	45.76	71.53	44.92	55.21	39.21
HGA [20]	36.38	33.82	35.71	35.83	42.08	38.40	46.33	70.51	46.56	55.60	39.67
HME [5]	39.14	34.70	37.97	34.35	40.57	36.91	41.81	71.86	38.36	51.87	39.79
HCRN [25]	39.86	36.90	39.09	37.30	43.89	40.01	42.37	62.03	40.66	49.16	40.95
VTQG [52]	43.14	39.82	42.27	40.25	47.21	43.11	46.89	74.58	52.46	59.59	45.24
HMRNet (ours)	42.05	39.97	41.50	38.46	47.51	42.18	46.33	77.63	54.75	61.52	44.84

Table 7. Comparison of Methods on the NExT-QA Dataset for Multi-choice Tasks

The best results are bolded.

Similarly, as we have analyzed in Tables 1 and 2, our method offers the advantage of not requiring complex feature pre-processing steps, such as those used in VTQG that rely on object detectors on each frame and seq-NMS to generate target trajectory features. Additionally, our method achieves powerful performance with fewer video clips, as compared to VTQG which samples a relatively large number of 16 clips. These benefits are due to our inter- and intra-modal relationship modeling and multiscale collaborative reasoning structure. However, it is important to note that object-level interactions, as explored in methods [52] and [51], offer advantages, particularly for questions that involve relational reasoning between objects in the NExT-QA dataset. Also, the approach in [51] of constructing graph interactions of objects and leveraging BERT fine-tuning to enhance performance is worth considering for our work.

To further demonstrate the advantages of our proposed HMRNet, in Table 8, we compare it with several methods in terms of parameter size and computational efficiency. The last four methods in the table, i.e., ClipBERT [27], VQA-T [59], MERLOT [67], and HD-VILA [57], are recent large-scale pre-training methods. We use nParams and GFLOPs [15, 56] to denote the number of learnable parameters and the computational efficiency of the model, respectively. As shown in the table, while leveraging off-the-shelf features for answer reasoning, our proposed HMRNet has the lowest number of parameters. Furthermore, HMRNet requires significantly fewer parameters than HME and HD-VILA. In terms of computational efficiency, our method also has a greater advantage compared with other methods. The GFLOPs of our HMRNet are only 0.25, which is remarkably less than those of MERLOT and HD-VILA, and less than half that of VQA-T.

Table 8.

Method	PSAC	HME	HCRN	MHN	HQGA	ClipBERT	VQA-T	MERLOT	HD-VILA	HMRNet (ours)
nParams	39.1M	44.8M	42.9M	16.5M	12.7M	85.1M	48.8M	85.1M	85.1M	10.5M
GFLOPs	0.48	0.63	0.74	0.78	1.43	3.48	0.68	134.54	17.84	0.25

Table 8. Comparison of the Number of Parameters Required and Computational Efficiency Across Methods

The best results are bolded.

4.4 Ablation Study

To justify our methodology, we run a series of ablation experiments on the TGIF-QA and NExT-QA datasets. With the default HMRNet (referred to here as the default model) used above as the baseline, we first analyze the impact of different components of our model and then analyze the impact of some important hyper-parameters on the experimental results.

4.4.1 Impact of Feature Encoders.

In Tables 1–8, we demonstrate the superiority of our proposed method and the simplicity of its design. Here, we analyze the effect of feature encoder changes on experimental results. Specifically, we replace the video and question encoders with lightweight ResNet backbones with 18 layers and the BERT [23] model, respectively. Experimental results are shown in Table 9. It can be seen that when using video encoders with much weaker encoding capabilities (Video Enc. \(^-\) ), our method is still able to achieve performance comparable to the default model. This indicates that our proposed method is robust for visual encoders. On the other hand, when replacing the question encoder with the BERT model (Question Enc. \(^+\) ), which has more learnable parameters and a relatively strong semantic encoding capability, our method shows improvements on all tasks. This experimental result demonstrates the good reusability of our method for question encoders and the potential to combine it with existing BERT-style large-scale pre-trained models.

Table 9.

Method	Action	Trans	FrameQA	Count \(\downarrow\)
Video Enc. \(^-\)	87.1	92.4	58.5	3.42
Question Enc. \(^+\)	87.5	93.0	60.3	3.40
HMRNet (default)	87.3	92.5	60.0	3.42

Table 9. Impact of Feature Encoders on Performance for Experiments Run on the TGIF-QA Dataset

The best results are bolded.

4.4.2 Analysis of the Relation-oriented Interaction Module.

In our work, we propose a relation-oriented interaction module to consider both intra- and inter-modal interactions and their relations in a compact and unified framework. Here, we conduct a series of experiments to analyze the implications of the relation between intra- and inter-modal interactions on multimodal learning. Specifically, we first remove the relation-oriented interaction module to evaluate its impact. Then, we separately add the intra- and inter-modal interactions as described in Section 4.4.2. Finally, we analyze the impact of exchanging the order of intra- and inter-modal interactions in the relation-oriented interaction module on experimental performance. We also analyze the benefits of our redesigned attention heads in inter-modal interactions, i.e., by comparing the non-shared attention head parameters between question-to-video and video-to-question processes.

The results are reported in Table 10. Without the relation-oriented interaction (without ROI) module, the performance drops significantly for all tasks. However, with the addition of intra-modal interactions (with intra), the performance of most tasks greatly improves; performance increases across tasks are even more significant with the addition of just inter-modal interactions (with inter). These results illustrate that extracting both intra-modal and inter-modal interactions is important for multimodal learning and that the inter-modal interaction is essential for solving VideoQA tasks. Additionally, when we change the order of inter- and intra-modal interactions (with exc.) in the default module, the performance on all tasks drops. This suggests that the intra-modal interactions can refine the inter-modal interactions to obtain more expressive multimodal interactive relations, whereas a pre-focus on intra-modal interactions would not be beneficial. One possible explanation for this result is that there are elements in both the video and question that are not relevant to the answer inference. If we focus on the intra-modal interactions first, noise independent of the modalities will inevitably be introduced and weaken the representation of each modality, which in turn affects the extraction of inter-modal interactions. Conversely, if the related content between modalities is extracted first, the intra-modal interactions will be more distilled. The experiments with non-shared parameters (HMRNet*) demonstrate that our redesigned attention head achieves a slightly superior overall performance while simultaneously reducing the model’s parameters and complexity. This observation further underscores its capability to maintain the semantic consistency of modal features during inter-modal interactions.

Table 10.

Method	TGIF-QA				NExT-QA
Method	Action	Trans	FrameQA	Count \(\downarrow\)	\(\mathrm{WUPS}_C\)	\(\mathrm{WUPS}_T\)	\(\mathrm{WUPS}_D\)	\(\mathrm{Acc}_C\)	\(\mathrm{Acc}_T\)	\(\mathrm{Acc}_D\)
without ROI	67.8	79.2	58.1	3.46	15.11	16.79	48.32	35.02	37.97	51.22
with intra	76.6	82.5	59.4	3.58	17.31	18.59	52.01	36.98	38.59	50.45
with inter	86.6	92.3	59.9	3.45	17.92	18.82	51.71	41.20	41.69	55.86
with exc.	84.9	92.0	59.8	3.53	17.12	19.30	51.02	38.20	41.38	50.58
HMRNet*	87.0	92.5	60.0	3.43	18.34	19.68	52.14	41.58	42.18	61.36
HMRNet (default)	87.3	92.5	60.0	3.42	18.51	19.64	52.18	41.50	42.18	61.52

Table 10. Performance on the TGIF-QA and NExT-QA Datasets in Ablation Experiments Assessing the Relation-oriented Interaction Module

The best results are bolded.

4.4.3 Analysis of the Hierarchical Synergistic Memory Unit.

Given different question types, we build on multiscale sampling and propose an HSM unit to complement and fuse multiscale features in a bottom-up and top-down approach to obtain a more confident representation of answer clues. Here, we first remove the HSM to analyze its importance in improving answer reasoning. Then, we use a single-scale video feature as input to analyze its impact. We also analyze the necessity of bottom-up and top-down operations in the default model.

As reported in Table 11, without the hierarchical synergistic memory unit (without HSM), performance decreases on all tasks, especially on the NExT-QA dataset, which requires stronger spatio-temporal inference than the other datasets. In addition, a single scale (with single scale \(n\) = 1, 2, or 3) of input also leads to decreased performance on all tasks. These experimental results illustrate the effectiveness of our proposed HSM in enhancing answer reasoning and its synergistic effect in complementing and fusing multiscale interactive semantics. Moreover, when only bottom-up (with bottom-up) or top-down (with top-down) operations are applied in the default model, nearly all tasks show a degradation in performance. This indicates the necessity of bottom-up and top-down operations to incorporate the semantics at all hierarchy levels.

Table 11.

Method	TGIF-QA				NExT-QA
Method	Action	Trans	FrameQA	Count \(\downarrow\)	\(\mathrm{WUPS}_C\)	\(\mathrm{WUPS}_T\)	\(\mathrm{WUPS}_D\)	\(\mathrm{Acc}_C\)	\(\mathrm{Acc}_T\)	\(\mathrm{Acc}_D\)
without HSM	85.8	91.8	59.0	3.52	15.72	18.20	50.07	35.98	35.11	50.32
with single scale n=1	85.9	91.9	59.9	4.02	16.99	19.61	51.66	40.35	42.00	54.44
with single scale n=2	86.9	92.2	59.2	3.92	16.94	19.39	51.49	40.62	41.00	53.28
with single scale n=3	85.9	92.0	58.3	3.94	16.90	19.37	51.71	38.97	39.76	52.77
with bottom-up	86.2	91.9	59.5	3.50	17.83	18.75	53.12	41.39	42.06	55.98
with top-down	86.7	91.9	59.1	3.70	16.90	18.01	49.90	36.94	38.71	51.35
HMRNet (default)	87.3	92.5	60.0	3.42	18.51	19.64	52.18	41.50	42.18	61.52

Table 11. Performance on the TGIF-QA and NExT-QA Datasets in Ablation Experiments Assessing the HSM Unit

The best results are bolded.

4.4.4 Impact of Tunable Hyper-parameters.

To further analyze the stability of our method, we experiment with the maximum scale value \(N=\lbrace 1, 2, 3, 4\rbrace\) and the feature dimension \(d=\lbrace 128, 256, 512,\) \(1024\rbrace\) while using the default values of other hyper-parameters.

The scale value \(N\) sets the scope of our multiscale sampling. It determines the amount of information provided to the model or even the introduction of irrelevant noise. Table 12 shows the variation in performance with different maximum scale values \(N\) . For different tasks, \(N\) should be set to reach a balance between the amount of information provided and model performance. We reach such a balance at \(N = 3\) . Notably, when \(N = 2\) so that only three clips are sampled, our method outperforms most previous methods (see Tables 2, 6, and 7). The feature dimension \(d\) determines the representational ability and complexity of the model. Figure 4 shows the impact of setting different \(d\) values on performance. The performance on each task shows relatively consistent fluctuations with the change in \(d\) , and we achieve the best overall performance at \(d = 512\) .

Table 12.

Method	TGIF-QA				NExT-QA
Method	Action	Trans	FrameQA	Count \(\downarrow\)	\(\mathrm{WUPS}_C\)	\(\mathrm{WUPS}_T\)	\(\mathrm{WUPS}_D\)	\(\mathrm{Acc}_C\)	\(\mathrm{Acc}_T\)	\(\mathrm{Acc}_D\)
N = 1	86.7	90.0	59.7	4.05	17.54	19.38	51.84	39.89	42.12	55.73
N = 2	86.0	91.9	59.2	3.59	16.90	19.64	51.92	40.05	41.38	58.04
N = 3 (default)	87.3	92.5	60.0	3.42	18.51	19.64	52.18	41.50	42.18	61.52
N = 4	85.8	90.5	59.0	3.45	17.48	19.83	51.90	41.31	43.00	59.07

Table 12. Performance on the TGIF-QA and NExT-QA Datasets with Different \(N\) Values

The best results are bolded.

Fig. 4.

4.5 Qualitative Evaluation

Three challenging examples of VideoQA are shown in Figure 5. To answer question (a), the model needs to extract multimodal interactions and search for clues at different temporal scales. To answer question (b), the model requires the capability of spatio-temporal relational reasoning at a specific scale, and to answer question (c), the model needs to capture spatial details in temporal relations. Our proposed HMRNet is capable of inferring the correct answer and demonstrating its superiority.

Fig. 5.

Figure 6 visualizes the reasoning behaviors of our relation-oriented interaction and HSM in HMRNet. In Figure 6(a), the fact that “adjust sunglasses” in both the question and the video has a large connection weight indicates that our proposed inter-modal interaction can better extract semantically relevant features between modalities. The large connection weights between “what does the woman do” and “adjust sunglasses,” as well as those between “after” and “adjust sunglasses,” suggest that the proposed intra-modal interaction model can further refine inter-modal interactions and obtain more expressive multimodal representations. In Figure 6(b), the usefulness of semantics on each scale for reasoning indicates the synergistic effect of our HSM and shows that finer scales are relatively important for the action counting task.

Fig. 6.

5 Conclusion

This article proposes a novel HMRNet for VideoQA. We focus on the characteristics of VideoQA tasks and follow a simple and effective design to address existing problems. Specifically, we propose a relation-oriented interaction module to explore both inter- and intra-modal interactions and their relation. Through this, we achieve effective multimodal learning. We propose an HSM unit to complement and fuse interactive semantics at different hierarchical levels in a bottom-up and top-down memory-based interaction scheme to enable synergy-enhanced reasoning. Furthermore, our HMRNet has fewer parameters and higher computational efficiency than other methods, and it is also robust and reusable for feature encoder variations. On eight VideoQA datasets, our method achieves better performance than existing state-of-the-art methods. In future research, we will consider extracting multimodal interaction relationships at a more fine-grained level to improve question-answering reasoning performance by learning object regions or keyframe segments in videos in an adaptive manner.

References

[1]

Elad Amrani, Rami Ben-Ari, Daniel Rotman, and Alex Bronstein. 2021. Noise estimation using density estimation for self-supervised multimodal learning. In Proceedings of the AAAI Conference on Artificial Intelligence. AAAI, Online, 6644–6652.

Abstract

1 Introduction

2 Related Work

3 Methods

3.1 Feature Encoders

3.2 Relation-Oriented Interaction

3.3 Hierarchical Synergistic Memory

3.4 Answer Decoder and Loss Function

4 Experiments

4.1 Datasets

4.2 Implementation Details

4.3 Comparison with State-of-the-art Methods

4.4 Ablation Study

4.4.1 Impact of Feature Encoders.

4.4.2 Analysis of the Relation-oriented Interaction Module.

4.4.3 Analysis of the Hierarchical Synergistic Memory Unit.

4.4.4 Impact of Tunable Hyper-parameters.

4.5 Qualitative Evaluation

5 Conclusion

References

Index Terms

Recommendations

Question Difficulty Estimation with Directional Modality Association in Video Question Answering

Gated Multi-modal Fusion with Cross-modal Contrastive Learning for Video Question Answering

Question difficulty estimation via enhanced directional modality association transformer

Comments

Information

Published In

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Funding Sources

Contributors

Other Metrics

Bibliometrics

Article Metrics

Other Metrics

Citations

View options

PDF

eReader

Get Access

Login options

Full Access

Figures

Other

Share

Share this Publication link

Share on social media

Affiliations