Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
research-article
Open access

Hierarchical Synergy-Enhanced Multimodal Relational Network for Video Question Answering

Published: 11 December 2023 Publication History
  • Get Citation Alerts
  • Abstract

    Video question answering (VideoQA) is challenging as it requires reasoning about natural language and multimodal interactive relations. Most existing methods apply attention mechanisms to extract interactions between the question and the video or to extract effective spatio-temporal relational representations. However, these methods neglect the implication of relations between intra- and inter-modal interactions for multimodal learning, and they fail to fully exploit the synergistic effect of multiscale semantics in answer reasoning. In this article, we propose a novel hierarchical synergy-enhanced multimodal relational network (HMRNet) to address these issues. Specifically, we devise (i) a compact and unified relation-oriented interaction module that explores the relation between intra- and inter-modal interactions to enable effective multimodal learning; and (ii) a hierarchical synergistic memory unit that leverages a memory-based interaction scheme to complement and fuse multimodal semantics at multiple scales to achieve synergistic enhancement of answer reasoning. With careful design of each component, our HMRNet has fewer parameters and is computationally efficient. Extensive experiments and qualitative analyses demonstrate that the HMRNet is superior to previous state-of-the-art methods on eight benchmark datasets. We also demonstrate the effectiveness of the different components of our method.

    1 Introduction

    Understanding multimodal information in the real world is a significant manifestation of a machine’s progress toward cognitive intelligence. Thanks to the great advances made in computer vision [15] and natural language processing [46], researchers are paying more attention to vision–language tasks, e.g., image/video captioning [8, 9], language video localization [53, 70], and visual question answering [2, 18]. Therein, a particularly challenging task is video question answering (VideoQA). Compared with image question answering (ImageQA) [2, 32, 42, 71], VideoQA is more difficult because it not only needs to model the semantic connection between the question and the image but also needs to extract complex interactive relations between the question and the temporal content of the video.
    Many existing VideoQA methods [5, 13, 19, 25, 28, 34, 39, 47, 69, 72] adopted recurrent neural networks (RNNs) and convolutional neural networks (CNNs) to encode linguistic embeddings and video features, respectively. Several works [36, 58, 63] designed attention schemes to extract the semantics of important words and question-related spatio-temporal representations. To obtain more expressive features, some methods proposed the use of global self-attention mechanisms [11, 20, 30] or designed extra memory modules [24, 62] to augment interactive feature encoding capabilities. Recently, with the great success of large-scale pre-training models [23, 46] seen in natural language processing, some works [27, 45, 57, 59, 64, 67] used BERT-style [23] pre-training methods for VideoQA.
    While previous studies have yielded promising results, the majority of them focused on learning interactions between the question and the video. For example, question-guided attention methods [18, 19, 55] focused on extracting video features that were related to words or sentences. Co-attention [7, 30] or multistep attention [5, 13] methods extracted the relation between words and frames as well as that between sentences and clips. Furthermore, memory-augmented networks [6, 24, 62] have learned the long-term dependencies between the question and the video. Going in another direction, several other approaches have focused on modeling spatio-temporal relations of video more effectively. For example, Seo et al. [44] explored the high-level relation between appearance and motion information. Some works [4, 34, 68] leveraged object detection across video frames to acquire fine-grained interactive representations. Moreover, works [21, 49, 60] employed large-scale language models [23, 46] to extract deep spatial-temporal contextual semantics. However, these above-mentioned methods have failed to effectively explore the impact of the relationship between intra- and inter-modal interactions on multimodal learning. While recent BERT-style approaches [1, 27, 57, 59, 64, 67] have achieved promising inferential performance by developing comparable independent transformer encoding structures to extract intra- and inter-modal interactions, they still lack adequate consideration of the relationship between these two features during the multimodal learning process. Moreover, these approaches heavily relied on additional large-scale pre-training, and the models proposed using such methods are excessively large, making training challenging. As shown in Figure 1(a), to answer the question, we need to establish the inter-modal connections and also capture the contextual clues within the intra-modal context. Moreover, it is essential to establish the relationship between inter-modal and intra-modal to reason about the answer “Guitar”.
    Fig. 1.
    Fig. 1. An example of VideoQA. (a) Our model leverages inter-modal interactions (marked with colored and dashed lines) and intra-modal interactions (marked with black curves), along with their semantic dependency, to infer answer clues such as “what is”, “the man”, and “playing” when provided with multiple consecutive video clips and a question. (b) Our model explores the synergistic effects of different scales of video clip inputs on answer inference, and determines that coarser scales offer more advantageous reasoning for this example (indicated by the importance values in the pink box during the answer inference process).
    The majority of existing methods infer video information at a single temporal scale by dense sampling and fail to exploit the synergistic complementarity of the multiscale interactive semantics of questions and videos for answer reasoning. Given an arbitrary question, one may better infer the correct answer from visual contents at different temporal scales. As shown in Figure 1(b), the information presented at a coarser scale (e.g., scale 1 or 2, which involves fewer video frames from the beginning to the end) offers a better understanding of the multimodal interaction. Nevertheless, for a different type of task, e.g., counting the number of repeated actions, visual information at a finer scale could also be helpful. Although a few studies [7, 12, 19, 31, 34, 35] have investigated the usefulness of multiscale information for answer reasoning, they all constructed multiscale information from a specific scale and did not sufficiently explore the synergistic effect of multiscale semantics. One of our recent works [40] used questions to recursively extract relationships with multiscale visual content. However, this work ignored the multiscale semantics of the question and failed to consider the synergistic reasoning of multiscale interactive relations.
    To solve the above problems, we propose a novel hierarchical synergy-enhanced multimodal relational network (HMRNet) for VideoQA. Our HMRNet comprises two main components: (i) a relation-oriented interaction module for multimodal learning by exploring the relation between intra- and inter-modal interactions in a compact and unified framework; and (ii) a hierarchical synergistic memory (HSM) unit that can enhance the answer reasoning by exploiting the synergy effect of multiscale semantics. The former module redesigns the transformation of the attention head and builds adaptive self-attentive learning to implement cross-modal interaction and intra-modal reasoning, respectively. This module uses parameter sharing to parallelize the extraction of multimodal interactions at multiple temporal scales, preserving efficiency and compactness. The latter component develops a bottom-up and top-down memory-based interaction scheme to complement and fuse multimodal semantics at different hierarchies to achieve multiscale synergistic reasoning.
    Furthermore, our proposed HMRNet has fewer parameters and higher computational efficiency than existing methods. With these advantages, HMRNet can be efficiently extended to different types of VideoQA tasks such as open-ended classification, open-ended generation, repetition counting, and multi-choice QA. Extensive experiments conducted on several benchmark datasets demonstrate the effectiveness of our method. Our technical contributions are summarized below.
    We propose a novel HMRNet for VideoQA. With careful design, the HMRNet requires few parameters, is computationally efficient and robust, and can be reused when changing feature encoders.
    A relation-oriented interaction module is proposed to extract both intra- and inter-modal interactions within a compact and unified framework, and we find that intra-modal interactions can build on inter-modal interactions to further improve multimodal learning.
    An HSM unit is proposed to complement and fuse multiscale semantics in a bottom-up and top-down interactive approach to achieve synergistic enhancement of answer reasoning.
    We conduct extensive evaluations on eight benchmark datasets, namely, MSVD-QA, MSRVTT-QA, ActivityNet-QA, TGIF-QA, Youtube2Text-QA, SUTD-TrafficQA, Social-IQ, and NExT-QA. When comparing our HMRNet to other methods, we observe significant improvements in performance on nearly all datasets.

    2 Related Work

    Popular VideoQA methods usually extract question and video features using off-the-shelf models. They also often design various attentive interaction structures to extract multimodal representations for answering.
    Attention mechanisms. Attention mechanisms are widely used in several areas, e.g., improving the accuracy of video classification [37], and enhancing sequential modeling with a transformer architecture [46]. Some VideoQA studies designed various attention schemes to investigate the inter-modal interactions between the question and the video. For example, Xu et al. [55] proposed a method called “gradually refined attention” that uses the question as guidance for the extraction of appearance and motion features. Some works [30, 69] proposed a co-attention model to extract question-relevant video features as well as video-relevant question semantics. Some other works [5, 24, 62] proposed memory-augmented networks to handle the semantic interaction between questions and videos across the long temporal dimension. Zhao et al. [73] extracted interactions between the video and the question from the frame level to the segment level. Jin et al. [22] proposed a question-knowledge-guided spatial-temporal attention model to learn the video representation. Xiao et al. [50] proposed modeling video as a conditional hierarchical structure, where a multi-granular visual representation for language concept alignment is obtained under the guidance of textual clues.
    Several other methods instead focused on extracting more efficient spatio-temporal representations from videos using attention mechanisms. With the inherent structural properties between frames and clips in the video, works [25, 26] extracted the near-term and far-term relations of spatio-temporal representation from the clip level and video level. Park et al. [39] proposed bridged visual-to-visual interaction to incorporate two complementary visual pieces of information, appearance and motion, using the question graph as an intermediate bridge. Some works [4, 44, 68] approached the VideoQA task by leveraging a Fast-RCNN [43] to detect objects-of-interest and acquire fine-grained spatio-temporal relations. Huang et al. [16] proposed introducing the positional information of objects in the video to establish a position-aware graph model, while Wang et al. [48] proposed another dual-visual graph reasoning unit to infer answers in the video. In addition, Zhang et al. [72] proposed an action-centric relation transformer network to emphasize dynamic temporal attributes. Liu et al. [36] proposed a dynamic self-attention model to select important tokens for a more efficient extraction of the interaction between appearance and motion.
    However, the aforementioned approaches have given limited consideration to both inter- and intra-modal interactions, failing to effectively explore the impact of their relationships on multimodal learning. Hence, in this work, we introduce a compact and unified relation-oriented interaction module to address this issue.
    Multiscale methods. The purpose of multiscale methods is usually to allow models to extract features at multiple scales, thereby improving performance on the target task. Multiscale methods are used in a variety of fields, such as salient object detection [43], video action classification [37], and natural language video localization [70]. For VideoQA, Jiang et al. [19] designed a multiscale temporal contextual attention block to extract the multimodal interactions by setting a 1D temporal convolution kernel with different dilation rates. Similarly, Liu et al. [35] proposed to use temporal convolution to down-sample feature maps along the temporal dimension and perform weighted aggregation for features at different scales. Li et al. [31] designed a multiscale relation unit, which captures temporal information by modeling different distances between motions. Liu et al. [34] suggested to use temporal average pooling with different kernel sizes to capture multiscale temporal information, and then aggregate the output of each scale feature with attention guided by the question. Guo et al. [12] proposed building graph networks at different scales and progressively achieved graph fusion in a bottom-up and top-down way to acquire question-relevant visual features. Lastly, Gao et al. [7] proposed a generalized pyramid co-attention structure to extract rich video contextual semantic information.
    However, all of these methods extracted multiscale interactions from a specific scale and failed to exploit the complementarity of multiscale information in the answer reasoning. Therefore, in the present work, we build on multiscale sampling and develop an HSM unit to complement and fuse multimodal interactions at different scales to achieve synergistic reasoning.

    3 Methods

    An overview of the proposed HMRNet is shown in Figure 2. It is worth mentioning that instead of focusing on the exploitation of large-scale pairwise video–text data for pre-training as well as building huge models to enhance the answer reasoning ability (e.g., [57] was pre-trained on an additional corpus with 100 million videos and was required to train more than 200 million parameters), we focus on designing compact and efficient components. We believe that our method is equally valuable and also a way to promote broader research, especially while powerful computational resources are still expensive to access.
    Fig. 2.
    Fig. 2. Overview of the proposed HMRNet for VideoQA. Given input of visual features \(\mathbf {X}^n\) ( \(n\in [1,N]\) ) and question embeddings \(\mathbf {Q}\) , the relation-oriented interaction module considers both intra- and inter-modal interactions and their relations to provide multimodal semantics \(\mathbf {U}^n\) . The HSM unit performs complementarity and fusion at different semantic levels in a bottom-up and top-down memory-based interactive approach to obtain the multiscale synergistic output \(\mathbf {O}\) for answer decoding.
    Specifically, as in previous works, we first extract essential features from the video and the question. We utilize CNN networks to extract multiscale appearance-motion features and a GRU network to obtain question embeddings. Then, we design a relation-oriented interaction module to extract the intra- and inter-modal interactions for the question and the video at multiple scales. An HSM unit is built to incorporate the multiscale interactive relations and produce the final representation for answering.

    3.1 Feature Encoders

    Similar to previous studies [7, 8, 10, 29, 36, 50, 62, 68, 69, 72], we first encode essential features for videos and questions. However, different from previous methods that constructed multiscale interactive features at a single scale by dense sampling, we perform sampling at multiple scales to extract the appearance-motion feature and its interaction with the question. Compared to existing strategies, when sampling on the same number of video clips, our sampling strategy has advantages in providing different aspects of the local and the global event in videos. This rich visual information enables us to better conduct synergistic reasoning about the answers.
    Specifically, for input video \(\mathcal {V}\) , we apply a uniform sampling along the temporal dimension to acquire a group of frames \(\lbrace I_i^n\rbrace _{i=1}^{\tau {T}}\) , where \(n\in [1,N]\) denotes the sampling scale and \(\tau = 2^{n-1}\) is the sampling rate. We adopt 2D and 3D ResNets [14, 15] to encode the frame-wise appearance feature \(\mathbf {V}^n\) and clip-wise motion feature \(\mathbf {M}^n\) , respectively.
    \begin{equation} {\mathbf {V}}^n = \left\lbrace v_i^n | v_i^n\in \mathbb {R}^{2048} \right\rbrace _{i=1}^{\tau {T}}, \end{equation}
    (1)
    \begin{equation} {\mathbf {M}}^n = \left\lbrace m_t^n | m_t^n\in \mathbb {R}^{2048} \right\rbrace _{t=1}^{\tau }, \end{equation}
    (2)
    Each feature vector in \(\mathbf {V}^n\) and \(\mathbf {M}^n\) is fed into linear transformation layers to map the feature into a \(d-\) dimension space. We connect these features along the temporal dimension and add a learnable temporal positional embedding \(\mathbf {P}\in \mathbb {R}^{(\tau T + \tau) \times d}\) [46] to acquire the appearance-motion feature as follows:
    \begin{equation} {\mathbf {X}}^n = \left\lbrace x_j^n | x_j^n\in \mathbb {R}^d \right\rbrace _{j=1}^{L_X^n}, \end{equation}
    (3)
    For question and answer candidates, we extract token-wise features with a pre-trained GloVe word model [41]. We map such features into the \(d\) -dimension space and extract sequential features with a bidirectional GRU network. The acquired question embedding \(\mathbf {Q}\) and answer candidate feature \(\mathbf {A}^k\) can be represented as
    \begin{equation} \mathbf {Q}=\lbrace q_j | q_j\in \mathbb {R}^d\rbrace _{j=1}^{L_Q}, \end{equation}
    (4)
    \begin{equation} {\mathbf {A}}^k = \left\lbrace a_j^k | a_j^k\in \mathbb {R}^d \right\rbrace _{j=1}^{L_A^k}, \end{equation}
    (5)
    where \(L_Q\) and \(L_A^k\) denote the number of tokens in the question and the \(k-\) th answer candidate, respectively.

    3.2 Relation-Oriented Interaction

    Effective handling of intra- and inter-modal interactions is important for extracting multimodal interactive semantics in VideoQA. Existing approaches seldom focus on both types of interaction and neglect the role of their relations in multimodal learning. To address this issue, we propose a relation-oriented interaction module to consider both intra- and inter-modal interactions and their relations in a compact and unified framework. Specifically, we follow the ideas of simplicity and efficiency to extract multimodal interactions. By default, we construct intra-modal interactions after the inter-modal interactions. This enables us to explore their relations. We assume that the intra-modal interactions can refine the inter-modal interactions and thus promote answer reasoning. Figure 3(a) illustrates this sequence, and we will analyze the module in detail through ablation experiments (see Section 4.4.2)
    Fig. 3.
    Fig. 3. Flow chart of our proposed (a) relation-oriented interaction module and (b) HSM unit.
    To extract inter-modal interactions, we construct a compact multi-headed cross-modal attention structure that integrates question-to-video and video-to-question processes. Given the appearance-motion feature \(\mathbf {X}^n\) and question feature \(\mathbf {Q}\) , the interactive output of the question-to-video process is computed as
    \begin{equation} {\tilde{\mathbf {X}}}^n={\mathbf {X}}^n+\mathrm{MCA}({\mathbf {X}}^n,\mathbf {Q}), \end{equation}
    (6)
    where \(\mathrm{MCA}(\cdot)\) denotes the operation of a multi-headed cross-modal attention layer. For a single attentional head, the output of \(\mathrm{MCA}_{X}^h\) is
    \begin{equation} \mathrm{MCA}_{X}^h=\mathrm{softmax}\left(\frac{{\mathop {F}}_{X}^h{\mathop {F}}_{{\tilde{Q}}}^{h\top }}{\sqrt {d}}\right){\mathop {F}}_{Q}^h, \end{equation}
    (7)
    with
    \begin{equation} \left\lbrace \begin{array}{l} \mathop {F}_{X}^h=\mathrm{LN}({\mathbf {X}}^n){\mathbf {W}}_{X}^h, \\ \mathop {F}_{{\tilde{Q}}}^h=\mathrm{LN}(\mathbf {Q}){\tilde{\mathbf {W}}}^h, \\ \mathop {F}_{Q}^h=\mathrm{LN}(\mathbf {Q}){\mathbf {W}}_{Q}^h, \\ \end{array} \right. \end{equation}
    (8)
    where \(\mathrm{LN}(\cdot)\) is the normalization layer. The learnable weight matrices are \(\mathbf {W}_{X}^h\) , \({\tilde{\mathbf {W}}}^h\) , \(\mathbf {W}_{Q}^h\) \(\in \mathbb {R}^{d\times d/H}\) . \(\mathrm{MCA}({\mathbf {X}}^n,\mathbf {Q})\) is obtained by concatenating all \(H\) attentional heads in the feature dimension. Similarly, the interactive output of the video-to-question process is computed as
    \begin{equation} \tilde{\mathbf {Q}}^n=\mathbf {Q}+\mathrm{MCA}(\mathbf {Q},{\mathbf {X}}^n), \end{equation}
    (9)
    Considering the symmetry of these two interactive processes, the attention head parameters of \(\mathrm{MCA}(\mathbf {Q},{\mathbf {X}}^n)\) in Equation (9) are shared with those in Equation (6) to maintain the semantic consistency of the feature space. We have
    \begin{equation} \mathrm{MCA}_Q^h=\mathrm{softmax}\left(\frac{\mathop {F}_Q^h\mathop {F}_{{{\tilde{X}}}}^{h\top }}{\sqrt {d}}\right){\mathop {F}}_{{X}}^h, \end{equation}
    (10)
    \begin{equation} {\mathop {F}}_{{{\tilde{X}}}}^{h}=\mathrm{LN}(\mathbf {X}^n){\tilde{\mathbf {W}}}^h, \end{equation}
    (11)
    we also concatenate the output of each attentional head to obtain \(\mathrm{MCA}(\mathbf {Q},{\mathbf {X}}^n)\) . In short, the question-to-video and video-to-question processes are dominated by video features and question features, respectively. Both processes use cross-modal attention to generate feature representations of modal-driven activation.
    To explore the relations between intra- and inter-modal interactions, we add to our relation-oriented interaction module an adaptive self-attention scheme for intra-modal learning using a unified framework. Given inter-modal interaction features \(\tilde{\mathbf {X}}^n\) and \(\tilde{\mathbf {Q}}^n\) , their intra-modal outputs are represented, respectively, as
    \begin{equation} \hat{\mathbf {X}}^n=\tilde{\mathbf {X}}^n+\sigma (\mathbf {R}_X\tilde{\mathbf {X}}^n{\hat{\mathbf {W}}}_{X}), \end{equation}
    (12)
    \begin{equation} \hat{\mathbf {Q}}^n=\tilde{\mathbf {Q}}^n+\sigma (\mathbf {R}_Q\tilde{\mathbf {Q}}^n{\hat{\mathbf {W}}}_{Q}), \end{equation}
    (13)
    where \(\sigma (\cdot)\) is the \(\mathrm{ELU}\) activation function, and \(\hat{\mathbf {W}}_{X}\) , \(\hat{\mathbf {W}}_{Q}\) \(\in \mathbb {R}^{d\times d}\) are learnable weight matrices. \(\mathbf {R}_X\) and \(\mathbf {R}_Q\) denote the self-attentive matrices of \(\tilde{\mathbf {X}}^n\) and \(\tilde{\mathbf {Q}}^n\) , respectively, and are obtained by applying linear transformation and dot-product to the modal features.
    \begin{equation} \mathbf {R}_X=\mathrm{softmax}(\sigma (\tilde{\mathbf {X}}^n{\mathbf {W}}_1)\sigma (\tilde{\mathbf {X}}^n{\mathbf {W}}_2)^\top), \end{equation}
    (14)
    \begin{equation} \mathbf {R}_Q=\mathrm{softmax}(\sigma (\tilde{\mathbf {Q}}^n{\mathbf {W}}_3)\sigma (\tilde{\mathbf {Q}}^n{\mathbf {W}}_4)^\top), \end{equation}
    (15)
    Here, \(\mathbf {W}_1\) , \(\mathbf {W}_2\) , \(\mathbf {W}_3\) , and \(\mathbf {W}_4\) are learnable weight matrices. The above self-attentive scheme can adaptively learn the weights of all features within the modality and perform interactive reasoning for each modality simultaneously.
    We concatenate \(\hat{\mathbf {X}}^n\) and \(\hat{\mathbf {Q}}^n\) in the temporal dimension to obtain the multimodal representation \(\mathbf {U}^n\) . For other \(n\) values, we adopt the same parameters as in Equations (6)–(15) to acquire multiscale interactive semantics. This parameter sharing makes the module compact and ensures that the interactive feature space is compatible at all scales. In summary, our relation-oriented interaction module incorporates multi-headed cross-modal attention and adaptive self-attention to extract inter- and intra-modal interactions, respectively, while considering the relationship between these features to facilitate multimodal learning. Unlike previous methods that used cross-modal attention (e.g., STA [10], Bridge2Answer [39], and HQGA [50]), or co-attention mechanisms (e.g., LAD-Net [29], PSAC [30], and MGTA-Net [54]) to extract inter-modal interactions, we leverage the multi-headed attention mechanism and redesign the transformation on each head (see Equations (7) and (10)) to synchronously extract question-to-video interactions and video-to-question interactions. This approach ensures semantic consistency in the multimodal feature space while maintaining the simplicity of the module. The ablation experiments in Section 4.4.2 demonstrate the benefits of this redesigned attention head. In addition, unlike Bridge2Answer [39], which requires semantic dependency analysis on questions to construct a graph structure, or PSAC [30], which directly uses the self-attention structure of the transformer [46] to extract intra-modal interactions, our adaptive self-attention approach uses a simple linear transformation to learn the relationships between features within each modality and conducts interaction inference for each modality simultaneously. More importantly, our relation-oriented interaction module is designed to be compact and unified, considering both inter- and intra-modal interactions and effectively exploring the impact of their relationship on multimodal learning.

    3.3 Hierarchical Synergistic Memory

    Given a question, a more reliable way of searching for answer clues is to incorporate visual information at different temporal scales. Most existing methods fail to exploit the complementarity between multiscale semantics in answer reasoning. In this section, we introduce the HSM unit (see Figure 3(b)), which is inspired by the GRU memory unit [3] to complement and fuse multimodal semantics at different scales using a bottom-up and top-down approach. This results in more confident answering.
    Specifically, the bottom-up and top-down operations provide a memory-based interaction to iteratively update the features at each scale, thus enabling synergistic reasoning of hierarchical semantics. For multimodal representation \({\mathbf {U}}^n, n\gt 1\) , the iterative output of the bottom-up operation is
    \begin{equation} {\mathbf {U}}_{\uparrow }^n=(1-{\mathbf {\alpha }}^n) \odot \mathbf {\gamma }^n{{\mathbf {U}}}_{\uparrow }^{n-1}+{\mathbf {\alpha }}^n \odot {\Delta {\mathbf {U}}^n}, \end{equation}
    (16)
    where \({\mathbf {U}}_{\uparrow }^1={\mathbf {U}}^1\) , and \(\odot\) denotes the matrix dot product. \({\mathbf {\gamma }}^n\) represents the connection matrix, which can map the hidden information from the previous level to the current level, thereby enabling the transfer of semantic features with different sizes. In addition, the connection matrix \({\mathbf {\gamma }}^n\) enables interactions to occur between different hierarchical levels. \({\mathbf {\alpha }}^n\) is the output of the update gate, which determines how much memory information is passed from the previous level to the current level. \(\Delta {\mathbf {U}}^n\) indicates potentially hidden information. We have
    \begin{equation} {\mathbf {\gamma }}^n=\mathbf {U}^n{\mathbf {W}}_{{\gamma }{1}} ({{\mathbf {U}}_{\uparrow }^{n-1}{\mathbf {W}}_{{\gamma }{2}}})^\top , \end{equation}
    (17)
    \begin{equation} {\mathbf {\alpha }}^n=\delta (\mathbf {U}^n{\mathbf {W}}_{\alpha {1}} + \mathbf {\gamma }^n{\mathbf {U}}_{\uparrow }^{n-1}{\mathbf {W}}_{\alpha {2}} + b_{\mathbf {\alpha }}), \end{equation}
    (18)
    \begin{equation} \Delta {\mathbf {U}}^n=\sigma (\mathbf {U}^n{\mathbf {W}}_{\Delta {1}} + \mathbf {\mu }^n \odot \mathbf {\gamma }^n{\mathbf {U}}_{\uparrow }^{n-1}{\mathbf {W}}_{\Delta {2}} + b_{\mathbf {\Delta }}), \end{equation}
    (19)
    with
    \begin{equation} {\mathbf {\mu }}^n=\delta (\mathbf {U}^n{\mathbf {W}}_{\mu {1}} + \mathbf {\gamma }^n{\mathbf {U}}_{\uparrow }^{n-1}{\mathbf {W}}_{\mu {2}} + b_{\mathbf {\mu }}), \end{equation}
    (20)
    where \(\delta (\cdot)\) is the \(sigmoid\) function. \({\mathbf {\mu }}^n\) is the output of the reset gate, which determines how to combine the new input information with the previous memory input. The top-down operation is the opposite of the bottom-up operation. for \(n\) from \(N\) to 1, it updates the multimodal representations in a manner similar to that described in Equations (16)–(20)
    Notably, the HSM treats the multimodal feature matrix of varying sizes at different scales as sequential inputs. This is beyond the capability of the vanilla GRU unit. It then complements and fuses multimodal semantics from different scales in each iteration step, using a memory-based interaction to accomplish hierarchical synergistic reasoning. We note that the features obtained by the last iteration step in the bottom-up and top-down operations are \({\mathbf {U}}_{\uparrow }\) and \({\mathbf {U}}_{\downarrow }\) , respectively; we then add them together to obtain the final semantic representations \({\mathbf {O}}\) for answer decoding.

    3.4 Answer Decoder and Loss Function

    Following the existing practices [4, 7, 11, 16, 19, 20, 21, 25, 27, 34, 39, 44, 51, 52, 54], we adopt different decoding functions for different types of questions. For the open-ended classification task, we use two fully connected layers and the vanilla cross-entropy function to compute the loss. To generate answers for the open-ended generation task, we use a layer of GRU with soft attention over the questions, similar to [49]. For the repetition counting task, we adopt Mean Squared Error (MSE) as our loss function and apply a round function to the output to obtain an integer result. For the multi-choice task, we process each answer candidate in the same way as the question input and share its parameters. We use the hinge loss to calculate the loss between each predicted answer and the correct answer.

    4 Experiments

    In this section, we first introduce the VideoQA datasets used for our experiments as well as implementation details. We then compare the proposed method with other state-of-the-art methods and conduct ablation studies to justify each component of our method.

    4.1 Datasets

    Eight benchmark datasets for VideoQA are adopted for evaluation which contain videos of different lengths, scenarios, and question-answering types.
    MSVD-QA [55] contains 50,505 question-answer pairs and 1,970 video clips, with questions divided into what, who, how, when, and where, all of which are open-ended tasks.
    MSRVTT-QA [55] contains 243k question-answer pairs and 10k videos, with questions in the same form as MSVD-QA but with more complex visual scenarios and longer video lengths of around 10 to 30 seconds.
    ActivityNet-QA [65] contains 58k open-ended question-answer pairs and 5.8k videos. All videos are untrimmed web videos with an average length of 180 seconds.
    TGIF-QA [18] contains 165k question-answer pairs and 72k animated GIFs. This dataset covers four types of questions: Action, Trans, FrameQA, and Count. Action is a multi-choice task to identify the repetitive action; Trans is another multi-choice task to identify the temporal transition of the action; FrameQA is an open-ended task, with answers being particularly obvious on a single frame; Count is a repetition counting task to figure out the number of repetitions of the action.
    Youtube2Text-QA [61] contains 9.9k question-answer pairs, with video data from MSVD-QA. Questions are divided into what, who, other. We select the multi-choice task in the dataset for evaluation.
    SUTD-TrafficQA [56] contains 62k question-answer pairs and 10k videos. This dataset includes six challenging traffic-related reasoning tasks, with each question providing four candidate answers, only one of which is correct. To solve these questions, the model needs strong common sense and logical reasoning abilities.
    Social-IQ [66] contains 7.5k question-answer pairs and 1.25k videos. All questions are multi-choice tasks and are specific to in-the-wild social situations, aiming to evaluate artificial social intelligence through question-answering. This dataset has a larger proportion of questions starting with why and how, which often require strong reasoning abilities.
    NExT-QA [49] contains 52k manually annotated question-answer pairs and 5.44k videos. This dataset contains three types of questions: Causal, Temporal, and Descriptive. The Causal questions are designed to explain actions or uncover the intentions of previously occurring actions. The Temporal questions are designed to assess the model’s ability to reason about temporal relationships between actions. The Descriptive questions focus on the description of the scene in the video. All questions are divided into multi-choice and open-ended tasks, where the latter requires the model to generate answers in short phrases.

    4.2 Implementation Details

    Experiment Settings. To split the datasets, we use the standard training, validation, and test sets provided in each dataset. Similar to existing works, we take the open-ended questions in MSVD-QA, MSRVTT-QA, ActivityNet-QA, and TGIF-QA datasets as open-ended classification tasks. We treat the open-ended questions in the NExT-QA dataset as open-ended generation tasks, truncating each answer to a maximum length of 6 and considering each word in the training set as a member of the answer vocabulary.
    We use a ResNet with 152 layers to extract appearance-motion features, and the maximum sampling scale \(N\) is set to 3 by default. The feature dimension \(d\) is set to 512. During network training, each VideoQA dataset has a pre-defined vocabulary composed of the top \(K\) most frequent words in the training set. We set the \(K\) value for the MSVD-QA dataset to 4,000 and that for other datasets to 8,000. We use the Adam optimizer to train the network, with the initial learning rate set to \(1e-4\) . When the loss is not decreased for 5 epochs, the learning rate is reduced by half. The maximum number of epochs is set to 50, and the batch size is set to 32. We implement our model with the PyTorch deep learning library on a PC with only one GTX 1080 Ti GPU.
    Evaluation Metrics. To evaluate the performance of a model, we use different metrics depending on the task. For open-ended classification and multi-choice tasks, we adopt accuracy (%) to evaluate the performance. For repetition counting tasks, we use MSE computed between the predicted answer and the ground truth answer to evaluate the performance. For open-ended generation tasks, we determine the Wu-Palmer similarity (WUPS) score [49] to evaluate the quality of the generated answers.

    4.3 Comparison with State-of-the-art Methods

    In this section, we compare our HMRNet with the state-of-the-art methods on eight popular VideoQA datasets. Unless otherwise stated, the reported results are those obtained from experiments done on the testing set.
    Table 1 reports the comparison results on MSVD-QA, MSRVTT-QA, and ActivityNet-QA datasets. For videos with varying durations and complex scenes, our proposed HMRNet still achieves the best performance, i.e., 41.8% (+0.6%), 39.5% (+0.9%), and 40.6% (+3.3%) on MSVD-QA, MSRVTT-QA, and ActivityNet-QA datasets, respectively. The results of these experiments demonstrate that our method is well adapted to both short and long untrimmed video question answering.
    Table 1.
    MethodMSVD-QAMSRVTT-QAActivityNet-QA
    E-SA [65]27.629.331.8
    MAR-VQA [74]--34.6
    CAN [63]32.433.235.4
    HGA [20]34.735.5-
    GMN [11]35.436.1-
    MHMAN [62]35.634.637.1
    Bridge2Answer [39]37.236.9-
    HAIR [34]37.536.9-
    HOSTR [4]39.435.9-
    DSAVS [36]37.235.8-
    ACRTransformer [72]--37.3
    DualVGR [48]39.035.5-
    MHN [40]40.438.6-
    PKOL [68]41.136.9-
    HQGA [50]41.238.6-
    SSML [1]35.135.1-
    ClipBERT [27]-37.4-
    HMRNet (ours)41.839.540.6
    Table 1. The Comparison Experiments on the MSVD-QA, MSRVTT-QA, and ActivityNet-QA Datasets
    The best and second-best results are bolded and underlined, respectively.
    Table 2 shows the comparison results on the TGIF-QA dataset. Our proposed HMRNet achieves the best performance on the Action (+2.9%), Trans (+1.7%), and Count ( \(-0.15\) ) tasks. The improvement is particularly noticeable for Action and Trans tasks. These tasks challenge the modeling of temporal relations between videos and questions. The PKOL [68] obtained the best performance on FrameQA tasks, in contrast, our method achieves 12.7 and 9.7 percentage points of improvement on Action and Trans tasks, respectively. The better performance of HMRNet could be owed to its ability to extract multimodal interactive relations from questions and videos and search for answer clues at different semantic levels.
    Table 2.
    MethodActionTransFrameQACount \(\downarrow\)
    L-GCN [16]74.381.156.33.95
    QueST [19]75.981.059.74.19
    HCRN [25]75.081.455.93.82
    GMN [11]73.081.757.54.16
    ACRTransformer [72]75.881.657.74.08
    Bridge2Answer [39]75.982.657.53.71
    HAIR [34]77.882.360.23.88
    HOSTR [4]75.083.058.03.65
    MSPAN [12]78.483.359.73.57
    CPTC [7]78.184.057.43.98
    MASN [44]84.487.459.53.75
    PKOL [68]74.682.861.83.67
    HQGA [50]76.985.661.3-
    MHN [40]83.590.858.13.58
    SiaSamRea [64]79.785.360.23.61
    ClipBERT [27]82.887.860.3-
    HD-VILA [57]84.390.060.5-
    HMRNet (ours)87.392.560.03.42
    Table 2. The Comparison Experiment on the TGIF-QA Dataset
    The best and second-best results are bolded and underlined, respectively.
    We note here that, in addition to using the necessary encoders to extract video and question features, some of the compared methods apply other complex feature extraction procedures to promote the video–question interaction. For example, Bridge2Answer [39] additionally extracted compositional semantic relations between words using an NLP tool [38]. GMN [11], HAIR [34], HOSTR [4], MASN [44], PKOL [68], and HQGA [50] applied a Faster R-CNN [43] for object detection per frame to obtain fine-grained spatio-temporal features. Conversely, ACRTransformer [72] adopted a BSN [33] to specifically locate high-action frames in a given video, and MSPAN [12] extracted motion features in a frame-by-frame style despite the large parameter size of the motion network. Furthermore, PKOL [68] used additional prior knowledge from a massive corpus and thus achieved strong results on the FrameQA task. By contrast, our proposed relation-oriented interaction module can effectively extract multimodal interactive representations without these complex feature pre-processing steps. Moreover, most of the previous methods applied dense sampling to the input video. For example, HCRN [25], Bridge2Answer [39], and HQGA [50] sampled 8 clips with each comprising 16 frames; E-SA [65], MAR-VQA [74], DSAVS [36], and DualVGR [48] sampled 20 clips for long videos. However, with fewer clips sampled (only seven clips are sampled when \(N\) is set to 3 by default), our proposed HSM is able to search for answer cues across multiscale semantics to achieve the best performance.
    Tables 1 and 2 also provide comparison results with recent large-scale pre-training approaches. Specifically, SSML [1] was pre-trained on the HowTo100M video–text corpus. ClipBERT [27] proposed a sparse sampling strategy and used the COCO Captions and Visual Genome Captions datasets for pre-training. With the same pre-training approach as in [27], SiaSamRea [64] proposed a Siamese sampling mechanism to enhance answer reasoning. HD-VILA [57] built a large-scale, high-resolution video–text dataset to improve reasoning performance. Moreover, their models are huge, complex, and demand ample computing resources to implement training. As a comparison, our proposed HMRNet is compact and does not require any extra video–text corpus for pre-training to achieve superior performance.
    Nevertheless, we also recognize that large-scale pre-training indeed largely improves the performances of several methods [59, 67] on the above datasets. e.g., [59] built a domain-specific dataset for pre-training and achieved a higher recognition rate. [67] collected a very large video–text corpus called YT-Temporal-180M for pre-training and achieved a performance superior to that of our approach. However, we believe that our method, which focuses on designing compact and efficient components for the problems present in VideoQA is equally valuable. Moreover, it provides a way to promote broader research, especially while powerful computational resources are still expensive to access.
    Tables 35 show evaluation results on Youtube2Text-QA, SUTD-TrafficQA, and Social-IQ datasets, respectively. For the Youtube2Text-QA dataset, we achieved an improvement of 3% in overall performance compared to the previously reported best results. For the SUTD-TrafficQA dataset, we obtain the highest accuracy rate, i.e., 40.91%. We also show the performance of our HMRNet for binary and four-class classification tasks on the Social-IQ dataset: here, HMRNet improves the recognition rate by 10.1% and 10.4%, respectively, compared with Tensor-MFN [66]. This outstanding performance suggests that our proposed HMRNet has the ability to perform causal reasoning in in-the-wild social situations and complex traffic scenarios.
    Table 3.
    MethodYoutube2Text-QA
    EVQA [2]47.6
    r-ANL [61]52.0
    HME[5]80.8
    L-GCN [16]83.9
    HAIR [34]85.3
    DSAVS [36]85.0
    HMRNet (ours)88.3
    Table 3. Test Results on the Youtube2Text-QA Dataset
    Table 4.
    MethodSUTD-TrafficQA
    VIS+LSTM [42]29.91
    BERT-VQA [60]33.68
    TVQA [28]35.16
    HCRN [25]36.49
    Eclips [56]37.05
    HMRNet (ours)40.91
    Table 4. Test Results on the SUTD-TrafficQA Dataset
    Table 5.
    MethodSocial-IQ
    binayfour-class
    LMN [47]61.131.8
    FVTA [32]60.931.0
    MDAM [24]62.230.7
    TVQA [28]60.030.0
    Tensor-MFN [66]64.834.1
    HMRNet (ours)74.944.5
    Table 5. Test Results on the Social-IQ Dataset
    Table 6 shows evaluation results for the open-ended generation task conducted on the NExT-QA dataset. Similar to [49], we compared experimental results across all question types. \(\mathrm{WUPS}_C\) , \(\mathrm{WUPS}_T\) , and \(\mathrm{WUPS}_D\) denote the generated answer scores for Causal, Temporal, and Descriptive questions, respectively. \(\mathrm{WUPS}\) denotes the overall score for all questions. The higher the score, the more accurate the generated answer. Our method achieves better overall performance compared with other state-of-the-art approaches. Table 7 shows evaluation results for the multi-choice task conducted using the NExT-QA dataset. We experiment with the multi-choice task on the validation set. \(\mathrm{Acc}_C\) , \(\mathrm{Acc}_T\) , and \(\mathrm{Acc}_D\) denote the recognition rates for Causal, Temporal, and Descriptive questions, respectively. \(\mathrm{Acc}\) denotes the overall accuracy of all questions. We denote the approach introduced in reference [52] as VTQG, which encompasses three graphical reasoning architectures: VQG, VTG, and TQG. The results demonstrate that our HMRNet outperforms the majority of existing methods across all question types and achieves comparable performance to VTQG.
    Table 6.
    Method \(\mathrm{WUPS}_C\) \(\mathrm{WUPS}_T\) \(\mathrm{WUPS}_D\) \(\mathrm{WUPS}\)
    STVQA [17]15.2418.0347.1123.04
    HME [5]15.7818.4050.0324.06
    HCRN [25]16.0517.6849.7823.92
    UATT [58]16.7318.6848.4224.25
    HGA [20]17.9817.9550.8425.18
    HMRNet (ours)18.5119.6452.1826.22
    Table 6. Comparison of Methods on the NExT-QA Dataset for Open-ended Generation Tasks
    The best results are bolded.
    Table 7.
    Method \(\mathrm{Acc}_C\) \(\mathrm{Acc}_T\) \(\mathrm{Acc}_D\) \(\mathrm{Acc}\)
    WhyHowAllPrevNextPresentAllCountLocationOtherAll
    EVQA [2]28.3829.5828.6929.8233.3331.2743.5043.3938.3641.4431.51
    PSAC [30]35.8129.5834.1828.5635.7531.5139.5567.9035.4148.6535.57
    PSAC+ [30]35.0329.8733.6830.7735.4432.6938.4271.5338.0350.8436.03
    Co-Mem [6]36.1232.2135.1034.0441.9337.2839.5567.1240.6650.4538.19
    ST-TP [18]37.5832.5036.2533.0940.8736.2945.7671.5344.9255.2139.21
    HGA [20]36.3833.8235.7135.8342.0838.4046.3370.5146.5655.6039.67
    HME [5]39.1434.7037.9734.3540.5736.9141.8171.8638.3651.8739.79
    HCRN [25]39.8636.9039.0937.3043.8940.0142.3762.0340.6649.1640.95
    VTQG [52]43.1439.8242.2740.2547.2143.1146.8974.5852.4659.5945.24
    HMRNet (ours)42.0539.9741.5038.4647.5142.1846.3377.6354.7561.5244.84
    Table 7. Comparison of Methods on the NExT-QA Dataset for Multi-choice Tasks
    The best results are bolded.
    Similarly, as we have analyzed in Tables 1 and 2, our method offers the advantage of not requiring complex feature pre-processing steps, such as those used in VTQG that rely on object detectors on each frame and seq-NMS to generate target trajectory features. Additionally, our method achieves powerful performance with fewer video clips, as compared to VTQG which samples a relatively large number of 16 clips. These benefits are due to our inter- and intra-modal relationship modeling and multiscale collaborative reasoning structure. However, it is important to note that object-level interactions, as explored in methods [52] and [51], offer advantages, particularly for questions that involve relational reasoning between objects in the NExT-QA dataset. Also, the approach in [51] of constructing graph interactions of objects and leveraging BERT fine-tuning to enhance performance is worth considering for our work.
    To further demonstrate the advantages of our proposed HMRNet, in Table 8, we compare it with several methods in terms of parameter size and computational efficiency. The last four methods in the table, i.e., ClipBERT [27], VQA-T [59], MERLOT [67], and HD-VILA [57], are recent large-scale pre-training methods. We use nParams and GFLOPs [15, 56] to denote the number of learnable parameters and the computational efficiency of the model, respectively. As shown in the table, while leveraging off-the-shelf features for answer reasoning, our proposed HMRNet has the lowest number of parameters. Furthermore, HMRNet requires significantly fewer parameters than HME and HD-VILA. In terms of computational efficiency, our method also has a greater advantage compared with other methods. The GFLOPs of our HMRNet are only 0.25, which is remarkably less than those of MERLOT and HD-VILA, and less than half that of VQA-T.
    Table 8.
    MethodPSACHMEHCRNMHNHQGAClipBERTVQA-TMERLOTHD-VILAHMRNet (ours)
    nParams39.1M44.8M42.9M16.5M12.7M85.1M48.8M85.1M85.1M10.5M
    GFLOPs0.480.630.740.781.433.480.68134.5417.840.25
    Table 8. Comparison of the Number of Parameters Required and Computational Efficiency Across Methods
    The best results are bolded.

    4.4 Ablation Study

    To justify our methodology, we run a series of ablation experiments on the TGIF-QA and NExT-QA datasets. With the default HMRNet (referred to here as the default model) used above as the baseline, we first analyze the impact of different components of our model and then analyze the impact of some important hyper-parameters on the experimental results.

    4.4.1 Impact of Feature Encoders.

    In Tables 18, we demonstrate the superiority of our proposed method and the simplicity of its design. Here, we analyze the effect of feature encoder changes on experimental results. Specifically, we replace the video and question encoders with lightweight ResNet backbones with 18 layers and the BERT [23] model, respectively. Experimental results are shown in Table 9. It can be seen that when using video encoders with much weaker encoding capabilities (Video Enc. \(^-\) ), our method is still able to achieve performance comparable to the default model. This indicates that our proposed method is robust for visual encoders. On the other hand, when replacing the question encoder with the BERT model (Question Enc. \(^+\) ), which has more learnable parameters and a relatively strong semantic encoding capability, our method shows improvements on all tasks. This experimental result demonstrates the good reusability of our method for question encoders and the potential to combine it with existing BERT-style large-scale pre-trained models.
    Table 9.
    MethodActionTransFrameQACount \(\downarrow\)
    Video Enc. \(^-\) 87.192.458.53.42
    Question Enc. \(^+\) 87.593.060.33.40
    HMRNet (default)87.392.560.03.42
    Table 9. Impact of Feature Encoders on Performance for Experiments Run on the TGIF-QA Dataset
    The best results are bolded.

    4.4.2 Analysis of the Relation-oriented Interaction Module.

    In our work, we propose a relation-oriented interaction module to consider both intra- and inter-modal interactions and their relations in a compact and unified framework. Here, we conduct a series of experiments to analyze the implications of the relation between intra- and inter-modal interactions on multimodal learning. Specifically, we first remove the relation-oriented interaction module to evaluate its impact. Then, we separately add the intra- and inter-modal interactions as described in Section 4.4.2. Finally, we analyze the impact of exchanging the order of intra- and inter-modal interactions in the relation-oriented interaction module on experimental performance. We also analyze the benefits of our redesigned attention heads in inter-modal interactions, i.e., by comparing the non-shared attention head parameters between question-to-video and video-to-question processes.
    The results are reported in Table 10. Without the relation-oriented interaction (without ROI) module, the performance drops significantly for all tasks. However, with the addition of intra-modal interactions (with intra), the performance of most tasks greatly improves; performance increases across tasks are even more significant with the addition of just inter-modal interactions (with inter). These results illustrate that extracting both intra-modal and inter-modal interactions is important for multimodal learning and that the inter-modal interaction is essential for solving VideoQA tasks. Additionally, when we change the order of inter- and intra-modal interactions (with exc.) in the default module, the performance on all tasks drops. This suggests that the intra-modal interactions can refine the inter-modal interactions to obtain more expressive multimodal interactive relations, whereas a pre-focus on intra-modal interactions would not be beneficial. One possible explanation for this result is that there are elements in both the video and question that are not relevant to the answer inference. If we focus on the intra-modal interactions first, noise independent of the modalities will inevitably be introduced and weaken the representation of each modality, which in turn affects the extraction of inter-modal interactions. Conversely, if the related content between modalities is extracted first, the intra-modal interactions will be more distilled. The experiments with non-shared parameters (HMRNet*) demonstrate that our redesigned attention head achieves a slightly superior overall performance while simultaneously reducing the model’s parameters and complexity. This observation further underscores its capability to maintain the semantic consistency of modal features during inter-modal interactions.
    Table 10.
    MethodTGIF-QANExT-QA
    ActionTransFrameQACount \(\downarrow\) \(\mathrm{WUPS}_C\) \(\mathrm{WUPS}_T\) \(\mathrm{WUPS}_D\) \(\mathrm{Acc}_C\) \(\mathrm{Acc}_T\) \(\mathrm{Acc}_D\)
    without ROI67.879.258.13.4615.1116.7948.3235.0237.9751.22
    with intra76.682.559.43.5817.3118.5952.0136.9838.5950.45
    with inter86.692.359.93.4517.9218.8251.7141.2041.6955.86
    with exc.84.992.059.83.5317.1219.3051.0238.2041.3850.58
    HMRNet*87.092.560.03.4318.3419.6852.1441.5842.1861.36
    HMRNet (default)87.392.560.03.4218.5119.6452.1841.5042.1861.52
    Table 10. Performance on the TGIF-QA and NExT-QA Datasets in Ablation Experiments Assessing the Relation-oriented Interaction Module
    The best results are bolded.

    4.4.3 Analysis of the Hierarchical Synergistic Memory Unit.

    Given different question types, we build on multiscale sampling and propose an HSM unit to complement and fuse multiscale features in a bottom-up and top-down approach to obtain a more confident representation of answer clues. Here, we first remove the HSM to analyze its importance in improving answer reasoning. Then, we use a single-scale video feature as input to analyze its impact. We also analyze the necessity of bottom-up and top-down operations in the default model.
    As reported in Table 11, without the hierarchical synergistic memory unit (without HSM), performance decreases on all tasks, especially on the NExT-QA dataset, which requires stronger spatio-temporal inference than the other datasets. In addition, a single scale (with single scale \(n\) = 1, 2, or 3) of input also leads to decreased performance on all tasks. These experimental results illustrate the effectiveness of our proposed HSM in enhancing answer reasoning and its synergistic effect in complementing and fusing multiscale interactive semantics. Moreover, when only bottom-up (with bottom-up) or top-down (with top-down) operations are applied in the default model, nearly all tasks show a degradation in performance. This indicates the necessity of bottom-up and top-down operations to incorporate the semantics at all hierarchy levels.
    Table 11.
    MethodTGIF-QANExT-QA
    ActionTransFrameQACount \(\downarrow\) \(\mathrm{WUPS}_C\) \(\mathrm{WUPS}_T\) \(\mathrm{WUPS}_D\) \(\mathrm{Acc}_C\) \(\mathrm{Acc}_T\) \(\mathrm{Acc}_D\)
    without HSM85.891.859.03.5215.7218.2050.0735.9835.1150.32
    with single scale n=185.991.959.94.0216.9919.6151.6640.3542.0054.44
    with single scale n=286.992.259.23.9216.9419.3951.4940.6241.0053.28
    with single scale n=385.992.058.33.9416.9019.3751.7138.9739.7652.77
    with bottom-up86.291.959.53.5017.8318.7553.1241.3942.0655.98
    with top-down86.791.959.13.7016.9018.0149.9036.9438.7151.35
    HMRNet (default)87.392.560.03.4218.5119.6452.1841.5042.1861.52
    Table 11. Performance on the TGIF-QA and NExT-QA Datasets in Ablation Experiments Assessing the HSM Unit
    The best results are bolded.

    4.4.4 Impact of Tunable Hyper-parameters.

    To further analyze the stability of our method, we experiment with the maximum scale value \(N=\lbrace 1, 2, 3, 4\rbrace\) and the feature dimension \(d=\lbrace 128, 256, 512,\) \(1024\rbrace\) while using the default values of other hyper-parameters.
    The scale value \(N\) sets the scope of our multiscale sampling. It determines the amount of information provided to the model or even the introduction of irrelevant noise. Table 12 shows the variation in performance with different maximum scale values \(N\) . For different tasks, \(N\) should be set to reach a balance between the amount of information provided and model performance. We reach such a balance at \(N = 3\) . Notably, when \(N = 2\) so that only three clips are sampled, our method outperforms most previous methods (see Tables 2, 6, and 7). The feature dimension \(d\) determines the representational ability and complexity of the model. Figure 4 shows the impact of setting different \(d\) values on performance. The performance on each task shows relatively consistent fluctuations with the change in \(d\) , and we achieve the best overall performance at \(d = 512\) .
    Table 12.
    MethodTGIF-QANExT-QA
    ActionTransFrameQACount \(\downarrow\) \(\mathrm{WUPS}_C\) \(\mathrm{WUPS}_T\) \(\mathrm{WUPS}_D\) \(\mathrm{Acc}_C\) \(\mathrm{Acc}_T\) \(\mathrm{Acc}_D\)
    N = 186.790.059.74.0517.5419.3851.8439.8942.1255.73
    N = 286.091.959.23.5916.9019.6451.9240.0541.3858.04
    N = 3 (default)87.392.560.03.4218.5119.6452.1841.5042.1861.52
    N = 485.890.559.03.4517.4819.8351.9041.3143.0059.07
    Table 12. Performance on the TGIF-QA and NExT-QA Datasets with Different \(N\) Values
    The best results are bolded.
    Fig. 4.
    Fig. 4. Performance on the TGIF-QA and NExT-QA datasets with different \(d\) values.

    4.5 Qualitative Evaluation

    Three challenging examples of VideoQA are shown in Figure 5. To answer question (a), the model needs to extract multimodal interactions and search for clues at different temporal scales. To answer question (b), the model requires the capability of spatio-temporal relational reasoning at a specific scale, and to answer question (c), the model needs to capture spatial details in temporal relations. Our proposed HMRNet is capable of inferring the correct answer and demonstrating its superiority.
    Fig. 5.
    Fig. 5. The evaluation of different methods includes three qualitative examples: (a) a multi-choice task from the TGIF-QA dataset, (b) a multi-choice task, and (c) an open-ended generation task from the NExT-QA dataset.
    Figure 6 visualizes the reasoning behaviors of our relation-oriented interaction and HSM in HMRNet. In Figure 6(a), the fact that “adjust sunglasses” in both the question and the video has a large connection weight indicates that our proposed inter-modal interaction can better extract semantically relevant features between modalities. The large connection weights between “what does the woman do” and “adjust sunglasses,” as well as those between “after” and “adjust sunglasses,” suggest that the proposed intra-modal interaction model can further refine inter-modal interactions and obtain more expressive multimodal representations. In Figure 6(b), the usefulness of semantics on each scale for reasoning indicates the synergistic effect of our HSM and shows that finer scales are relatively important for the action counting task.
    Fig. 6.
    Fig. 6. A visual analysis of HMRNet. (a) The behavior of inter- and intra-modal interactions on the largest temporal scale. For illustrative purposes, we adopt attention matrices in Equation (7) to compute the average weight along the temporal dimension between each clip and each word group. The two highest average weights of each word group are placed in the figure with colored and dashed lines following their connection to their respective video clip. For intra-modal interaction, we adopt matrices \(\mathbf {R}_X\) and \(\mathbf {R}_Q\) to compute the two highest connection weights (marked with black curves) on clips and word groups, respectively. (b) The value in the box represents the amount of information passed to the next hierarchical level and the importance of this semantic level. These values are obtained by averaging the bottom-up and top-down memory-based interaction processes in Equation (18).

    5 Conclusion

    This article proposes a novel HMRNet for VideoQA. We focus on the characteristics of VideoQA tasks and follow a simple and effective design to address existing problems. Specifically, we propose a relation-oriented interaction module to explore both inter- and intra-modal interactions and their relation. Through this, we achieve effective multimodal learning. We propose an HSM unit to complement and fuse interactive semantics at different hierarchical levels in a bottom-up and top-down memory-based interaction scheme to enable synergy-enhanced reasoning. Furthermore, our HMRNet has fewer parameters and higher computational efficiency than other methods, and it is also robust and reusable for feature encoder variations. On eight VideoQA datasets, our method achieves better performance than existing state-of-the-art methods. In future research, we will consider extracting multimodal interaction relationships at a more fine-grained level to improve question-answering reasoning performance by learning object regions or keyframe segments in videos in an adaptive manner.

    References

    [1]
    Elad Amrani, Rami Ben-Ari, Daniel Rotman, and Alex Bronstein. 2021. Noise estimation using density estimation for self-supervised multimodal learning. In Proceedings of the AAAI Conference on Artificial Intelligence. AAAI, Online, 6644–6652.
    [2]
    Stanislaw Antol, Aishwarya Agrawal, Jiasen Lu, Margaret Mitchell, Dhruv Batra, C. Lawrence Zitnick, and Devi Parikh. 2015. Vqa: Visual question answering. In Proceedings of the IEEE International Conference on Computer Vision. ICCV, Santiago, Chile, 2425–2433.
    [3]
    Kyunghyun Cho, Bart van Merriënboer, Dzmitry Bahdanau, and Yoshua Bengio. 2014. On the properties of neural machine translation: Encoder–decoder approaches. In Proceedings of SSST-8, 8th Workshop on Syntax, Semantics, and Structure in Statistical Translation. ACL, Doha, Qatar, 103–111.
    [4]
    Long Hoang Dang, Thao Minh Le, Vuong Le, and Truyen Tran. 2021. Hierarchical object-oriented spatio-temporal reasoning for video question answering. In Proceedings of the 30th International Joint Conference on Artificial Intelligence, IJCAI-21. IJCAI, Montreal, Canada, 636–642.
    [5]
    Chenyou Fan, Xiaofan Zhang, Shu Zhang, Wensheng Wang, Chi Zhang, and Heng Huang. 2019. Heterogeneous memory enhanced multimodal attention model for video question answering. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. CVPR, 1999–2007.
    [6]
    Jiyang Gao, Runzhou Ge, Kan Chen, and Ram Nevatia. 2018. Motion-appearance co-memory networks for video question answering. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. CVPR, 6576–6585.
    [7]
    Lianli Gao, Tangming Chen, Xiangpeng Li, Pengpeng Zeng, Lei Zhao, and Yuan-Fang Li. 2021. Generalized pyramid co-attention with learnable aggregation net for video question answering. Pattern Recognition 120, C (2021), 108145.
    [8]
    Lianli Gao, Yu Lei, Pengpeng Zeng, Jingkuan Song, Meng Wang, and Heng Tao Shen. 2021. Hierarchical representation network with auxiliary tasks for video captioning and video question answering. IEEE Transactions on Image Processing 31 (2021), 202–215.
    [9]
    Lianli Gao, Xuanhan Wang, Jingkuan Song, and Yang Liu. 2020. Fused GRU with semantic-temporal attention for video captioning. Neurocomputing 395 (2020), 222–228.
    [10]
    Lianli Gao, Pengpeng Zeng, Jingkuan Song, Yuan-Fang Li, Wu Liu, Tao Mei, and Heng Tao Shen. 2019. Structured two-stream attention network for video question answering. In Proceedings of the AAAI Conference on Artificial Intelligence. AAAI, 6391–6398.
    [11]
    Mao Gu, Zhou Zhao, Weike Jin, Richang Hong, and Fei Wu. 2021. Graph-based multi-interaction network for video question answering. IEEE Transactions on Image Processing 30 (2021), 2758–2770.
    [12]
    Zhicheng Guo, Jiaxuan Zhao, Licheng Jiao, Xu Liu, and Lingling Li. 2021. Multi-scale progressive attention network for video question answering. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing. ACL, Bangkok, Thailand, 973–978.
    [13]
    Zhaoyu Guo, Zhou Zhao, Weike Jin, Zhicheng Wei, Min Yang, Nannan Wang, and Nicholas Jing Yuan. 2021. Multi-turn video question generation via reinforced multi-choice attention network. IEEE Transactions on Circuits and Systems for Video Technology 31, 5 (2021), 1697–1710.
    [14]
    Kensho Hara, Hirokatsu Kataoka, and Yutaka Satoh. 2018. Can spatiotemporal 3d cnns retrace the history of 2d cnns and imagenet?. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. CVPR, 6546–6555.
    [15]
    Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. CVPR, 770–778.
    [16]
    Deng Huang, Peihao Chen, Runhao Zeng, Qing Du, Mingkui Tan, and Chuang Gan. 2020. Location-aware graph convolutional networks for video question answering. In Proceedings of the AAAI Conference on Artificial Intelligence. AAAI, 11021–11028.
    [17]
    Yunseok Jang, Yale Song, Chris Dongjoo Kim, Youngjae Yu, Youngjin Kim, and Gunhee Kim. 2019. Video question answering with spatio-temporal reasoning. International Journal of Computer Vision 127, 10 (2019), 1385–1412.
    [18]
    Yunseok Jang, Yale Song, Youngjae Yu, Youngjin Kim, and Gunhee Kim. 2017. Tgif-qa: Toward spatio-temporal reasoning in visual question answering. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. CVPR, 2758–2766.
    [19]
    Jianwen Jiang, Ziqiang Chen, Haojie Lin, Xibin Zhao, and Yue Gao. 2020. Divide and conquer: Question-guided spatio-temporal contextual attention for video question answering. In Proceedings of the AAAI Conference on Artificial Intelligence. AAAI, 11101–11108.
    [20]
    Pin Jiang and Yahong Han. 2020. Reasoning with heterogeneous graph alignment for video question answering. In Proceedings of the AAAI Conference on Artificial Intelligence. AAAI, 11109–11116.
    [21]
    Weike Jin, Zhou Zhao, Xiaochun Cao, Jieming Zhu, Xiuqiang He, and Yueting Zhuang. 2021. Adaptive spatio-temporal graph enhanced vision-language representation for video QA. IEEE Transactions on Image Processing 30 (2021), 5477–5489.
    [22]
    Weike Jin, Zhou Zhao, Yimeng Li, Jie Li, Jun Xiao, and Yueting Zhuang. 2019. Video question answering via knowledge-based progressive spatial-temporal attention network. ACM Transactions on Multimedia Computing, Communications, and Applications 15, 2s, Article 52(2019), 22 pages.
    [23]
    Jacob Devlin Ming-Wei Chang Kenton and Lee Kristina Toutanova. 2019. BERT: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the NAACL-HLT. ACL, 4171–4186.
    [24]
    Kyung-Min Kim, Seong-Ho Choi, Jin-Hwa Kim, and Byoung-Tak Zhang. 2018. Multimodal dual attention memory for video story question answering. In Proceedings of the European Conference on Computer Vision. ECCV, Munich, Germany, 673–688.
    [25]
    Thao Minh Le, Vuong Le, Svetha Venkatesh, and Truyen Tran. 2020. Hierarchical conditional relation networks for video question answering. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. CVPR, 9972–9981.
    [26]
    Thao Minh Le, Vuong Le, Svetha Venkatesh, and Truyen Tran. 2021. Hierarchical conditional relation networks for multimodal video question answering. International Journal of Computer Vision 129, 11 (2021), 3027–3050.
    [27]
    Jie Lei, Linjie Li, Luowei Zhou, Zhe Gan, Tamara L Berg, Mohit Bansal, and Jingjing Liu. 2021. Less is more: Clipbert for video-and-language learning via sparse sampling. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. CVPR, Online, 7331–7341.
    [28]
    Jie Lei, Licheng Yu, Mohit Bansal, and Tamara Berg. 2018. TVQA: Localized, compositional video question answering. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing. ACL, Brussels, Belgium, 1369–1379.
    [29]
    Xiangpeng Li, Lianli Gao, Xuanhan Wang, Wu Liu, Xing Xu, Heng Tao Shen, and Jingkuan Song. 2019. Learnable aggregating net with diversity learning for video question answering. In Proceedings of the 27th ACM International Conference on Multimedia. ACM, 1166–1174.
    [30]
    Xiangpeng Li, Jingkuan Song, Lianli Gao, Xianglong Liu, Wenbing Huang, Xiangnan He, and Chuang Gan. 2019. Beyond rnns: Positional self-attention with co-attention for video question answering. In Proceedings of the AAAI Conference on Artificial Intelligence. AAAI, 8658–8665.
    [31]
    Xinrui Li, Aming Wu, and Yahong Han. 2022. Complementary spatiotemporal network for video question answering. Multimedia Systems 28, 1 (2022), 161–169.
    [32]
    Junwei Liang, Lu Jiang, Liangliang Cao, Li-Jia Li, and Alexander G. Hauptmann. 2018. Focal visual-text attention for visual question answering. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. CVPR, 6135–6143.
    [33]
    Tianwei Lin, Xu Zhao, Haisheng Su, Chongjing Wang, and Ming Yang. 2018. Bsn: Boundary sensitive network for temporal action proposal generation. In Proceedings of the European Conference on Computer Vision (ECCV ’18). ECCV, Munich, Germany, 3–19.
    [34]
    Fei Liu, Jing Liu, Weining Wang, and Hanqing Lu. 2021. HAIR: Hierarchical visual-semantic relational reasoning for video question answering. In Proceedings of the IEEE/CVF International Conference on Computer Vision. ICCV, Online, 1698–1707.
    [35]
    Fei Liu, Jing Liu, Xinxin Zhu, Richang Hong, and Hanqing Lu. 2020. Dual hierarchical temporal convolutional network with QA-aware dynamic normalization for video story question answering. In Proceedings of the 28th ACM International Conference on Multimedia. ACM, 4253–4261.
    [36]
    Yun Liu, Xiaoming Zhang, Feiran Huang, Shixun Shen, Peng Tian, Lang Li, and Zhoujun Li. 2022. Dynamic self-attention with vision synchronization networks for video question answering. Pattern Recognition 132, C (2022), 108959.
    [37]
    Xiang Long, Gerard de Melo, Dongliang He, Fu Li, Zhizhen Chi, Shilei Wen, and Chuang Gan. 2022. Purely attention based local feature integration for video classification. IEEE Transactions on Pattern Analysis and Machine Intelligence 44, 4 (2022), 2140–2154.
    [38]
    Christopher D. Manning, Mihai Surdeanu, John Bauer, Jenny Rose Finkel, Steven Bethard, and David McClosky. 2014. The stanford CoreNLP natural language processing toolkit. In Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics: System Demonstrations. ACL, Baltimore, Maryland, 55–60.
    [39]
    Jungin Park, Jiyoung Lee, and Kwanghoon Sohn. 2021. Bridge to answer: Structure-aware graph interaction network for video question answering. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. CVPR, Online, 15526–15535.
    [40]
    Min Peng, Chongyang Wang, Yuan Gao, Yu Shi, and Xiang-Dong Zhou. 2022. Multilevel hierarchical network with multiscale sampling for video question answering. In Proceedings of the 31st International Joint Conference on Artificial Intelligence (IJCAI ’22). IJCAI, Messe Wien, Vienna, Austria, 1276–1282.
    [41]
    Jeffrey Pennington, Richard Socher, and Christopher D. Manning. 2014. Glove: Global vectors for word representation. In Proceedings of the EMNLP. ACL, Doha, Qatar, 1532–1543.
    [42]
    Mengye Ren, Ryan Kiros, and Richard Zemel. 2015. Exploring models and data for image question answering. in Neural Information Processing Systems 2 (2015), 2953–2961.
    [43]
    Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun. 2016. Faster R-CNN: Towards real-time object detection with region proposal networks. IEEE Transactions on Pattern Analysis and Machine Intelligence 39, 6 (2016), 1137–1149.
    [44]
    Ahjeong Seo, Gi-Cheon Kang, Joonhan Park, and Byoung-Tak Zhang. 2021. Attend what you need: Motion-appearance synergistic networks for video question answering. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing. ACL, Online, 6167–6177.
    [45]
    Paul Hongsuck Seo, Arsha Nagrani, and Cordelia Schmid. 2021. Look before you speak: Visually contextualized utterances. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. CVPR, Online, 16877–16887.
    [46]
    Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. In Proceedings of the Advances in Neural Information Processing Systems. Curran Associates, Inc., 5998–6008.
    [47]
    Bo Wang, Youjiang Xu, Yahong Han, and Richang Hong. 2018. Movie question answering: Remembering the textual cues for layered visual contents. In Proceedings of the AAAI Conference on Artificial Intelligence. AAAI, 7380–7387.
    [48]
    Jianyu Wang, Bing-Kun Bao, and Changsheng Xu. 2022. DualVGR: A dual-visual graph reasoning unit for video question answering. IEEE Transactions on Multimedia 24 (2022), 3369–3380.
    [49]
    Junbin Xiao, Xindi Shang, Angela Yao, and Tat-Seng Chua. 2021. NExT-QA: Next phase of question-answering to explaining temporal actions. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. CVPR, Online, 9777–9786.
    [50]
    Junbin Xiao, Angela Yao, Zhiyuan Liu, Yicong Li, Wei Ji, and Tat-Seng Chua. 2022. Video as conditional graph hierarchy for multi-granular question answering. In Proceedings of the AAAI Conference on Artificial Intelligence. AAAI, Vancouver, Canada, 2804–2812.
    [51]
    Junbin Xiao, Pan Zhou, Tat-Seng Chua, and Shuicheng Yan. 2022. Video graph transformer for video question answering. In Proceedings of the European Conference on Computer Vision. Springer, Cham, 39–58.
    [52]
    Shaoning Xiao, Long Chen, Kaifeng Gao, Zhao Wang, Yi Yang, Zhimeng Zhang, and Jun Xiao. 2022. Rethinking multi-modal alignment in multi-choice VideoQA from feature and sample perspectives. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing. ACL, Abu Dhabi, United Arab Emirates, 8188–8198.
    [53]
    Shaoning Xiao, Long Chen, Songyang Zhang, Wei Ji, Jian Shao, Lu Ye, and Jun Xiao. 2021. Boundary proposal network for two-stage natural language video localization. In Proceedings of the AAAI Conference on Artificial Intelligence. AAAI, Online, 2986–2994.
    [54]
    Shaoning Xiao, Yimeng Li, Yunan Ye, Long Chen, Shiliang Pu, Zhou Zhao, Jian Shao, and Jun Xiao. 2020. Hierarchical temporal fusion of multi-grained attention features for video question answering. Neural Processing Letters 52, 2 (2020), 993–1003.
    [55]
    Dejing Xu, Zhou Zhao, Jun Xiao, Fei Wu, Hanwang Zhang, Xiangnan He, and Yueting Zhuang. 2017. Video question answering via gradually refined attention over appearance and motion. In Proceedings of the 25th ACM International Conference on Multimedia. ACM, 1645–1653.
    [56]
    Li Xu, He Huang, and Jun Liu. 2021. Sutd-trafficqa: A question answering benchmark and an efficient network for video reasoning over traffic events. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. CVPR, Online, 9878–9888.
    [57]
    Hongwei Xue, Tiankai Hang, Yanhong Zeng, Yuchong Sun, Bei Liu, Huan Yang, Jianlong Fu, and Baining Guo. 2022. Advancing high-resolution video-language representation with large-scale video transcriptions. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. CVPR, 5036–5045.
    [58]
    Hongyang Xue, Zhou Zhao, and Deng Cai. 2017. Unifying the video and question attentions for open-ended video question answering. IEEE Transactions on Image Processing 26, 12 (2017), 5656–5666.
    [59]
    Antoine Yang, Antoine Miech, Josef Sivic, Ivan Laptev, and Cordelia Schmid. 2021. Just ask: Learning to answer questions from millions of narrated videos. In Proceedings of the IEEE/CVF International Conference on Computer Vision. ICCV, Online, 1686–1697.
    [60]
    Zekun Yang, Noa Garcia, Chenhui Chu, Mayu Otani, Yuta Nakashima, and Haruo Takemura. 2020. Bert representations for video question answering. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision. IEEE, Suntec Singapore, Singapore, 1556–1565.
    [61]
    Yunan Ye, Zhou Zhao, Yimeng Li, Long Chen, Jun Xiao, and Yueting Zhuang. 2017. Video question answering via attribute-augmented attention network learning. In Proceedings of the 40th International ACM SIGIR Conference on Research and Development in Information Retrieval. ACM, Tokyo, Japan, 829–832.
    [62]
    Ting Yu, Jun Yu, Zhou Yu, Qingming Huang, and Qi Tian. 2021. Long-term video question answering via multimodal hierarchical memory attentive networks. IEEE Transactions on Circuits and Systems for Video Technology 31, 3 (2021), 931–944.
    [63]
    Ting Yu, Jun Yu, Zhou Yu, and Dacheng Tao. 2019. Compositional attention networks with two-stream fusion for video question answering. IEEE Transactions on Image Processing 29 (2019), 1204–1218.
    [64]
    Weijiang Yu, Haoteng Zheng, Mengfei Li, Lei Ji, Lijun Wu, Nong Xiao, and Nan Duan. 2021. Learning from inside: Self-driven siamese sampling and reasoning for video question answering. In Proceedings of the Advances in Neural Information Processing Systems. Curran Associates, Inc., Vancouver, Canada, 26462–26474.
    [65]
    Zhou Yu, Dejing Xu, Jun Yu, Ting Yu, Zhou Zhao, Yueting Zhuang, and Dacheng Tao. 2019. Activitynet-qa: A dataset for understanding complex web videos via question answering. In Proceedings of the AAAI Conference on Artificial Intelligence. AAAI, 9127–9134.
    [66]
    Amir Zadeh, Michael Chan, Paul Pu Liang, Edmund Tong, and Louis-Philippe Morency. 2019. Social-iq: A question answering benchmark for artificial social intelligence. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. CVPR, 8807–8817.
    [67]
    Rowan Zellers, Ximing Lu, Jack Hessel, Youngjae Yu, Jae Sung Park, Jize Cao, Ali Farhadi, and Yejin Choi. 2021. MERLOT: Multimodal neural script knowledge models. In Proceedings of the Advances in Neural Information Processing Systems. Curran Associates, Inc., Vancouver, Canada, 23634–23651.
    [68]
    Pengpeng Zeng, Haonan Zhang, Lianli Gao, Jingkuan Song, and Heng Tao Shen. 2022. Video question answering with prior knowledge and object-sensitive learning. IEEE Transactions on Image Processing 31 (2022), 5936–5948.
    [69]
    Zheng-Jun Zha, Jiawei Liu, Tianhao Yang, and Yongdong Zhang. 2019. Spatiotemporal-textual co-attention network for video question answering. ACM Transactions on Multimedia Computing, Communications, and Applications 15, 2s, Article 53(2019), 18 pages.
    [70]
    Hao Zhang, Aixin Sun, Wei Jing, Liangli Zhen, Joey Tianyi Zhou, and Rick Siow Mong Goh. 2021. Natural language video localization: A revisit in span-based question answering framework. IEEE Transactions on Pattern Analysis and Machine Intelligence 44, 8 (2021), 4252–4266.
    [71]
    Haonan Zhang, Pengpeng Zeng, Yuxuan Hu, Jin Qian, Jingkuan Song, and Lianli Gao. 2023. Learning visual question answering on controlled semantic noisy labels. Pattern Recognition 138, C (2023), 109339.
    [72]
    Jipeng Zhang, Jie Shao, Rui Cao, Lianli Gao, Xing Xu, and Heng Tao Shen. 2022. Action-centric relation transformer network for video question answering. IEEE Transactions on Circuits and Systems for Video Technology 32, 1 (2022), 63–74.
    [73]
    Zhou Zhao, Zhu Zhang, Shuwen Xiao, Zhenxin Xiao, Xiaohui Yan, Jun Yu, Deng Cai, and Fei Wu. 2019. Long-form video question answering via dynamic hierarchical reinforced networks. IEEE Transactions on Image Processing 28, 12 (2019), 5939–5952.
    [74]
    Yueting Zhuang, Dejing Xu, Xin Yan, Wenzhuo Cheng, Zhou Zhao, Shiliang Pu, and Jun Xiao. 2020. Multichannel attention refinement for video question answering. ACM Transactions on Multimedia Computing, Communications, and Applications 16, 1s, Article 24(2020), 23 pages.

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image ACM Transactions on Multimedia Computing, Communications, and Applications
    ACM Transactions on Multimedia Computing, Communications, and Applications  Volume 20, Issue 4
    April 2024
    676 pages
    ISSN:1551-6857
    EISSN:1551-6865
    DOI:10.1145/3613617
    • Editor:
    • Abdulmotaleb El Saddik
    Issue’s Table of Contents

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 11 December 2023
    Online AM: 25 October 2023
    Accepted: 21 October 2023
    Revised: 05 October 2023
    Received: 25 May 2023
    Published in TOMM Volume 20, Issue 4

    Permissions

    Request permissions for this article.

    Check for updates

    Author Tags

    1. Video question answering
    2. multimodal learning
    3. attention mechanisms
    4. multiscale semantics

    Qualifiers

    • Research-article

    Funding Sources

    • National Natural Science Foundation of China

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • 0
      Total Citations
    • 1,002
      Total Downloads
    • Downloads (Last 12 months)1,002
    • Downloads (Last 6 weeks)144
    Reflects downloads up to 11 Aug 2024

    Other Metrics

    Citations

    View Options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Get Access

    Login options

    Full Access

    Media

    Figures

    Other

    Tables

    Share

    Share

    Share this Publication link

    Share on social media