To reduce the effects of moment annotation biases, we further propose a novel debiasing approach. The overall framework is shown in Figure
8. Basically, we add three key components to the base model for unbiased moment predictions. In this section, we firstly define the TSGV problem and illustrate how the base model works. Afterwards, each of the key components will be described in detail, along with ultimate learning objectives.
4.2 Base Model
Due to the superior performance of 2D-TAN [
57] in recent public models, we adopt it as the base model for our unbiased temporal sentence grounding. The core idea of 2D-TAN is utilizing a 2D feature map to represent candidate moments of various lengths and locations, where one dimension depicts the start indices of moments and the other one represents the end.
More specifically, as shown in Figure
8(a), for the sentence query, it first embeds the words within the sentence query
S via GloVe [
34] to obtain the corresponding word vectors, and then the word vectors are fed into a three-layer LSTM [
21], where the last hidden state denoted as
\(\mathbf {q}^s \in \mathbb {R}^{d^h}\) is used to encode the whole query. For the video sequence, it first segments the video into non-overlapping clips, then samples the clips to a fixed size. The features of sampled
\(N^v\) video clips are extracted by a pre-trained CNN model and projected into the dimension of
\(d^v\), which can be denoted as
\(\lbrace \mathbf {c}_1,\mathbf {c}_2,\dots ,\mathbf {c}_{N^v}\rbrace\). The moment feature
\(\mathbf {m}_{ij}\) (
\(1 \le i \le j \le N^v\)) out of the 2D feature map
\(\mathbf {M} \in \mathbb {R}^{N^v \times N^v \times d^v}\) can be obtained by adopting max pooling strategy on clips
\(\lbrace \mathbf {c}_i,\mathbf {c}_{i+1},\dots ,\mathbf {c}_{j}\rbrace\). Afterwards, the 2D feature map
\(\mathbf {M}\) is fused with the query feature
\(\mathbf {q}^s\) and fed into a temporal adjacent network to model the temporal relations of moments. Then it passes through a fully connected layer and a Sigmoid function to generate the final 2D matching score map.
However, the inherent structure of 2D-TAN has natural advantages in exploiting location bias of datasets, since 2D feature map \(\mathbf {M}\) is indexed by moment locations. Therefore, we propose to improve this base model from two aspects. On the one hand, due to the difficulties of semantic alignment between two modalities, the representation capability from each single modality should be enhanced. On the other hand, we attempt to debias the model from perspective of causality as causality-based methods have proven to be successful in debiasing from other fields.
4.4 Multi-branch Deconfounder
Analysis on Multiple Confounders. Inspired by the work [
50], we leverage the structured causal model to analyze the underlying relations among all variables of this TSGV problem. The causal graph which is a
directed acyclic graph (DAG) is shown in Figure
10(a), where the nodes denote the variables and the directed edges denote the relations between nodes.
Q is the variable of query,
V denotes the video moment and
Y is the variable of predicted matching score. For those traditional TSGV models, they train a model to obtain the probabilities
\(P(Y|Q,V)\) that is conditioned on
Q and
V. However, there may exist a confounder
C that has connections with both the multimodal inputs (i.e.,
V and
Q) and output scores
Y. The confounder is harmful since it causes spurious correlation between the inputs and outputs.
We further investigate the characteristics of TSGV task and find there may exist multiple confounders. Some of the confounders are observable, e.g., the location variable
L [
50]. Since the location information is naturally encoded in the moment representations while we can also use the moment location distribution priors shown in Figure
4 to perform moment predictions. Moreover, the action variable
A could also be the confounder. The activity concepts implicitly exist in the inputs of video moments and queries while the model could also predict the matching score with only the action label. For example, it can localize on a short moment at the beginning of the video when seeing action “open” based on the action-conditioned moment annotation distribution shown in Figure
6. Besides, some of the confounders (denoted as
U) are not observable, such unobserved confounders should also be taken into consideration. Therefore, the do-calculus operation for intervening multiple confounders should be:
Here, we assume that all the confounder variables are independent of each other.
Implementation of Base Model. After obtaining the 2D temporal moment feature
\(\mathbf {M}\) and gated fine-grained query feature
\(\mathbf {u}\), the probabilities
\(P(Y|Q,V)\) without do-calculus can be learned by:
Here, the moment features are fused with the broadcasting query feature via Hadamard product. Then such multimodal representations are fed into the temporal convolutional network
\(\Phi _{conv}\), followed by a fully connected layer with learnable matrix
\({\bf W}^T\) and the Sigmoid function
\(\sigma (\cdot)\) to get the final 2D temporal matching scores.
Implementation of Multi-branch Deconfounder. As shown in Figure
10(b), we consider getting three confounders
L,
A and
U intervened as the multi-branch deconfounder. Each confounder is represented by a dictionary of enumerable elements. Specifically, we implement such intervention by adding a weighted embedding of all elements in the dictionary for each query-moment pair. More concretely, we assign the dictionary of location
L with the 2D position encodings which is the same as the position reconstruction module (Section
4.3.2), and we initiate the dictionary of action
A with the corresponding word embeddings of limited top-frequency action labels. The unobserved confounder
U can be represented by learnable dictionary embeddings of a fixed size. In order to get all confounders intervened at the same time, the weighted representations of multiple confounders are subsequently fused by element-wise multiplication to achieve multi-branch de-confounding.
\(P(Y|do(Q,V))\) can be approximated as:
where the effects of multiple confounders are implemented by integrating all the weighted 2D embedding
\(\mathbf {M}_k \in \mathbb {R}^{N^v \times N^v \times d^v}, k \in \lbrace l,a,u\rbrace\), and then adding such integrated embedding to
\({\bf M}\) (c.f., Figure
10(b)). Each
\(\mathbf {M}_k\) with
\(k \in \lbrace l, a, u\rbrace\) denotes the effect of any confounder belonging to
\(\lbrace L, A, U\rbrace\), which is the weighted average of all elements within the dictionary
\(\mathbb {E}_k[h_{qv}(k)]\).
\(\mathbb {E}_k[h_{qv}(k)]\) can be computed with the multi-head attention module [
43] whose query is the fusion of
\({\bf M}\) and
\({\bf q}^u\). In other words, the attention weight of each element within the dictionary is determined by each query-moment pair. Specifically,
\(\mathbf {M}_k\) can be defined as:
where
H is the head number and
\(d^H=\tfrac{d^v}{H}\) is the dimension of each subspace.
\({\bf D} \in \mathbb {R}^{N^k \times d^v}\) represents the dictionary containing
\(N^k\) elements. And
\({\bf D}_k = {\bf D} {\bf W}_1\),
\({\bf D}_v = {\bf D} {\bf W}_2\) with learnable parameters
\({\bf W}_1\),
\({\bf W}_2 \in \mathbb {R}^{d^v \times d^v}\). The query for multi-head attention is
\({\bf Q} = \textrm {FC}_q(\textrm {FC}_u({\bf q}^u) + \textrm {FC}_m({\bf M}))\), where
\(\textrm {FC}_q\),
\(\textrm {FC}_u\),
\(\textrm {FC}_m\) are all the fully connected layers with learnable parameters
\(\in \mathbb {R}^{d^v \times d^v}\). Note that
\({\bf Q}\) is flatten to
\(\mathbb {R}^{L^q \times d^v}\) for subsequent computation, where
\(L^q = N^v \times N^v\). Then
\({\bf D}_k\) is equally divided into
H parts
\(\lbrace {\bf K}_i\rbrace _1^H \in \mathbb {R}^{N^k \times d^H}\) along the feature dimension, so do
\({\bf D}_v\) and
\({\bf Q}\).