For the
i-th observed video clip,
\((x)_i= \lbrace (x_1)_i, (x_2)_i,\ldots , (x_K)_i \rbrace\) denotes the sequence of
K observed video frames where we sample frames with a unified timestep of
\(\epsilon\) seconds. For brevity, we omit the notation
i that indicates a single instance unless otherwise specified. By applying visual feature extractors
\(\phi _v\) and
\(\phi _n\) to
\(\lbrace x_1, x_2,\ldots , x_K \rbrace\) , we obtain the sequence of verb-aware visual embeddings
\(\lbrace f_v^1, f_v^2,\ldots , f_v^K \rbrace\) and noun-aware visual embeddings
\(\lbrace f_n^1, f_n^2,\ldots , f_n^K \rbrace ,\) respectively. Similarly, we can also obtain the series of verb-aware semantic embeddings
\(\lbrace w_v^1, w_v^2,\ldots , w_v^K \rbrace\) or noun-aware semantic embeddings
\(\lbrace w_n^1, w_n^2,\ldots , w_n^K \rbrace\) by applying semantic feature extractor
\(\psi _v\) or
\(\psi _n\) to
\(\lbrace x_1, x_2,\ldots , x_K \rbrace\) . Both visual and semantic embedding vectors serve as inputs to the encoder-decoder networks to summarize the observed content and predict latent future representations, followed by linear classifiers to obtain verb-aware and noun-aware prediction logits of visual and semantic modalities (denoted as
\(l_v^{vis}\) ,
\(l_n^{vis}\) ,
\(l_v^{sem}\) , and
\(l_n^{sem}\) ). We employ a knowledge-driven strategy to compose verb-aware predictions and noun-aware predictions for both visual and semantic modalities as follows:
where
\(l_a^{vis}\) and
\(l_a^{sem}\) denote action prediction logits for visual modality and semantic modality, respectively. Both
\(M_{va}\) and
\(M_{na}\) represent the modality-agnostic prior knowledge matrices. Specifically, supposing
\(M_{va}(\alpha ,\gamma)\) denotes the
\(\alpha\) -th row and
\(\gamma\) -th column element of
\(M_{va}\) , it represents the prior probability of predicting the
\(\gamma\) -th action class given the
\(\alpha\) -th verb class, and the similar meaning for
\(M_{na}(\beta ,\gamma)\) , which can be computed as follows:
where
\([\cdot ]\) denotes the Iverson bracket [
23], and
\((y^a)_i\) ,
\((y^v)_i,\) and
\((y^n)_i\) represent the category of action, verb, and noun for the
i-th instance. In the factual scenario, both visual modality and semantic modality are fused to produce multi-modal predictions with an attention mechanism. Specifically, visual prediction logit vector
\(l_a^{vis}\) and semantic prediction logit vector
\(l_a^{sem}\) are concatenated and run through a Multi-Layer Perceptron (MLP) to get modality-specific weights:
where
\(s_a^{vis}\) and
\(s_a^{sem}\) are scalars that represent the attention scores for visual modality and semantic modality, respectively. The fusion weights are obtained by further normalizing the attention scores:
where
\(\mu _a^{vis}\) and
\(\mu _a^{sem}\) are modality-specific weights. Thus, the factual prediction logit vector
\(l_a^{f}\) is obtained as follows: