research-article

Open access

Toward Egocentric Compositional Action Anticipation with Adaptive Semantic Debiasing

Authors:

Tianyu Zhang,

Weiqing Min,

Tao Liu,

Shuqiang Jiang,

Yong RuiAuthors Info & Claims

ACM Transactions on Multimedia Computing, Communications and Applications, Volume 20, Issue 5

Article No.: 122, Pages 1 - 21

https://doi.org/10.1145/3633333

Published: 11 January 2024 Publication History

PDF eReader

Abstract

Predicting the unknown from the first-person perspective is expected as a necessary step toward machine intelligence, which is essential for practical applications including autonomous driving and robotics. As a human-level task, egocentric action anticipation aims at predicting an unknown action seconds before it is performed from the first-person viewpoint. Egocentric actions are usually provided as verb-noun pairs; however, predicting the unknown action may be trapped in insufficient training data for all possible combinations. Therefore, it is crucial for intelligent systems to use limited known verb-noun pairs to predict new combinations of actions that have never appeared, which is known as compositional generalization. In this article, we are the first to explore the egocentric compositional action anticipation problem, which is more in line with real-world settings but neglected by existing studies. Whereas prediction results are prone to suffer from semantic bias considering the distinct difference between training and test distributions, we further introduce a general and flexible adaptive semantic debiasing framework that is compatible with different deep neural networks. To capture and mitigate semantic bias, we can imagine one counterfactual situation where no visual representations have been observed and only semantic patterns of observation are used to predict the next action. Instead of the traditional counterfactual analysis scheme that reduces semantic bias in a mindless way, we devise a novel counterfactual analysis scheme to adaptively amplify or penalize the effect of semantic experience by considering the discrepancy both among categories and among examples. We also demonstrate that the traditional counterfactual analysis scheme is a special case of the devised adaptive counterfactual analysis scheme. We conduct experiments on three large-scale egocentric video datasets. Experimental results verify the superiority and effectiveness of our proposed solution.

1 Introduction

In recent years, wearable cameras have been widely used to collect egocentric (first-person) vision data from a human perspective [1]. This unique viewpoint enables the multimedia content analysis of egocentric videos, such as predicting the future actions of camera wearers. There are many real-world scenarios. For example, assistive robots need to anticipate human motion to promptly provide timely assistance. In addition, they need to identify dangerous patterns of executed actions in everyday routine and send a warning signal to prevent the camera wearer from performing unsafe activities [53]. This gives birth to the study of egocentric action anticipation, which aims at predicting an unobserved but upcoming action before it occurs [6].

Despite its promising practical value, egocentric action anticipation is quite challenging in nature. Due to the temporal misalignment between observed content and target action, action recognition models [58, 68] have been proved inapplicable [6], which implies that egocentric action anticipation cannot be simply treated as a classification task. Indeed, it is a high-level reasoning task that goes beyond classifying observed patterns into a single action category [14, 73]. Moreover, egocentric actions are usually provided as verb-noun pairs [5, 6, 27], such as “take knife” and “cut carrot,” which indicates that predicting the unknown action may be restricted by insufficient training data for all possible combinations. However, intelligent systems are expected to use limited known verb-noun pairs to infer new combinations of actions that have never appeared in real-world settings. In fact, as humans, we can learn from a limited set of known components and seamlessly generate new compositions even before experiencing them. As illustrated in Figure 1, we can understand and produce new composition “cut onion” even if we have never observed it before based on the familiarity with the movement “cut” and the object “onion.” This ability is known as compositional generalization, which is a hallmark of human intelligence [24, 25]. This is also considered as a long-term goal of machine intelligence. For example, the robot needs to understand new instructions based on finite colloquial communication. However, existing studies on egocentric action anticipation neglect the compositional setting, which restricts the generalization ability of intelligent systems in the out-of-distribution scenario.

Fig. 1.

In our work, to the best of our knowledge, we are the first to formalize and tackle the problem of egocentric compositional action anticipation. More formally, we denote the action categories in the training set as \(\mathbb {A}_{tr}\) . Current works address the egocentric action anticipation problem where the set of action categories at the test time \(\mathbb {A}_{te}\) is always a subset of \(\mathbb {A}_{tr}\) . We extend the problem to a higher level where no intersection exists between \(\mathbb {A}_{tr}\) and \(\mathbb {A}_{te}\) . Hence, the system should be able to anticipate actions of new combinations by learning from seen combinations. Note that combinations of verbs and nouns are disjoint between training and test data in the compositional setting, and this non-overlapping splitting leads to the distinct difference between training and test distributions. Therefore, it is crucial for models to produce unbiased predictions for egocentric compositional action anticipation. Unfortunately, the shortcut from the semantic pattern of past observation to the future prediction could mislead models [76]. As shown in Figure 2, for observed content in the two cases, their semantic patterns are quite close despite the variation in visual representations. In the second case, models still tend to make predictions as in the first case by memorizing the spurious correlation from semantic experience (“cut onion” and “peel onion” are commonly successive actions) while ignoring the visual representation (the light is so dim that “turn on light” should be executed). Since relying on the spurious correlation in semantic modality could produce biased predictions, it is necessary to mitigate the overdependence of semantic bias.

Fig. 2.

To this end, we propose an Adaptive Semantic Debiasing (ASD) framework for egocentric compositional action anticipation, leveraging the tool of counterfactual analysis. Counterfactual thinking gifts humans with the imagination ability to reason the outcome of an alternative operation that could have been performed [45, 46]. Specifically, we can imagine one counterfactual situation where no visual representations have been observed and only semantic patterns of the observed content are used to predict the next action. This counterfactual prediction captures the effect of semantic experience. To alleviate the influence of spurious correlations in semantic modality, counterfactual analysis is conducted to obtain the final prediction. The traditional counterfactual analysis scheme simply subtracts the counterfactual prediction from original factual prediction as the final prediction [3, 61, 69, 76]. We believe this is suboptimal because the semantic bias contains both positive and negative parts. In many cases we observe “cut onion,” and the semantic bias can have a positive impact on anticipation when the future action is “peel onion.” However, it could also mislead the predicted result when the future action can be “turn on light” or “throw onion” in other cases. Therefore, instead of reducing semantic bias in a mindless way, we need to amplify or penalize the effect of semantic experience according to different cases. We devise a novel adaptive counterfactual analysis scheme to adaptively recalibrate the semantic bias, which considers the discrepancy both among categories and among examples. We also demonstrate that the traditional counterfactual analysis scheme is a special case of the proposed adaptive counterfactual analysis scheme. The details are introduced in Section 3.

Our method is inspired by cognitive science in which the human brain’s thinking process consists of System 1 and System 2 [21]. The former is an implicit, unconscious, and intuitive process, which matches the data-driven learning process of deep neural networks. The latter is conducted based on the former and is responsible for explicit, conscious, and controllable reasoning. We build our framework on the prevailing encoder-decoder structure in action anticipation [16, 19], which is compatible with various deep neural networks such as LSTM [15] and Transformer [66] at the level of System 1. To overcome the limitations that the data-driven learning process is prone to memorize the spurious correlation in semantic modality, we further devise a novel adaptive counterfactual analysis scheme to help deep neural networks think out of the box at the level of System 2. Our insight is that human-level intelligence needs not only to deduce what action will happen after observation (System 1) but also to know why this action is going to happen rather than others via exploiting the effect of semantic experience in a delicate manner (System 2). We expect our work will encourage more research on developing deep learning methods inspired by cognitive studies.

The main contributions of our work are as follows. First, to our best knowledge, we are the first to explore egocentric action anticipation in the compositional generalization scenario, which is more in line with real-world settings. Second, we propose a general and flexible ASD framework to address egocentric compositional action anticipation, which is compatible with different neural networks and exploits the effect of semantic experience in a delicate manner. Third, we validate our method on three widely used egocentric datasets (EGTEA Gaze+ [27], EPIC-Kitchens-55 [6], and EPIC-Kitchens-100 [5]) by creating new splits for the compositional setting. Experimental results demonstrate the superiority and effectiveness of our proposed ASD. In addition, it achieves competitive performance in conventional egocentric action anticipation.

2 Related Work

In this section, we introduce the relevant research areas in the following three aspects: (1) egocentric action anticipation, (2) compositional generalization, and (3) counterfactual analysis.

2.1 Egocentric Action Anticipation

With the rapid development of wearable devices, egocentric videos offer a natural perspective of daily activities and have raised a range of challenging research topics in recent years [1, 53], including detecting gaze [26, 27], estimating hand-object interaction regions [8, 29, 30], identifying the camera wearer [64], learning scene affordance [38], and video captioning [39]. Compared with videos captured in the third-person perspective, egocentric videos tend to be more challenging because many actions may not be directly observable due to the limited field of view and severe occlusions caused by hands. Studies on egocentric activity analysis have also emerged recently, ranging from action recognition to early action prediction to action anticipation [2, 10, 11, 12, 18, 31, 43, 47, 57, 72, 75, 77]. Different from egocentric action recognition [17, 56, 59, 70] which focuses only on classifying the action from the observed video clip, egocentric action anticipation cannot be simply treated as a classification task because it requires understanding what has happened and thus predicting the upcoming action. Dedicated to the future action prediction, it is also more challenging than egocentric early action prediction [79] which focuses on predicting the category of an ongoing action based on partial observations.

With the advancement of deep neural networks, various approaches have been proposed for this task where encoder-decoder structure becomes the commonly used paradigm. Furnari and Farinella [10, 11] propose a classic encoder-decoder architecture consisting of two cascaded LSTMs. The LSTM encoder summarizes the observed video sequence, whereas the LSTM decoder makes predictions about the future based on the hidden vectors learned in observed content. Camporese et al. [2] further introduce the label smoothing technique to inject useful semantic information into the model to alleviate over-confident predictions. Zhang et al. [77] mitigate the visual gap problem by exploiting both visual features and sequential text instructions of observed content. Osman et al. [43] propose a multi-scale approach to fuse information extracted at different time scales. Considering that recurrent networks are prone to accumulate prediction errors, subsequent works [47, 72] integrate contrastive learning to recalibrate predicted future representations. Recently, Liu and Lam [31] devised a memory-augmented strategy to regulate the process of recurrent representation forecasting. Despite the dominance, recurrent networks are limited in modeling long-term temporal dependencies owing to the non-parallel nature. Girdhar and Grauman [14] introduce an attention-based architecture by processing the observed video sequence in parallel with long-range attention. Roy and Fernando [55] propose a multi-modal Transformer model that combines human-object, spatio-temporal, and motion representations to anticipate future actions, whereas audio features are also considered in one recent work [80]. Researchers’ default egocentric action anticipation has been evaluated in an offline fashion; however, several works explore the problem in streaming [12] or untrimmed [54] scenarios.

Our work differs significantly from the preceding works. We emphasize the importance of compositional generalization in egocentric action anticipation by explicitly defining the egocentric compositional action anticipation problem. To alleviate the influence of spurious correlations in semantic modality under such an out-of-distribution scenario, we further devise a flexible and general ASD framework, which is compatible with different deep neural networks including both LSTM and Transformer.

2.2 Compositional Generalization

Compositional generalization is a crucial ability of human cognition and significant embodiment of machine intelligence [24, 25]. It requires the system to use a limited set of known components to understand and produce new combinations that have never appeared before. In the field of video understanding, compositional action recognition has received increasing attention. Zhang et al. [78] propose a noun plus verb decomposition for egocentric actions along with a variant of fusion methods. Materzynska et al. [36] make the combination of objects and actions disjoint between training and test which requires models to recognize an action performed with unseen objects. Núñez-Marcos et al. [42] leverage external knowledge as action priors to improve the performance of compositional action recognition in egocentric videos. Luo et al. [33] extract disentangled features of verbs and nouns, and leverage object affordance priors from knowledge bases to obtain composed actions. Although there is no standard definition of compositionality of human action recognition, recent works [34, 49, 60, 74] focus on the compositional setting introduced in the work of Materzynska et al. [36] to enhance the generalization capability of action recognition models. Different from these works, we focus on inferring new combinations of known verbs and nouns in the context of egocentric action anticipation, which to our knowledge has not been studied before.

2.3 Counterfactual Analysis

Counterfactual thinking is a concept derived from psychology, which describes the human capacity to reason the outcome of an alternative operation that could have been performed [45, 46]. It has been widely studied in economics, politics, and epidemiology for years [4, 22, 52] as the tool to study the effect of certain policies or treatments. In recent years, counterfactual analysis has inspired several studies in computer vision [3, 50, 62], natural language understanding [48, 65, 71], robotics [69], and multimedia [32, 40, 41, 44, 61, 63]. These works are committed to removing spurious bias and improving the generalization capacity in domain-specific applications. For example, Niu et al. [40] conceive a novel counterfactual inference framework to capture and mitigate language bias in visual question answering. Chen et al. [3] alleviate the overdependence of environment bias and highlight the trajectory clues for human trajectory prediction. Sun et al. [60] remove the spurious correlation between action and instance appearance for compositional action recognition. Qian et al. [48] mitigate label bias and keyword bias for text classification. Wang et al. [69] weaken the interference of background features for the 6D pose estimation task in space. More related to our work is a prior work [76] where the semantic bias is formulated as the effect of semantic patterns of observed content on anticipated actions. Differently, we focus on egocentric compositional action anticipation with a novel ASD framework. Instead of reducing the influence of semantic bias in a mindless way, our method can adaptively amplify or penalize the effect of semantic experience by considering the discrepancy among categories and among examples.

3 Method

In this section, we first introduce the formal description of egocentric compositional action anticipation. Then we detail the proposed ASD framework, including multi-modal based prediction (factual), semantic bias capture (counterfactual), and semantic bias recalibration.

3.1 Problem Formulation

Given an action starting at time T, the goal of action anticipation is to predict the action by observing a video clip \(T_a\) seconds before it [6]. \(T_a\) is defined as the anticipation time, which denotes how many seconds in advance an action is predicted. The observed video clip is denoted as \(x = [T-T_a-T_o, T-T_a]\) , which starts at time \(T-T_a-T_o\) and ends at time \(T-T_a\) . \(T_o\) is defined as the observation time. We denote the set of action categories as \(\mathbb {A} = \mathbb {A}_{tr} \cup \mathbb {A}_{te}\) consisting of of training set \(\mathbb {A}_{tr}\) and test set \(\mathbb {A}_{te}\) . The goal of the model is to classify the target action into its category \(y^a \in \mathbb {A}\) based on x. As an egocentric action is represented as the combination of one verb and one noun, we also denote the set of verbs and nouns as \(\mathbb {V} = \mathbb {V}_{tr} \cup \mathbb {V}_{te}\) and \(\mathbb {N} = \mathbb {N}_{tr} \cup \mathbb {N}_{te}\) for the similar meaning. Even though \(\mathbb {A}_{te}\) is always a subset of \(\mathbb {A}_{tr}\) in the ordinary setting, we make the combinations of verbs and nouns disjoint under the egocentric compositional action anticipation scenario, which can be formalized as

\begin{equation} \left\lbrace \!\!\begin{array}{ll} \mathbb {A}_{te} \cap \mathbb {A}_{tr} = \varnothing \\ \mathbb {V}_{te} \subseteq \mathbb {V}_{tr}, \:\mathbb {N}_{te} \subseteq \mathbb {N}_{tr} \end{array}\right.\!\!\!. \end{equation}

(1)

3.2 The ASD Framework

3.2.1 Framework Overview.

The proposed ASD framework is illustrated in Figure 3. Given the observed content, both visual embeddings and semantic embeddings serve as inputs to the encoder-decoder networks to produce factual predictions. In the counterfactual scenario, the semantic bias is purely captured by anticipating future actions relying only on abstract semantic patterns of the observed content. To exploit the effect of semantic experience in a delicate manner, the semantic bias is further recalibrated to integrate with factual outcomes to generate the final prediction results. It is also worth mentioning that a knowledge-driven strategy is leveraged to compose verb-aware predictions and noun-aware predictions for both visual and semantic modalities.

Fig. 3.

3.2.2 Multi-Modal-Based Prediction.

For the i-th observed video clip, \((x)_i= \lbrace (x_1)_i, (x_2)_i,\ldots , (x_K)_i \rbrace\) denotes the sequence of K observed video frames where we sample frames with a unified timestep of \(\epsilon\) seconds. For brevity, we omit the notation i that indicates a single instance unless otherwise specified. By applying visual feature extractors \(\phi _v\) and \(\phi _n\) to \(\lbrace x_1, x_2,\ldots , x_K \rbrace\) , we obtain the sequence of verb-aware visual embeddings \(\lbrace f_v^1, f_v^2,\ldots , f_v^K \rbrace\) and noun-aware visual embeddings \(\lbrace f_n^1, f_n^2,\ldots , f_n^K \rbrace ,\) respectively. Similarly, we can also obtain the series of verb-aware semantic embeddings \(\lbrace w_v^1, w_v^2,\ldots , w_v^K \rbrace\) or noun-aware semantic embeddings \(\lbrace w_n^1, w_n^2,\ldots , w_n^K \rbrace\) by applying semantic feature extractor \(\psi _v\) or \(\psi _n\) to \(\lbrace x_1, x_2,\ldots , x_K \rbrace\) . Both visual and semantic embedding vectors serve as inputs to the encoder-decoder networks to summarize the observed content and predict latent future representations, followed by linear classifiers to obtain verb-aware and noun-aware prediction logits of visual and semantic modalities (denoted as \(l_v^{vis}\) , \(l_n^{vis}\) , \(l_v^{sem}\) , and \(l_n^{sem}\) ). We employ a knowledge-driven strategy to compose verb-aware predictions and noun-aware predictions for both visual and semantic modalities as follows:

\begin{equation} \begin{aligned}l_a^{vis} = (M_{va})^T l_v^{vis} + (M_{na})^T l_n^{vis} , \quad l_a^{sem} = (M_{va})^T l_v^{sem} + (M_{na})^T l_n^{sem}, \end{aligned} \end{equation}

(2)

where \(l_a^{vis}\) and \(l_a^{sem}\) denote action prediction logits for visual modality and semantic modality, respectively. Both \(M_{va}\) and \(M_{na}\) represent the modality-agnostic prior knowledge matrices. Specifically, supposing \(M_{va}(\alpha ,\gamma)\) denotes the \(\alpha\) -th row and \(\gamma\) -th column element of \(M_{va}\) , it represents the prior probability of predicting the \(\gamma\) -th action class given the \(\alpha\) -th verb class, and the similar meaning for \(M_{na}(\beta ,\gamma)\) , which can be computed as follows:

\begin{equation} \begin{aligned}M_{va}(\alpha ,\gamma) = \frac{\sum _i[(y^a)_i=\gamma ]}{\sum _i [(y^v)_i=\alpha ]}, \quad M_{va}(\beta ,\gamma) = \frac{\sum _i[(y^a)_i=\gamma ]}{\sum _i [(y^n)_i=\beta ]}, \end{aligned} \end{equation}

(3)

where \([\cdot ]\) denotes the Iverson bracket [23], and \((y^a)_i\) , \((y^v)_i,\) and \((y^n)_i\) represent the category of action, verb, and noun for the i-th instance. In the factual scenario, both visual modality and semantic modality are fused to produce multi-modal predictions with an attention mechanism. Specifically, visual prediction logit vector \(l_a^{vis}\) and semantic prediction logit vector \(l_a^{sem}\) are concatenated and run through a Multi-Layer Perceptron (MLP) to get modality-specific weights:

\begin{equation} \begin{aligned}\left\langle s_a^{vis}, s_a^{sem} \right\rangle =\operatorname{MLP} \left(\operatorname{concat} \left\langle l_a^{vis}, l_a^{sem} \right\rangle \right)\!, \end{aligned} \end{equation}

(4)

where \(s_a^{vis}\) and \(s_a^{sem}\) are scalars that represent the attention scores for visual modality and semantic modality, respectively. The fusion weights are obtained by further normalizing the attention scores:

\begin{equation} \begin{aligned}\mu _a^{vis}=\frac{\operatorname{exp} (s_a^{vis})}{\operatorname{exp}(s_a^{vis})+\operatorname{exp}(s_a^{sem})}, \quad \mu ^{sem}=\frac{\operatorname{exp}(s_a^{sem})}{\operatorname{exp}(s_a^{vis})+\operatorname{exp}(s_a^{sem})}, \end{aligned} \end{equation}

(5)

where \(\mu _a^{vis}\) and \(\mu _a^{sem}\) are modality-specific weights. Thus, the factual prediction logit vector \(l_a^{f}\) is obtained as follows:

\begin{equation} \begin{aligned}l_a^{f}=\mu _a^{vis} \cdot l_a^{vis}+\mu _a^{sem} \cdot l_a^{sem}. \end{aligned} \end{equation}

(6)

3.2.3 Semantic Bias Capture.

Even though the factual multi-modal prediction results subsume the semantic bias, it is necessary to purely capture the shortcut from abstract semantic patterns of the observed content to the anticipated future action. To this end, we depict the virtual scenario where only semantic embeddings serve as inputs to the encoder-decoder networks to anticipate future actions by blocking the direct path from visual information to target prediction results. We can obtain the counterfactual prediction logit vector \(l_a^{cf}\) following the pipeline of producing the semantic-based prediction result in the factual scenario (i.e., \(l_a^{sem}\) ). The difference between \(l_a^{cf}\) and \(l_a^{sem}\) lies in that the employed neural networks (i.e., encoder-decoder networks with linear classifiers) share the same structure but different learning parameters.

3.2.4 Semantic Bias Recalibration.

Based on both factual and counterfactual prediction results, we aim to perform semantic debiasing to exploit the effect of semantic experience in a delicate manner. Traditional counterfactual analysis debiasing methods tend to simply subtract counterfactual outcomes from factual outcomes to pursue unbiased outcomes, which can be formulated as follows in the context of our work:

\begin{equation} \begin{aligned}l_a = l_a^f-\lambda \cdot l_a^{cf}, \end{aligned} \end{equation}

(7)

where \(l_a\) is the final prediction result and \(\lambda\) is a constant coefficient that can either be equal to 1 [3, 40, 69, 76] or other constant values [48, 60, 71]. We believe adopting the traditional counterfactual analysis scheme is suboptimal because it reduces the influence of semantic bias in a mindless way. As shown in Figure 4(a), the subtraction weights are identical for each instance and each category, which fail to consider the discrepancy among instances and among categories. Therefore, we propose a novel adaptive counterfactual analysis scheme to recalibrate the semantic bias at instance level as well as category level. Specifically, the factual prediction logit vector \(l_a^f\) and counterfactual prediction logit vector \(l_a^{cf}\) are concatenated and run through a linear transformation layer \(P(\cdot)\) to yield a C-dimensional vector (C denotes the number of categories):

\begin{equation} \begin{aligned}l_a^\delta = P \left(\operatorname{concat} \left\langle l_a^f, l_a^{cf} \right\rangle \right)\!, \end{aligned} \end{equation}

(8)

where \(l_a^\delta\) , \(l_a^f\) , and \(l_a^{cf}\) are vectors of the same length. A nonlinear monotone function \(\sigma (\cdot)\) is further introduced to control the scale within fixed bounds:

\begin{equation} \sigma (x,b_1,b_2) = \operatorname{sigmoid} \left(\frac{x}{b_2-b_1} \right) \cdot (b_2-b_1) +b_1, \end{equation}

(9)

where x is the independent variable, and \(b_1\) and \(b_2\) ( \(b_1\lt 0\lt b_2\) ) are hyperparameters with respect to the normalized upper and lower thresholds. The final prediction result \(l_a\) is computed as follows:

\begin{equation} l_a = l_a^{f} + l_a^\Delta , \end{equation}

(10)

where \(l_a^\Delta = \sigma (l_a^\delta)\) represents the recalibrated semantic bias and the final outcome is obtained by incorporating the recalibrated semantic bias into the factual outcome. In our framework, we use the standard cross entropy as the training objective:

\begin{equation} \begin{aligned}&\mathcal {L}=- \sum _{i=1}^{N} (y^a)_i \log (\hat{y^a})_i, \\ \end{aligned} \end{equation}

(11)

where N is the number of examples, \((y^a)_i\) is the ground truth label, and \((\hat{y^a})_i =\operatorname{softmax}\left((l_a)_i \right)\) is the prediction label for the i-th example.

Fig. 4.

3.2.5 Remark.

We further explain how the devised adaptive counterfactual analysis scheme can exploit the effect of semantic experience in a delicate manner. First, it considers the discrepancy among categories by projecting the concatenation of factual prediction logit vector \(l_a^f\) and counterfactual prediction logit vector \(l_a^{cf}\) to obtain the C-dimension vector \(l_a^{\delta }\) of the same length. Specifically, the j-th element value represents the prediction probability of the j-th category ( \(1\le j \le C\) ). Each element value of \(l_a^{\delta }\) is obtained by assigning different weights for each element value of \(l_a^f\) and \(l_a^{cf}\) , which indicates that we treat each category individually. Second, the adaptive counterfactual analysis scheme also considers the discrepancy among instances by generating different \((l_a^{\delta })_i\) for different instances ( \(1\le i \le N\) ). Moreover, given the i-th instance, we obtain the final prediction as \((l_a)_i=(l_c^f)_i+\sigma ((l_a^\delta)_i)\) , which is equivalent to performing element-wise sum of factual prediction \((l_a^f)_i\) and adaptively weighted counterfactual prediction \((l_a^{cf})_i\) because each element value \(\lambda _{ij}\) ( \(1\le j \le C\) ) of the i-th weight vector can be either positive and negative after the operations of Equations (8) and (9). Meanwhile, the weight vectors \([\lambda _{i1},\ldots ,\lambda _{iC}]^T\) ( \(1\le i \le N\) ) are also diverse at instance level, as illustrated in Figure 4(b). Therefore, the traditional counterfactual analysis scheme can be regarded as a special case of our proposed adaptive counterfactual analysis scheme if and only if \(\lambda _{ji}\) is identically equal to – \(\lambda\) ( \(1\le j \le C, 1\le j \le N\) ). The underlying motivation of our design is that semantic experience may vary with each individual to perform actions because of personalized habits. For example, one tends to put meat into the pot first, whereas another individual tends to put vegetables into the pot first when preparing the same dish. Our method can better adapt to personalized habits by adaptively amplifying or penalizing the effect of semantic experience, which is crucial for enhancing the generalization ability in out-of-distribution scenarios.

4 Experiment

4.1 Datasets and Metrics

We evaluate our proposed method on three large-scale egocentric video datasets—EGTEA Gaze+ [27], EPIC-Kitchens-55 [6], and EPIC-Kitchens-100 [5]—by creating new splits of non-overlapping training and test sets for egocentric compositional action anticipation. Specifically, given a verb class in the test set, we should make sure that it has appeared in the training set, which indicates that any verb class should exist in at least two action classes. The same principle also applies to the noun class. Therefore, we first abandon action classes containing verb categories (or noun categories) that appear only once, then randomly take 20% of action classes to generate the test set. Since the annotations of held-out test data are unavailable for EPIC-Kitchens-55 and EPIC-Kitchens-100, we split their new training and test sets using the public training data. We summarize the dataset statistics in Table 1, including the number of action categories and samples for training and test as well as the total number of verb and noun categories.

Table 1.

Dataset	Action@tr	Action@te	Verb	Noun	Sample@tr	Sample@te
EGTEA Gaze+	65	16	9	29	5,827	1,758
EPIC-Kitchens-55	1,968	492	95	292	22,145	6,194
EPIC-Kitchens-100	3,040	760	96	288	63,326	13,541

Table 1. Datasets Used for Egocentric Compositional Action Anticipation

Action@tr and Action@te denote the number of action categories for training and test, and the similar meaning for Sample@tr and Sample@te.

We adopt both class-agnostic and class-aware evaluation metrics. For class-agnostic metrics, we use Top-k ( \(k=1,5\) ) accuracy where the prediction is deemed as correct if the ground-truth action falls in the Top-k predictions. For class-aware metrics, we use mean Top-k ( \(k=1,5\) ) recall as class-aware measures. Top-k recall for a given class c is divided as the number of instances where c is in the list of Top-k predictions divided by all class c instances. Mean Top-k recall is computed by averaging Top-k recall values over classes.

4.2 Implementation Details

We sample video frames with a unified timestep of \(\epsilon\) = 0.25 seconds. The anticipation time \(T_a\) is set to 1 second, and the observation time \(T_o\) is set to 3.75 seconds with K = 16 unless otherwise specified. On EPIC-Kitchens-55 and EPIC-Kitchens-100, \(\phi _v\) and \(\phi _n\) are TSM [28] and Faster R-CNN [51] to obtain 2048-dimensional and 352-dimensional feature vectors with respect to verb-aware and noun-aware visual embeddings. On EGTEA Gaze+, since no object annotations are given, both \(\phi _v\) and \(\phi _n\) are TSN [68] to obtain 1024-dimensional vectors for verb-aware and noun-aware visual embeddings. On all datasets, both \(\psi _n\) and \(\psi _v\) are instantiated with a linear classifier and word2vec [37] connected with respect to \(\phi _v\) and \(\phi _n\) to obtain 512-dimensional vectors for both verb-aware and noun-aware semantic embeddings. For the encoder-decoder, we consider both recurrent- and self-attention-based architectures. The former consists of two one-layer, 1024-dimensional LSTMs [15], whereas the latter contains two-layer, 512-dimensional Transformer [66] encoders and decoders with 8 multi-head attention. The MLP used for multi-modal fusion consists of three fully connected layers with respect to 512, 128, and 2 hidden units along with the ReLU activation function. The normalized upper bound \(b_1\) and lower bound \(b_2\) are set to –4 and 4, respectively. We use a stochastic gradient descent optimizer to train the framework with a learning rate of 0.01 and momentum of 0.9. The batch size is 128. To regularize the training and avoid overfitting, dropout with retain probability 0.8 is used. All of our experiments are trained for 200 epochs.

4.3 Performance Comparison

Since we are the first to explicitly define the egocentric compositional action anticipation problem, there are no direct comparisons to state-of-the-art methods. Apart from our proposed ASD framework, we experiment with the following baselines for performance comparison (Tables 2–4). Random is derived from Damen et al. [5] by randomly choosing one option from the candidate categories as the predicted result. MM-LSTM is derived from Furnari and Farinella [11] using the LSTM decoder-encoder to take both visual and semantic modalities as inputs without the participation of semantic debiasing (i.e., factual predictions). MM-Trans is derived from Roy and Fernando [55] using the Transformer encoder-decoder to take both visual and semantic modalities as inputs without the participation of semantic debiasing (i.e., factual predictions). CA-LTSM and CA-Trans are derived from prior work [76] by adopting the traditional counterfactual analysis scheme on top of LSTM and Transformer encoder-decoders, respectively. Variants with different subtraction coefficients \(\lambda\) in Equation (7) are also considered, and we choose \(\lambda \in \lbrace 0.5,1.0,1.5\rbrace\) for a comprehensive evaluation. We maintain the model architectures for all LSTM-based or Transformer-based methods for fair comparisons.

Table 2.

Method	Top-1 Acc.	Top-5 Acc.	Mean Top-1 Rec.	Mean Top-5 Rec.
Random	2.37	5.28	1.31	4.16
MM-LSTM	16.85	31.66	11.73	28.54
MM-Trans	17.18	33.28	12.08	28.07
CA-LSTM ( \(\lambda\) = 0.5)	16.41	33.13	12.17	28.22
CA-LSTM ( \(\lambda\) = 1.0)	16.54	32.87	12.05	28.34
CA-LSTM ( \(\lambda\) = 1.5)	16.04	31.77	12.15	26.53
CA-Trans ( \(\lambda\) = 0.5)	17.21	32.34	13.07	27.93
CA-Trans ( \(\lambda\) = 1.0)	17.33	33.40	13.26	28.23
CA-Trans ( \(\lambda\) = 1.5)	16.53	33.44	12.22	28.82
ASD-LSTM (Ours)	18.23	33.81	12.80	30.41
ASD-Trans (Ours)	18.78	34.77	13.96	30.12

Table 2. Performance Comparison among Different Methods on EGTEA Gaze+ (%)

Table 3.

Method	Top-1 Acc.	Top-5 Acc.	Mean Top-1 Rec.	Mean Top-5 Rec.
Random	0.17	0.49	0.03	0.25
MM-LSTM	2.24	11.46	0.89	1.77
MM-Trans	3.66	14.32	1.50	2.91
CA-LSTM ( \(\lambda\) = 0.5)	3.11	13.52	1.47	3.73
CA-LSTM ( \(\lambda\) = 1.0)	2.88	13.48	1.53	3.62
CA-LSTM ( \(\lambda\) = 1.5)	2.67	13.24	1.47	3.25
CA-Trans ( \(\lambda\) = 0.5)	3.20	15.85	1.45	3.66
CA-Trans ( \(\lambda\) = 1.0)	3.22	15.24	1.40	3.28
CA-Trans ( \(\lambda\) = 1.5)	3.31	15.12	1.29	3.24
ASD-LSTM (Ours)	3.25	15.78	1.76	3.60
ASD-Trans (Ours)	3.96	16.43	1.92	4.89

Table 3. Performance Comparison among Different Methods on EPIC-Kitchens-55 (%)

Table 4.

Method	Top-1 Acc.	Top-5 Acc.	Mean Top-1 Rec.	Mean Top-5 Rec.
Random	0.53	1.32	0.09	0.11
MM-LSTM	2.91	12.97	0.81	1.29
MM-Trans	3.82	14.60	1.64	1.69
CA-LSTM ( \(\lambda\) = 0.5)	2.87	12.83	1.06	1.29
CA-LSTM ( \(\lambda\) = 1.0)	3.28	13.11	1.19	1.35
CA-LSTM ( \(\lambda\) = 1.5)	3.24	12.95	1.22	1.42
CA-Trans ( \(\lambda\) = 0.5)	4.49	15.04	2.21	3.34
CA-Trans ( \(\lambda\) = 1.0)	4.42	15.35	2.77	3.54
CA-Trans ( \(\lambda\) = 1.5)	4.26	15.18	2.06	3.01
ASD-LSTM (Ours)	3.32	16.94	1.37	2.55
ASD-Trans (Ours)	4.87	17.30	2.87	3.68

Table 4. Performance Comparison among Different Methods on EPIC-Kitchens-100 (%)

Table 2–4 show the performance comparison results on three datasets. We can see that the Random baseline performs worst among all methods, revealing the inherent uncertainty of predicting unknown actions. Instead of mindlessly guessing, MM-LSTM and MM-Trans achieve consistent improvement compared to the Random baseline. It demonstrates the effectiveness of data-driven deep neural networks, which is similar to the role of System 1. Meanwhile, Transformer tends to outperform LSTM as encoder-decoder networks, which indicates the advantage of a self-attention mechanism over recurrent modeling in capturing long-term temporal dependencies. We observe that introducing traditional counterfactual analysis on top of deep neural networks improves the performance in general while degrading in several metrics. This suggests that reducing the influence of semantic bias in a mindless way is suboptimal. In addition, the selection of the optimal subtraction coefficient \(\lambda\) varies with datasets, showing the lack of generalization ability. Furthermore, we can see that our proposed ASD can consistently and more significantly improve the anticipation performance. This verifies the superiority of the adaptive counterfactual analysis scheme in exploiting the semantic experience delicately by considering the discrepancy both among categories and among examples, which is similar to the role of System 2.

4.4 Further Analysis

To further examine the effectiveness of our method, a set of ablative studies are conducted on EGTEA Gaze+ using LSTM encoder-decoder networks. Fixing the beginning of observation, we report prediction results at multiple anticipation times \(T_a\) (0.5s, 1s, 1.5s, and 2s) by adjusting the corresponding observation times \(T_o\) (4.25s, 3.75s, 3.25s, and 2.75s).

4.4.1 Effectiveness of Knowledge-Driven Strategy.

We assess the contribution of the knowledge-driven strategy used to compose verb logits and noun logits. For the baseline method, the knowledge-driven strategy is removed, which means no prior knowledge can be leveraged to compose verb logits and noun logits (i.e., \(M_{va}\) and \(M_{na}\) are identity matrices). As shown in Figure 5, adopting the knowledge-driven strategy brings significant and consistent performance improvement over the baseline at different anticipation times, which demonstrates the effectiveness of leveraging prior knowledge to reduce uncertainty for egocentric compositional action anticipation.

Fig. 5.

4.4.2 Effectiveness of Attention-Based Multi-Modal Fusion.

To assess the role of the attention-based multi-modal fusion, we compare it with the baseline of average weighted multi-modal fusion strategy (i.e., \(\mu _a^{vis}=\mu _a^{sem}=0.5\) ). We can see from Figure 6 that adopting the attention-based fusion strategy outperforms the average weighted multi-modal fusion strategy at different anticipation times. Considering the complementarity of visual and semantic modalities, it is beneficial to fuse their predictions by assigning them modality-specific weights to produce more reliable factual predictions, which paves the way for semantic bias recalibration.

Fig. 6.

4.4.3 Effectiveness of Normalization in Semantic Bias Recalibration.

As introduced in Section 3.2.4, we employ a normalization function \(\sigma (\cdot)\) in the process of semantic bias recalibration. To evaluate its contribution, we consider two baselines for comparison. One baseline is to remove the element-wise normalization function, which means \(l_a^\Delta =l_a^\delta\) . The other baseline is to use the normalization function without considering bounded parameters \(b_1\) and \(b_2\) , which is the original version of sigmoid function. As shown in Table 5, we find that the former baseline performs worst at all anticipation times, which indicates the necessity of introducing a normalization function to enhance the nonlinearity because there is a linear transformation (Equation (8)) before. In addition, using bounded parameters further brings consistent performance improvement. It demonstrates the effectiveness of controlling the scale within fixed bounds. Specifically, the negative value \(b_1\) and the positive value \(b_2\) ensure each element of the recalibrated semantic bias \(l_a^\Delta\) can be positive or negative, which enables us to adaptively amplify or penalize the effect of semantic experience.

Table 5.

Setting		Top-1 Acc. (%) at Different \(T_a\) (s)				Mean Top-1 Rec. (%) at Different \(T_a\) (s)
Norm	Bounds	2	1.5	1	0.5	2	1.5	1	0.5
–	–	14.92	15.64	16.96	17.85	10.61	10.82	11.33	12.49
\(\checkmark\)	–	15.43	15.79	17.39	18.63	11.32	11.73	12.29	13.53
\(\checkmark\)	\(\checkmark\)	16.08	16.97	18.23	19.46	11.65	12.38	12.80	14.41
Setting		Top-5 Acc. (%) at Different \(T_a\) (s)				Mean Top-5 Rec. (%) at Different \(T_a\) (s)
Norm	Bounds	2	1.5	1	0.5	2	1.5	1	0.5
–	–	28.23	30.78	32.26	34.75	25.67	27.47	28.39	30.75
\(\checkmark\)	–	28.57	31.62	32.41	34.88	26.93	28.18	29.85	32.24
\(\checkmark\)	\(\checkmark\)	30.90	32.45	33.81	35.63	27.38	28.67	30.41	33.16

Table 5. Ablation Experimental Results with Respect to the Normalization Function in Semantic Bias Recalibration at Different Anticipation Times

4.4.4 Qualitative Analysis.

As shown in Figure 7, we present some qualitative examples of prediction results obtained by MM-LSTM, CA-LSTM ( \(\lambda =1\) ), and our proposed ASD-LSTM. For the first four columns, we list Top-5 predictions at four anticipation times \(T_a \in \left\lbrace 2s, 1.5s, 1s, 0.5s \right\rbrace\) . Blue indicates the prediction matches the ground truth, which corresponds to the last column. We can observe that due to the uncertainty of the future, multiple predictions are plausible. For all methods, the anticipation results get more accurate when the anticipation time is shorter, which is consistent with our intuition. From these cases, we can see that the representative data-driven method MM-LSTM can partly reduce the uncertainty by making the ground-truth actions appear in its Top-5 predictions as the anticipation time gets shorter. Unfortunately, the data-driven learning process makes MM-LSTM prone to be misled by semantic bias, which leads to the incorrect prediction like “put sponge” after observing “take sponge” at \(T_a=0.5s\) as shown in the first case. CA-LSTM is used to help deep neural networks like MM-LSTM think out of the box by introducing counterfactual analysis. However, it seems unstable because the traditional counterfactual analysis scheme reduces the semantic bias in a mindless way. For example, it can promote the ranking of the ground-truth action “wash cup” at \(T_a=1s\) and \(0.5s\) compared to MM-LSTM in the first case. However, it performs even worse than MM-LSTM in the second case where the ground-truth action “take pan” in the Top-5 predictions of MM-LSTM disappears at \(T_a=1s\) and \(0.5s\) . Although it reduces the probability of “close drawer,” it also lowers the chance of ground-truth action “take pan” by treating them in the same way. The underlying reason for the unstable performance of CA-LSTM is that the traditional counterfactual analysis scheme does not consider the discrepancy both among categories and among instances. In other words, it could not ensure the effect of semantic experience is fully utilized. Our proposed ASD-LSTM can mitigate the limitation with the adaptive counterfactual analysis scheme. We can see from the two cases that the ground-truth actions rank first in the predictions of ASD-LSTM at \(T_a=1s\) and \(0.5s\) . We believe our method aims at knowing why this action is going to happen rather than others via making the most of semantic bias. Taking the second case as an example, the observation indicates that the drawer is gradually opened and the hands move closer toward the pan. By assigning different weights to various categories like “close drawer” and “take pan,” ASD-LSTM lowers the probability of “close drawer” and makes the ground-truth action “take pan” stand out. In general, our method can exploit the effect of semantic experience in a delicate manner.

Fig. 7.

4.5 Results on Conventional Egocentric Action Anticipation

Whereas the proposed ASD framework is tailored for the out-of-distribution scenario, we also conduct experiments on conventional egocentric action anticipation to see whether it maintains competitive performance in the in-distribution scenario. Table 6 presents the conventional egocentric action anticipation results on EGTEA Gaze+, where 8,299 action instances are used for training and 2,022 instances for test with 106 action categories. We report the average performance over the three official splits. Compared with state-of-the-art methods, our proposed ASD achieves competitive performance, which demonstrates the effectiveness of ASD in the in-distribution scenario. It is also worth mentioning that a line of existing conventional egocentric action anticipation methods [18, 31, 47, 72, 73] use future frames during training to distill future content into observations, which is orthogonal to our work. The main reason some methods [18, 31, 47] surpass us in terms of Top-5 accuracy is that they use actual future features during training, which provide more powerful supervisory signals. Instead, our method does not take future features as supervisory signals. It further indicates the superiority of our ASD idea that focuses on the observed content without the requirement of future frames. Meanwhile, to have a deeper understanding of our method, we also show some failure cases obtained by ASD-LSTM in Figure 8. In the first case, our method fails to predict the ground-truth action “cut cucumber” because it mistakenly identifies cucumber as courgette, as they are highly similar. In the second case, the ground-truth future action is “take cheese,” but cheese does not appear in observed frames. It is extremely difficult to accurately predict “take cheese” because the observed information is limited. It further indicates that egocentric action anticipation is a quite challenging task.

Table 6.

Method	Content	Top-5 Acc.	Mean Top-5 Rec.
DMR [67]	Obs	55.70	38.11
ATSN [6]	Obs	40.53	31.61
MCE [9]	Obs	56.29	43.75
ED [13]	Obs	60.18	54.61
FN [7]	Obs	60.12	49.82
RL [35]	Obs	62.74	52.17
EL [20]	Obs	63.76	55.11
RULSTM [11]	Obs	66.40	58.64
ImagineRNN [72]	Obs+Fut	66.71	–
KDLM [2]	Obs	68.74	–
SRL [47]	Obs+Fut	70.67	–
SF-RULSTM [43]	Obs	67.60	–
MGRKD [18]	Obs+Fut	70.86	–
HRO [73]	Obs+Fut	71.46	–
DCR [31]	Obs+Fut	67.90	61.10
ASD-LSTM (Ours)	Obs	68.25	62.21
ASD-Trans (Ours)	Obs	69.84	63.38

Table 6. Conventional Egocentric Action Anticipation Results on EGTEA Gaze+ (%)

Although ASD is tailored for the out-of-distribution scenario, it is also effective in the in-distribution scenario. Obs+Fut denotes that future frames are also used during training apart from observed frames.

Fig. 8.

5 Conclusion

In our work, we explored a novel and practical egocentric compositional action anticipation problem, which is neglected by existing works. This was paired with our ASD framework by combining deep neural networks with an adaptive counterfactual analysis scheme, which takes inspiration from the human brain’s cognitive process of coordinating System 1 and System 2. The former is good at memorizing what action will happen after observation, whereas the latter aims at knowing why this action is going to happen rather than others via exploiting the semantic experience in a delicate manner. Experimental results on three large-scale egocentric video datasets demonstrated that our method achieves state-of-the-art performance in the out-of-distribution scenario while maintaining competitive performance in the in-distribution scenario.

Our work can be extended in multiple directions. First, although the representation learning of visual and semantic modalities goes beyond the scope of our work and is an orthogonal line of development, stronger backbones can actually be introduced to enhance multi-modal representation learning. Second, a simple yet effective knowledge-driven strategy is used to compose verbs and nouns, which indicates that an external knowledge graph might be considered in future work. Furthermore, integrating egocentric compositional action anticipation into streaming or untrimmed scenario is also worth exploration, which moves closer toward practical applications.

References

[1]

Alejandro Betancourt, Pietro Morerio, Carlo S. Regazzoni, and Matthias Rauterberg. 2015. The evolution of first person vision methods: A survey. IEEE Transactions on Circuits and Systems for Video Technology 25, 5 (2015), 744–760.

Abstract

1 Introduction

2 Related Work

2.1 Egocentric Action Anticipation

2.2 Compositional Generalization

2.3 Counterfactual Analysis

3 Method

3.1 Problem Formulation

3.2 The ASD Framework

3.2.1 Framework Overview.

3.2.2 Multi-Modal-Based Prediction.

3.2.3 Semantic Bias Capture.

3.2.4 Semantic Bias Recalibration.

3.2.5 Remark.

4 Experiment

4.1 Datasets and Metrics

4.2 Implementation Details

4.3 Performance Comparison

4.4 Further Analysis

4.4.1 Effectiveness of Knowledge-Driven Strategy.

4.4.2 Effectiveness of Attention-Based Multi-Modal Fusion.

4.4.3 Effectiveness of Normalization in Semantic Bias Recalibration.

4.4.4 Qualitative Analysis.

4.5 Results on Conventional Egocentric Action Anticipation

5 Conclusion

References

Index Terms

Recommendations

Streaming egocentric action anticipation: An evaluation scheme and approach

Learning Causality Under Uncertainty for Egocentric Action Anticipation

Egocentric Early Action Prediction via Adversarial Knowledge Distillation

Comments

Information

Published In

Publisher

Publication History

Check for updates

Author Tags

Qualifiers

Funding Sources

Contributors

Other Metrics

Bibliometrics

Article Metrics

Other Metrics

Citations

View options

PDF

eReader

Get Access

Login options

Full Access

Figures

Other

Share

Share this Publication link

Share on social media

Affiliations