Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
research-article
Open access

Toward Egocentric Compositional Action Anticipation with Adaptive Semantic Debiasing

Published: 11 January 2024 Publication History
  • Get Citation Alerts
  • Abstract

    Predicting the unknown from the first-person perspective is expected as a necessary step toward machine intelligence, which is essential for practical applications including autonomous driving and robotics. As a human-level task, egocentric action anticipation aims at predicting an unknown action seconds before it is performed from the first-person viewpoint. Egocentric actions are usually provided as verb-noun pairs; however, predicting the unknown action may be trapped in insufficient training data for all possible combinations. Therefore, it is crucial for intelligent systems to use limited known verb-noun pairs to predict new combinations of actions that have never appeared, which is known as compositional generalization. In this article, we are the first to explore the egocentric compositional action anticipation problem, which is more in line with real-world settings but neglected by existing studies. Whereas prediction results are prone to suffer from semantic bias considering the distinct difference between training and test distributions, we further introduce a general and flexible adaptive semantic debiasing framework that is compatible with different deep neural networks. To capture and mitigate semantic bias, we can imagine one counterfactual situation where no visual representations have been observed and only semantic patterns of observation are used to predict the next action. Instead of the traditional counterfactual analysis scheme that reduces semantic bias in a mindless way, we devise a novel counterfactual analysis scheme to adaptively amplify or penalize the effect of semantic experience by considering the discrepancy both among categories and among examples. We also demonstrate that the traditional counterfactual analysis scheme is a special case of the devised adaptive counterfactual analysis scheme. We conduct experiments on three large-scale egocentric video datasets. Experimental results verify the superiority and effectiveness of our proposed solution.

    1 Introduction

    In recent years, wearable cameras have been widely used to collect egocentric (first-person) vision data from a human perspective [1]. This unique viewpoint enables the multimedia content analysis of egocentric videos, such as predicting the future actions of camera wearers. There are many real-world scenarios. For example, assistive robots need to anticipate human motion to promptly provide timely assistance. In addition, they need to identify dangerous patterns of executed actions in everyday routine and send a warning signal to prevent the camera wearer from performing unsafe activities [53]. This gives birth to the study of egocentric action anticipation, which aims at predicting an unobserved but upcoming action before it occurs [6].
    Despite its promising practical value, egocentric action anticipation is quite challenging in nature. Due to the temporal misalignment between observed content and target action, action recognition models [58, 68] have been proved inapplicable [6], which implies that egocentric action anticipation cannot be simply treated as a classification task. Indeed, it is a high-level reasoning task that goes beyond classifying observed patterns into a single action category [14, 73]. Moreover, egocentric actions are usually provided as verb-noun pairs [5, 6, 27], such as “take knife” and “cut carrot,” which indicates that predicting the unknown action may be restricted by insufficient training data for all possible combinations. However, intelligent systems are expected to use limited known verb-noun pairs to infer new combinations of actions that have never appeared in real-world settings. In fact, as humans, we can learn from a limited set of known components and seamlessly generate new compositions even before experiencing them. As illustrated in Figure 1, we can understand and produce new composition “cut onion” even if we have never observed it before based on the familiarity with the movement “cut” and the object “onion.” This ability is known as compositional generalization, which is a hallmark of human intelligence [24, 25]. This is also considered as a long-term goal of machine intelligence. For example, the robot needs to understand new instructions based on finite colloquial communication. However, existing studies on egocentric action anticipation neglect the compositional setting, which restricts the generalization ability of intelligent systems in the out-of-distribution scenario.
    Fig. 1.
    Fig. 1. Examples of compositional generalization. Since predicting the unknown action may be restricted by insufficient training data for all possible combinations, it is necessary to enhance the generalization ability of egocentric action anticipation models in the out-of-distribution scenario.
    In our work, to the best of our knowledge, we are the first to formalize and tackle the problem of egocentric compositional action anticipation. More formally, we denote the action categories in the training set as \(\mathbb {A}_{tr}\) . Current works address the egocentric action anticipation problem where the set of action categories at the test time \(\mathbb {A}_{te}\) is always a subset of \(\mathbb {A}_{tr}\) . We extend the problem to a higher level where no intersection exists between \(\mathbb {A}_{tr}\) and \(\mathbb {A}_{te}\) . Hence, the system should be able to anticipate actions of new combinations by learning from seen combinations. Note that combinations of verbs and nouns are disjoint between training and test data in the compositional setting, and this non-overlapping splitting leads to the distinct difference between training and test distributions. Therefore, it is crucial for models to produce unbiased predictions for egocentric compositional action anticipation. Unfortunately, the shortcut from the semantic pattern of past observation to the future prediction could mislead models [76]. As shown in Figure 2, for observed content in the two cases, their semantic patterns are quite close despite the variation in visual representations. In the second case, models still tend to make predictions as in the first case by memorizing the spurious correlation from semantic experience (“cut onion” and “peel onion” are commonly successive actions) while ignoring the visual representation (the light is so dim that “turn on light” should be executed). Since relying on the spurious correlation in semantic modality could produce biased predictions, it is necessary to mitigate the overdependence of semantic bias.
    Fig. 2.
    Fig. 2. Illustration of semantic bias. The ground-truth future prediction results are shown in red. The shortcut from the semantic pattern of past observation to the future prediction leads to the incorrect result in the second case, exhibiting the spurious correlations in semantic modality.
    To this end, we propose an Adaptive Semantic Debiasing (ASD) framework for egocentric compositional action anticipation, leveraging the tool of counterfactual analysis. Counterfactual thinking gifts humans with the imagination ability to reason the outcome of an alternative operation that could have been performed [45, 46]. Specifically, we can imagine one counterfactual situation where no visual representations have been observed and only semantic patterns of the observed content are used to predict the next action. This counterfactual prediction captures the effect of semantic experience. To alleviate the influence of spurious correlations in semantic modality, counterfactual analysis is conducted to obtain the final prediction. The traditional counterfactual analysis scheme simply subtracts the counterfactual prediction from original factual prediction as the final prediction [3, 61, 69, 76]. We believe this is suboptimal because the semantic bias contains both positive and negative parts. In many cases we observe “cut onion,” and the semantic bias can have a positive impact on anticipation when the future action is “peel onion.” However, it could also mislead the predicted result when the future action can be “turn on light” or “throw onion” in other cases. Therefore, instead of reducing semantic bias in a mindless way, we need to amplify or penalize the effect of semantic experience according to different cases. We devise a novel adaptive counterfactual analysis scheme to adaptively recalibrate the semantic bias, which considers the discrepancy both among categories and among examples. We also demonstrate that the traditional counterfactual analysis scheme is a special case of the proposed adaptive counterfactual analysis scheme. The details are introduced in Section 3.
    Our method is inspired by cognitive science in which the human brain’s thinking process consists of System 1 and System 2 [21]. The former is an implicit, unconscious, and intuitive process, which matches the data-driven learning process of deep neural networks. The latter is conducted based on the former and is responsible for explicit, conscious, and controllable reasoning. We build our framework on the prevailing encoder-decoder structure in action anticipation [16, 19], which is compatible with various deep neural networks such as LSTM [15] and Transformer [66] at the level of System 1. To overcome the limitations that the data-driven learning process is prone to memorize the spurious correlation in semantic modality, we further devise a novel adaptive counterfactual analysis scheme to help deep neural networks think out of the box at the level of System 2. Our insight is that human-level intelligence needs not only to deduce what action will happen after observation (System 1) but also to know why this action is going to happen rather than others via exploiting the effect of semantic experience in a delicate manner (System 2). We expect our work will encourage more research on developing deep learning methods inspired by cognitive studies.
    The main contributions of our work are as follows. First, to our best knowledge, we are the first to explore egocentric action anticipation in the compositional generalization scenario, which is more in line with real-world settings. Second, we propose a general and flexible ASD framework to address egocentric compositional action anticipation, which is compatible with different neural networks and exploits the effect of semantic experience in a delicate manner. Third, we validate our method on three widely used egocentric datasets (EGTEA Gaze+ [27], EPIC-Kitchens-55 [6], and EPIC-Kitchens-100 [5]) by creating new splits for the compositional setting. Experimental results demonstrate the superiority and effectiveness of our proposed ASD. In addition, it achieves competitive performance in conventional egocentric action anticipation.

    2 Related Work

    In this section, we introduce the relevant research areas in the following three aspects: (1) egocentric action anticipation, (2) compositional generalization, and (3) counterfactual analysis.

    2.1 Egocentric Action Anticipation

    With the rapid development of wearable devices, egocentric videos offer a natural perspective of daily activities and have raised a range of challenging research topics in recent years [1, 53], including detecting gaze [26, 27], estimating hand-object interaction regions [8, 29, 30], identifying the camera wearer [64], learning scene affordance [38], and video captioning [39]. Compared with videos captured in the third-person perspective, egocentric videos tend to be more challenging because many actions may not be directly observable due to the limited field of view and severe occlusions caused by hands. Studies on egocentric activity analysis have also emerged recently, ranging from action recognition to early action prediction to action anticipation [2, 10, 11, 12, 18, 31, 43, 47, 57, 72, 75, 77]. Different from egocentric action recognition [17, 56, 59, 70] which focuses only on classifying the action from the observed video clip, egocentric action anticipation cannot be simply treated as a classification task because it requires understanding what has happened and thus predicting the upcoming action. Dedicated to the future action prediction, it is also more challenging than egocentric early action prediction [79] which focuses on predicting the category of an ongoing action based on partial observations.
    With the advancement of deep neural networks, various approaches have been proposed for this task where encoder-decoder structure becomes the commonly used paradigm. Furnari and Farinella [10, 11] propose a classic encoder-decoder architecture consisting of two cascaded LSTMs. The LSTM encoder summarizes the observed video sequence, whereas the LSTM decoder makes predictions about the future based on the hidden vectors learned in observed content. Camporese et al. [2] further introduce the label smoothing technique to inject useful semantic information into the model to alleviate over-confident predictions. Zhang et al. [77] mitigate the visual gap problem by exploiting both visual features and sequential text instructions of observed content. Osman et al. [43] propose a multi-scale approach to fuse information extracted at different time scales. Considering that recurrent networks are prone to accumulate prediction errors, subsequent works [47, 72] integrate contrastive learning to recalibrate predicted future representations. Recently, Liu and Lam [31] devised a memory-augmented strategy to regulate the process of recurrent representation forecasting. Despite the dominance, recurrent networks are limited in modeling long-term temporal dependencies owing to the non-parallel nature. Girdhar and Grauman [14] introduce an attention-based architecture by processing the observed video sequence in parallel with long-range attention. Roy and Fernando [55] propose a multi-modal Transformer model that combines human-object, spatio-temporal, and motion representations to anticipate future actions, whereas audio features are also considered in one recent work [80]. Researchers’ default egocentric action anticipation has been evaluated in an offline fashion; however, several works explore the problem in streaming [12] or untrimmed [54] scenarios.
    Our work differs significantly from the preceding works. We emphasize the importance of compositional generalization in egocentric action anticipation by explicitly defining the egocentric compositional action anticipation problem. To alleviate the influence of spurious correlations in semantic modality under such an out-of-distribution scenario, we further devise a flexible and general ASD framework, which is compatible with different deep neural networks including both LSTM and Transformer.

    2.2 Compositional Generalization

    Compositional generalization is a crucial ability of human cognition and significant embodiment of machine intelligence [24, 25]. It requires the system to use a limited set of known components to understand and produce new combinations that have never appeared before. In the field of video understanding, compositional action recognition has received increasing attention. Zhang et al. [78] propose a noun plus verb decomposition for egocentric actions along with a variant of fusion methods. Materzynska et al. [36] make the combination of objects and actions disjoint between training and test which requires models to recognize an action performed with unseen objects. Núñez-Marcos et al. [42] leverage external knowledge as action priors to improve the performance of compositional action recognition in egocentric videos. Luo et al. [33] extract disentangled features of verbs and nouns, and leverage object affordance priors from knowledge bases to obtain composed actions. Although there is no standard definition of compositionality of human action recognition, recent works [34, 49, 60, 74] focus on the compositional setting introduced in the work of Materzynska et al. [36] to enhance the generalization capability of action recognition models. Different from these works, we focus on inferring new combinations of known verbs and nouns in the context of egocentric action anticipation, which to our knowledge has not been studied before.

    2.3 Counterfactual Analysis

    Counterfactual thinking is a concept derived from psychology, which describes the human capacity to reason the outcome of an alternative operation that could have been performed [45, 46]. It has been widely studied in economics, politics, and epidemiology for years [4, 22, 52] as the tool to study the effect of certain policies or treatments. In recent years, counterfactual analysis has inspired several studies in computer vision [3, 50, 62], natural language understanding [48, 65, 71], robotics [69], and multimedia [32, 40, 41, 44, 61, 63]. These works are committed to removing spurious bias and improving the generalization capacity in domain-specific applications. For example, Niu et al. [40] conceive a novel counterfactual inference framework to capture and mitigate language bias in visual question answering. Chen et al. [3] alleviate the overdependence of environment bias and highlight the trajectory clues for human trajectory prediction. Sun et al. [60] remove the spurious correlation between action and instance appearance for compositional action recognition. Qian et al. [48] mitigate label bias and keyword bias for text classification. Wang et al. [69] weaken the interference of background features for the 6D pose estimation task in space. More related to our work is a prior work [76] where the semantic bias is formulated as the effect of semantic patterns of observed content on anticipated actions. Differently, we focus on egocentric compositional action anticipation with a novel ASD framework. Instead of reducing the influence of semantic bias in a mindless way, our method can adaptively amplify or penalize the effect of semantic experience by considering the discrepancy among categories and among examples.

    3 Method

    In this section, we first introduce the formal description of egocentric compositional action anticipation. Then we detail the proposed ASD framework, including multi-modal based prediction (factual), semantic bias capture (counterfactual), and semantic bias recalibration.

    3.1 Problem Formulation

    Given an action starting at time T, the goal of action anticipation is to predict the action by observing a video clip \(T_a\) seconds before it [6]. \(T_a\) is defined as the anticipation time, which denotes how many seconds in advance an action is predicted. The observed video clip is denoted as \(x = [T-T_a-T_o, T-T_a]\) , which starts at time \(T-T_a-T_o\) and ends at time \(T-T_a\) . \(T_o\) is defined as the observation time. We denote the set of action categories as \(\mathbb {A} = \mathbb {A}_{tr} \cup \mathbb {A}_{te}\) consisting of of training set \(\mathbb {A}_{tr}\) and test set \(\mathbb {A}_{te}\) . The goal of the model is to classify the target action into its category \(y^a \in \mathbb {A}\) based on x. As an egocentric action is represented as the combination of one verb and one noun, we also denote the set of verbs and nouns as \(\mathbb {V} = \mathbb {V}_{tr} \cup \mathbb {V}_{te}\) and \(\mathbb {N} = \mathbb {N}_{tr} \cup \mathbb {N}_{te}\) for the similar meaning. Even though \(\mathbb {A}_{te}\) is always a subset of \(\mathbb {A}_{tr}\) in the ordinary setting, we make the combinations of verbs and nouns disjoint under the egocentric compositional action anticipation scenario, which can be formalized as
    \(\begin{equation} \left\lbrace \!\!\begin{array}{ll} \mathbb {A}_{te} \cap \mathbb {A}_{tr} = \varnothing \\ \mathbb {V}_{te} \subseteq \mathbb {V}_{tr}, \:\mathbb {N}_{te} \subseteq \mathbb {N}_{tr} \end{array}\right.\!\!\!. \end{equation}\)
    (1)

    3.2 The ASD Framework

    3.2.1 Framework Overview.

    The proposed ASD framework is illustrated in Figure 3. Given the observed content, both visual embeddings and semantic embeddings serve as inputs to the encoder-decoder networks to produce factual predictions. In the counterfactual scenario, the semantic bias is purely captured by anticipating future actions relying only on abstract semantic patterns of the observed content. To exploit the effect of semantic experience in a delicate manner, the semantic bias is further recalibrated to integrate with factual outcomes to generate the final prediction results. It is also worth mentioning that a knowledge-driven strategy is leveraged to compose verb-aware predictions and noun-aware predictions for both visual and semantic modalities.
    Fig. 3.
    Fig. 3. The proposed ASD framework for egocentric compositional action anticipation. \(\lbrace {x_i}\rbrace _{i=1}^K\) are observed video frames. \(\phi _v\) and \(\phi _n\) are visual feature extractors, whereas \(\psi _v\) and \(\psi _n\) are semantic feature extractors. \(\lbrace {f_v^i}\rbrace _{i=1}^K\) and \(\lbrace {f_n^i}\rbrace _{i=1}^K\) are verb-aware and noun-aware visual embeddings, whereas \(\lbrace {w_v^i}\rbrace _{i=1}^K\) and \(\lbrace {w_n^i}\rbrace _{i=1}^K\) are verb-aware and noun-aware semantic embeddings. \(M_{va}\) and \(M_{na}\) are modality-agnostic prior knowledge matrices. \(l_v^{vis}\) , \(l_n^{vis}\) , \(l_v^{sem}\) , and \(l_n^{sem}\) are verb-aware and noun-aware prediction logits of visual and semantic modalities. \(l_a^{vis}\) and \(l_a^{sem}\) are action prediction logits of visual and semantic modalities. \(\mu _a^{vis}\) and \(\mu _a^{sem}\) are modality-specific weights. \(l_a^{f}\) and \(l_a^{cf}\) correspond to counterfactual and factual prediction logits. \(l_a^\delta\) is the recalibrated semantic bias, and \(l_a^\Delta\) is its normalized version. \(l_a\) is the final prediction result.

    3.2.2 Multi-Modal-Based Prediction.

    For the i-th observed video clip, \((x)_i= \lbrace (x_1)_i, (x_2)_i,\ldots , (x_K)_i \rbrace\) denotes the sequence of K observed video frames where we sample frames with a unified timestep of \(\epsilon\) seconds. For brevity, we omit the notation i that indicates a single instance unless otherwise specified. By applying visual feature extractors \(\phi _v\) and \(\phi _n\) to \(\lbrace x_1, x_2,\ldots , x_K \rbrace\) , we obtain the sequence of verb-aware visual embeddings \(\lbrace f_v^1, f_v^2,\ldots , f_v^K \rbrace\) and noun-aware visual embeddings \(\lbrace f_n^1, f_n^2,\ldots , f_n^K \rbrace ,\) respectively. Similarly, we can also obtain the series of verb-aware semantic embeddings \(\lbrace w_v^1, w_v^2,\ldots , w_v^K \rbrace\) or noun-aware semantic embeddings \(\lbrace w_n^1, w_n^2,\ldots , w_n^K \rbrace\) by applying semantic feature extractor \(\psi _v\) or \(\psi _n\) to \(\lbrace x_1, x_2,\ldots , x_K \rbrace\) . Both visual and semantic embedding vectors serve as inputs to the encoder-decoder networks to summarize the observed content and predict latent future representations, followed by linear classifiers to obtain verb-aware and noun-aware prediction logits of visual and semantic modalities (denoted as \(l_v^{vis}\) , \(l_n^{vis}\) , \(l_v^{sem}\) , and \(l_n^{sem}\) ). We employ a knowledge-driven strategy to compose verb-aware predictions and noun-aware predictions for both visual and semantic modalities as follows:
    \(\begin{equation} \begin{aligned}l_a^{vis} = (M_{va})^T l_v^{vis} + (M_{na})^T l_n^{vis} , \quad l_a^{sem} = (M_{va})^T l_v^{sem} + (M_{na})^T l_n^{sem}, \end{aligned} \end{equation}\)
    (2)
    where \(l_a^{vis}\) and \(l_a^{sem}\) denote action prediction logits for visual modality and semantic modality, respectively. Both \(M_{va}\) and \(M_{na}\) represent the modality-agnostic prior knowledge matrices. Specifically, supposing \(M_{va}(\alpha ,\gamma)\) denotes the \(\alpha\) -th row and \(\gamma\) -th column element of \(M_{va}\) , it represents the prior probability of predicting the \(\gamma\) -th action class given the \(\alpha\) -th verb class, and the similar meaning for \(M_{na}(\beta ,\gamma)\) , which can be computed as follows:
    \(\begin{equation} \begin{aligned}M_{va}(\alpha ,\gamma) = \frac{\sum _i[(y^a)_i=\gamma ]}{\sum _i [(y^v)_i=\alpha ]}, \quad M_{va}(\beta ,\gamma) = \frac{\sum _i[(y^a)_i=\gamma ]}{\sum _i [(y^n)_i=\beta ]}, \end{aligned} \end{equation}\)
    (3)
    where \([\cdot ]\) denotes the Iverson bracket [23], and \((y^a)_i\) , \((y^v)_i,\) and \((y^n)_i\) represent the category of action, verb, and noun for the i-th instance. In the factual scenario, both visual modality and semantic modality are fused to produce multi-modal predictions with an attention mechanism. Specifically, visual prediction logit vector \(l_a^{vis}\) and semantic prediction logit vector \(l_a^{sem}\) are concatenated and run through a Multi-Layer Perceptron (MLP) to get modality-specific weights:
    \(\begin{equation} \begin{aligned}\left\langle s_a^{vis}, s_a^{sem} \right\rangle =\operatorname{MLP} \left(\operatorname{concat} \left\langle l_a^{vis}, l_a^{sem} \right\rangle \right)\!, \end{aligned} \end{equation}\)
    (4)
    where \(s_a^{vis}\) and \(s_a^{sem}\) are scalars that represent the attention scores for visual modality and semantic modality, respectively. The fusion weights are obtained by further normalizing the attention scores:
    \(\begin{equation} \begin{aligned}\mu _a^{vis}=\frac{\operatorname{exp} (s_a^{vis})}{\operatorname{exp}(s_a^{vis})+\operatorname{exp}(s_a^{sem})}, \quad \mu ^{sem}=\frac{\operatorname{exp}(s_a^{sem})}{\operatorname{exp}(s_a^{vis})+\operatorname{exp}(s_a^{sem})}, \end{aligned} \end{equation}\)
    (5)
    where \(\mu _a^{vis}\) and \(\mu _a^{sem}\) are modality-specific weights. Thus, the factual prediction logit vector \(l_a^{f}\) is obtained as follows:
    \(\begin{equation} \begin{aligned}l_a^{f}=\mu _a^{vis} \cdot l_a^{vis}+\mu _a^{sem} \cdot l_a^{sem}. \end{aligned} \end{equation}\)
    (6)

    3.2.3 Semantic Bias Capture.

    Even though the factual multi-modal prediction results subsume the semantic bias, it is necessary to purely capture the shortcut from abstract semantic patterns of the observed content to the anticipated future action. To this end, we depict the virtual scenario where only semantic embeddings serve as inputs to the encoder-decoder networks to anticipate future actions by blocking the direct path from visual information to target prediction results. We can obtain the counterfactual prediction logit vector \(l_a^{cf}\) following the pipeline of producing the semantic-based prediction result in the factual scenario (i.e., \(l_a^{sem}\) ). The difference between \(l_a^{cf}\) and \(l_a^{sem}\) lies in that the employed neural networks (i.e., encoder-decoder networks with linear classifiers) share the same structure but different learning parameters.

    3.2.4 Semantic Bias Recalibration.

    Based on both factual and counterfactual prediction results, we aim to perform semantic debiasing to exploit the effect of semantic experience in a delicate manner. Traditional counterfactual analysis debiasing methods tend to simply subtract counterfactual outcomes from factual outcomes to pursue unbiased outcomes, which can be formulated as follows in the context of our work:
    \(\begin{equation} \begin{aligned}l_a = l_a^f-\lambda \cdot l_a^{cf}, \end{aligned} \end{equation}\)
    (7)
    where \(l_a\) is the final prediction result and \(\lambda\) is a constant coefficient that can either be equal to 1 [3, 40, 69, 76] or other constant values [48, 60, 71]. We believe adopting the traditional counterfactual analysis scheme is suboptimal because it reduces the influence of semantic bias in a mindless way. As shown in Figure 4(a), the subtraction weights are identical for each instance and each category, which fail to consider the discrepancy among instances and among categories. Therefore, we propose a novel adaptive counterfactual analysis scheme to recalibrate the semantic bias at instance level as well as category level. Specifically, the factual prediction logit vector \(l_a^f\) and counterfactual prediction logit vector \(l_a^{cf}\) are concatenated and run through a linear transformation layer \(P(\cdot)\) to yield a C-dimensional vector (C denotes the number of categories):
    \(\begin{equation} \begin{aligned}l_a^\delta = P \left(\operatorname{concat} \left\langle l_a^f, l_a^{cf} \right\rangle \right)\!, \end{aligned} \end{equation}\)
    (8)
    where \(l_a^\delta\) , \(l_a^f\) , and \(l_a^{cf}\) are vectors of the same length. A nonlinear monotone function \(\sigma (\cdot)\) is further introduced to control the scale within fixed bounds:
    \(\begin{equation} \sigma (x,b_1,b_2) = \operatorname{sigmoid} \left(\frac{x}{b_2-b_1} \right) \cdot (b_2-b_1) +b_1, \end{equation}\)
    (9)
    where x is the independent variable, and \(b_1\) and \(b_2\) ( \(b_1\lt 0\lt b_2\) ) are hyperparameters with respect to the normalized upper and lower thresholds. The final prediction result \(l_a\) is computed as follows:
    \(\begin{equation} l_a = l_a^{f} + l_a^\Delta , \end{equation}\)
    (10)
    where \(l_a^\Delta = \sigma (l_a^\delta)\) represents the recalibrated semantic bias and the final outcome is obtained by incorporating the recalibrated semantic bias into the factual outcome. In our framework, we use the standard cross entropy as the training objective:
    \(\begin{equation} \begin{aligned}&\mathcal {L}=- \sum _{i=1}^{N} (y^a)_i \log (\hat{y^a})_i, \\ \end{aligned} \end{equation}\)
    (11)
    where N is the number of examples, \((y^a)_i\) is the ground truth label, and \((\hat{y^a})_i =\operatorname{softmax}\left((l_a)_i \right)\) is the prediction label for the i-th example.
    Fig. 4.
    Fig. 4. The comparison between traditional counterfactual analysis (a) and our proposed adaptive counterfactual analysis (b) schemes. Instead of the traditional counterfactual analysis scheme that simply subtracts counterfactual predictions from factual predictions with unified weight \(\lambda\) for each category and each instance, the adaptive counterfactual analysis scheme considers discrepancy at both category level and instance level. The former scheme can be regarded as a special case of the latter if and only if \(\lambda _{ji}\) is identically equal to – \(\lambda\) ( \(1\le j \le C, 1\le j \le N\) ), where C and N represent the number of categories and instances.

    3.2.5 Remark.

    We further explain how the devised adaptive counterfactual analysis scheme can exploit the effect of semantic experience in a delicate manner. First, it considers the discrepancy among categories by projecting the concatenation of factual prediction logit vector \(l_a^f\) and counterfactual prediction logit vector \(l_a^{cf}\) to obtain the C-dimension vector \(l_a^{\delta }\) of the same length. Specifically, the j-th element value represents the prediction probability of the j-th category ( \(1\le j \le C\) ). Each element value of \(l_a^{\delta }\) is obtained by assigning different weights for each element value of \(l_a^f\) and \(l_a^{cf}\) , which indicates that we treat each category individually. Second, the adaptive counterfactual analysis scheme also considers the discrepancy among instances by generating different \((l_a^{\delta })_i\) for different instances ( \(1\le i \le N\) ). Moreover, given the i-th instance, we obtain the final prediction as \((l_a)_i=(l_c^f)_i+\sigma ((l_a^\delta)_i)\) , which is equivalent to performing element-wise sum of factual prediction \((l_a^f)_i\) and adaptively weighted counterfactual prediction \((l_a^{cf})_i\) because each element value \(\lambda _{ij}\) ( \(1\le j \le C\) ) of the i-th weight vector can be either positive and negative after the operations of Equations (8) and (9). Meanwhile, the weight vectors \([\lambda _{i1},\ldots ,\lambda _{iC}]^T\) ( \(1\le i \le N\) ) are also diverse at instance level, as illustrated in Figure 4(b). Therefore, the traditional counterfactual analysis scheme can be regarded as a special case of our proposed adaptive counterfactual analysis scheme if and only if \(\lambda _{ji}\) is identically equal to – \(\lambda\) ( \(1\le j \le C, 1\le j \le N\) ). The underlying motivation of our design is that semantic experience may vary with each individual to perform actions because of personalized habits. For example, one tends to put meat into the pot first, whereas another individual tends to put vegetables into the pot first when preparing the same dish. Our method can better adapt to personalized habits by adaptively amplifying or penalizing the effect of semantic experience, which is crucial for enhancing the generalization ability in out-of-distribution scenarios.

    4 Experiment

    4.1 Datasets and Metrics

    We evaluate our proposed method on three large-scale egocentric video datasets—EGTEA Gaze+ [27], EPIC-Kitchens-55 [6], and EPIC-Kitchens-100 [5]—by creating new splits of non-overlapping training and test sets for egocentric compositional action anticipation. Specifically, given a verb class in the test set, we should make sure that it has appeared in the training set, which indicates that any verb class should exist in at least two action classes. The same principle also applies to the noun class. Therefore, we first abandon action classes containing verb categories (or noun categories) that appear only once, then randomly take 20% of action classes to generate the test set. Since the annotations of held-out test data are unavailable for EPIC-Kitchens-55 and EPIC-Kitchens-100, we split their new training and test sets using the public training data. We summarize the dataset statistics in Table 1, including the number of action categories and samples for training and test as well as the total number of verb and noun categories.
    Table 1.
    DatasetAction@trAction@teVerbNounSample@trSample@te
    EGTEA Gaze+65169295,8271,758
    EPIC-Kitchens-551,9684929529222,1456,194
    EPIC-Kitchens-1003,0407609628863,32613,541
    Table 1. Datasets Used for Egocentric Compositional Action Anticipation
    Action@tr and Action@te denote the number of action categories for training and test, and the similar meaning for Sample@tr and Sample@te.
    We adopt both class-agnostic and class-aware evaluation metrics. For class-agnostic metrics, we use Top-k ( \(k=1,5\) ) accuracy where the prediction is deemed as correct if the ground-truth action falls in the Top-k predictions. For class-aware metrics, we use mean Top-k ( \(k=1,5\) ) recall as class-aware measures. Top-k recall for a given class c is divided as the number of instances where c is in the list of Top-k predictions divided by all class c instances. Mean Top-k recall is computed by averaging Top-k recall values over classes.

    4.2 Implementation Details

    We sample video frames with a unified timestep of \(\epsilon\) = 0.25 seconds. The anticipation time \(T_a\) is set to 1 second, and the observation time \(T_o\) is set to 3.75 seconds with K = 16 unless otherwise specified. On EPIC-Kitchens-55 and EPIC-Kitchens-100, \(\phi _v\) and \(\phi _n\) are TSM [28] and Faster R-CNN [51] to obtain 2048-dimensional and 352-dimensional feature vectors with respect to verb-aware and noun-aware visual embeddings. On EGTEA Gaze+, since no object annotations are given, both \(\phi _v\) and \(\phi _n\) are TSN [68] to obtain 1024-dimensional vectors for verb-aware and noun-aware visual embeddings. On all datasets, both \(\psi _n\) and \(\psi _v\) are instantiated with a linear classifier and word2vec [37] connected with respect to \(\phi _v\) and \(\phi _n\) to obtain 512-dimensional vectors for both verb-aware and noun-aware semantic embeddings. For the encoder-decoder, we consider both recurrent- and self-attention-based architectures. The former consists of two one-layer, 1024-dimensional LSTMs [15], whereas the latter contains two-layer, 512-dimensional Transformer [66] encoders and decoders with 8 multi-head attention. The MLP used for multi-modal fusion consists of three fully connected layers with respect to 512, 128, and 2 hidden units along with the ReLU activation function. The normalized upper bound \(b_1\) and lower bound \(b_2\) are set to –4 and 4, respectively. We use a stochastic gradient descent optimizer to train the framework with a learning rate of 0.01 and momentum of 0.9. The batch size is 128. To regularize the training and avoid overfitting, dropout with retain probability 0.8 is used. All of our experiments are trained for 200 epochs.

    4.3 Performance Comparison

    Since we are the first to explicitly define the egocentric compositional action anticipation problem, there are no direct comparisons to state-of-the-art methods. Apart from our proposed ASD framework, we experiment with the following baselines for performance comparison (Tables 24). Random is derived from Damen et al. [5] by randomly choosing one option from the candidate categories as the predicted result. MM-LSTM is derived from Furnari and Farinella [11] using the LSTM decoder-encoder to take both visual and semantic modalities as inputs without the participation of semantic debiasing (i.e., factual predictions). MM-Trans is derived from Roy and Fernando [55] using the Transformer encoder-decoder to take both visual and semantic modalities as inputs without the participation of semantic debiasing (i.e., factual predictions). CA-LTSM and CA-Trans are derived from prior work [76] by adopting the traditional counterfactual analysis scheme on top of LSTM and Transformer encoder-decoders, respectively. Variants with different subtraction coefficients \(\lambda\) in Equation (7) are also considered, and we choose \(\lambda \in \lbrace 0.5,1.0,1.5\rbrace\) for a comprehensive evaluation. We maintain the model architectures for all LSTM-based or Transformer-based methods for fair comparisons.
    Table 2.
    MethodTop-1 Acc.Top-5 Acc.Mean Top-1 Rec.Mean Top-5 Rec.
    Random2.375.281.314.16
    MM-LSTM16.8531.6611.7328.54
    MM-Trans17.1833.2812.0828.07
    CA-LSTM ( \(\lambda\) = 0.5)16.4133.1312.1728.22
    CA-LSTM ( \(\lambda\) = 1.0)16.5432.8712.0528.34
    CA-LSTM ( \(\lambda\) = 1.5)16.0431.7712.1526.53
    CA-Trans ( \(\lambda\) = 0.5)17.2132.3413.0727.93
    CA-Trans ( \(\lambda\) = 1.0)17.3333.4013.2628.23
    CA-Trans ( \(\lambda\) = 1.5)16.5333.4412.2228.82
    ASD-LSTM (Ours)18.2333.8112.8030.41
    ASD-Trans (Ours)18.7834.7713.9630.12
    Table 2. Performance Comparison among Different Methods on EGTEA Gaze+ (%)
    Table 3.
    MethodTop-1 Acc.Top-5 Acc.Mean Top-1 Rec.Mean Top-5 Rec.
    Random0.170.490.030.25
    MM-LSTM2.2411.460.891.77
    MM-Trans3.6614.321.502.91
    CA-LSTM ( \(\lambda\) = 0.5)3.1113.521.473.73
    CA-LSTM ( \(\lambda\) = 1.0)2.8813.481.533.62
    CA-LSTM ( \(\lambda\) = 1.5)2.6713.241.473.25
    CA-Trans ( \(\lambda\) = 0.5)3.2015.851.453.66
    CA-Trans ( \(\lambda\) = 1.0)3.2215.241.403.28
    CA-Trans ( \(\lambda\) = 1.5)3.3115.121.293.24
    ASD-LSTM (Ours)3.2515.781.763.60
    ASD-Trans (Ours)3.9616.431.924.89
    Table 3. Performance Comparison among Different Methods on EPIC-Kitchens-55 (%)
    Table 4.
    MethodTop-1 Acc.Top-5 Acc.Mean Top-1 Rec.Mean Top-5 Rec.
    Random0.531.320.090.11
    MM-LSTM2.9112.970.811.29
    MM-Trans3.8214.601.641.69
    CA-LSTM ( \(\lambda\) = 0.5)2.8712.831.061.29
    CA-LSTM ( \(\lambda\) = 1.0)3.2813.111.191.35
    CA-LSTM ( \(\lambda\) = 1.5)3.2412.951.221.42
    CA-Trans ( \(\lambda\) = 0.5)4.4915.042.213.34
    CA-Trans ( \(\lambda\) = 1.0)4.4215.352.773.54
    CA-Trans ( \(\lambda\) = 1.5)4.2615.182.063.01
    ASD-LSTM (Ours)3.3216.941.372.55
    ASD-Trans (Ours)4.8717.302.873.68
    Table 4. Performance Comparison among Different Methods on EPIC-Kitchens-100 (%)
    Table 24 show the performance comparison results on three datasets. We can see that the Random baseline performs worst among all methods, revealing the inherent uncertainty of predicting unknown actions. Instead of mindlessly guessing, MM-LSTM and MM-Trans achieve consistent improvement compared to the Random baseline. It demonstrates the effectiveness of data-driven deep neural networks, which is similar to the role of System 1. Meanwhile, Transformer tends to outperform LSTM as encoder-decoder networks, which indicates the advantage of a self-attention mechanism over recurrent modeling in capturing long-term temporal dependencies. We observe that introducing traditional counterfactual analysis on top of deep neural networks improves the performance in general while degrading in several metrics. This suggests that reducing the influence of semantic bias in a mindless way is suboptimal. In addition, the selection of the optimal subtraction coefficient \(\lambda\) varies with datasets, showing the lack of generalization ability. Furthermore, we can see that our proposed ASD can consistently and more significantly improve the anticipation performance. This verifies the superiority of the adaptive counterfactual analysis scheme in exploiting the semantic experience delicately by considering the discrepancy both among categories and among examples, which is similar to the role of System 2.

    4.4 Further Analysis

    To further examine the effectiveness of our method, a set of ablative studies are conducted on EGTEA Gaze+ using LSTM encoder-decoder networks. Fixing the beginning of observation, we report prediction results at multiple anticipation times \(T_a\) (0.5s, 1s, 1.5s, and 2s) by adjusting the corresponding observation times \(T_o\) (4.25s, 3.75s, 3.25s, and 2.75s).

    4.4.1 Effectiveness of Knowledge-Driven Strategy.

    We assess the contribution of the knowledge-driven strategy used to compose verb logits and noun logits. For the baseline method, the knowledge-driven strategy is removed, which means no prior knowledge can be leveraged to compose verb logits and noun logits (i.e., \(M_{va}\) and \(M_{na}\) are identity matrices). As shown in Figure 5, adopting the knowledge-driven strategy brings significant and consistent performance improvement over the baseline at different anticipation times, which demonstrates the effectiveness of leveraging prior knowledge to reduce uncertainty for egocentric compositional action anticipation.
    Fig. 5.
    Fig. 5. Ablation studies on the knowledge-driven strategy (KDS) at different anticipation times.

    4.4.2 Effectiveness of Attention-Based Multi-Modal Fusion.

    To assess the role of the attention-based multi-modal fusion, we compare it with the baseline of average weighted multi-modal fusion strategy (i.e., \(\mu _a^{vis}=\mu _a^{sem}=0.5\) ). We can see from Figure 6 that adopting the attention-based fusion strategy outperforms the average weighted multi-modal fusion strategy at different anticipation times. Considering the complementarity of visual and semantic modalities, it is beneficial to fuse their predictions by assigning them modality-specific weights to produce more reliable factual predictions, which paves the way for semantic bias recalibration.
    Fig. 6.
    Fig. 6. Ablation studies on the attention-based multi-modal fusion strategy at different anticipation times.

    4.4.3 Effectiveness of Normalization in Semantic Bias Recalibration.

    As introduced in Section 3.2.4, we employ a normalization function \(\sigma (\cdot)\) in the process of semantic bias recalibration. To evaluate its contribution, we consider two baselines for comparison. One baseline is to remove the element-wise normalization function, which means \(l_a^\Delta =l_a^\delta\) . The other baseline is to use the normalization function without considering bounded parameters \(b_1\) and \(b_2\) , which is the original version of sigmoid function. As shown in Table 5, we find that the former baseline performs worst at all anticipation times, which indicates the necessity of introducing a normalization function to enhance the nonlinearity because there is a linear transformation (Equation (8)) before. In addition, using bounded parameters further brings consistent performance improvement. It demonstrates the effectiveness of controlling the scale within fixed bounds. Specifically, the negative value \(b_1\) and the positive value \(b_2\) ensure each element of the recalibrated semantic bias \(l_a^\Delta\) can be positive or negative, which enables us to adaptively amplify or penalize the effect of semantic experience.
    Table 5.
    SettingTop-1 Acc. (%) at Different \(T_a\) (s)Mean Top-1 Rec. (%) at Different \(T_a\) (s)
    NormBounds21.510.521.510.5
    14.9215.6416.9617.8510.6110.8211.3312.49
    \(\checkmark\) 15.4315.7917.3918.6311.3211.7312.2913.53
    \(\checkmark\) \(\checkmark\) 16.0816.9718.2319.4611.6512.3812.8014.41
    SettingTop-5 Acc. (%) at Different \(T_a\) (s)Mean Top-5 Rec. (%) at Different \(T_a\) (s)
    NormBounds21.510.521.510.5
    28.2330.7832.2634.7525.6727.4728.3930.75
    \(\checkmark\) 28.5731.6232.4134.8826.9328.1829.8532.24
    \(\checkmark\) \(\checkmark\) 30.9032.4533.8135.6327.3828.6730.4133.16
    Table 5. Ablation Experimental Results with Respect to the Normalization Function in Semantic Bias Recalibration at Different Anticipation Times

    4.4.4 Qualitative Analysis.

    As shown in Figure 7, we present some qualitative examples of prediction results obtained by MM-LSTM, CA-LSTM ( \(\lambda =1\) ), and our proposed ASD-LSTM. For the first four columns, we list Top-5 predictions at four anticipation times \(T_a \in \left\lbrace 2s, 1.5s, 1s, 0.5s \right\rbrace\) . Blue indicates the prediction matches the ground truth, which corresponds to the last column. We can observe that due to the uncertainty of the future, multiple predictions are plausible. For all methods, the anticipation results get more accurate when the anticipation time is shorter, which is consistent with our intuition. From these cases, we can see that the representative data-driven method MM-LSTM can partly reduce the uncertainty by making the ground-truth actions appear in its Top-5 predictions as the anticipation time gets shorter. Unfortunately, the data-driven learning process makes MM-LSTM prone to be misled by semantic bias, which leads to the incorrect prediction like “put sponge” after observing “take sponge” at \(T_a=0.5s\) as shown in the first case. CA-LSTM is used to help deep neural networks like MM-LSTM think out of the box by introducing counterfactual analysis. However, it seems unstable because the traditional counterfactual analysis scheme reduces the semantic bias in a mindless way. For example, it can promote the ranking of the ground-truth action “wash cup” at \(T_a=1s\) and \(0.5s\) compared to MM-LSTM in the first case. However, it performs even worse than MM-LSTM in the second case where the ground-truth action “take pan” in the Top-5 predictions of MM-LSTM disappears at \(T_a=1s\) and \(0.5s\) . Although it reduces the probability of “close drawer,” it also lowers the chance of ground-truth action “take pan” by treating them in the same way. The underlying reason for the unstable performance of CA-LSTM is that the traditional counterfactual analysis scheme does not consider the discrepancy both among categories and among instances. In other words, it could not ensure the effect of semantic experience is fully utilized. Our proposed ASD-LSTM can mitigate the limitation with the adaptive counterfactual analysis scheme. We can see from the two cases that the ground-truth actions rank first in the predictions of ASD-LSTM at \(T_a=1s\) and \(0.5s\) . We believe our method aims at knowing why this action is going to happen rather than others via making the most of semantic bias. Taking the second case as an example, the observation indicates that the drawer is gradually opened and the hands move closer toward the pan. By assigning different weights to various categories like “close drawer” and “take pan,” ASD-LSTM lowers the probability of “close drawer” and makes the ground-truth action “take pan” stand out. In general, our method can exploit the effect of semantic experience in a delicate manner.
    Fig. 7.
    Fig. 7. Qualitative examples on EGTEA Gaze+ with anticipation time \(T_a = 2s, 1.5s, 1s, 0.5s\) . We list the Top-5 action predictions obtained by MM-LSTM, CA-LSTM ( \(\lambda =1\) ), and our proposed ASD-LSTM at each anticipation time. Blue means the prediction matches the ground truth.

    4.5 Results on Conventional Egocentric Action Anticipation

    Whereas the proposed ASD framework is tailored for the out-of-distribution scenario, we also conduct experiments on conventional egocentric action anticipation to see whether it maintains competitive performance in the in-distribution scenario. Table 6 presents the conventional egocentric action anticipation results on EGTEA Gaze+, where 8,299 action instances are used for training and 2,022 instances for test with 106 action categories. We report the average performance over the three official splits. Compared with state-of-the-art methods, our proposed ASD achieves competitive performance, which demonstrates the effectiveness of ASD in the in-distribution scenario. It is also worth mentioning that a line of existing conventional egocentric action anticipation methods [18, 31, 47, 72, 73] use future frames during training to distill future content into observations, which is orthogonal to our work. The main reason some methods [18, 31, 47] surpass us in terms of Top-5 accuracy is that they use actual future features during training, which provide more powerful supervisory signals. Instead, our method does not take future features as supervisory signals. It further indicates the superiority of our ASD idea that focuses on the observed content without the requirement of future frames. Meanwhile, to have a deeper understanding of our method, we also show some failure cases obtained by ASD-LSTM in Figure 8. In the first case, our method fails to predict the ground-truth action “cut cucumber” because it mistakenly identifies cucumber as courgette, as they are highly similar. In the second case, the ground-truth future action is “take cheese,” but cheese does not appear in observed frames. It is extremely difficult to accurately predict “take cheese” because the observed information is limited. It further indicates that egocentric action anticipation is a quite challenging task.
    Table 6.
    MethodContentTop-5 Acc.Mean Top-5 Rec.
    DMR [67]Obs55.7038.11
    ATSN [6]Obs40.5331.61
    MCE [9]Obs56.2943.75
    ED [13]Obs60.1854.61
    FN [7]Obs60.1249.82
    RL [35]Obs62.7452.17
    EL [20]Obs63.7655.11
    RULSTM [11]Obs66.4058.64
    ImagineRNN [72]Obs+Fut66.71
    KDLM [2]Obs68.74
    SRL [47]Obs+Fut70.67
    SF-RULSTM [43]Obs67.60
    MGRKD [18]Obs+Fut70.86
    HRO [73]Obs+Fut71.46
    DCR [31]Obs+Fut67.9061.10
    ASD-LSTM (Ours)Obs68.2562.21
    ASD-Trans (Ours)Obs69.8463.38
    Table 6. Conventional Egocentric Action Anticipation Results on EGTEA Gaze+ (%)
    Although ASD is tailored for the out-of-distribution scenario, it is also effective in the in-distribution scenario. Obs+Fut denotes that future frames are also used during training apart from observed frames.
    Fig. 8.
    Fig. 8. Failure cases of our method on EGTEA Gaze+ with anticipation time \(T_a = 2s, 1.5s, 1s, 0.5s\) .

    5 Conclusion

    In our work, we explored a novel and practical egocentric compositional action anticipation problem, which is neglected by existing works. This was paired with our ASD framework by combining deep neural networks with an adaptive counterfactual analysis scheme, which takes inspiration from the human brain’s cognitive process of coordinating System 1 and System 2. The former is good at memorizing what action will happen after observation, whereas the latter aims at knowing why this action is going to happen rather than others via exploiting the semantic experience in a delicate manner. Experimental results on three large-scale egocentric video datasets demonstrated that our method achieves state-of-the-art performance in the out-of-distribution scenario while maintaining competitive performance in the in-distribution scenario.
    Our work can be extended in multiple directions. First, although the representation learning of visual and semantic modalities goes beyond the scope of our work and is an orthogonal line of development, stronger backbones can actually be introduced to enhance multi-modal representation learning. Second, a simple yet effective knowledge-driven strategy is used to compose verbs and nouns, which indicates that an external knowledge graph might be considered in future work. Furthermore, integrating egocentric compositional action anticipation into streaming or untrimmed scenario is also worth exploration, which moves closer toward practical applications.

    References

    [1]
    Alejandro Betancourt, Pietro Morerio, Carlo S. Regazzoni, and Matthias Rauterberg. 2015. The evolution of first person vision methods: A survey. IEEE Transactions on Circuits and Systems for Video Technology 25, 5 (2015), 744–760.
    [2]
    Guglielmo Camporese, Pasquale Coscia, Antonino Furnari, Giovanni Maria Farinella, and Lamberto Ballan. 2021. Knowledge distillation for action anticipation via label smoothing. In Proceedings of the International Conference on Pattern Recognition. 3312–3319.
    [3]
    Guangyi Chen, Junlong Li, Jiwen Lu, and Jie Zhou. 2021. Human trajectory prediction via counterfactual analysis. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 9824–9833.
    [4]
    Victor Chernozhukov, Iván Fernández-Val, and Blaise Melly. 2013. Inference on counterfactual distributions. Econometrica 81, 6 (2013), 2205–2268.
    [5]
    Dima Damen, Hazel Doughty, Giovanni Maria Farinella, Antonino Furnari, Evangelos Kazakos, Jian Ma, Davide Moltisanti, Jonathan Munro, Toby Perrett, Will Price, et al. 2022. Rescaling egocentric vision: Collection, pipeline and challenges for EPIC-KITCHENS-100. International Journal of Computer Vision 130, 1 (2022), 33–55.
    [6]
    Dima Damen, Hazel Doughty, Giovanni Maria Farinella, Sanja Fidler, Antonino Furnari, Evangelos Kazakos, Davide Moltisanti, Jonathan Munro, Toby Perrett, Will Price, et al. 2018. Scaling egocentric vision: The EPIC-KITCHENS dataset. In Proceedings of the European Conference on Computer Vision. 720–736.
    [7]
    Roeland De Geest and Tinne Tuytelaars. 2018. Modeling temporal structure with LSTM for online action detection. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision. 1549–1557.
    [8]
    Eadom Dessalene, Chinmaya Devaraj, Michael Maynord, Cornelia Fermuller, and Yiannis Aloimonos. 2021. Forecasting action through contact representations from first person video. IEEE Transactions on Pattern Analysis and Machine Intelligence. Early Access, January 28, 2021.
    [9]
    Antonino Furnari, Sebastiano Battiato, and Giovanni Maria Farinella. 2018. Leveraging uncertainty to rethink loss functions and evaluation measures for egocentric action anticipation. In Proceedings of the European Conference on Computer Vision Workshops. 1–17.
    [10]
    Antonino Furnari and Giovanni Maria Farinella. 2019. What would you expect? Anticipating egocentric actions with rolling-unrolling LSTMs and modality attention. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 6252–6261.
    [11]
    Antonino Furnari and Giovanni Maria Farinella. 2021. Rolling-unrolling LSTMs for action anticipation from first-person video. IEEE Transactions on Pattern Analysis and Machine Intelligence 43, 11 (2021), 4021–4036.
    [12]
    Antonino Furnari and Giovanni Maria Farinella. 2022. Towards streaming egocentric action anticipation. In Proceedings of the International Conference on Pattern Recognition. 1250–1257.
    [13]
    Jiyang Gao, Zhenheng Yang, and Ram Nevatia. 2017. RED: Reinforced encoder-decoder networks for action anticipation. In Proceedings of the British Machine Vision Conference. 1–11.
    [14]
    Rohit Girdhar and Kristen Grauman. 2021. Anticipative video transformer. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 13505–13515.
    [15]
    Sepp Hochreiter and Jürgen Schmidhuber. 1997. Long short-term memory. Neural Computation 9, 8 (1997), 1735–1780.
    [16]
    Xuejiao Hu, Jingzhao Dai, Ming Li, Chenglei Peng, Yang Li, and Sidan Du. 2022. Online human action detection and anticipation in videos: A survey. Neurocomputing 491 (2022), 395–413.
    [17]
    Yi Huang, Xiaoshan Yang, Junyu Gao, Jitao Sang, and Changsheng Xu. 2020. Knowledge-driven egocentric multimodal activity recognition. ACM Transactions on Multimedia Computing, Communications, and Applications 16, 4 (2020), 1–133.
    [18]
    Yi Huang, Xiaoshan Yang, and Changsheng Xu. 2021. Multimodal global relation knowledge distillation for egocentric action anticipation. In Proceedings of the ACM International Conference on Multimedia. 245–254.
    [19]
    Matthew S. Hutchinson and Vijay N. Gadepally. 2021. Video action understanding: A tutorial. IEEE Access 9 (2021), 134611–134637.
    [20]
    Ashesh Jain, Avi Singh, Hema S Koppula, Shane Soh, and Ashutosh Saxena. 2016. Recurrent neural networks for driver activity anticipation via sensory-fusion architecture. In Proceedings of the IEEE International Conference on Robotics and Automation. 3118–3125.
    [21]
    Daniel Kahneman. 2011. Thinking, Fast and Slow. Macmillan.
    [22]
    Brayden G. King. 2008. A political mediation model of corporate response to social movement activism. Administrative Science Quarterly 53, 3 (2008), 395–421.
    [23]
    Donald E. Knuth. 1992. Two notes on notation. American Mathematical Monthly 99, 5 (1992), 403–422.
    [24]
    Brenden M. Lake. 2014. Towards More Human-Like Concept Learning in Machines: Compositionality, Causality, and Learning-to-Learn. Ph.D. Dissertation. Massachusetts Institute of Technology, Cambridge, MA.
    [25]
    Brenden M. Lake, Tomer Ullman, Joshua B. Tenenbaum, and Samuel J. Gershman. 2017. Building machines that learn and think like people. Behavioral and Brain Sciences 40 (2017), 1–101.
    [26]
    Yin Li, Alireza Fathi, and James M. Rehg. 2013. Learning to predict gaze in egocentric video. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 3216–3223.
    [27]
    Yin Li, Miao Liu, and James M. Rehg. 2018. In the eye of beholder: Joint learning of gaze and actions in first person video. In Proceedings of the European Conference on Computer Vision. 619–635.
    [28]
    Ji Lin, Chuang Gan, and Song Han. 2019. TSM: Temporal shift module for efficient video understanding. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 7083–7093.
    [29]
    Miao Liu, Siyu Tang, Yin Li, and James M. Rehg. 2020. Forecasting human-object interaction: Joint prediction of motor attention and actions in first person video. In Proceedings of the European Conference on Computer Vision. 704–721.
    [30]
    Shaowei Liu, Subarna Tripathi, Somdeb Majumdar, and Xiaolong Wang. 2022. Joint hand motion and interaction hotspots prediction from egocentric videos. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 3282–3292.
    [31]
    Tianshan Liu and Kin-Man Lam. 2022. A hybrid egocentric activity anticipation framework via memory-augmented recurrent and one-shot representation forecasting. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 13904–13913.
    [32]
    Xiaohao Liu, Zhulin Tao, Jiahong Shao, Lifang Yang, and Xianglin Huang. 2022. EliMRec: Eliminating single-modal bias in multimedia recommendation. In Proceedings of the ACM International Conference on Multimedia. 687–695.
    [33]
    Zhekun Luo, Shalini Ghosh, Devin Guillory, Keizo Kato, Trevor Darrell, and Huijuan Xu. 2022. Disentangled action recognition with knowledge bases. In Proceedings of the Conference of the North American Chapter of the Association for Computational Linguistics. 559–572.
    [34]
    Lei Ma, Yuhui Zheng, Zhao Zhang, Yazhou Yao, Xijian Fan, and Qiaolin Ye. 2022. Motion stimulation for compositional action recognition. IEEE Transactions on Circuits and Systems for Video Technology. Early Access, November 14, 2022.
    [35]
    Shugao Ma, Leonid Sigal, and Stan Sclaroff. 2016. Learning activity progression in LSTMs for activity detection and early detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 1942–1950.
    [36]
    Joanna Materzynska, Tete Xiao, Roei Herzig, Huijuan Xu, Xiaolong Wang, and Trevor Darrell. 2020. Something-else: Compositional action recognition with spatial-temporal interaction networks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 1049–1059.
    [37]
    Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S. Corrado, and Jeff Dean. 2013. Distributed representations of words and phrases and their compositionality. In Proceedings of the Advances in Neural Information Processing Systems. 1–9.
    [38]
    Tushar Nagarajan, Yanghao Li, Christoph Feichtenhofer, and Kristen Grauman. 2020. Ego-TOPO: Environment affordances from egocentric video. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 163–172.
    [39]
    Katsuyuki Nakamura, Hiroki Ohashi, and Mitsuhiro Okada. 2021. Sensor-augmented egocentric-video captioning with dynamic modal attention. In Proceedings of the ACM International Conference on Multimedia. 4220–4229.
    [40]
    Yulei Niu, Kaihua Tang, Hanwang Zhang, Zhiwu Lu, Xian-Sheng Hua, and Ji-Rong Wen. 2021. Counterfactual VQA: A cause-effect look at language bias. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 12700–12710.
    [41]
    Yulei Niu and Hanwang Zhang. 2021. Introspective distillation for robust question answering. In Proceedings of the Advances in Neural Information Processing Systems. 16292–16304.
    [42]
    Adrián Núñez-Marcos, Gorka Azkune, Eneko Agirre, Diego López-de Ipiña, and Ignacio Arganda-Carreras. 2020. Using external knowledge to improve zero-shot action recognition in egocentric videos. In Proceedings of the International Conference on Image Analysis and Recognition. 174–185.
    [43]
    Nada Osman, Guglielmo Camporese, Pasquale Coscia, and Lamberto Ballan. 2021. SlowFast rolling-unrolling LSTMs for action anticipation in egocentric videos. In Proceedings of the IEEE/CVF International Conference on Computer Vision Workshops. 3437–3445.
    [44]
    Yonghua Pan, Zechao Li, Liyan Zhang, and Jinhui Tang. 2022. Causal inference with knowledge distilling and curriculum learning for unbiased VQA. ACM Transactions on Multimedia Computing, Communications, and Applications 18, 3 (2022), 1–23.
    [45]
    Judea Pearl, Madelyn Glymour, and Nicholas P. Jewell. 2016. Causal Inference in Statistics: A Primer. John Wiley & Sons.
    [46]
    Judea Pearl and Dana Mackenzie. 2018. The Book of Why: The New Science of Cause and Effect. Basic Books.
    [47]
    Zhaobo Qi, Shuhui Wang, Chi Su, Li Su, Qingming Huang, and Qi Tian. 2021. Self-regulated learning for egocentric video activity anticipation. IEEE Transactions on Pattern Analysis and Machine Intelligence. Early Access, February 17, 2021.
    [48]
    Chen Qian, Fuli Feng, Lijie Wen, Chunping Ma, and Pengjun Xie. 2021. Counterfactual inference for text classification debiasing. In Proceedings of the Annual Meeting of the Association for Computational Linguistics. 5434–5445.
    [49]
    Gorjan Radevski, Marie-Francine Moens, and Tinne Tuytelaars. 2021. Revisiting spatio-temporal layouts for compositional action recognition. In Proceedings of the British Machine Vision Conference. 1–16.
    [50]
    Yongming Rao, Guangyi Chen, Jiwen Lu, and Jie Zhou. 2021. Counterfactual attention learning for fine-grained visual categorization and re-identification. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 1025–1034.
    [51]
    Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun. 2015. Faster R-CNN: Towards real-time object detection with region proposal networks. In Proceedings of the Advances in Neural Information Processing Systems. 91–99.
    [52]
    Lorenzo Richiardi, Rino Bellocco, and Daniela Zugna. 2013. Mediation analysis in epidemiology: Methods, interpretation and bias. International Journal of Epidemiology 42, 5 (2013), 1511–1519.
    [53]
    Ivan Rodin, Antonino Furnari, Dimitrios Mavroeidis, and Giovanni Maria Farinella. 2021. Predicting the future from first person (egocentric) vision: A survey. Computer Vision and Image Understanding 211 (2021), 103252–10370.
    [54]
    Ivan Rodin, Antonino Furnari, Dimitrios Mavroeidis, and Giovanni Maria Farinella. 2022. Untrimmed action anticipation. In Proceedings of the International Conference on Image Analysis and Processing. 337–348.
    [55]
    Debaditya Roy and Basura Fernando. 2021. Action anticipation using pairwise human-object interactions and transformers. IEEE Transactions on Image Processing 30 (2021), 8116–8129.
    [56]
    Abhimanyu Sahu and Ananda S. Chowdhury. 2021. Together recognizing, localizing and summarizing actions in egocentric videos. IEEE Transactions on Image Processing 30 (2021), 4330–4340.
    [57]
    Fadime Sener, Dipika Singhania, and Angela Yao. 2020. Temporal aggregate representations for long-range video understanding. In Proceedings of the European Conference on Computer Vision. 154–171.
    [58]
    Karen Simonyan and Andrew Zisserman. 2014. Two-stream convolutional networks for action recognition in videos. In Proceedings of the Advances in Neural Information Processing Systems. 568–576.
    [59]
    Swathikiran Sudhakaran, Sergio Escalera, and Oswald Lanz. 2021. Learning to recognize actions on objects in egocentric video with attention dictionaries. IEEE Transactions on Pattern Analysis and Machine Intelligence. Early Access, February 11, 2021.
    [60]
    Pengzhan Sun, Bo Wu, Xunsong Li, Wen Li, Lixin Duan, and Chuang Gan. 2021. Counterfactual debiasing inference for compositional action recognition. In Proceedings of the ACM International Conference on Multimedia. 3220–3228.
    [61]
    Teng Sun, Wenjie Wang, Liqaing Jing, Yiran Cui, Xuemeng Song, and Liqiang Nie. 2022. Counterfactual reasoning for out-of-distribution multimodal sentiment analysis. In Proceedings of the ACM International Conference on Multimedia. 15–23.
    [62]
    Kaihua Tang, Jianqiang Huang, and Hanwang Zhang. 2020. Long-tailed classification by keeping the good and removing the bad momentum causal effect. In Proceedings of the Advances in Neural Information Processing Systems. 1513–1524.
    [63]
    Kaihua Tang, Yulei Niu, Jianqiang Huang, Jiaxin Shi, and Hanwang Zhang. 2020. Unbiased scene graph generation from biased training. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 3716–3725.
    [64]
    Daksh Thapar, Aditya Nigam, and Chetan Arora. 2020. Recognizing camera wearer from hand gestures in egocentric videos. In Proceedings of the ACM International Conference on Multimedia. 2095–2103.
    [65]
    Bing Tian, Yixin Cao, Yong Zhang, and Chunxiao Xing. 2022. Debiasing NLU models via causal intervention and counterfactual reasoning. In Proceedings of the AAAI Conference on Artificial Intelligence. 11376–11384.
    [66]
    Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. In Proceedings of the Advances in Neural Information Processing Systems. 1–11.
    [67]
    Carl Vondrick, Hamed Pirsiavash, and Antonio Torralba. 2016. Anticipating visual representations from unlabeled video. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 98–106.
    [68]
    Limin Wang, Yuanjun Xiong, Zhe Wang, Yu Qiao, Dahua Lin, Xiaoou Tang, and Luc Van Gool. 2016. Temporal segment networks: Towards good practices for deep action recognition. In Proceedings of the European Conference on Computer Vision. 20–36.
    [69]
    Shunli Wang, Shuaibing Wang, Bo Jiao, Dingkang Yang, Liuzhen Su, Peng Zhai, Chixiao Chen, and Lihua Zhang. 2022. CA-SpaceNet: Counterfactual analysis for 6D pose estimation in space. In Proceedings of the International Conference on Intelligent Robots and Systems. 10627–10634.
    [70]
    Xiaohan Wang, Linchao Zhu, Yu Wu, and Yi Yang. 2020. Symbiotic attention for egocentric action recognition with object-centric alignment. IEEE Transactions on Pattern Analysis and Machine Intelligence. Early Access, August 11, 2020.
    [71]
    Junfei Wu, Qiang Liu, Weizhi Xu, and Shu Wu. 2022. Bias mitigation for evidence-aware fake news detection by causal intervention. In Proceedings of the International ACM SIGIR Conference on Research and Development in Information Retrieval. 2308–2313.
    [72]
    Yu Wu, Linchao Zhu, Xiaohan Wang, Yi Yang, and Fei Wu. 2020. Learning to anticipate egocentric actions by imagination. IEEE Transactions on Image Processing 30 (2020), 1143–1152.
    [73]
    Xinyu Xu, Yong-Lu Li, and Cewu Lu. 2022. Learning to anticipate future with dynamic context removal. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 12734–12744.
    [74]
    Rui Yan, Peng Huang, Xiangbo Shu, Junhao Zhang, Yonghua Pan, and Jinhui Tang. 2022. Look less think more: Rethinking compositional action recognition. In Proceedings of the ACM International Conference on Multimedia. 3666–3675.
    [75]
    Olga Zatsarynna, Yazan Abu Farha, and Juergen Gall. 2021. Multi-modal temporal convolutional network for anticipating actions in egocentric videos. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops. 2249–2258.
    [76]
    Tianyu Zhang, Weiqing Min, Jiahao Yang, Tao Liu, Shuqiang Jiang, and Yong Rui. 2021. What if we could not see? Counterfactual analysis for egocentric action anticipation. In Proceedings of the International Joint Conference on Artificial Intelligence. 1316–1322.
    [77]
    Tianyu Zhang, Weiqing Min, Ying Zhu, Yong Rui, and Shuqiang Jiang. 2020. An egocentric action anticipation framework via fusing intuition and analysis. In Proceedings of the ACM International Conference on Multimedia. 402–410.
    [78]
    Yun C. Zhang, Yin Li, and James M. Rehg. 2017. First-person action decomposition and zero-shot learning. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision. 121–129.
    [79]
    Na Zheng, Xuemeng Song, Tianyu Su, Weifeng Liu, Yan Yan, and Liqiang Nie. 2022. Egocentric early action prediction via adversarial knowledge distillation. ACM Transactions on Multimedia Computing, Communications, and Applications. Early Access, June 16, 2022.
    [80]
    Zeyun Zhong, David Schneider, Michael Voit, Rainer Stiefelhagen, and Jürgen Beyerer. 2023. Anticipative feature fusion transformer for multi-modal action anticipation. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision. 6068–6077.

    Index Terms

    1. Toward Egocentric Compositional Action Anticipation with Adaptive Semantic Debiasing

      Recommendations

      Comments

      Information & Contributors

      Information

      Published In

      cover image ACM Transactions on Multimedia Computing, Communications, and Applications
      ACM Transactions on Multimedia Computing, Communications, and Applications  Volume 20, Issue 5
      May 2024
      650 pages
      ISSN:1551-6857
      EISSN:1551-6865
      DOI:10.1145/3613634
      • Editor:
      • Abdulmotaleb El Saddik
      Issue’s Table of Contents
      This work is licensed under a Creative Commons Attribution International 4.0 License.

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      Published: 11 January 2024
      Online AM: 04 December 2023
      Accepted: 21 October 2023
      Revised: 16 August 2023
      Received: 19 January 2023
      Published in TOMM Volume 20, Issue 5

      Check for updates

      Author Tags

      1. Egocentric video understanding
      2. compositional action anticipation
      3. semantic bias
      4. adaptive counterfactual analysis

      Qualifiers

      • Research-article

      Funding Sources

      • National Key Research and Development Project of New Generation Artificial Intelligence of China

      Contributors

      Other Metrics

      Bibliometrics & Citations

      Bibliometrics

      Article Metrics

      • 0
        Total Citations
      • 732
        Total Downloads
      • Downloads (Last 12 months)732
      • Downloads (Last 6 weeks)68
      Reflects downloads up to 27 Jul 2024

      Other Metrics

      Citations

      View Options

      View options

      PDF

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader

      Get Access

      Login options

      Full Access

      Media

      Figures

      Other

      Tables

      Share

      Share

      Share this Publication link

      Share on social media