Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
research-article
Open access

Mixed Attention and Channel Shift Transformer for Efficient Action Recognition

Published: 10 March 2025 Publication History

Abstract

The practical use of the Transformer-based methods for processing videos is constrained by the high computing complexity. Although previous approaches adopt the spatiotemporal decomposition of 3D attention to mitigate the issue, they suffer from the drawback of neglecting the majority of visual tokens. This article presents a novel mixed attention operation that subtly fuses the random, spatial, and temporal attention mechanisms. The proposed random attention stochastically samples video tokens in a simple yet effective way, complementing other attention methods. Furthermore, since the attention operation concentrates on learning long-distance relationships, we employ the channel shift operation to encode short-term temporal characteristics. Our model can provide more comprehensive motion representations thanks to the amalgamation of these techniques. Experimental results show that the proposed method produces competitive action recognition results with low computational overhead on both large-scale and small-scale public video datasets.

1 Introduction

With the rise in the amount of video data, action recognition is playing an increasingly important role in many applications, including sports [41], health care [40], and human–computer interaction [21]. Numerous Convolutional Neural Networks (CNNs)-based algorithms have been presented to handle action videos with the emergence of deep learning techniques. These techniques employ either 3D/(2 + 1)D convolution [7, 42, 43, 46] or 2D convolution with a light temporal modeling module [28, 32, 56] to learn spatiotemporal representations. To overcome CNN’s shortcoming of focusing mostly on local patterns, some attention-based approaches [4, 11, 15, 35, 39] have been recently put forward to encode global video features, but they also come with higher time complexity.
Decomposing 3D attention into spatial and temporal components is a commonly used method to reduce computational overhead. For instance, TimeSformer [4] and ViViT [1] utilize 2D attention and 1D attention, respectively, to learn appearance and motion features. X-ViT [5] combines spatial attention and temporal attention limited to a local time window. Longformer [3] and BIGBIRD [61] leverage multiple attention (e.g., random attention and sliding window attention) to facilitate the efficient processing of extended sequences in natural language processing tasks. Motivated by their success, we investigate the mixture of several attention mechanisms to establish the long-term dependencies between different regions of videos in a computationally inexpensive manner. As depicted in Figure 1, the blue, orange, and green cells, respectively, represent the random, spatial, and temporal tokens that are utilized to calculate the relations with the central token in the middle frame. Clearly, spatial attention focuses on the correlation between the current token and tokens within the same frame for appearance modeling, while temporal attention considers the correlation between the current token and tokens in the same position across different frames for motion modeling. Random attention explores the correlation between the current token and a set of randomly selected tokens, facilitating efficient learning of long-range dependencies. Hence, using only spatial and temporal attention overlooks a significant amount of relevant token information, while the adoption of random attention provides a convenient way to bring complementary information. We primarily explore how to combine the three types of attention in a suitable way to fully leverage their individual functionalities. Additionally, we also study the proportion of tokens selected in random attention to achieve a balance between performance and computational cost.
Fig. 1.
An image illustrating the different token dependencies considered in spatial, temporal, and random attention.
Fig. 1. Motivation of the proposed mixed attention operation. In random, spatial, and temporal attention, tokens related to the central token highlighted with a yellow border are represented in different colors. They contain visual semantics from diverse regions, thereby complementing each other to improve modeling effectiveness.
Considering that the adopted attention mechanisms capture long-term motion information in videos, we attempt to learn short-term motion features in an efficient way as a supplement. TSM [32] achieves local motion modeling by shifting partial channels of the feature maps along the temporal direction without adding any computational burden. TEA [28] initializes the 1D depthwise convolution with the temporal shift operation, enabling the extraction of short-term motion information. LAPS [62] incorporates the periodic shift operation into the attention mechanism to perform temporal shifting of features from each head. This article investigates which method is most effective for extracting short-term motion information when combined with the mixed attention and proposes 1D depthwise convolution initialized with the periodic shift operation.
Figure 2 presents an overview of the proposed Mixed Attention with Channel Shift (MACS) Encoder, which combines two techniques for temporal modeling, namely the mixed attention operation and the channel shift operation to extract local and global motion cues, respectively. Experimental results demonstrate that our MACS Transformer yields superior performance to the state-of-the-art methods on multiple datasets with reduced computational cost. We observe that these three attention mechanisms promote each other when connected appropriately, and the accuracy is further boosted by adding channel shift for learning local dynamics. We also compare different connectivity patterns of various attention in the ablation experiments and visualize the classification outputs and the attention weights of our model.
Fig. 2.
Fig. 2. Overview of the MACS encoder. The mixed attention and channel shift operations synergistically capture both long-term and short-term motion cues in videos. Q, K, and V represent the query, key, and value tensors, respectively. \(\bigoplus\) is the element-wise addition, and MLP indicates multi-layer perceptron.
In summary, the primary contributions of this article consist of the following three aspects:
We propose a novel mixed attention operation, combining the random, spatial, and temporal attention mechanisms for robust spatiotemporal modeling with low computational overhead.
We explore different methods of extracting local motion information and find that when integrated with our mixed attention, the 1D depthwise convolution initialized with periodic shift achieves the best performance.
Extensive experiments show that the proposed MACS operation is applicable to various backbone networks and obtains superior recognition results on several public video datasets.
The structure of the rest of the article is as follows: We review recent related approaches in Section 2 and provide a detailed introduction to the proposed MACS Transformer in Section 3. Following this, we report the experimental results on multiple public datasets in Section 4 and summarize this work in Section 5.

2 Related Work

The methods related to our work are action recognition approaches based on deep neural networks, mainly including the CNN-based and Transformer-based approaches.

2.1 CNNs

Motivated by the impressive performance of CNNs on image-related tasks [22, 24], many researchers have tried its applications to video modeling [7, 13, 14, 23, 46, 59, 64]. One straightforward approach is to utilize 3D convolution to learn the spatiotemporal representation of videos. C3D [46] treated the spatial and temporal information equally and leveraged 3D convolution with a kernel size of 3 \(\times\) 3 \(\times\) 3 to encode both the appearance and motion features simultaneously. In order to reduce computational overhead, P3D [42] and R(2 + 1)D [48] decomposed 3D convolution into a series of 2D convolution and 1D convolution to process the input videos. SlowFast [14] included two pathways to handle inputs with different frame rates, which learn spatial semantics and rapidly changing motion, respectively.
Another way is to understand videos with 2D convolution and a dedicated temporal modeling module, which works well on action-related datasets such as Something-Something [18]. TSM [32] proposed the channel shift operation to potentially characterize temporal information with spatial convolution. TEA [28] adopted the frame difference technique to estimate motion information and enhanced the motion-related channels of the video representation through feature weighting. TDN [53] improved upon 2D CNN by designing the S-TDM and L-TDM modules to capture both the short-term and long-term temporal structures. However, CNNs are not proficient in capturing long-distance dependencies, which limits the capabilities of these methods for video understanding.

2.2 Visual Transformers

Various attention mechanisms [3, 9, 19, 61] have been widely applied in natural language processing tasks. Inspired by them, researchers exploited the idea of self-attention to solve vision problems [4, 6, 10, 33, 35, 60]. ViT [10] made a pioneering attempt in image classification and took fixed-size image patches as tokens. Swin Transformer [33] constrained the calculation of self-attention in local windows and presented a shifted window partitioning technique to consider cross-window associations. Visformer [8] explored the transition of Transformer-based models into convolution-based models in a step-by-step manner and incorporated the advantages of both approaches. In addition, self-attention was also applied in many downstream image-related tasks, such as detection [6] and segmentation [45].
For video modeling, TimeSformer [4] and ViViT [1] used the combination of spatial and temporal attention to simulate 3D attention with decreased computational burden. MViT [11] adopted various resolutions and channel dimensions at different stages to learn multi-scale features hierarchically. X-ViT [5] restricted the computation of self-attention within a local temporal window. PST [57] proposed a temporal patch shift block, which moved partial patches in the time dimension, enabling spatial attention to potentially learn temporal dynamics. AcT [38] proposed a pose estimation network to generate 2D skeletal representations for action recognition. Video Swin [35] treated 3D patches as tokens and extended the Swin Transformer to the video domain. Uniformer [25] employed the 3D convolution and spatiotemporal attention in shallow and deep layers to learn local and global dependencies. LAPS [62] proposed leap attention and periodic shift to simultaneously capture long-term and short-term motion information in videos. There were also many works that employed the attention mechanism for handling low-level tasks [30, 31].
Although these methods exhibit better long-range modeling capabilities, they often have higher time complexity and require longer training time compared to CNN-based methods. Hence, this article aims to investigate the fusion of multiple attention mechanisms for optimized spatiotemporal representation learning while reducing the computational overhead.

3 Method

In this section, we present the proposed MACS Transformer. We start by introducing three attention mechanisms and discussing their mixtures. Next, we describe the channel shift operation in detail. Last, we introduce the proposed model instantiated with the ViT-B backbone.

3.1 Random, Spatial, and Temporal Attention Mechanisms

When dealing with videos, considering the dependencies between all spatiotemporal tokens can be time-consuming. However, a simple decomposition of 3D attention into spatial and temporal attention may result in the loss of information from most tokens. To tackle this challenge, we propose a novel mixed attention strategy by leveraging multiple types of attention mechanisms.
The adopted random, spatial, and temporal attention mechanisms are illustrated in Figure 3. Spatial and temporal attention entail the initial grouping of tokens in space and time, followed by the calculation of the attention matrices between the corresponding groups. Conversely, random attention processes the relationships between the query token and a set of randomly chosen key tokens. Therefore, random attention serves to compensate for visual information that would otherwise be lost by spatial and temporal attention, and the sampling of a small proportion of visual tokens mitigates the computational overhead of the model.
Fig. 3.
Three diagrams illustrating the computational processes of random, spatial, and temporal attention.
Fig. 3. Illustrations of three types of attention mechanisms. In (a) random attention, the query tensor K consists of randomly sampled tokens. (b) Spatial attention and (c) temporal attention are special cases of group attention.
We utilize the projection matrices \(\boldsymbol{W}_{Q}^{l},\boldsymbol{W}_{K}^{l},\boldsymbol{W}_{V}^{l}\in\mathbb {R}^{{D_{h}}\times{D}}\) to computing the query, key, and value vectors, respectively, from the token representation \(z_{s,t}^{l}\in\mathbb{R}^{D}\) at the spatiotemporal position \((s,t)\) and layer l:
\begin{align} &q_{s,t}^{l},k_{s,t}^{l},v_{s,t}^{l}=[\boldsymbol{W}_{Q}^{l}, \boldsymbol{W}_{K}^{l},\boldsymbol{W}_{V}^{l}]\cdot{z_{s,t}^{l}}\in\mathbb{R}^ {D_{h}} \notag\\ &\qquad\quad s=1,\cdots,S,\quad t=1,\cdots,T \end{align}
(1)
where \(D_{h}\) is the dimension for each attention head. For random attention, we sample \(R=\frac{{H}\cdot{W}\cdot{T}}{\lambda}\) tokens randomly to form the key and the value tensors with a sampling rate \(1/\lambda\). We then derive the self-attention weights for spatial, temporal, and random attention through the following calculations:
\begin{align} \omega_{s,t,r^{\prime}}^{l,random}=Softmax({\frac{q_{s,t}^{l}}{ \sqrt{D_{h}}}}\cdot{k_{r^{\prime}}^{l}}),r^{\prime}=1,\cdots,R \end{align}
(2)
\begin{align} \omega_{s,t,s^{\prime}}^{l,space}=Softmax({\frac{q_{s,t}^{l}}{ \sqrt{D_{h}}}}\cdot{k_{s^{\prime},t}^{l}}),s^{\prime}=1,\cdots,S \end{align}
(3)
\begin{align} \omega_{s,t,t^{\prime}}^{l,time}=& Softmax({\frac{q_{s,t}^{l}}{ \sqrt{D_{h}}}}\cdot{k_{s,t^{\prime}}^{l}}),t^{\prime}=1,\cdots,T \notag\\ & s=1,\cdots,S,\quad t=1,\cdots,T \end{align}
(4)
where \(k_{r^{\prime}}^{l}\) denotes the representation of the \(r^{\prime}\)th token in the key tensor. The self-attention weights \(\omega_{s,t,r^{\prime}}^{l,random}\), \(\omega_{s,t,s^{\prime}}^{l,space}\), and \(\omega_{s,t,t^{\prime}}^{l,time}\) will be further employed to manipulate the corresponding value tensors.

3.2 Mixed Attention Operation

Previous studies [1, 42] have confirmed that the connection ways of the spatial and temporal convolution or attention operations are closely related to the model performance. Motivated by them, we conduct a comparative analysis of diverse connection methodologies of multiple attention mechanisms shown in the right part of Figure 4. Concretely, we compare the following structures:
Mix-A: Random, spatial, and temporal dependencies are independently calculated, and then, the resulting video features are fused together.
Mix-B: The sequences of spatial and temporal attention jointly extract static and dynamic cues, which are subsequently merged with the output of random attention.
Mix-C: Spatial, temporal, and random attention are employed to learn video features in series, hierarchically encoding different correlations.
Fig. 4.
An illustration of the MACS Transformer, including three different structures of the mixed attention module.
Fig. 4. Overview of our MACS Transformer. The video patches are linearly projected, concatenated with the [class] token, and added with positional embeddings. They are subsequently passed through multiple layers of Transformer encoders to learn video features. Each encoder contains the proposed mixed attention operation and channel shift operation. Considering different ways of extracting visual information, we design three structures for the mixed attention operation as shown on the right.
Drawing inspiration from the factorization of 3D attention in [1, 4], Mix-B and Mix-C aim to establish a sequential linkage between spatial and temporal attention. However, the ablation experiment in Table 2 yields an interesting outcome that Mix-A with a parallel connection achieves superior performance, despite the serial configuration performing better solely for temporal and spatial attention.
The token feature \(y_{s,t}^{l}\) at the spatiotemporal position \((s,t)\) resulting from different connection methods can be calculated through the dot-product between the self-attention weights and the corresponding value tensors:
\begin{align} y_{s,t}^{l,MixA}=&\,\alpha_{1}\sum\limits_{s^{\prime}=1}^{S}{\omega_{s,t,s^{\prime}}^{l, space}v_{s^{\prime},t}^{l}}+\alpha_{2}\sum\limits_{t^{\prime}=1}^{T}{\omega_{s,t,t^{ \prime}}^{l,time}v_{s,t^{\prime}}^{l}} \notag\\& {}+\alpha_{3}\sum\limits_{r^{\prime}=1}^{R}{\omega_{s,t,r^{\prime}}^{l, random}v_{r^{\prime}}^{l}} \end{align}
(5)
\begin{align} y_{s,t}^{l,MixB}=&\, \alpha_{1}\sum\limits_{t^{\prime}=1}^{T}{\omega_{s,t,t^{\prime}}^{l, time}(\sum\limits_{s^{\prime}=1}^{S}{\omega_{s,t,s^{\prime}}^{l,space}v_{s^{\prime},t }^{l}})_{t^{\prime}}} \notag\\ &{}+\alpha_{2}\sum\limits_{r^{\prime}=1}^{R}{\omega_{s,t,r^{\prime}}^{l, random}v_{r^{\prime}}^{l}}\end{align}
(6)
\begin{align} y_{s,t}^{l,MixC} = & \sum\limits_{r^{\prime}=1}^{R}{\omega_{s,t,r^{\prime}}^{l,random}[\sum\limits_{ t^{\prime}=1}^{T}{\omega_{s,t,t^{\prime}}^{l,time}(\sum\limits_{s^{\prime}=1}^{S}{ \omega_{s,t,s^{\prime}}^{l,space}v_{s^{\prime},t}^{l}})_{t^{\prime}}}]_{r^{\prime}}} \notag\\ &{}s=1,\cdots,S,\quad t=1,\cdots,T \end{align}
(7)
where \(v_{r^{\prime}}^{l}\) refers to the \(r^{\prime}\)th token representation in the value tensor, and \(\alpha_{i}\) denotes the hyperparameter used to weight the video features of the ith branch during feature fusion.
Figure 5 displays tokens associated with the upper-left token under different attention mechanisms, providing an intuitive visualization. The random, spatial, and temporal tokens are depicted by the blue, orange, and green cells, respectively. Notably, the proportion of colored cells to all cells roughly reflects the ratio of the computational cost between our mixed attention and 3D attention (with some cells being recalculated in the mixed attention operation). The observation indicates that the proposed methodology effectively lessens the computational overhead to a significant extent.
Fig. 5.
Fig. 5. Visualization of the associations between the upper-left token marked with the red border and the colored tokens in various attention mechanisms. The proportion of colored cells roughly represents the ratio of computational cost between the proposed mixed attention and 3D attention.

3.3 Channel Shift Operation

In contrast to the mixed attention operation that captures long-range dependencies in videos, the channel shift operation prioritizes encoding local motion patterns. Some prior works propose the temporal shift [32] and periodic shift [62] methods in CNN and Transformer, respectively, to move specific channels of video features along the time direction, thereby achieving hard channel transfer. The 1D convolution [28] initialized with the temporal shift method accomplishes a flexible and learnable way to manipulate channels in CNN, which can be regarded as a soft channel shift mechanism. Considering the multi-head attention mechanism of the Transformer network, we employ the periodic shift method to initialize 1D depthwise convolution in order to equally transfer the channels of each head. As shown in Figure 6, we evaluate the performance of combining the following techniques used for learning short-term motion features with the mixed attention operation: (a) temporal shift; (b) periodic shift; (c) 1D depthwise convolution initialized with temporal shift; (d) our 1D depthwise convolution initialized with periodic shift. The temporal shift and periodic shift methods in Figure 6(a) and (b) can be regarded as convolution operations with special weights. Previous research [37] has shown that the effective receptive field of CNNs has a Gaussian distribution and only occupies a fraction of the full theoretical receptive field. Therefore, the different methods of channel shift in Figure 6 are considered to focus more on local motion patterns between adjacent frames. This is particularly helpful for the classification of many fine-grained action categories, such as braiding and curling hair.
Fig. 6.
Fig. 6. Illustrations of various channel shift mechanisms. The feature channels belonging to each attention head are represented using different colors. The letters “d” and “t” indicate the channel and temporal dimensions.
As shown in Figure 2, we perform the channel shift operation after the mixed attention operation. Hence, the token representation at the spatiotemporal position \((s,t)\) after the channel shift operation can be derived as follows:
\begin{align} \hat{z}_{s,t}^{l}=Conv(y_{s,t}^{l}) \end{align}
(8)
where \(Conv(.)\) denotes 1D depthwise convolution, which is initialized using periodic shift and trained jointly with the remaining parameters of the network. In periodic shift, the token feature \(y_{s,t}^{l}\) from each head exchanges partial channels between adjacent frames as follows:
\begin{align} y_{s,t}^{l}[d]&=y_{s,t+1}^{l}[d],{}{0} {\lt} {d}\leq{\frac{D_{h}}{\rho}},{1}\leq{t} {\lt} {T}\end{align}
(9)
\begin{align} y_{s,t}^{l}[d]&=y_{s,t-1}^{l}[d],{}{\frac{D_{h}}{\rho}} {\lt} {d}\leq{\frac{2D_{h}}{ \rho}},{1} {\lt} {t}\leq{T} \end{align}
(10)
\begin{align} y_{s,t}^{l}[d]&=y_{s,t}^{l}[d],{}{\frac{2D_{h}}{\rho}} {\lt} {d}\leq{D_{h},{1} \leq{t}\leq{T}} \end{align}
(11)
where \(1/\rho\) denotes the proportion of shifted channels.

3.4 MACS Transformer

The proposed MACS operation applies to various architectures, and we introduce the model instantiated with the ViT [10] backbone, as illustrated in the left part of Figure 4. Following the previous works [4, 10], the initial step involves spatially partitioning the input video \(\boldsymbol{V}\in\mathbb{R}^{{T}\times{H}\times{W}\times{3}}\) into patches (i.e., tokens) of size \(P\times{P}\), where H, W, and T specify the video’s height, width, and length, respectively. We then flatten these patches into a vector \(\boldsymbol{X}\in\mathbb{R}^{{T}\times{N}\times{C}}\), with \(N=\frac{{H}\cdot{W}}{P^{2}}\) and \(C = 3P^{2}\) separately denoting the number of patches per frame and the token dimension. After that, we linearly map the vector \(\boldsymbol{X}\) with the matrix \(\boldsymbol{E}\in\mathbb{R}^{{C}\times{D}}\) and obtain the embedding vector \(\boldsymbol{Z}^{0}\in\mathbb{R}^{{T}\times{(N + 1)}\times{D}}\) as follows:
\begin{align} \boldsymbol{Z}^{0}=[x_{cls};x_{1}\boldsymbol{E};x_{2}\boldsymbol{E};\cdots;x_{ N}\boldsymbol{E}]+\boldsymbol{E}_{pos} \end{align}
(12)
where \(\boldsymbol{E}_{pos}\in\mathbb{R}^{{(N + 1)}\times{D}}\) indicates the positional embeddings and \(x_{cls}\) is the [class] token. The video representation further undergoes processing by a sequence of L MACS encoders shown in Figure 2. Each encoder includes the combination of the mixed attention and channel shift operations and the Multi-Layer Perceptron (MLP):
\begin{align} \hat{\boldsymbol{Z}}^{l-1}=MACS(LN(\boldsymbol{Z}^{l-1}))+ \boldsymbol{Z}^{l-1}\end{align}
(13)
\begin{align} \boldsymbol{Z}^{l}=MLP(LN(\hat{\boldsymbol{Z}}^{l-1}))+\hat{ \boldsymbol{Z}}^{l-1} \end{align}
(14)
where \(LN(.)\) represents layer normalization [2] and \(MLP(.)\) consists of two linear layers with a GELU activation. We employ the fully connected layer to handle the [class] token \(z^{L}_{0}\) from the last encoder. Finally, the per-frame predictions are averaged over the temporal dimension to obtain the action category of the input video:
\begin{align} y=\frac{1}{T}\sum\limits_{i=1}^{T}{FC(z^{L}_{0})} \end{align}
(15)

4 Experiments

In this section, we introduce the experimental settings, conduct ablation studies, and compare the action recognition results of the proposed MACS model and the state-of-the-art approaches on both large-scale and small-scale datasets, followed by visualizations.

4.1 Datasets

Kinetics-400 [7] gathered from the YouTube website by DeepMind is a classic large-scale action-related dataset. It consists of 400 action classes, and each class has at least 400 videos. The dataset contains a variety of action categories, such as Person–Person Actions (e.g., kissing) and Person–Object Actions (e.g., washing dishes). Something-Something V2 [18] is an extensive action-related dataset, encompassing roughly 170K training videos and 25K validation videos. It incorporates 174 categories of fine-grained human–object interaction actions, such as moving something up. Following previous works [4, 53], we conduct model training and inference separately on the training and validation sets for these two datasets.
UCF101 [44] and EGTEA Gaze+ [29] are widely used small-scale video datasets for action recognition. UCF101 consists of 13,320 videos from 101 classes, which are clustered into five types, such as Playing Musical Instruments, Sports, and Body-Motion. EGTEA Gaze+ contains 10,321 first-person action videos from 106 categories, with an average length of 3.2 seconds. In line with [62], we adopt the first split for both datasets to train and evaluate the proposed model.

4.2 Implementation Details

We employ the Visformer-S [8] and ViT-B [10] backbones to instantiate our model. We temporally sample frames from the videos at a rate of \(1/8\) as the network input. In the channel shift operation, we set the ratio of shifted channels to \(1/8\). We adopt data augmentation strategies such as spatial scale jittering and random cropping. For example, we resize the shorter side of the input video to a value within the interval of [224, 360] and then randomly crop \(224\times 224\) regions on Kinetics-400. During training, the initial learning rate scales linearly with the batch size [17] and varies depending on factors like the datasets and the input resolutions. For instance, when training on Kinetics-400 with a resolution of \(224\times 224\times 8\) and a batch size of 40, we set the initial learning rate to 0.2. In addition, following [1], we only employ additional regularization methods such as label smoothing and mixup to enhance the performance on Something-Something V2. We train the model for a total of 18 epochs in all experiments, reducing the learning rate by a factor of 10 at the 10th and 15th epochs. For evaluation, we utilize three spatial crops (left, center, and right) and uniformly select one or five clips from the videos, resulting in two test settings of \(3\times 1\) and \(3\times 5\). We also fix the seed for random attention in both the training and testing stages throughout the experiments.

4.3 Ablation Study

We conduct ablation experiments on the Kinetics-400 dataset and report the Top-1 accuracy. If not specified otherwise, we default to employing the Visformer-S backbone network and the input size of 224 \(\times\) 224 \(\times\) 8 under the configuration of 3 \(\times\) 5 (#crops \(\times\) #clips). Table 1 explores the combinations of different attention mechanisms, with Base2D (i.e., spatial attention) and Base3D (i.e., 3D attention) used as baselines. S, T, and R represent spatial, temporal, and random attention, respectively. The symbols “+” and “\(\cup\)” indicate the connection of attention operations in series and parallel, respectively. The results demonstrate that the serial connection is more effective when combining spatial or random with temporal attention, while the parallel connection achieves higher accuracy for the mixture of spatial and random attention. This aligns with previous methods [1, 4] that break down 3D attention into a series connection of spatial and temporal attention. Table 2 compares the proposed mixed attention mechanisms under different designs and shows that Mix-A (i.e., executing them in parallel) surprisingly attains the most optimal outcomes when combining all three attention operations. Based on the assumption that spatial attention provides the essential information among them, we further explore different fusion weights for these attention mechanisms. Ultimately, we obtain the top-1 accuracy of 76.0% with the weights of the three branches being 0.6:0.2:0.2, represented by Mix-A\({}^{\ast}\).
Table 1.
MethodFLOPsParamsTop-1
 (G)(M)(%)
Base2D from [62]39.139.874.0
Base3D from [62]46.539.876.3
S39.139.874.1
T38.139.856.4
R40.139.873.5
S + T39.139.875.6
S\(\cup\)T39.139.875.4
R + T40.239.873.3
R\(\cup\)T40.239.868.8
S + R41.239.875.1
S\(\cup\)R41.239.875.4
Table 1. Results of Different Combinations of Attention Operations
“+” and “\(\cup\)” signify series and parallel connections.
Table 2.
MethodFLOPsParamsTop-1
 (G)(M)(%)
Base2D from [62]39.139.874.0
Base3D from [62]46.5 (\(\uparrow 18.9%\))39.876.3
Mix-A41.3 (\(\uparrow 5.6%\))39.875.9
Mix-B41.3 (\(\uparrow 5.6%\))39.875.8
Mix-C41.3 (\(\uparrow 5.6%\))39.875.0
Mix-A\({}^{\ast}\)41.3 (\(\uparrow 5.6%\))39.876.0
Table 2. Results of the Mixed Attention Operation Using Various Designs
Mix-A\({}^{\ast}\) denotes the fusion of the three attention branches in a ratio of 0.6:0.2:0.2.
We study the impact of selecting varying proportions of tokens in random attention when combined with spatial attention, as illustrated in Figure 7. As the sampling rate rises, the recognition accuracy and the Floating Point Operations (FLOPs) gradually increase, which can be attributed to the consideration of the attention relationships between more spatiotemporal tokens. To achieve a balance between performance and computational overhead, we set the sampling rate to \(1/4\) in all other experiments. Table 3 provides the results of different channel shift operations when combined with mixed attention. Due to the low computational cost of 1D depthwise convolution, all channel shift operations result in a negligible increase in FLOPs compared to only using the mixed attention operation. It can be observed that our 1D depthwise convolution with periodic shift initialization exhibits superior recognition accuracy. Our performance is close to that of Base3D, but with lesser computational augmentation in comparison to Base2D (5.6% vs. 18.9%). The experimental results also indicate that positioning channel shift after mixed attention leads to a slight improvement in accuracy compared to positioning it before (the result is not displayed in the table for simplicity). The possible reason is that the appearance modeling process conducted in mixed attention facilitates the subsequent channel shift operation in learning short-term motion features. Table 4 gives the testing throughput of the proposed model on a single GPU. It can be seen that reasonable results show a negative correlation between the model’s FLOPs and throughput. Our MACS model achieves a throughput of 20.6 clips per second, significantly exceeding that of the Base3D model.
Fig. 7.
A line graph showing the accuracy and FLOPs of random attention at different sampling rates.
Fig. 7. Impact of varying the proportion of sampled tokens in random attention. We set the sampling rate to \(1/4\) to achieve a balance between accuracy and computational cost.
Table 3.
Attention OperationChannel ShiftFLOPsTop-1
  (G)(%)
Base2D from [62]-39.174.0
Base3D from [62]-46.5 (\(\uparrow 18.9%\))76.3
Mixed attentionTemporal shift41.3 (\(\uparrow 5.6%\))75.9
Periodic shift41.3 (\(\uparrow 5.6%\))76.0
1D Conv (TS Init)41.3 (\(\uparrow 5.6%\))76.0
1D Conv (PS Init)41.3 (\(\uparrow 5.6%\))76.2
Table 3. Results of Combinations of the Mixed Attention and Different Channel Shift Operations
“TS Init” and “PS Init” denote the temporal shift and periodic shift initialization techniques, respectively.
Table 4.
MethodThroughputFLOPs
 (clips/sec)(G)
Base2D from [62]25.339.1
Base3D from [62]15.4 (\(\downarrow 39.1%\))46.5
MACS20.6 (\(\downarrow 18.6%\))41.3
Table 4. Testing Throughput of the Proposed MACS Model
clips/sec means clips per second, higher the better.
We present the performance of the proposed MACS model for different backbone networks and input resolutions in Table 5. For inputs with eight frames and more, we adopt the testing settings of 3 \(\times\) 5 and 3 \(\times\) 1 (#crops \(\times\) #clips), respectively. The reason is that using more clips cannot significantly enhance the accuracy for long inputs but instead leads to an increase in computation time, as shown in Figure 8. Our model outperforms the Base2D model with inputs of the same size, whether based on Visformer-S or ViT-B backbones. As the number of input frames and the spatial resolution increase, the recognition accuracy of the proposed method continuously improves. This observation suggests that our method exhibits efficacy in processing inputs of varying resolutions across diverse backbone networks.
Table 5.
MethodBackbone#F \(\times\) ResTestTop-1
  (H \(\times\) W)Setting(%)
Base2D from [62]Visformer-S8 \(\times\, 224^{2}\)\(3\times 5\)74.0
Base2D from [62]ViT-B8 \(\times\, 224^{2}\)\(3\times 5\)76.4
MACSVisformer-S8 \(\times\, 224^{2}\)\(3\times 5\)76.2
Visformer-S32 \(\times\, 224^{2}\)\(3\times 1\)78.0
Visformer-S32 \(\times\, 320^{2}\)\(3\times 1\)79.8
Visformer-S32 \(\times\, 360^{2}\)\(3\times 1\)80.0
MACSViT-B8 \(\times\, 224^{2}\)\(3\times 5\)78.1
ViT-B32 \(\times\, 224^{2}\)\(3\times 1\)80.2
ViT-B48 \(\times\, 224^{2}\)\(3\times 1\)80.8
Table 5. Results of Our MACS Model with Different Backbones and Input Resolutions
Fig. 8.
A bar chart showing the performance of our method with different backbones and input sizes under various test settings.
Fig. 8. Impact of different test settings on our MACS method with varying input sizes. For long sequence inputs, the improvement in accuracy from multiple clips is relatively small.

4.4 Comparison with the State-of-the-Art

We perform a comparison of the proposed MACS model with the current state-of-the-art methods on the Kinetics-400 dataset, as illustrated in Table 6. MACS-Visf and MACS-ViT represent variants of our method based on the Visformer-S and ViT-B backbones. (H/L) indicate that higher spatial resolution frames or longer sequences are taken as inputs. In the first two sections of Table 6, we compare the proposed MACS model with the 3D CNN-based methods (e.g., SlowFast [14]) and the 2D CNN-based methods (e.g., TDN [53]). These methods are based on various backbone networks such as ResNet101 and Inception V1 and consider a wide range of input resolutions. Our MACS-ViT model attains a substantial advantage over all these methods by a large margin. Due to the higher time complexity of Transformer compared to CNN, our method incurs more computational overhead as opposed to these methods. However, the fewer training epochs (i.e., 18) help mitigate this issue to some extent.
Table 6.
MethodBackbonePre-TrainResolutionFramesGFLOPsParamsTrainingTop-1Top-5
  Dataset(H\(\,\times\,\)W) (\(\times\,\)Views)(M)Epochs(%)(%)
3D CNNs         
I3D [7]Inception V1ImgNet-1K224\(\,\times\,\)224250108\(\,\times\,\)NA12.0-71.190.3
Two-Stream I3D [7]Inception V1ImgNet-1K224\(\,\times\,\)224500216\(\,\times\,\)NA25.0-75.792.0
ARTNet [52]ResNet18None112\(\,\times\,\) 1121624\(\,\times\,\)25035.2-69.288.3
S3D-G [58]Inception V1ImgNet-1K224\(\,\times\,\)22425071\(\,\times\,\)NA11.511274.793.4
Non-Local R101 [54]ResNet101ImgNet-1K256\(\,\times\,\) 256128359\(\,\times\,\) 3054.319677.793.3
SlowFast\({}_{16\times 8}\) [14]ResNet101+NLNone256\(\,\times\,\) 25632234\(\,\times\,\) 3059.919679.893.9
ip-CSN [47]ResNet152Sports1M224\(\,\times\,\) 22432109\(\,\times\,\) 30-4579.293.8
SmallBigNet [26]ResNet101ImgNet-1K224\(\,\times\,\) 22432418\(\,\times\,\) 12-11077.493.3
X3D-XL [13]X2DNone356\(\,\times\,\) 3561648\(\,\times\,\) 3011.025679.193.9
CorrNet [51]ResNet101None224\(\,\times\,\) 22432224\(\,\times\,\) 30-25079.2-
2D CNNs         
TSM [32]ResNet50ImgNet-1K256\(\,\times\,\) 2561665\(\,\times\,\) 1024.310074.7-
TAM [12]bLResNet50Kinetics-400224\(\,\times\,\) 2244893\(\,\times\,\) 925.07573.591.2
TEA [28]ResNet50ImgNet-1K256\(\,\times\,\) 2561670\(\,\times\,\) 3035.35076.192.5
TEINet [34]ResNet50ImgNet-1K256\(\,\times\,\) 2561666\(\,\times\,\) 3030.810076.292.5
TANet [36]ResNet50ImgNet-1K256\(\,\times\,\) 2561686\(\,\times\,\) 1225.610076.992.9
TDN-R101 [53]ResNet101ImgNet-1K256\(\,\times\,\) 25624198\(\,\times\,\) 3043.910079.494.4
GC-TDN-R50 [20]ResNet50ImgNet-1K256\(\,\times\,\) 25624110\(\,\times\,\) 3027.410079.694.1
MDAF [50]ResNet50ImgNet-1K224\(\,\times\,\) 224834\(\,\times\,\) 3024.515076.292.0
CANet [16]ResNet101ImgNet-1K256\(\,\times\,\) 256867\(\,\times\,\)3044.15077.993.5
Transformers         
LAPS [62]Visformer-SImgNet-10K224\(\,\times\,\) 224840\(\,\times\,\) 1539.81876.092.6
ViT (Video) [10]ViT-BImgNet-21K224\(\,\times\,\) 2248135\(\,\times\,\) 3085.91876.092.5
TokShift [63]ViT-BImgNet-21K224\(\,\times\,\) 22416270\(\,\times\,\) 3085.91878.293.8
TokShift (MR) [63]ViT-BImgNet-21K256\(\,\times\,\) 2568176\(\,\times\,\) 3085.91877.793.6
VTN [39]ViT-BImgNet-21K224\(\,\times\,\) 2242504218\(\,\times\,\) 1114.02578.693.7
TimeSformer [4]ViT-BImgNet-21K224\(\,\times\,\) 2248197\(\,\times\,\) 3121.41578.093.7
TimeSformer-HR [4]ViT-BImgNet-21K448\(\,\times\,\) 448161703\(\,\times\,\) 3121.41579.794.4
TimeSformer-L [4]ViT-BImgNet-21K224\(\,\times\,\) 224962380\(\,\times\,\) 3121.41580.794.7
STTM [15]ViT-BImgNet-21K224\(\,\times\,\) 224112288\(\,\times\,\) 189.63080.2-
MACS-VisfVisformer-SImgNet-10K224\(\,\times\,\) 224841\(\,\times\,\) 1539.81876.292.4
MACS-Visf (H)Visformer-SImgNet-10K320\(\,\times\,\) 32032472\(\,\times\,\) 340.01879.894.3
MACS-ViTViT-BImgNet-21K224\(\,\times\,\) 2248152\(\,\times\,\) 1586.11878.193.7
MACS-ViT (L)ViT-BImgNet-21K224\(\,\times\,\) 224481265\(\,\times\,\) 386.11880.894.8
Table 6. Comparison with the State-of-the-Art Methods on the Validation Set of Kinetics-400
In contrast to video Transformers, Our MACS-Visf model goes beyond LAPS [62] (76.2% vs. 76.0%), which also focuses on designing efficient Transformers and adopts the Visformer-S architecture. The proposed MACS-ViT model surpasses other approaches with the same ViT-B backbone, including TokShift [63], VTN [39], and TimeSformer [4]. For instance, our MACS-ViT model exhibits superior performance to TimeSformer-HR (80.8% vs. 79.7%), with significantly lower GFLOPs (1265 vs. 5110) and fewer parameters (86.1 vs. 121.4). These results also reveal that our method possesses good adaptability to different backbones and has the potential to be integrated with more potent 2D Transformers.
We further fine-tune our model pre-trained on Kinetics-400 on commonly used other datasets, such as Something-Something V2, UCF101, and EGTEA Gaze+. As depicted in Table 7, our MACS-Visf exceeds the methods utilizing the ResNet50 backbone, including TSM [32], SlowFast [14], and SmallBig [26]. Compared to Transformer-based methods, the proposed method achieves improved performance over VidTr-L [27] and TimeSformer-L [4] with the ViT-B backbone, even when employing a lighter and weaker Visformer-S architecture. This suggests that our model demonstrates superior motion encoding capability by jointly modeling short-term and long-term motion features. As a result, it performs well even in motion-related videos with a lack of scene information.
Table 7.
MethodBackbone#F\(\,\times\,\)ResTop-1
  (T\(\,\times\,\)HW)(%)
TSM [32]ResNet5016\(\,\times\, 224^{2}\)63.4
SlowFast [14]ResNet5016\(\,\times\, 224^{2}\)61.7
SmallBig [26]ResNet5024\(\,\times\, 224^{2}\)63.3
VidTr-L [27]ViT-B32\(\,\times\, 224^{2}\)63.0
TimeSformer-HR [4]ViT-B16\(\,\times\, 448^{2}\)62.2
TimeSformer-L [4]ViT-B96\(\,\times\, 224^{2}\)62.4
MACS-Visf (H)Visformer-S32\(\,\times\, 320^{2}\)64.8
Table 7. Comparison Results on Something-Something V2
Table 8 demonstrates the superiority of our MACS model over other approaches, such as CNN-based methods (e.g., P3D [42]) and Transformer-based methods (e.g., TokShift [63]) on the first split of the UCF101 dataset. With the same input size (i.e., 32 \(\,\times\,\) 320 \(\,\times\,\) 320) and Visformer-S backbone, the MACS-Visf model surpasses LAPS (H) [62] (97.2% vs. 96.9%). The comparison results on the first split of the EGTEA-GAZE+ dataset are shown in Table 9. Our MACS-ViT model outperforms all listed methods, including the recent TokShift (HR) [63] and LAPS (H) [62] methods. These findings indicate that after pre-trained on Kinetics-400, our method displays excellent generalization capability on small-scale datasets.
Table 8.
MethodPre-Train#F\(\,\times\,\)ResTop-1
 Dataset(T\(\,\times\,\)HW)(%)
P3D [42]Sports-1M16\(\,\times\,224^{2}\)84.2
TSM [32]Kinetics-4008\(\,\times\,256^{2}\)95.9
ViT (Video) [10]ImageNet-21k8\(\,\times\,256^{2}\)91.5
TokShift [63]Kinetics-4008\(\,\times\,256^{2}\)95.4
TokShift-L (HR) [63]Kinetics-4008\(\,\times\,384^{2}\)96.8
LAPS (H) [62]Kinetics-40032\(\,\times\,320^{2}\)96.9
MACS-Visf (H)Kinetics-40032\(\,\times\,320^{2}\)97.2
MACS-ViT (L)Kinetics-40048\(\,\times\,224^{2}\)97.3
Table 8. Comparison Results on Split 1 of UCF101
Table 9.
MethodPre-Train#F\(\,\times\,\)ResTop-1
 Dataset(T\(\,\times\,\)HW)(%)
TSM [32]Kinetics-4008\(\,\times\, 224^{2}\)63.5
SAP [55]Kinetics-40064\(\,\times\, 256^{2}\)64.1
ViT (Video) [10]ImageNet-21k8\(\,\times\, 224^{2}\)62.6
TokShift (HR) [63]Kinetics-4008\(\,\times\, 384^{2}\)65.8
LAPS (H) [62]Kinetics-40032\(\,\times\, 320^{2}\)66.1
MACS-Visf (H)Kinetics-40032\(\,\times\, 320^{2}\)66.3
MACS-ViT (L)Kinetics-40048\(\,\times\, 224^{2}\)67.3
Table 9. Comparison Results on Split 1 of EGTEA-GAZE+

4.5 Visualization

Visualizing the Classification Probabilities. Figure 9 shows the top four action classes with the highest predicted probabilities using our MACS Transformer and the Base2D model on the chosen Kinetics-400 clips. The light coral and light steel blue bars indicate the probabilities of the correct and incorrect categories, respectively. Clearly, our MACS model yields more precise predictions compared to Base2D. Even for many easily confused action categories such as \(curling\_hair\) and \(braiding\_hair\), our method still provides accurate classification results.
Fig. 9.
Fig. 9. Visualization on the examples drawn from Kinetics-400. We showcase the top four classes with the highest probabilities from both the Base2D and our MACS models on these video samples. The light coral and light steel blue bars, respectively, denote the correct and incorrect category predictions.
Visualizing the Attention Weights. In Figure 10, we display the self-attention weights of the mixed attention operation in the first layer. We fuse the self-attention weights of different types of attention by using a weighted summation. Our model enables the focus on important interactive objects in the video, such as hands and cups, allowing the extraction of key information from the video for robust spatiotemporal representation learning.
Fig. 10.
The images showing the self-attention weights of the mixed attention operation in the first layer.
Fig. 10. Illustration of the self-attention weights in the first layer on Something-Something V2. The first and second rows, respectively, present the input frames and the self-attention weights.
Visualizing the Video Features. We select some samples of 106 classes from the EGTEA Gaze+ dataset and display their deep representations extracted by the Base2D model and our MACS model via t-SNE [49] in Figure 11. We use different colored dots to represent video samples from diverse classes. Our model obtains features with small within-class distances and large between-class distances. Hence, the proposed MACS model can learn more discriminative representations compared to the Base2D model.
Fig. 11.
The images illustrating the deep representations extracted by the Base2D model and our method.
Fig. 11. Feature visualization of the 106 class samples on EGTEA Gaze+ via t-SNE [49]. Points of various colors represent videos belonging to different categories. Our MACS model extracts more discriminative features compared to the Base2D model.

4.6 Discussion

Our goal in this article is to achieve a balance between model performance and computational cost. As shown in Table 3, the proposed model yields comparable accuracy to Base3D, while introducing a lesser increase in computational overhead compared to Base2D. This is benefited by the combination of three lightweight attention mechanisms, as well as the channel shift operation that introduces minimal additional floating-point computations. Furthermore, the main challenge in action recognition tasks lies in modeling complex motions, encompassing both slow and fast movements. Short-term motion information focuses on inter-frame motion, which is crucial for capturing fast movements. Long-term motion information pertains to overall motion patterns, which are essential for depicting slow movements. The proposed mixed attention and channel shift modules capture these two types of motion information based on the long-range modeling capability of the attention mechanism and the local modeling capability of the convolution operation. We observe that on the motion-related Something-Something V2 dataset, our method consistently outperforms other Transformer-based approaches in Table 7, even when employing a less powerful backbone. Figure 10 also demonstrates our mixed attention’s ability to focus on moving foreground objects, thereby validating its effectiveness in capturing motion information.

5 Conclusion

This article introduces the MACS Transformer, which utilizes a combination of mixed attention and channel shift operations to effectively capture long-term and short-term motion information while maintaining low computational cost. In the proposed mixed attention operation, we employ random attention to address the limitation of temporal and spatial attention, which overlooks a significant portion of the visual regions. We further use the lightweight 1D depthwise convolution initialized with periodic shift to encode short-term dynamics in videos. The experimental results validate the advantages of mixing multiple attention mechanisms, and the channel shift operation additionally enhances the recognition accuracy.
In random attention, we efficiently learn long-range dependencies by randomly selecting a subset of key tokens. However, this approach may struggle to choose tokens that possess rich visual or motion-related information. If we can leverage visual priors to select tokens corresponding to important regions in videos, it has the potential to further enhance the model’s performance.
In future studies, we will focus on exploring how to effectively select critical tokens in attention mechanisms. One promising exploration is to calculate frame differences for motion estimation and subsequently generate motion-sensitive tokens, similar to the motion excitation module [28] in CNNs. We will also apply the proposed MACS module to more advanced frameworks such as Swin-B [35] to further improve the model’s recognition capabilities.

References

[1]
Anurag Arnab, Mostafa Dehghani, Georg Heigold, Chen Sun, Mario Lučić, and Cordelia Schmid. 2021. ViViT: A video vision transformer. In Proceedings of the IEEE International Conference on Computer Vision, 6836–6846.
[2]
Jimmy Lei Ba, Jamie Ryan Kiros, and Geoffrey E. Hinton. 2016. Layer normalization. arXiv:1607.06450. Retrieved from https://arxiv.org/abs/1607.06450
[3]
Iz Beltagy, Matthew E. Peters, and Arman Cohan. 2020. Longformer: The long-document transformer. arXiv:2004.05150. Retrieved from https://arxiv.org/abs/2004.05150
[4]
Gedas Bertasius, Heng Wang, and Lorenzo Torresani. 2021. Is space-time attention all you need for video understanding? In Proceedings of the International Conference on Machine Learning, 813–824.
[5]
Adrian Bulat, Juan Manuel Perez Rua, Swathikiran Sudhakaran, Martinez Brais, and Tzimiropoulos Georgios. 2021. Space-time mixing attention for video transformer. In Proceedings of the Advances in Neural Information Processing Systems, 19594–19607.
[6]
Nicolas Carion, Francisco Massa, Gabriel Synnaeve, Nicolas Usunier, Alexander Kirillov, and Sergey Zagoruyko. 2020. End-to-end object detection with transformers. In Proceedings of the European Conference on Computer Vision, 213–229.
[7]
Joao Carreira and Andrew Zisserman. 2017. Quo vadis, action recognition? A new model and the kinetics dataset. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 6299–6308.
[8]
Zhengsu Chen, Lingxi Xie, Jianwei Niu, Xuefeng Liu, Longhui Wei, and Qi Tian. 2021. Visformer: The vision-friendly transformer. In Proceedings of the IEEE International Conference on Computer Vision, 589–598.
[9]
Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2018. BERT: Pre-training of deep bidirectional transformers for language understanding. arXiv:1810.04805. Retrieved from https://arxiv.org/abs/1810.04805
[10]
Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, et al. 2020. An image is worth 16x16 words: Transformers for image recognition at scale. arXiv:2010.11929. Retrieved from https://arxiv.org/abs/2010.11929
[11]
Haoqi Fan, Bo Xiong, Karttikeya Mangalam, Yanghao Li, Zhicheng Yan, Jitendra Malik, and Christoph Feichtenhofer. 2021. Multiscale vision transformers. In Proceedings of the IEEE International Conference on Computer Vision, 6824–6835.
[12]
Quanfu Fan, Chun-Fu (Richard) Chen, Hilde Kuehne, Marco Pistoia, and David Cox. 2019. More is less: Learning efficient video representations by big-little network and depthwise temporal aggregation. In Proceedings of the Advances in Neural Information Processing Systems, 2261–2270.
[13]
Christoph Feichtenhofer. 2020. X3d: Expanding architectures for efficient video recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 203–213.
[14]
Christoph Feichtenhofer, Haoqi Fan, Jitendra Malik, and Kaiming He. 2019. Slowfast networks for video recognition. In Proceedings of the IEEE International Conference on Computer Vision, 6202–6211.
[15]
Zhanzhou Feng, Jiaming Xu, Lei Ma, and Shiliang Zhang. 2024. Efficient video transformers via spatial-temporal token merging for action recognition. ACM Transactions on Multimedia Computing, Communications and Applications 20, 4 (2024), 1–21.
[16]
Xiong Gao, Zhaobin Chang, Xingcheng Ran, and Yonggang Lu. 2024. CANet: Comprehensive attention network for video-based action recognition. Knowledge-Based Systems 296 (2024), 111852.
[17]
Priya Goyal, Piotr Dollár, Ross Girshick, Pieter Noordhuis, Lukasz Wesolowski, Aapo Kyrola, Andrew Tulloch, Yangqing Jia, and Kaiming He. 2017. Accurate, large minibatch SGD: Training ImageNet in 1 hour. arXiv:1706.02677. Retrieved from https://arxiv.org/abs/1706.02677
[18]
Raghav Goyal, Samira Ebrahimi Kahou, Vincent Michalski, Joanna Materzynska, Susanne Westphal, Heuna Kim, Valentin Haenel, Ingo Fruend, Peter Yianilos, Moritz Mueller-Freitag, et al. 2017. The “something something” video database for learning and evaluating visual common sense. In Proceedings of the IEEE International Conference on Computer Vision, 5842–5850.
[19]
Qipeng Guo, Xipeng Qiu, Pengfei Liu, Yunfan Shao, Xiangyang Xue, and Zheng Zhang. 2019. Star-transformer. arXiv:1902.09113. Retrieved from https://arxiv.org/abs/1902.09113
[20]
Yanbin Hao, Hao Zhang, Chong-Wah Ngo, and Xiangnan He. 2022. Group contextualization for video recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 928–938.
[21]
Aashni Haria, Archanasri Subramanian, Nivedhitha Asokkumar, Shristi Poddar, and Jyothi S. Nayak. 2017. Hand gesture recognition for human computer interaction. Procedia Computer Science 115 (2017), 367–374.
[22]
Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 770–778.
[23]
Andrej Karpathy, George Toderici, Sanketh Shetty, Thomas Leung, Rahul Sukthankar, and Li Fei-Fei. 2014. Large-scale video classification with convolutional neural networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 1725–1732.
[24]
Alex Krizhevsky, Ilya Sutskever, and Geoffrey E. Hinton. 2017. ImageNet classification with deep convolutional neural networks. Communications of the ACM 60, 6 (2017), 84–90.
[25]
Kunchang Li, Yali Wang, Junhao Zhang, Peng Gao, Guanglu Song, Yu Liu, Hongsheng Li, and Yu Qiao. 2023. Uniformer: Unifying convolution and self-attention for visual recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence 45 (2023), 12581–12600.
[26]
Xianhang Li, Yali Wang, Zhipeng Zhou, and Yu Qiao. 2020. Smallbignet: Integrating core and contextual views for video classification. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 1092–1101.
[27]
Xinyu Li, Yanyi Zhang, Chunhui Liu, Bing Shuai, Yi Zhu, Biagio Brattoli, Hao Chen, Ivan Marsic, and Joseph Tighe. 2021. VidTr: Video transformer without convolutions. In Proceedings of the IEEE International Conference on Computer Vision, 13557–13567.
[28]
Yan Li, Bin Ji, Xintian Shi, Jianguo Zhang, Bin Kang, and Limin Wang. 2020. TEA: Temporal excitation and aggregation for action recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 909–918.
[29]
Yin Li, Miao Liu, and James M. Rehg. 2018. In the eye of beholder: Joint learning of gaze and actions in first person video. In Proceedings of the European Conference on Computer Vision, 619–635.
[30]
Jingyun Liang, Jiezhang Cao, Yuchen Fan, Kai Zhang, Rakesh Ranjan, Yawei Li, Radu Timofte, and Luc Van Gool. 2022. VRT: A video restoration transformer. arXiv:2201.12288. Retrieved from https://arxiv.org/abs/2201.12288
[31]
Jing Lin, Yuanhao Cai, Xiaowan Hu, Haoqian Wang, Youliang Yan, Xueyi Zou, Henghui Ding, Yulun Zhang, Radu Timofte, and Luc Van Gool. 2022. Flow-guided sparse transformer for video deblurring. In Proceedings of the International Conference on Machine Learning, 13334–13343.
[32]
Ji Lin, Chuang Gan, and Song Han. 2019. TSM: Temporal shift module for efficient video understanding. In Proceedings of the IEEE International Conference on Computer Vision, 7083–7093.
[33]
Ze Liu, Yutong Lin, Yue Cao, Han Hu, Yixuan Wei, Zheng Zhang, Stephen Lin, and Baining Guo. 2021. Swin transformer: Hierarchical vision transformer using shifted windows. In Proceedings of the IEEE International Conference on Computer Vision, 10012–10022.
[34]
Zhaoyang Liu, Donghao Luo, Yabiao Wang, Limin Wang, Ying Tai, Chengjie Wang, Jilin Li, Feiyue Huang, and Tong Lu. 2020. TEINet: Towards an efficient architecture for video recognition. In Proceedings of the 34th AAAI Conference on Artificial Intelligence, 11669–11676.
[35]
Ze Liu, Jia Ning, Yue Cao, Yixuan Wei, Zheng Zhang, Stephen Lin, and Han Hu. 2022. Video swin transformer. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 3202–3211.
[36]
Zhaoyang Liu, Limin Wang, Wayne Wu, Chen Qian, and Tong Lu. 2021. TAM: Temporal adaptive module for video recognition. In Proceedings of the IEEE International Conference on Computer Vision, 13708–13718.
[37]
Wenjie Luo, Yujia Li, Raquel Urtasun, and Richard Zemel. 2016. Understanding the effective receptive field in deep convolutional neural networks. In Proceedings of the Advances in Neural Information Processing Systems, 4898–4906.
[38]
Vittorio Mazzia, Simone Angarano, Francesco Salvetti, Federico Angelini, and Marcello Chiaberge. 2022. Action transformer: A self-attention model for short-time pose-based human action recognition. Pattern Recognition 124 (2022), 108487.
[39]
Daniel Neimark, Omri Bar, Maya Zohar, and Dotan Asselmann. 2021. Video transformer network. In Proceedings of the IEEE International Conference on Computer Vision, 3163–3172.
[40]
Leila Panahi and Vahid Ghods. 2018. Human fall detection using machine vision techniques on RGB–D images. Biomedical Signal Processing and Control 44 (2018), 146–153.
[41]
Mengshi Qi, Yunhong Wang, Jie Qin, Annan Li, Jiebo Luo, and Luc Van Gool. 2019. StagNet: An attentive semantic RNN for group activity and individual action recognition. IEEE Transactions on Circuits and Systems for Video Technology 30, 2 (2019), 549–565.
[42]
Zhaofan Qiu, Ting Yao, and Tao Mei. 2017. Learning spatio-temporal representation with pseudo-3d residual networks. In Proceedings of the IEEE International Conference on Computer Vision, 5533–5541.
[43]
Qinghongya Shi, Hong-Bo Zhang, Zhe Li, Ji-Xiang Du, Qing Lei, and Jing-Hua Liu. 2022. Shuffle-invariant network for action recognition in videos. ACM Transactions on Multimedia Computing, Communications, and Applications 18, 3 (2022), 1–18.
[44]
Khurram Soomro, Amir Roshan Zamir, and Mubarak Shah. 2012. UCF101: A dataset of 101 human actions classes from videos in the wild. arXiv:1212.0402. Retrieved from https://arxiv.org/abs/1212.0402
[45]
Robin Strudel, Ricardo Garcia, Ivan Laptev, and Cordelia Schmid. 2021. Segmenter: Transformer for semantic segmentation. In Proceedings of the IEEE International Conference on Computer Vision, 7262–7272.
[46]
Du Tran, Lubomir Bourdev, Rob Fergus, Lorenzo Torresani, and Manohar Paluri. 2015. Learning spatiotemporal features with 3d convolutional networks. In Proceedings of the IEEE International Conference on Computer Vision, 4489–4497.
[47]
Du Tran, Heng Wang, Lorenzo Torresani, and Matt Feiszli. 2019. Video classification with channel-separated convolutional networks. In Proceedings of the IEEE International Conference on Computer Vision, 5552–5561.
[48]
Du Tran, Heng Wang, Lorenzo Torresani, Jamie Ray, Yann LeCun, and Manohar Paluri. 2018. A closer look at spatiotemporal convolutions for action recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 6450–6459.
[49]
Laurens Van der Maaten and Geoffrey Hinton. 2008. Visualizing data using t-SNE. Journal of Machine Learning Research 9, 11 (2008), 2579–2605.
[50]
Bin Wang, Faliang Chang, Chunsheng Liu, Wenqian Wang, and Ruiyi Ma. 2024. An efficient motion visual learning method for video action recognition. Expert Systems with Applications 255 (2024), 124596.
[51]
Heng Wang, Du Tran, Lorenzo Torresani, and Matt Feiszli. 2020. Video modeling with correlation networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 352–361.
[52]
Limin Wang, Wei Li, Wen Li, and Luc Van Gool. 2018. Appearance-and-relation networks for video classification. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 1430–1439.
[53]
Limin Wang, Zhan Tong, Bin Ji, and Gangshan Wu. 2021. TDN: Temporal difference networks for efficient action recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 1895–1904.
[54]
Xiaolong Wang, Ross Girshick, Abhinav Gupta, and Kaiming He. 2018. Non-local neural networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 7794–7803.
[55]
Xiaohan Wang, Yu Wu, Linchao Zhu, and Yi Yang. 2020. Symbiotic attention with privileged information for egocentric action recognition. In Proceedings of the 34th AAAI Conference on Artificial Intelligence, 12249–12256.
[56]
Zhengwei Wang, Qi She, and Aljosa Smolic. 2021. Action-net: Multipath excitation for action recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 13214–13223.
[57]
Wangmeng Xiang, Chao Li, Biao Wang, Xihan Wei, Xian-Sheng Hua, and Lei Zhang. 2022. Spatiotemporal self-attention modeling with temporal patch shift for action recognition. In Proceedings of the European Conference on Computer Vision, 627–644.
[58]
Saining Xie, Chen Sun, Jonathan Huang, Zhuowen Tu, and Kevin Murphy. 2018. Rethinking spatiotemporal feature learning: Speed-accuracy trade-offs in video classification. In Proceedings of the European Conference on Computer Vision, 305–321.
[59]
Haotian Xu, Xiaobo Jin, Qiufeng Wang, Amir Hussain, and Kaizhu Huang. 2022. Exploiting attention-consistency loss for spatial-temporal stream action recognition. ACM Transactions on Multimedia Computing, Communications, and Applications 18, 2 (2022), 1–15.
[60]
Jin Yuan, Shikai Chen, Yao Zhang, Zhongchao Shi, Xin Geng, Jianping Fan, and Yong Rui. 2023. Graph attention transformer network for multi-label image classification. ACM Transactions on Multimedia Computing, Communications and Applications 19, 4 (2023), 1–16.
[61]
Manzil Zaheer, Guru Guruganesh, Kumar Avinava Dubey, Joshua Ainslie, Chris Alberti, Santiago Ontanon, Philip Pham, Anirudh Ravula, Qifan Wang, Li Yang, et al. 2020. Big bird: Transformers for longer sequences. In Proceedings of the Advances in Neural Information Processing Systems, 17283–17297.
[62]
Hao Zhang, Lechao Cheng, Yanbin Hao, and Chong-wah Ngo. 2022. Long-term leap attention, short-term periodic shift for video classification. In Proceedings of the ACM International Conference on Multimedia, 5773–5782.
[63]
Hao Zhang, Yanbin Hao, and Chong-Wah Ngo. 2021. Token shift transformer for video classification. In Proceedings of the ACM International Conference on Multimedia, 917–925.
[64]
Weigang Zhang, Zhaobo Qi, Shuhui Wang, Chi Su, Li Su, and Qingming Huang. 2023. Temporal dynamic concept modeling network for explainable video event recognition. ACM Transactions on Multimedia Computing, Communications, and Applications 19, 6 (2023), 1–22.

Index Terms

  1. Mixed Attention and Channel Shift Transformer for Efficient Action Recognition

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image ACM Transactions on Multimedia Computing, Communications, and Applications
    ACM Transactions on Multimedia Computing, Communications, and Applications  Volume 21, Issue 3
    March 2025
    673 pages
    EISSN:1551-6865
    DOI:10.1145/3703019
    Issue’s Table of Contents

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 10 March 2025
    Online AM: 17 January 2025
    Accepted: 10 December 2024
    Revised: 20 November 2024
    Received: 26 March 2024
    Published in TOMM Volume 21, Issue 3

    Check for updates

    Author Tags

    1. Action recognition
    2. mixed attention
    3. random attention
    4. channel shift

    Qualifiers

    • Research-article

    Funding Sources

    • Exploratory Research Project of Zhejiang Lab
    • National Natural Science Foundation of China
    • National Key R&D Program of China
    • Key R&D Program of Xinjiang, China
    • Natural Science Foundation of Shandong Province

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • 0
      Total Citations
    • 80
      Total Downloads
    • Downloads (Last 12 months)80
    • Downloads (Last 6 weeks)49
    Reflects downloads up to 08 Mar 2025

    Other Metrics

    Citations

    View Options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Login options

    Full Access

    Figures

    Tables

    Media

    Share

    Share

    Share this Publication link

    Share on social media