research-article

Open access

Mixed Attention and Channel Shift Transformer for Efficient Action Recognition

Authors:

Xiusheng Lu,

Yanbin Hao, Lechao Cheng,

Sicheng Zhao,

Yutao Liu,

Mingli SongAuthors Info & Claims

ACM Transactions on Multimedia Computing, Communications and Applications, Volume 21, Issue 3

Article No.: 93, Pages 1 - 20

https://doi.org/10.1145/3712594

Published: 10 March 2025 Publication History

PDF eReader

Abstract

The practical use of the Transformer-based methods for processing videos is constrained by the high computing complexity. Although previous approaches adopt the spatiotemporal decomposition of 3D attention to mitigate the issue, they suffer from the drawback of neglecting the majority of visual tokens. This article presents a novel mixed attention operation that subtly fuses the random, spatial, and temporal attention mechanisms. The proposed random attention stochastically samples video tokens in a simple yet effective way, complementing other attention methods. Furthermore, since the attention operation concentrates on learning long-distance relationships, we employ the channel shift operation to encode short-term temporal characteristics. Our model can provide more comprehensive motion representations thanks to the amalgamation of these techniques. Experimental results show that the proposed method produces competitive action recognition results with low computational overhead on both large-scale and small-scale public video datasets.

1 Introduction

With the rise in the amount of video data, action recognition is playing an increasingly important role in many applications, including sports [41], health care [40], and human–computer interaction [21]. Numerous Convolutional Neural Networks (CNNs)-based algorithms have been presented to handle action videos with the emergence of deep learning techniques. These techniques employ either 3D/(2 + 1)D convolution [7, 42, 43, 46] or 2D convolution with a light temporal modeling module [28, 32, 56] to learn spatiotemporal representations. To overcome CNN’s shortcoming of focusing mostly on local patterns, some attention-based approaches [4, 11, 15, 35, 39] have been recently put forward to encode global video features, but they also come with higher time complexity.

Decomposing 3D attention into spatial and temporal components is a commonly used method to reduce computational overhead. For instance, TimeSformer [4] and ViViT [1] utilize 2D attention and 1D attention, respectively, to learn appearance and motion features. X-ViT [5] combines spatial attention and temporal attention limited to a local time window. Longformer [3] and BIGBIRD [61] leverage multiple attention (e.g., random attention and sliding window attention) to facilitate the efficient processing of extended sequences in natural language processing tasks. Motivated by their success, we investigate the mixture of several attention mechanisms to establish the long-term dependencies between different regions of videos in a computationally inexpensive manner. As depicted in Figure 1, the blue, orange, and green cells, respectively, represent the random, spatial, and temporal tokens that are utilized to calculate the relations with the central token in the middle frame. Clearly, spatial attention focuses on the correlation between the current token and tokens within the same frame for appearance modeling, while temporal attention considers the correlation between the current token and tokens in the same position across different frames for motion modeling. Random attention explores the correlation between the current token and a set of randomly selected tokens, facilitating efficient learning of long-range dependencies. Hence, using only spatial and temporal attention overlooks a significant amount of relevant token information, while the adoption of random attention provides a convenient way to bring complementary information. We primarily explore how to combine the three types of attention in a suitable way to fully leverage their individual functionalities. Additionally, we also study the proportion of tokens selected in random attention to achieve a balance between performance and computational cost.

Fig. 1.

An image illustrating the different token dependencies considered in spatial, temporal, and random attention. — Fig. 1. Motivation of the proposed mixed attention operation. In random, spatial, and temporal attention, tokens related to the central token highlighted with a yellow border are represented in different colors. They contain visual semantics from diverse regions, thereby complementing each other to improve modeling effectiveness.

Considering that the adopted attention mechanisms capture long-term motion information in videos, we attempt to learn short-term motion features in an efficient way as a supplement. TSM [32] achieves local motion modeling by shifting partial channels of the feature maps along the temporal direction without adding any computational burden. TEA [28] initializes the 1D depthwise convolution with the temporal shift operation, enabling the extraction of short-term motion information. LAPS [62] incorporates the periodic shift operation into the attention mechanism to perform temporal shifting of features from each head. This article investigates which method is most effective for extracting short-term motion information when combined with the mixed attention and proposes 1D depthwise convolution initialized with the periodic shift operation.

Figure 2 presents an overview of the proposed Mixed Attention with Channel Shift (MACS) Encoder, which combines two techniques for temporal modeling, namely the mixed attention operation and the channel shift operation to extract local and global motion cues, respectively. Experimental results demonstrate that our MACS Transformer yields superior performance to the state-of-the-art methods on multiple datasets with reduced computational cost. We observe that these three attention mechanisms promote each other when connected appropriately, and the accuracy is further boosted by adding channel shift for learning local dynamics. We also compare different connectivity patterns of various attention in the ablation experiments and visualize the classification outputs and the attention weights of our model.

Fig. 2.

In summary, the primary contributions of this article consist of the following three aspects:

—

We propose a novel mixed attention operation, combining the random, spatial, and temporal attention mechanisms for robust spatiotemporal modeling with low computational overhead.

—

We explore different methods of extracting local motion information and find that when integrated with our mixed attention, the 1D depthwise convolution initialized with periodic shift achieves the best performance.

—

Extensive experiments show that the proposed MACS operation is applicable to various backbone networks and obtains superior recognition results on several public video datasets.

The structure of the rest of the article is as follows: We review recent related approaches in Section 2 and provide a detailed introduction to the proposed MACS Transformer in Section 3. Following this, we report the experimental results on multiple public datasets in Section 4 and summarize this work in Section 5.

2 Related Work

The methods related to our work are action recognition approaches based on deep neural networks, mainly including the CNN-based and Transformer-based approaches.

2.1 CNNs

Motivated by the impressive performance of CNNs on image-related tasks [22, 24], many researchers have tried its applications to video modeling [7, 13, 14, 23, 46, 59, 64]. One straightforward approach is to utilize 3D convolution to learn the spatiotemporal representation of videos. C3D [46] treated the spatial and temporal information equally and leveraged 3D convolution with a kernel size of 3 \(\times\) 3 \(\times\) 3 to encode both the appearance and motion features simultaneously. In order to reduce computational overhead, P3D [42] and R(2 + 1)D [48] decomposed 3D convolution into a series of 2D convolution and 1D convolution to process the input videos. SlowFast [14] included two pathways to handle inputs with different frame rates, which learn spatial semantics and rapidly changing motion, respectively.

Another way is to understand videos with 2D convolution and a dedicated temporal modeling module, which works well on action-related datasets such as Something-Something [18]. TSM [32] proposed the channel shift operation to potentially characterize temporal information with spatial convolution. TEA [28] adopted the frame difference technique to estimate motion information and enhanced the motion-related channels of the video representation through feature weighting. TDN [53] improved upon 2D CNN by designing the S-TDM and L-TDM modules to capture both the short-term and long-term temporal structures. However, CNNs are not proficient in capturing long-distance dependencies, which limits the capabilities of these methods for video understanding.

2.2 Visual Transformers

Various attention mechanisms [3, 9, 19, 61] have been widely applied in natural language processing tasks. Inspired by them, researchers exploited the idea of self-attention to solve vision problems [4, 6, 10, 33, 35, 60]. ViT [10] made a pioneering attempt in image classification and took fixed-size image patches as tokens. Swin Transformer [33] constrained the calculation of self-attention in local windows and presented a shifted window partitioning technique to consider cross-window associations. Visformer [8] explored the transition of Transformer-based models into convolution-based models in a step-by-step manner and incorporated the advantages of both approaches. In addition, self-attention was also applied in many downstream image-related tasks, such as detection [6] and segmentation [45].

For video modeling, TimeSformer [4] and ViViT [1] used the combination of spatial and temporal attention to simulate 3D attention with decreased computational burden. MViT [11] adopted various resolutions and channel dimensions at different stages to learn multi-scale features hierarchically. X-ViT [5] restricted the computation of self-attention within a local temporal window. PST [57] proposed a temporal patch shift block, which moved partial patches in the time dimension, enabling spatial attention to potentially learn temporal dynamics. AcT [38] proposed a pose estimation network to generate 2D skeletal representations for action recognition. Video Swin [35] treated 3D patches as tokens and extended the Swin Transformer to the video domain. Uniformer [25] employed the 3D convolution and spatiotemporal attention in shallow and deep layers to learn local and global dependencies. LAPS [62] proposed leap attention and periodic shift to simultaneously capture long-term and short-term motion information in videos. There were also many works that employed the attention mechanism for handling low-level tasks [30, 31].

Although these methods exhibit better long-range modeling capabilities, they often have higher time complexity and require longer training time compared to CNN-based methods. Hence, this article aims to investigate the fusion of multiple attention mechanisms for optimized spatiotemporal representation learning while reducing the computational overhead.

3 Method

In this section, we present the proposed MACS Transformer. We start by introducing three attention mechanisms and discussing their mixtures. Next, we describe the channel shift operation in detail. Last, we introduce the proposed model instantiated with the ViT-B backbone.

3.1 Random, Spatial, and Temporal Attention Mechanisms

When dealing with videos, considering the dependencies between all spatiotemporal tokens can be time-consuming. However, a simple decomposition of 3D attention into spatial and temporal attention may result in the loss of information from most tokens. To tackle this challenge, we propose a novel mixed attention strategy by leveraging multiple types of attention mechanisms.

The adopted random, spatial, and temporal attention mechanisms are illustrated in Figure 3. Spatial and temporal attention entail the initial grouping of tokens in space and time, followed by the calculation of the attention matrices between the corresponding groups. Conversely, random attention processes the relationships between the query token and a set of randomly chosen key tokens. Therefore, random attention serves to compensate for visual information that would otherwise be lost by spatial and temporal attention, and the sampling of a small proportion of visual tokens mitigates the computational overhead of the model.

Fig. 3.

Three diagrams illustrating the computational processes of random, spatial, and temporal attention. — Fig. 3. Illustrations of three types of attention mechanisms. In (a) random attention, the query tensor K consists of randomly sampled tokens. (b) Spatial attention and (c) temporal attention are special cases of group attention.

We utilize the projection matrices \(\boldsymbol{W}_{Q}^{l},\boldsymbol{W}_{K}^{l},\boldsymbol{W}_{V}^{l}\in\mathbb {R}^{{D_{h}}\times{D}}\) to computing the query, key, and value vectors, respectively, from the token representation \(z_{s,t}^{l}\in\mathbb{R}^{D}\) at the spatiotemporal position \((s,t)\) and layer l:

\begin{align} &q_{s,t}^{l},k_{s,t}^{l},v_{s,t}^{l}=[\boldsymbol{W}_{Q}^{l}, \boldsymbol{W}_{K}^{l},\boldsymbol{W}_{V}^{l}]\cdot{z_{s,t}^{l}}\in\mathbb{R}^ {D_{h}} \notag\\ &\qquad\quad s=1,\cdots,S,\quad t=1,\cdots,T \end{align}

(1)

where \(D_{h}\) is the dimension for each attention head. For random attention, we sample \(R=\frac{{H}\cdot{W}\cdot{T}}{\lambda}\) tokens randomly to form the key and the value tensors with a sampling rate \(1/\lambda\). We then derive the self-attention weights for spatial, temporal, and random attention through the following calculations:

\begin{align} \omega_{s,t,r^{\prime}}^{l,random}=Softmax({\frac{q_{s,t}^{l}}{ \sqrt{D_{h}}}}\cdot{k_{r^{\prime}}^{l}}),r^{\prime}=1,\cdots,R \end{align}

(2)

\begin{align} \omega_{s,t,s^{\prime}}^{l,space}=Softmax({\frac{q_{s,t}^{l}}{ \sqrt{D_{h}}}}\cdot{k_{s^{\prime},t}^{l}}),s^{\prime}=1,\cdots,S \end{align}

(3)

\begin{align} \omega_{s,t,t^{\prime}}^{l,time}=& Softmax({\frac{q_{s,t}^{l}}{ \sqrt{D_{h}}}}\cdot{k_{s,t^{\prime}}^{l}}),t^{\prime}=1,\cdots,T \notag\\ & s=1,\cdots,S,\quad t=1,\cdots,T \end{align}

(4)

where \(k_{r^{\prime}}^{l}\) denotes the representation of the \(r^{\prime}\)th token in the key tensor. The self-attention weights \(\omega_{s,t,r^{\prime}}^{l,random}\), \(\omega_{s,t,s^{\prime}}^{l,space}\), and \(\omega_{s,t,t^{\prime}}^{l,time}\) will be further employed to manipulate the corresponding value tensors.

3.2 Mixed Attention Operation

Previous studies [1, 42] have confirmed that the connection ways of the spatial and temporal convolution or attention operations are closely related to the model performance. Motivated by them, we conduct a comparative analysis of diverse connection methodologies of multiple attention mechanisms shown in the right part of Figure 4. Concretely, we compare the following structures:

—

Mix-A: Random, spatial, and temporal dependencies are independently calculated, and then, the resulting video features are fused together.

—

Mix-B: The sequences of spatial and temporal attention jointly extract static and dynamic cues, which are subsequently merged with the output of random attention.

—

Mix-C: Spatial, temporal, and random attention are employed to learn video features in series, hierarchically encoding different correlations.

Fig. 4.

An illustration of the MACS Transformer, including three different structures of the mixed attention module. — Fig. 4. Overview of our MACS Transformer. The video patches are linearly projected, concatenated with the [class] token, and added with positional embeddings. They are subsequently passed through multiple layers of Transformer encoders to learn video features. Each encoder contains the proposed mixed attention operation and channel shift operation. Considering different ways of extracting visual information, we design three structures for the mixed attention operation as shown on the right.

Drawing inspiration from the factorization of 3D attention in [1, 4], Mix-B and Mix-C aim to establish a sequential linkage between spatial and temporal attention. However, the ablation experiment in Table 2 yields an interesting outcome that Mix-A with a parallel connection achieves superior performance, despite the serial configuration performing better solely for temporal and spatial attention.

The token feature \(y_{s,t}^{l}\) at the spatiotemporal position \((s,t)\) resulting from different connection methods can be calculated through the dot-product between the self-attention weights and the corresponding value tensors:

\begin{align} y_{s,t}^{l,MixA}=&\,\alpha_{1}\sum\limits_{s^{\prime}=1}^{S}{\omega_{s,t,s^{\prime}}^{l, space}v_{s^{\prime},t}^{l}}+\alpha_{2}\sum\limits_{t^{\prime}=1}^{T}{\omega_{s,t,t^{ \prime}}^{l,time}v_{s,t^{\prime}}^{l}} \notag\\& {}+\alpha_{3}\sum\limits_{r^{\prime}=1}^{R}{\omega_{s,t,r^{\prime}}^{l, random}v_{r^{\prime}}^{l}} \end{align}

(5)

\begin{align} y_{s,t}^{l,MixB}=&\, \alpha_{1}\sum\limits_{t^{\prime}=1}^{T}{\omega_{s,t,t^{\prime}}^{l, time}(\sum\limits_{s^{\prime}=1}^{S}{\omega_{s,t,s^{\prime}}^{l,space}v_{s^{\prime},t }^{l}})_{t^{\prime}}} \notag\\ &{}+\alpha_{2}\sum\limits_{r^{\prime}=1}^{R}{\omega_{s,t,r^{\prime}}^{l, random}v_{r^{\prime}}^{l}}\end{align}

(6)

\begin{align} y_{s,t}^{l,MixC} = & \sum\limits_{r^{\prime}=1}^{R}{\omega_{s,t,r^{\prime}}^{l,random}[\sum\limits_{ t^{\prime}=1}^{T}{\omega_{s,t,t^{\prime}}^{l,time}(\sum\limits_{s^{\prime}=1}^{S}{ \omega_{s,t,s^{\prime}}^{l,space}v_{s^{\prime},t}^{l}})_{t^{\prime}}}]_{r^{\prime}}} \notag\\ &{}s=1,\cdots,S,\quad t=1,\cdots,T \end{align}

(7)

where \(v_{r^{\prime}}^{l}\) refers to the \(r^{\prime}\)th token representation in the value tensor, and \(\alpha_{i}\) denotes the hyperparameter used to weight the video features of the ith branch during feature fusion.

Figure 5 displays tokens associated with the upper-left token under different attention mechanisms, providing an intuitive visualization. The random, spatial, and temporal tokens are depicted by the blue, orange, and green cells, respectively. Notably, the proportion of colored cells to all cells roughly reflects the ratio of the computational cost between our mixed attention and 3D attention (with some cells being recalculated in the mixed attention operation). The observation indicates that the proposed methodology effectively lessens the computational overhead to a significant extent.

Fig. 5.

3.3 Channel Shift Operation

In contrast to the mixed attention operation that captures long-range dependencies in videos, the channel shift operation prioritizes encoding local motion patterns. Some prior works propose the temporal shift [32] and periodic shift [62] methods in CNN and Transformer, respectively, to move specific channels of video features along the time direction, thereby achieving hard channel transfer. The 1D convolution [28] initialized with the temporal shift method accomplishes a flexible and learnable way to manipulate channels in CNN, which can be regarded as a soft channel shift mechanism. Considering the multi-head attention mechanism of the Transformer network, we employ the periodic shift method to initialize 1D depthwise convolution in order to equally transfer the channels of each head. As shown in Figure 6, we evaluate the performance of combining the following techniques used for learning short-term motion features with the mixed attention operation: (a) temporal shift; (b) periodic shift; (c) 1D depthwise convolution initialized with temporal shift; (d) our 1D depthwise convolution initialized with periodic shift. The temporal shift and periodic shift methods in Figure 6(a) and (b) can be regarded as convolution operations with special weights. Previous research [37] has shown that the effective receptive field of CNNs has a Gaussian distribution and only occupies a fraction of the full theoretical receptive field. Therefore, the different methods of channel shift in Figure 6 are considered to focus more on local motion patterns between adjacent frames. This is particularly helpful for the classification of many fine-grained action categories, such as braiding and curling hair.

Fig. 6.

As shown in Figure 2, we perform the channel shift operation after the mixed attention operation. Hence, the token representation at the spatiotemporal position \((s,t)\) after the channel shift operation can be derived as follows:

\begin{align} \hat{z}_{s,t}^{l}=Conv(y_{s,t}^{l}) \end{align}

(8)

where \(Conv(.)\) denotes 1D depthwise convolution, which is initialized using periodic shift and trained jointly with the remaining parameters of the network. In periodic shift, the token feature \(y_{s,t}^{l}\) from each head exchanges partial channels between adjacent frames as follows:

\begin{align} y_{s,t}^{l}[d]&=y_{s,t+1}^{l}[d],{}{0} {\lt} {d}\leq{\frac{D_{h}}{\rho}},{1}\leq{t} {\lt} {T}\end{align}

(9)

\begin{align} y_{s,t}^{l}[d]&=y_{s,t-1}^{l}[d],{}{\frac{D_{h}}{\rho}} {\lt} {d}\leq{\frac{2D_{h}}{ \rho}},{1} {\lt} {t}\leq{T} \end{align}

(10)

\begin{align} y_{s,t}^{l}[d]&=y_{s,t}^{l}[d],{}{\frac{2D_{h}}{\rho}} {\lt} {d}\leq{D_{h},{1} \leq{t}\leq{T}} \end{align}

(11)

where \(1/\rho\) denotes the proportion of shifted channels.

3.4 MACS Transformer

The proposed MACS operation applies to various architectures, and we introduce the model instantiated with the ViT [10] backbone, as illustrated in the left part of Figure 4. Following the previous works [4, 10], the initial step involves spatially partitioning the input video \(\boldsymbol{V}\in\mathbb{R}^{{T}\times{H}\times{W}\times{3}}\) into patches (i.e., tokens) of size \(P\times{P}\), where H, W, and T specify the video’s height, width, and length, respectively. We then flatten these patches into a vector \(\boldsymbol{X}\in\mathbb{R}^{{T}\times{N}\times{C}}\), with \(N=\frac{{H}\cdot{W}}{P^{2}}\) and \(C = 3P^{2}\) separately denoting the number of patches per frame and the token dimension. After that, we linearly map the vector \(\boldsymbol{X}\) with the matrix \(\boldsymbol{E}\in\mathbb{R}^{{C}\times{D}}\) and obtain the embedding vector \(\boldsymbol{Z}^{0}\in\mathbb{R}^{{T}\times{(N + 1)}\times{D}}\) as follows:

\begin{align} \boldsymbol{Z}^{0}=[x_{cls};x_{1}\boldsymbol{E};x_{2}\boldsymbol{E};\cdots;x_{ N}\boldsymbol{E}]+\boldsymbol{E}_{pos} \end{align}

(12)

where \(\boldsymbol{E}_{pos}\in\mathbb{R}^{{(N + 1)}\times{D}}\) indicates the positional embeddings and \(x_{cls}\) is the [class] token. The video representation further undergoes processing by a sequence of L MACS encoders shown in Figure 2. Each encoder includes the combination of the mixed attention and channel shift operations and the Multi-Layer Perceptron (MLP):

\begin{align} \hat{\boldsymbol{Z}}^{l-1}=MACS(LN(\boldsymbol{Z}^{l-1}))+ \boldsymbol{Z}^{l-1}\end{align}

(13)

\begin{align} \boldsymbol{Z}^{l}=MLP(LN(\hat{\boldsymbol{Z}}^{l-1}))+\hat{ \boldsymbol{Z}}^{l-1} \end{align}

(14)

where \(LN(.)\) represents layer normalization [2] and \(MLP(.)\) consists of two linear layers with a GELU activation. We employ the fully connected layer to handle the [class] token \(z^{L}_{0}\) from the last encoder. Finally, the per-frame predictions are averaged over the temporal dimension to obtain the action category of the input video:

\begin{align} y=\frac{1}{T}\sum\limits_{i=1}^{T}{FC(z^{L}_{0})} \end{align}

(15)

4 Experiments

In this section, we introduce the experimental settings, conduct ablation studies, and compare the action recognition results of the proposed MACS model and the state-of-the-art approaches on both large-scale and small-scale datasets, followed by visualizations.

4.1 Datasets

Kinetics-400 [7] gathered from the YouTube website by DeepMind is a classic large-scale action-related dataset. It consists of 400 action classes, and each class has at least 400 videos. The dataset contains a variety of action categories, such as Person–Person Actions (e.g., kissing) and Person–Object Actions (e.g., washing dishes). Something-Something V2 [18] is an extensive action-related dataset, encompassing roughly 170K training videos and 25K validation videos. It incorporates 174 categories of fine-grained human–object interaction actions, such as moving something up. Following previous works [4, 53], we conduct model training and inference separately on the training and validation sets for these two datasets.

UCF101 [44] and EGTEA Gaze+ [29] are widely used small-scale video datasets for action recognition. UCF101 consists of 13,320 videos from 101 classes, which are clustered into five types, such as Playing Musical Instruments, Sports, and Body-Motion. EGTEA Gaze+ contains 10,321 first-person action videos from 106 categories, with an average length of 3.2 seconds. In line with [62], we adopt the first split for both datasets to train and evaluate the proposed model.

4.2 Implementation Details

We employ the Visformer-S [8] and ViT-B [10] backbones to instantiate our model. We temporally sample frames from the videos at a rate of \(1/8\) as the network input. In the channel shift operation, we set the ratio of shifted channels to \(1/8\). We adopt data augmentation strategies such as spatial scale jittering and random cropping. For example, we resize the shorter side of the input video to a value within the interval of [224, 360] and then randomly crop \(224\times 224\) regions on Kinetics-400. During training, the initial learning rate scales linearly with the batch size [17] and varies depending on factors like the datasets and the input resolutions. For instance, when training on Kinetics-400 with a resolution of \(224\times 224\times 8\) and a batch size of 40, we set the initial learning rate to 0.2. In addition, following [1], we only employ additional regularization methods such as label smoothing and mixup to enhance the performance on Something-Something V2. We train the model for a total of 18 epochs in all experiments, reducing the learning rate by a factor of 10 at the 10th and 15th epochs. For evaluation, we utilize three spatial crops (left, center, and right) and uniformly select one or five clips from the videos, resulting in two test settings of \(3\times 1\) and \(3\times 5\). We also fix the seed for random attention in both the training and testing stages throughout the experiments.

4.3 Ablation Study

We conduct ablation experiments on the Kinetics-400 dataset and report the Top-1 accuracy. If not specified otherwise, we default to employing the Visformer-S backbone network and the input size of 224 \(\times\) 224 \(\times\) 8 under the configuration of 3 \(\times\) 5 (#crops \(\times\) #clips). Table 1 explores the combinations of different attention mechanisms, with Base2D (i.e., spatial attention) and Base3D (i.e., 3D attention) used as baselines. S, T, and R represent spatial, temporal, and random attention, respectively. The symbols “+” and “\(\cup\)” indicate the connection of attention operations in series and parallel, respectively. The results demonstrate that the serial connection is more effective when combining spatial or random with temporal attention, while the parallel connection achieves higher accuracy for the mixture of spatial and random attention. This aligns with previous methods [1, 4] that break down 3D attention into a series connection of spatial and temporal attention. Table 2 compares the proposed mixed attention mechanisms under different designs and shows that Mix-A (i.e., executing them in parallel) surprisingly attains the most optimal outcomes when combining all three attention operations. Based on the assumption that spatial attention provides the essential information among them, we further explore different fusion weights for these attention mechanisms. Ultimately, we obtain the top-1 accuracy of 76.0% with the weights of the three branches being 0.6:0.2:0.2, represented by Mix-A\({}^{\ast}\).

Table 1.

Method	FLOPs	Params	Top-1
	(G)	(M)	(%)
Base2D from [62]	39.1	39.8	74.0
Base3D from [62]	46.5	39.8	76.3
S	39.1	39.8	74.1
T	38.1	39.8	56.4
R	40.1	39.8	73.5
S + T	39.1	39.8	75.6
S\(\cup\)T	39.1	39.8	75.4
R + T	40.2	39.8	73.3
R\(\cup\)T	40.2	39.8	68.8
S + R	41.2	39.8	75.1
S\(\cup\)R	41.2	39.8	75.4

Table 1. Results of Different Combinations of Attention Operations

“+” and “\(\cup\)” signify series and parallel connections.

Table 2.

Method	FLOPs	Params	Top-1
	(G)	(M)	(%)
Base2D from [62]	39.1	39.8	74.0
Base3D from [62]	46.5 (\(\uparrow 18.9%\))	39.8	76.3
Mix-A	41.3 (\(\uparrow 5.6%\))	39.8	75.9
Mix-B	41.3 (\(\uparrow 5.6%\))	39.8	75.8
Mix-C	41.3 (\(\uparrow 5.6%\))	39.8	75.0
Mix-A\({}^{\ast}\)	41.3 (\(\uparrow 5.6%\))	39.8	76.0

Table 2. Results of the Mixed Attention Operation Using Various Designs

Mix-A\({}^{\ast}\) denotes the fusion of the three attention branches in a ratio of 0.6:0.2:0.2.

We study the impact of selecting varying proportions of tokens in random attention when combined with spatial attention, as illustrated in Figure 7. As the sampling rate rises, the recognition accuracy and the Floating Point Operations (FLOPs) gradually increase, which can be attributed to the consideration of the attention relationships between more spatiotemporal tokens. To achieve a balance between performance and computational overhead, we set the sampling rate to \(1/4\) in all other experiments. Table 3 provides the results of different channel shift operations when combined with mixed attention. Due to the low computational cost of 1D depthwise convolution, all channel shift operations result in a negligible increase in FLOPs compared to only using the mixed attention operation. It can be observed that our 1D depthwise convolution with periodic shift initialization exhibits superior recognition accuracy. Our performance is close to that of Base3D, but with lesser computational augmentation in comparison to Base2D (5.6% vs. 18.9%). The experimental results also indicate that positioning channel shift after mixed attention leads to a slight improvement in accuracy compared to positioning it before (the result is not displayed in the table for simplicity). The possible reason is that the appearance modeling process conducted in mixed attention facilitates the subsequent channel shift operation in learning short-term motion features. Table 4 gives the testing throughput of the proposed model on a single GPU. It can be seen that reasonable results show a negative correlation between the model’s FLOPs and throughput. Our MACS model achieves a throughput of 20.6 clips per second, significantly exceeding that of the Base3D model.

Fig. 7.

A line graph showing the accuracy and FLOPs of random attention at different sampling rates. — Fig. 7. Impact of varying the proportion of sampled tokens in random attention. We set the sampling rate to \(1/4\) to achieve a balance between accuracy and computational cost.

Table 3.

Attention Operation	Channel Shift	FLOPs	Top-1
		(G)	(%)
Base2D from [62]	-	39.1	74.0
Base3D from [62]	-	46.5 (\(\uparrow 18.9%\))	76.3
Mixed attention	Temporal shift	41.3 (\(\uparrow 5.6%\))	75.9
	Periodic shift	41.3 (\(\uparrow 5.6%\))	76.0
	1D Conv (TS Init)	41.3 (\(\uparrow 5.6%\))	76.0
	1D Conv (PS Init)	41.3 (\(\uparrow 5.6%\))	76.2

Table 3. Results of Combinations of the Mixed Attention and Different Channel Shift Operations

“TS Init” and “PS Init” denote the temporal shift and periodic shift initialization techniques, respectively.

Table 4.

Method	Throughput	FLOPs
	(clips/sec)	(G)
Base2D from [62]	25.3	39.1
Base3D from [62]	15.4 (\(\downarrow 39.1%\))	46.5
MACS	20.6 (\(\downarrow 18.6%\))	41.3

Table 4. Testing Throughput of the Proposed MACS Model

clips/sec means clips per second, higher the better.

We present the performance of the proposed MACS model for different backbone networks and input resolutions in Table 5. For inputs with eight frames and more, we adopt the testing settings of 3 \(\times\) 5 and 3 \(\times\) 1 (#crops \(\times\) #clips), respectively. The reason is that using more clips cannot significantly enhance the accuracy for long inputs but instead leads to an increase in computation time, as shown in Figure 8. Our model outperforms the Base2D model with inputs of the same size, whether based on Visformer-S or ViT-B backbones. As the number of input frames and the spatial resolution increase, the recognition accuracy of the proposed method continuously improves. This observation suggests that our method exhibits efficacy in processing inputs of varying resolutions across diverse backbone networks.

Table 5.

Method	Backbone	#F \(\times\) Res	Test	Top-1
		(H \(\times\) W)	Setting	(%)
Base2D from [62]	Visformer-S	8 \(\times\, 224^{2}\)	\(3\times 5\)	74.0
Base2D from [62]	ViT-B	8 \(\times\, 224^{2}\)	\(3\times 5\)	76.4
MACS	Visformer-S	8 \(\times\, 224^{2}\)	\(3\times 5\)	76.2
	Visformer-S	32 \(\times\, 224^{2}\)	\(3\times 1\)	78.0
	Visformer-S	32 \(\times\, 320^{2}\)	\(3\times 1\)	79.8
	Visformer-S	32 \(\times\, 360^{2}\)	\(3\times 1\)	80.0
MACS	ViT-B	8 \(\times\, 224^{2}\)	\(3\times 5\)	78.1
	ViT-B	32 \(\times\, 224^{2}\)	\(3\times 1\)	80.2
	ViT-B	48 \(\times\, 224^{2}\)	\(3\times 1\)	80.8

Table 5. Results of Our MACS Model with Different Backbones and Input Resolutions

Fig. 8.

A bar chart showing the performance of our method with different backbones and input sizes under various test settings. — Fig. 8. Impact of different test settings on our MACS method with varying input sizes. For long sequence inputs, the improvement in accuracy from multiple clips is relatively small.

4.4 Comparison with the State-of-the-Art

We perform a comparison of the proposed MACS model with the current state-of-the-art methods on the Kinetics-400 dataset, as illustrated in Table 6. MACS-Visf and MACS-ViT represent variants of our method based on the Visformer-S and ViT-B backbones. (H/L) indicate that higher spatial resolution frames or longer sequences are taken as inputs. In the first two sections of Table 6, we compare the proposed MACS model with the 3D CNN-based methods (e.g., SlowFast [14]) and the 2D CNN-based methods (e.g., TDN [53]). These methods are based on various backbone networks such as ResNet101 and Inception V1 and consider a wide range of input resolutions. Our MACS-ViT model attains a substantial advantage over all these methods by a large margin. Due to the higher time complexity of Transformer compared to CNN, our method incurs more computational overhead as opposed to these methods. However, the fewer training epochs (i.e., 18) help mitigate this issue to some extent.

Table 6.

Method	Backbone	Pre-Train	Resolution	Frames	GFLOPs	Params	Training	Top-1	Top-5
		Dataset	(H\(\,\times\,\)W)		(\(\times\,\)Views)	(M)	Epochs	(%)	(%)
3D CNNs
I3D [7]	Inception V1	ImgNet-1K	224\(\,\times\,\)224	250	108\(\,\times\,\)NA	12.0	-	71.1	90.3
Two-Stream I3D [7]	Inception V1	ImgNet-1K	224\(\,\times\,\)224	500	216\(\,\times\,\)NA	25.0	-	75.7	92.0
ARTNet [52]	ResNet18	None	112\(\,\times\,\) 112	16	24\(\,\times\,\)250	35.2	-	69.2	88.3
S3D-G [58]	Inception V1	ImgNet-1K	224\(\,\times\,\)224	250	71\(\,\times\,\)NA	11.5	112	74.7	93.4
Non-Local R101 [54]	ResNet101	ImgNet-1K	256\(\,\times\,\) 256	128	359\(\,\times\,\) 30	54.3	196	77.7	93.3
SlowFast\({}_{16\times 8}\) [14]	ResNet101+NL	None	256\(\,\times\,\) 256	32	234\(\,\times\,\) 30	59.9	196	79.8	93.9
ip-CSN [47]	ResNet152	Sports1M	224\(\,\times\,\) 224	32	109\(\,\times\,\) 30	-	45	79.2	93.8
SmallBigNet [26]	ResNet101	ImgNet-1K	224\(\,\times\,\) 224	32	418\(\,\times\,\) 12	-	110	77.4	93.3
X3D-XL [13]	X2D	None	356\(\,\times\,\) 356	16	48\(\,\times\,\) 30	11.0	256	79.1	93.9
CorrNet [51]	ResNet101	None	224\(\,\times\,\) 224	32	224\(\,\times\,\) 30	-	250	79.2	-
2D CNNs
TSM [32]	ResNet50	ImgNet-1K	256\(\,\times\,\) 256	16	65\(\,\times\,\) 10	24.3	100	74.7	-
TAM [12]	bLResNet50	Kinetics-400	224\(\,\times\,\) 224	48	93\(\,\times\,\) 9	25.0	75	73.5	91.2
TEA [28]	ResNet50	ImgNet-1K	256\(\,\times\,\) 256	16	70\(\,\times\,\) 30	35.3	50	76.1	92.5
TEINet [34]	ResNet50	ImgNet-1K	256\(\,\times\,\) 256	16	66\(\,\times\,\) 30	30.8	100	76.2	92.5
TANet [36]	ResNet50	ImgNet-1K	256\(\,\times\,\) 256	16	86\(\,\times\,\) 12	25.6	100	76.9	92.9
TDN-R101 [53]	ResNet101	ImgNet-1K	256\(\,\times\,\) 256	24	198\(\,\times\,\) 30	43.9	100	79.4	94.4
GC-TDN-R50 [20]	ResNet50	ImgNet-1K	256\(\,\times\,\) 256	24	110\(\,\times\,\) 30	27.4	100	79.6	94.1
MDAF [50]	ResNet50	ImgNet-1K	224\(\,\times\,\) 224	8	34\(\,\times\,\) 30	24.5	150	76.2	92.0
CANet [16]	ResNet101	ImgNet-1K	256\(\,\times\,\) 256	8	67\(\,\times\,\)30	44.1	50	77.9	93.5
Transformers
LAPS [62]	Visformer-S	ImgNet-10K	224\(\,\times\,\) 224	8	40\(\,\times\,\) 15	39.8	18	76.0	92.6
ViT (Video) [10]	ViT-B	ImgNet-21K	224\(\,\times\,\) 224	8	135\(\,\times\,\) 30	85.9	18	76.0	92.5
TokShift [63]	ViT-B	ImgNet-21K	224\(\,\times\,\) 224	16	270\(\,\times\,\) 30	85.9	18	78.2	93.8
TokShift (MR) [63]	ViT-B	ImgNet-21K	256\(\,\times\,\) 256	8	176\(\,\times\,\) 30	85.9	18	77.7	93.6
VTN [39]	ViT-B	ImgNet-21K	224\(\,\times\,\) 224	250	4218\(\,\times\,\) 1	114.0	25	78.6	93.7
TimeSformer [4]	ViT-B	ImgNet-21K	224\(\,\times\,\) 224	8	197\(\,\times\,\) 3	121.4	15	78.0	93.7
TimeSformer-HR [4]	ViT-B	ImgNet-21K	448\(\,\times\,\) 448	16	1703\(\,\times\,\) 3	121.4	15	79.7	94.4
TimeSformer-L [4]	ViT-B	ImgNet-21K	224\(\,\times\,\) 224	96	2380\(\,\times\,\) 3	121.4	15	80.7	94.7
STTM [15]	ViT-B	ImgNet-21K	224\(\,\times\,\) 224	11	2288\(\,\times\,\) 1	89.6	30	80.2	-
MACS-Visf	Visformer-S	ImgNet-10K	224\(\,\times\,\) 224	8	41\(\,\times\,\) 15	39.8	18	76.2	92.4
MACS-Visf (H)	Visformer-S	ImgNet-10K	320\(\,\times\,\) 320	32	472\(\,\times\,\) 3	40.0	18	79.8	94.3
MACS-ViT	ViT-B	ImgNet-21K	224\(\,\times\,\) 224	8	152\(\,\times\,\) 15	86.1	18	78.1	93.7
MACS-ViT (L)	ViT-B	ImgNet-21K	224\(\,\times\,\) 224	48	1265\(\,\times\,\) 3	86.1	18	80.8	94.8

Table 6. Comparison with the State-of-the-Art Methods on the Validation Set of Kinetics-400

In contrast to video Transformers, Our MACS-Visf model goes beyond LAPS [62] (76.2% vs. 76.0%), which also focuses on designing efficient Transformers and adopts the Visformer-S architecture. The proposed MACS-ViT model surpasses other approaches with the same ViT-B backbone, including TokShift [63], VTN [39], and TimeSformer [4]. For instance, our MACS-ViT model exhibits superior performance to TimeSformer-HR (80.8% vs. 79.7%), with significantly lower GFLOPs (1265 vs. 5110) and fewer parameters (86.1 vs. 121.4). These results also reveal that our method possesses good adaptability to different backbones and has the potential to be integrated with more potent 2D Transformers.

We further fine-tune our model pre-trained on Kinetics-400 on commonly used other datasets, such as Something-Something V2, UCF101, and EGTEA Gaze+. As depicted in Table 7, our MACS-Visf exceeds the methods utilizing the ResNet50 backbone, including TSM [32], SlowFast [14], and SmallBig [26]. Compared to Transformer-based methods, the proposed method achieves improved performance over VidTr-L [27] and TimeSformer-L [4] with the ViT-B backbone, even when employing a lighter and weaker Visformer-S architecture. This suggests that our model demonstrates superior motion encoding capability by jointly modeling short-term and long-term motion features. As a result, it performs well even in motion-related videos with a lack of scene information.

Table 7.

Method	Backbone	#F\(\,\times\,\)Res	Top-1
		(T\(\,\times\,\)HW)	(%)
TSM [32]	ResNet50	16\(\,\times\, 224^{2}\)	63.4
SlowFast [14]	ResNet50	16\(\,\times\, 224^{2}\)	61.7
SmallBig [26]	ResNet50	24\(\,\times\, 224^{2}\)	63.3
VidTr-L [27]	ViT-B	32\(\,\times\, 224^{2}\)	63.0
TimeSformer-HR [4]	ViT-B	16\(\,\times\, 448^{2}\)	62.2
TimeSformer-L [4]	ViT-B	96\(\,\times\, 224^{2}\)	62.4
MACS-Visf (H)	Visformer-S	32\(\,\times\, 320^{2}\)	64.8

Table 7. Comparison Results on Something-Something V2

Table 8 demonstrates the superiority of our MACS model over other approaches, such as CNN-based methods (e.g., P3D [42]) and Transformer-based methods (e.g., TokShift [63]) on the first split of the UCF101 dataset. With the same input size (i.e., 32 \(\,\times\,\) 320 \(\,\times\,\) 320) and Visformer-S backbone, the MACS-Visf model surpasses LAPS (H) [62] (97.2% vs. 96.9%). The comparison results on the first split of the EGTEA-GAZE+ dataset are shown in Table 9. Our MACS-ViT model outperforms all listed methods, including the recent TokShift (HR) [63] and LAPS (H) [62] methods. These findings indicate that after pre-trained on Kinetics-400, our method displays excellent generalization capability on small-scale datasets.

Table 8.

Method	Pre-Train	#F\(\,\times\,\)Res	Top-1
	Dataset	(T\(\,\times\,\)HW)	(%)
P3D [42]	Sports-1M	16\(\,\times\,224^{2}\)	84.2
TSM [32]	Kinetics-400	8\(\,\times\,256^{2}\)	95.9
ViT (Video) [10]	ImageNet-21k	8\(\,\times\,256^{2}\)	91.5
TokShift [63]	Kinetics-400	8\(\,\times\,256^{2}\)	95.4
TokShift-L (HR) [63]	Kinetics-400	8\(\,\times\,384^{2}\)	96.8
LAPS (H) [62]	Kinetics-400	32\(\,\times\,320^{2}\)	96.9
MACS-Visf (H)	Kinetics-400	32\(\,\times\,320^{2}\)	97.2
MACS-ViT (L)	Kinetics-400	48\(\,\times\,224^{2}\)	97.3

Table 8. Comparison Results on Split 1 of UCF101

Table 9.

Method	Pre-Train	#F\(\,\times\,\)Res	Top-1
	Dataset	(T\(\,\times\,\)HW)	(%)
TSM [32]	Kinetics-400	8\(\,\times\, 224^{2}\)	63.5
SAP [55]	Kinetics-400	64\(\,\times\, 256^{2}\)	64.1
ViT (Video) [10]	ImageNet-21k	8\(\,\times\, 224^{2}\)	62.6
TokShift (HR) [63]	Kinetics-400	8\(\,\times\, 384^{2}\)	65.8
LAPS (H) [62]	Kinetics-400	32\(\,\times\, 320^{2}\)	66.1
MACS-Visf (H)	Kinetics-400	32\(\,\times\, 320^{2}\)	66.3
MACS-ViT (L)	Kinetics-400	48\(\,\times\, 224^{2}\)	67.3

Table 9. Comparison Results on Split 1 of EGTEA-GAZE+

4.5 Visualization

Visualizing the Classification Probabilities. Figure 9 shows the top four action classes with the highest predicted probabilities using our MACS Transformer and the Base2D model on the chosen Kinetics-400 clips. The light coral and light steel blue bars indicate the probabilities of the correct and incorrect categories, respectively. Clearly, our MACS model yields more precise predictions compared to Base2D. Even for many easily confused action categories such as \(curling\_hair\) and \(braiding\_hair\), our method still provides accurate classification results.

Fig. 9.

Visualizing the Attention Weights. In Figure 10, we display the self-attention weights of the mixed attention operation in the first layer. We fuse the self-attention weights of different types of attention by using a weighted summation. Our model enables the focus on important interactive objects in the video, such as hands and cups, allowing the extraction of key information from the video for robust spatiotemporal representation learning.

Fig. 10.

The images showing the self-attention weights of the mixed attention operation in the first layer. — Fig. 10. Illustration of the self-attention weights in the first layer on Something-Something V2. The first and second rows, respectively, present the input frames and the self-attention weights.

Visualizing the Video Features. We select some samples of 106 classes from the EGTEA Gaze+ dataset and display their deep representations extracted by the Base2D model and our MACS model via t-SNE [49] in Figure 11. We use different colored dots to represent video samples from diverse classes. Our model obtains features with small within-class distances and large between-class distances. Hence, the proposed MACS model can learn more discriminative representations compared to the Base2D model.

Fig. 11.

The images illustrating the deep representations extracted by the Base2D model and our method. — Fig. 11. Feature visualization of the 106 class samples on EGTEA Gaze+ via t-SNE [49]. Points of various colors represent videos belonging to different categories. Our MACS model extracts more discriminative features compared to the Base2D model.

4.6 Discussion

Our goal in this article is to achieve a balance between model performance and computational cost. As shown in Table 3, the proposed model yields comparable accuracy to Base3D, while introducing a lesser increase in computational overhead compared to Base2D. This is benefited by the combination of three lightweight attention mechanisms, as well as the channel shift operation that introduces minimal additional floating-point computations. Furthermore, the main challenge in action recognition tasks lies in modeling complex motions, encompassing both slow and fast movements. Short-term motion information focuses on inter-frame motion, which is crucial for capturing fast movements. Long-term motion information pertains to overall motion patterns, which are essential for depicting slow movements. The proposed mixed attention and channel shift modules capture these two types of motion information based on the long-range modeling capability of the attention mechanism and the local modeling capability of the convolution operation. We observe that on the motion-related Something-Something V2 dataset, our method consistently outperforms other Transformer-based approaches in Table 7, even when employing a less powerful backbone. Figure 10 also demonstrates our mixed attention’s ability to focus on moving foreground objects, thereby validating its effectiveness in capturing motion information.

5 Conclusion

This article introduces the MACS Transformer, which utilizes a combination of mixed attention and channel shift operations to effectively capture long-term and short-term motion information while maintaining low computational cost. In the proposed mixed attention operation, we employ random attention to address the limitation of temporal and spatial attention, which overlooks a significant portion of the visual regions. We further use the lightweight 1D depthwise convolution initialized with periodic shift to encode short-term dynamics in videos. The experimental results validate the advantages of mixing multiple attention mechanisms, and the channel shift operation additionally enhances the recognition accuracy.

In random attention, we efficiently learn long-range dependencies by randomly selecting a subset of key tokens. However, this approach may struggle to choose tokens that possess rich visual or motion-related information. If we can leverage visual priors to select tokens corresponding to important regions in videos, it has the potential to further enhance the model’s performance.

In future studies, we will focus on exploring how to effectively select critical tokens in attention mechanisms. One promising exploration is to calculate frame differences for motion estimation and subsequently generate motion-sensitive tokens, similar to the motion excitation module [28] in CNNs. We will also apply the proposed MACS module to more advanced frameworks such as Swin-B [35] to further improve the model’s recognition capabilities.

References

[1]

Anurag Arnab, Mostafa Dehghani, Georg Heigold, Chen Sun, Mario Lučić, and Cordelia Schmid. 2021. ViViT: A video vision transformer. In Proceedings of the IEEE International Conference on Computer Vision, 6836–6846.

Abstract

1 Introduction

2 Related Work

2.1 CNNs

2.2 Visual Transformers

3 Method

3.1 Random, Spatial, and Temporal Attention Mechanisms

3.2 Mixed Attention Operation

3.3 Channel Shift Operation

3.4 MACS Transformer

4 Experiments

4.1 Datasets

4.2 Implementation Details

4.3 Ablation Study

4.4 Comparison with the State-of-the-Art

4.5 Visualization

4.6 Discussion

5 Conclusion

References

Index Terms

Recommendations

Nesting spatiotemporal attention networks for action recognition

Attention with structure regularization for action recognition

Tennis Action Recognition Based on Multi-Branch Mixed Attention

Comments

Information

Published In

Publisher

Publication History

Check for updates

Author Tags

Qualifiers

Funding Sources

Contributors

Other Metrics

Bibliometrics

Article Metrics

Other Metrics

Citations

View options

PDF

eReader

Login options

Full Access

Share

Share this Publication link

Share on social media

Affiliations