Actnetformer: Transformer-Resnet Hybrid Method For Semi-Supervised Action Recognition in Videos
Actnetformer: Transformer-Resnet Hybrid Method For Semi-Supervised Action Recognition in Videos
Actnetformer: Transformer-Resnet Hybrid Method For Semi-Supervised Action Recognition in Videos
1 Introduction
The remarkable advancements in deep learning have revolutionized action recogni-
tion, particularly with the advent of supervised learning protocols. However, acquir-
ing a substantial number of annotated videos remains a challenge in practice since
* This research is supported by the Global Research Excellence Scholarship, Monash Univer-
sity, Malaysia. This research is also supported, in part, by the Global Excellence and Mobility
Scholarship (GEMS), Monash University, Malaysia & Australia.
2 Dass et al.
(a) The difference in performance between (b) The difference in performance between
ResNet-B and VIT-S, categorized by class, is VIT-B and VIT-S, categorized by class, is
evaluated under a supervised training scenario evaluated under a supervised training scenario
with only 1% labeled videos in the Kinetics- with only 1% labeled videos in the Kinetics-
400 dataset. 400 dataset.
it is time-consuming and expensive [16, 38]. Each day, video-sharing platforms like
YouTube and Instagram witness millions of new video uploads. Leveraging this vast
pool of unlabeled videos presents a significant opportunity for semi-supervised learn-
ing approaches, promising substantial benefits for advancing action recognition capa-
bilities [19, 36].
A typical method for leveraging unlabeled data involves assigning pseudo-labels to
them and effectively treating them as ground truth during training [12, 21, 29]. Current
methodologies typically involve training a model on annotated data and subsequently
employing it to make predictions on unlabeled videos. When predictions exhibit high
confidence levels, they are adopted as pseudo-labels for the respective videos, guid-
ing further network training. However, the efficacy of this approach hugely depends on
the quantity and accuracy of the pseudo-labels generated. Unfortunately, the inherent
limitations in discriminating patterns from a scant amount of labeled data often result
in subpar pseudo-labels, ultimately impeding the potential benefits gleaned from unla-
beled data.
To enhance the utilization of unlabeled videos, our approach draws inspiration from
recent studies, particularly from [33], which introduced an auxiliary model to provide
complementary learning. We also introduce complementary learning but with notable
advancements. Firstly, we introduce a cross-architecture strategy, leveraging both 3D
CNNs and transformer models’ strengths, unlike CMPL [33], which relies solely on
3D CNNs. This is because both 3D CNNs and video transformers (VIT) offer distinct
advantages in action recognition. As shown in Fig. 1a, videos for activities such as
‘playing the guitar” from the Kinetics-400 dataset that demonstrate short-range tempo-
Transformer-ResNet Hybrid Pipeline for Semi-Supervised Action Recognition 3
ral dependencies typically involve actions or events that occur over a relatively short
duration and require capturing temporal context within a limited time-frame, and per-
form better with 3D CNNs. This is because 3D CNNs excel at capturing spatial features
and local dependencies in the temporal domain due to their intrinsic property, which in-
volves processing spatio-temporal information through convolutions.
On the other hand, transformer architectures, leveraging self-attention mechanisms,
can naturally capture long-range dependencies by allowing each token to learn atten-
tion across the entire sequence. As shown in Fig. 1a videos such as the "yoga” class in
the Kinetics-400 dataset, which demonstrate long-range temporal dependencies involv-
ing actions or events that unfold gradually over extended periods that require capturing
temporal context over more extended periods, perform better in the transformer model.
Such intrinsic property in transformers enables them to capture complex relationships
and interactions between distant frames, leading to a more holistic understanding of the
action context. This capability enables transformers to encode meaningful context in-
formation into video representations, facilitating a deeper understanding of the temporal
dynamics and interactions within the video sequence.
Besides that, CMPL [33] also suggests that smaller models excel at capturing tem-
poral dynamics in action recognition. In comparison, larger models are more adept at
learning spatial semantics to differentiate between various action instances. Motivated
by this approach, we chose to leverage the advantages of a smaller transformer model,
VIT-S, over its larger counterpart, VIT-B. As depicted in Fig. 1b and further studied
in Section S2 in the Supplementary Material, a smaller model, despite its smaller ca-
pacity, does obtain significant improvements over a bigger model in certain classes.
While VIT-B excels at capturing spatial semantics, it is essential to note that our pri-
mary model, 3D-ResNet50, already possesses these strong capabilities. The 3D con-
volutional nature of ResNet-50 makes it well-suited for extracting spatial features and
local dependencies within the temporal domain. Therefore, the inclusion of VIT-S as an
auxiliary model complements the strengths of our primary model by focusing on cap-
turing temporal dynamics, which aligns with our primary objective of addressing action
recognition in videos. This strategic combination allows our ActNetFormer framework
to achieve a balanced representation learning, leveraging the spatial semantics captured
by 3D-ResNet50 and the temporal dynamics captured by VIT-S. As demonstrated in our
ablation study (Section 7.2), this integration of VIT-S as an auxiliary model consistently
leads to better results compared to adapting VIT-B. Hence, while VIT-B remains essen-
tial, its role is effectively supported by the capabilities of our primary model, thereby
justifying our choice of prioritizing VIT-S within the ActNetFormer framework.
Furthermore, our method also incorporates video level contrastive learning, en-
abling the model to glean stronger representations at the spatio-temporal level. Hence,
our cross-architecture pseudo-labeling approach is utilized to capture distinct aspects
of action representation from both the 3D CNNs and transformer architectures, while
cross-architecture contrastive learning aims explicitly to align the representations and
discover mutual information in global high-level representations across these architec-
tures. More experimental details about the cross-architecture strategy are included in
Section S1.1 in the Supplementary Material.
The main contributions of this work is twofold and listed as follows:
4 Dass et al.
LuZ
Strongly Unsupervised loss
augmented Cross-architecture contrastive learning
data
3D CNN (3D-ResNet50) PRIMARY MODEL
3D CNN SG
(3D-ResNet50) //
PRIMARY MODEL
Pseudo
Weakly labelling
augmented
data
Weakly
augmented
data Pseudo
labelling
Video Transformer
network (VIT-S)
AUXILIARY //
MODEL SG Video Transformer network (VIT-S) AUXILIARY MODEL
Strongly
augmented
Unsupervised loss Maximize agreement Minimize agreement
data
2 Related works
Action recognition has advanced significantly with deep learning architectures like
CNNs, Recurrent Neural Networks (RNNs), Long Short-term Memory (LSTM), and
Transformers. CNNs capture spatial information, while the RNNs captures temporal
dependencies. Meanwhile, Transformers, known for NLP tasks, is excellent at captur-
ing long-range dependencies. Varshney et al. [27] proposed a CNN model combin-
ing spatial and temporal information using different fusion schemes for human activi-
ties recognition video. Bilal et al. [3] employ hybrid CNN-LSTM models and transfer
learning for recognizing overlapping human actions in long videos. Vision Transformer
(ViT) [8] treats images as sequences of patches, achieving competitive performance on
image classification tasks. Arnab et al. [1] extend Transformers to video classification,
while Bertasius et al. [2] introduce TimeSformer, a convolution-free approach to video
classification built exclusively on self-attention over space and time convolution-free
approach. TimeSformer achieves state-of-the-art (SOTA) results on action recognition
Transformer-ResNet Hybrid Pipeline for Semi-Supervised Action Recognition 5
benchmarks like Kinetics-400 and Kinetics-600, offering faster training and higher effi-
ciency. Besides that, TimeSformer can also achieve good results even without pretrain-
ing. However, achieving these results may require more extensive data augmentation
and longer training periods.
Action recognition in computer vision is vital across various applications, yet it of-
ten suffers from limited labeled data. Semi-supervised learning (SSL) methods pro-
vide a solution by utilizing both labeled and unlabeled data to enhance model per-
formance [21, 33]. These approaches exploit the abundance of unlabeled video data
available online. Wu et al. [30] proposed NCCL, a neighbor-guided consistent and
contrastive learning (NCCL) method for semi-supervised video-based action recogni-
tion. Xu et al. [33] introduced CMPL, employing cross-model predictions to gener-
ate pseudo-labels and improve model performance. Singh et al. [20] leverage unsuper-
vised videos played at different speeds to address limited labeled data. Xiao et al. [31]
enhance semi-supervised video action recognition by incorporating temporal gradient
information alongside RGB data. Jing et al. [13] use pseudo-labels from CNN confi-
dences and normalized probabilities to guide training, achieving impressive results with
minimal labeled data. Gao et al. [10] introduced an end-to-end semi-supervised Differ-
entiated Auxiliary guided Network (DANet) for action recognition. Xiong et al. [32]
introduce multi-view pseudo-labeling, leveraging appearance and motion cues for im-
proved SSL. Tong et al. [25] propose TACL, employing temporal action augmentation,
action consistency learning, and action curriculum pseudo-labeling for enhanced SSL.
These advancements demonstrate the potential of SSL techniques in boosting action
recognition performance, especially in scenarios with limited labeled data.
Contrastive learning has become a popular approach, especially in computer vision [15].
Unlike supervised methods, contrastive learning operates on unlabeled data, maximiz-
ing agreement between similar samples while minimizing it between dissimilar ones [18,
23]. It fosters a feature space where similar instances are clustered and dissimilar ones
are separated. By optimizing a similarity metric using positive (similar) and negative
(dissimilar) sample pairs, contrastive learning extracts meaningful features beneficial
for tasks like classification and object detection. Its advantage lies in learning from vast
unlabeled data, making it suitable for scenarios with limited labeled data [7, 24]. Guo
et al. [11] propose AimCLR, a contrastive learning-based self-supervised action repre-
sentation framework. They enhance positive sample diversity and minimize distribution
divergence, achieving superior performance. The method in [37] also proposes a hierar-
chical matching model for few-shot action recognition, leveraging contrastive learning
to enhance video similarity measurements across multiple levels. Rao et al. [17] intro-
duce AS-CAL, a contrastive learning-based approach for action recognition using 3D
skeleton data. It learns inherent action patterns across various transformations, facilitat-
ing effective representation of human actions.
6 Dass et al.
3 Method
Given a labeled dataset X containing Nl videos, each paired with a corresponding label
(xi , yi ), and an unlabeled dataset U comprised of Nu videos, ActNetFormer efficiently
learns an action recognition model by utilizing both data that are labeled and unlabeled.
Typically, the size of the unlabeled dataset Nu is greater than that of the labeled dataset
Nl . We provide a brief description of the pseudo-labeling method in Section 3.3. Sub-
sequently, we introduce the proposed ActNetFormer framework in Section 3.4. Then,
we explain how contrastive learning works in ActNetFormer framework in Section 3.4.
Subsequently, we delve into the implementation details of ActNetFormer in Section 4.
In the equation (1), Bu denotes the batch size, τ is the threshold used to indicate
if the prediction that is made is reliable or not, 1(·) denotes the indicator function,
qi = Z(Gw (ui )) represents the class distribution, and ŷi = arg max(qi ) denotes the
pseudo-label. Gs (·) and Gw (·) respectively denote the processes of strong and weak
augmentation. H(·, ·) represents the standard cross-entropy loss. Lu represents the loss
on the unlabeled data, while the loss on the labeled data is the cross-entropy loss typi-
cally used in action recognition.
Bl
1 X
LZs = H(yi , Z(GZ
n (xi ))) (2)
Bl i=1
Bl
1 X
LA
s = H(yi , A(GA
n (xi ))) (3)
Bl i=1
where Gn (·) denotes the conventional data augmentation method employed in [9,
28].
8 Dass et al.
Training on unlabeled data. When presented with an unlabeled video ui , the auxil-
iary model A(·) generates predictions based on data that are weakly augmented ui and
produces category-wise probabilities denoted as qiA = A(Gw (ui )). If the maximum
probability among these probabilities, max(qiA ), exceeds a predefined threshold τ , it is
considered a reliable prediction. In such cases, we utilize qiA to infer the pseudo ground
truth label ŷiA = arg max(qiA ) for the strongly augmented ui . This process allows the
model Z(·) to learn effectively.
Bu
1 X
LZ
u = 1 max(qiA ) ≥ τ H(ŷiA , Z(Gs (ui ))) (4)
Bu i=1
where, Bu represents the batch size, and H(·, ·) denotes the cross-entropy loss.
Similar to the auxiliary model, the primary model will also produce a prediction
qiZ = Z(Gw (ui )), which is then utilized to create a labeled pair (ŷiZ , Gs (ui )) for the
auxiliary model:
Bu
1 X
LA
u = 1 max(qiZ ) ≥ τ H(ŷiZ , A(Gs (ui ))) (5)
Bu i=1
Contrastive learning. The goal is to train the primary and auxiliary models using
limited supervision initially, which can effectively analyze a vast collection of unla-
beled videos to enhance activity understanding. Our cross-architecture pseudo-labeling
approach already leverages two different architectures to capture different aspects of
action representations as mentioned in Section 3.4. Contrastive learning is incorporated
to encourage the models further to extract complementary features from the input data,
leading to more comprehensive representations of actions. 3D CNN and a Video Trans-
former process the input video clip differently and produce a unique representation of
the video content. In other words, the features extracted by each architecture capture
different aspects of the video, such as spatial and temporal information. This diversity
in representations can be advantageous as it allows the model to learn from multiple
perspectives, potentially leading to a more comprehensive understanding of the action
sequences in the videos. Therefore, cross-architecture contrastive learning is employed
to discover the mutual information that coexists between both the representation en-
coding generated by the 3D CNN and the video transformer model. It is worth noting
that our framework uses weakly augmented samples from each architecture for cross-
architecture contrastive learning, inspired by [31].
Consider a mini-batch with Bu unlabeled videos. Here, m(uZ i ) represents the video
clip processed by the primary model, while m(uA i ) represents the video clip processed
by the auxiliary model. Therefore, m can be interpreted as the function that generates
representations of the input videos through the respective models. These representations
q
form the positive pair. For the rest of Bu − 1 videos, m(uZ i ) and m(uk ) form negative
pairs, where the representation of the k-th video can come from either of the architecture
(i.e., q ∈ {Z, A}). Given that the negative pairs comprise various videos with distinct
content, the representation of different videos within each architecture is pushed apart.
Transformer-ResNet Hybrid Pipeline for Semi-Supervised Action Recognition 9
This is facilitated by utilizing a contrastive loss (Lca ) adapted from [5, 20], as outlined
below.
h(m(uZ A
i ), m(ui ))
Lca (uZ A
i , ui ) = − log P B q
(6)
h(m(uZ A
i ), m(ui )) + k=1 1{k̸=i} h(m(uZ
i ), m(uk ))
q∈{Z,A}
⊤
where, h(u, v) = exp ∥u∥u2 ∥v∥v
2
/τ represents the exponential of the cosine similarity
measure between vectors u and v, where τ denotes the temperature hyperparameter.
The final contrastive loss is calculated for all positive pairs, (uZ
i , ui ), where ui is the
A Z
by auxiliary model. The loss function is engineered to reduce the similarity, not just
among different videos processed within individual architectures but also across both
architectural models.
L = (LZs + LA Z A
s ) + γ · (Lu + Lu ) + β · Lca (7)
where, γ and β are weights of the cross-architecture loss and contrastive learning losses
respectively.
4 Implementation
4.1 Auxiliary Model
As mentioned in Section 3.4, the auxiliary model should possess distinct learning ca-
pabilities compared to the primary model in order to offer complementary representa-
tions. Hence, we utilize VIT-S, which is the smaller version of the bigger transformer
model (VIT-B). Comprehensive ablation studies (in the next section) show the supe-
riority of VIT-S w.r.t. the transformer model (VIT-B) and the smaller 3D CNN model
(3D-ResNet18). Unless otherwise specified, we utilize 3D-ResNet50 as the primary and
VIT-S as the auxiliary models, respectively. More details of these models are included
in Section S3 in the Supplementary Material.
Our ActNetFormer framework incorporates variations in frame rates for temporal data
augmentations inspired by prior research in [20, 34]. While the primary model operates
at a lower frame rate, the auxiliary model is provided with a higher one. This variation
in frame rates allows for exploring different speeds in video representations. Despite
the differences in playback speeds, the videos maintain the same semantics, maximiz-
ing the similarity between their representations. This approach offers complementary
benefits by leveraging both slower and faster frame rates between the primary and aux-
iliary models. Consequently, this contributes to improving the overall performance of
our ActNetFormer framework in action recognition. Additional spatial and temporal
augmentations analysis are provided in Section S1.2 in Supplementary Material.
5 Experiments
5.1 Dataset
The Kinetics-400 dataset [14] comprises a vast collection of human action videos, en-
compassing around 245,000 training samples and 20,000 validation samples across 400
distinct action categories. Following established methodologies like MvPL [32] and
CMPL [33], we adopt a labeling rate of 1% or 10%, selecting 6 or 60 labeled training
videos per category. Additionally, the UCF-101 dataset [22] offers 13,320 video sam-
ples spread across 101 categories. We also sample 1 or 10 samples in each category as
the labeled set following CMPL [33].
5.2 Baseline
For our primary model, we utilize the 3D-ResNet50 from [9]. We employ the ViT [8]
extended with the video TimeSformer [2] as the auxiliary model in our ActNetFormer
approach. While most hyperparameters remain consistent with the baseline, we utilize
the divided space-time attention mechanism, as mentioned in TimeSformer [2]. How-
ever, only the big transformer model (VIT-B) is offered in TimeSformer, hence we adopt
the smaller transformer model (VIT-S) inspired by DeiT-S [26] with the dimensions of
384 and 6 heads. More details on the structure of primary and auxiliary models are
included in Section S3 in the Supplementary Material.
Transformer-ResNet Hybrid Pipeline for Semi-Supervised Action Recognition 11
During training, we utilize a stochastic gradient descent (SGD) optimizer with a mo-
mentum of 0.9 and a weight decay of 0.001. The confidence score threshold τ , is set
to 0.8. Parameters γ and β are both set to 2. Based on insights from the ablation study
in Section 7.1, we employ a batch ratio of 1:5 for labeled to unlabeled data, ensuring a
balanced and effective training process. A total of 250 training epochs are used. During
testing, consistent with the inference method employed in MvPL [32] and CMPL [33],
we uniformly sample five clips from each video and generate three distinct crops to
achieve a resolution of 224 × 224, covering various spatial areas within the clips. The
final prediction is obtained by averaging the softmax probabilities of these 5 × 3 predic-
tions. While both the primary and auxiliary models are optimized jointly during train-
ing, only the primary model is utilized for inference, thereby incurring no additional
inference cost. It is noteworthy that our ActNetFormer approach does not rely on pre-
training or pre-trained weights, setting it apart from other methods and underscoring its
uniqueness in the field of action recognition in videos.
6 Results
The backbone column in Table 1 denotes the primary model used in the respective
methods. We present the top-1 accuracy as our chosen evaluation metric. The “Input"
category indicates the data format utilized during training, with “V" representing raw
RGB video, “F" denoting optical flow, and “G" indicating temporal gradient. ActNet-
Former consistently performs better than various SOTA methods, including FixMatch,
TCL, MvPL, TACL, CMPL, NCCL, DANet, and LTG, across both datasets and la-
beling rates. The inclusion of contrastive learning in our approach demonstrates an
improved performance by a significant percentage, specifically at the 1% labeled data
12 Dass et al.
setting. We observe a percentage increase of approximately 4.60% for the UCF-101 and
4.37% for the Kinetics-400 dataset. This enhancement underscores the effectiveness of
incorporating contrastive learning, resulting in more robust representations. ActNet-
Former outperforms FixMatch by a large margin due to its novel cross-architecture
strategy, which leverages the strengths of both 3D CNN and VIT models, whereas Fix-
Match relies solely on its own architecture for label generation, potentially limiting its
adaptability. Our approach shares similarities with the CMPL approach. However, it
surpasses CMPL in several vital aspects. Firstly, our approach incorporates video-level
contrastive learning, which enables the model to learn more robust representations at the
video level. This enhanced representation leads to better performance in action recogni-
tion. Additionally, our approach leverages a cross-architecture strategy, combining the
strengths of both 3D CNN and VIT models. In contrast, CMPL leverages a cross-model
strategy which utilizes the strength of 3D CNN alone. By integrating spatial feature
extraction capabilities from CNNs with the attention mechanisms of transformers, our
approach achieves a more comprehensive understanding of both spatial and temporal
aspects of video data. Besides that, our approach achieves a performance of 80.0%
in the 10% UCF-101 dataset, while incorporating contrastive learning boosts our per-
formance to 80.6%, bringing it closer to the 80.5% achieved by MvPL. Notably, our
approach relies solely on one modality, whereas MvPL exploits three modalities. De-
spite this discrepancy in input modalities, our approach demonstrates comparable per-
formance, indicating its efficiency in leveraging single-modality information for video
understanding tasks. This suggests that our approach may offer a more streamlined so-
lution than MvPL, which relies on multiple modalities to achieve similar performance
levels.
7 Ablation Studies
the threshold acts as a criterion for determining which samples are considered confi-
dently predicted by the model and thus eligible for inclusion in the training process.
Therefore, if the threshold is excessively high, fewer unlabeled samples may meet this
criterion, leading to under-utilization of unlabeled data and potentially compromising
model performance. Hence, we utilize 0.8 as the threshold for all the experiments in
this study.
Next, we evaluate the impact of the ratio between labeled and unlabeled samples in
a mini-batch on the final outcome. Specifically, we fix the number of labeled samples Bl
at 1 and randomly sample Bu unlabeled samples to form a mini-batch, where Bu varies
from {1, 3, 5, 7}. The outcomes are depicted in Fig. 3 (b), indicating that the model
performs best when Bu = 5. Lastly, we explore the selection of the loss weights γ and
β, as shown in Fig. 3 (c) and Fig. 3 (d) for the cross-architecture loss and contrastive
learning loss, respectively. We find that the optimum value of γ and β are 2. Hence, we
utilize γ = 2 and β = 2 for all the experiments.
8 Conclusion
References
1. Arnab, A., Deghani, M., Heigold, G., Sun, C., Lučić, M., Scmid, C.: Vivit: A video vision
transformer. In: Proceedings of the IEEE/CVF international conference on computer vision.
pp. 6936–6946 (2021)
2. Bertasius, G., Wang, H., Torresani, L.: Is space - time attention all you need for video under-
standing? In: ICML. vol. 3, p. 4 (2021)
3. Bilal, M., Maqsood, M., Yasmin, S., Hasan, N.U., Rho, S.: A transfer learning-based efficient
spatiotemporal human action recognition framework for long and overlapping action classes.
The Journal of Supercomputing 78(2), 2873–2908 (2022)
4. Bouthillier, X., Konda, K., Vincent, P., Memisevic, R.: Dropout as data augmentation (2016)
Transformer-ResNet Hybrid Pipeline for Semi-Supervised Action Recognition 15
5. Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning
of visual representations. In: III, H.D., Singh, A. (eds.) Proceedings of the 37th International
Conference on Machine Learning. Proceedings of Machine Learning Research, vol. 119, pp.
1597–1607. PMLR (13–18 Jul 2020), https://proceedings.mlr.press/v119/
chen20j.html
6. Cubuk, E.D., Zoph, B., Mane, D., Vasudevan, V., Le, Q.V.: Autoaugment: Learning aug-
mentation strategies from data. In: Proceedings of the IEEE/CVF Conference on Computer
Vision and Pattern Recognition (CVPR) (June 2019)
7. Dave, I., Gupta, R., Rizve, M.N., Shah, M.: Tclr: Temporal contrastive learning for video
representation. Computer Vision and Image Understanding 219, 103406 (2022)
8. Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., De-
hghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words:
Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020)
9. Feichtenhofer, C., Fan, H., Malik, J., He, K.: Slowfast networks for video recognition. In:
Proceedings of the IEEE/CVF international conference on computer vision. pp. 6202–6211
(2019)
10. Gao, G., Liu, Z., Zhang, G., Li, J., Qin, A.K.: Danet: Semi-supervised differentiated auxil-
iaries guided network for video action recognition. Neural Networks 158, 121–131 (2023)
11. Guo, T., Liu, H., Chen, Z., Liu, M., Wang, T., Ding, R.: Contrastive learn-
ing from extremely augmented skeleton sequences for self-supervised action recogni-
tion. Proceedings of the AAAI Conference on Artificial Intelligence 36(1), 762–770
(Jun 2022). https://doi.org/10.1609/aaai.v36i1.19957, https://ojs.
aaai.org/index.php/AAAI/article/view/19957
12. Hu, Z., Yang, Z., Hu, X., Nevatie, R.: Simple: Similar pseudo label exploittation for semi -
supervised clasiffication. In: Procedings of the IEEE/CVF Conference on Computer Vision
and Pattern Recognition (CVPR). pp. 15099–15107 (June 2021)
13. Jing, Y., Wang, F.: Tp-vit: A two-pathway vision transformer for video action recognition.
In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal
Processing (ICASSP). pp. 2185–2189. IEEE (2022)
14. Kay, W., Carreira, J., Simonyan, K., Zhang, B., Hillier, C., Vijayanarasimhan, S., Viola, F.,
Green, T., Back, T., Natsev, P., et al.: The kinetics human action video dataset. arXiv preprint
arXiv:1705.06950 (2017)
15. Le-Khac, P.H., Healy, G., Smeaton, A.F.: Contrastive representation learning: A framework
and review. Ieee Access 8, 193907–193934 (2020)
16. Pareek, P., Thakkar, A.: A survey on video-based human action recognition: recent updates,
datasets, challenges, and applications. Artificial Intelligence Review 54, 2259–2322 (2021)
17. Rao, H., Xu, S., Hu, X., Cheng, J., Hu, B.: Augmented skeleton based contrastive action
learning with momentum lstm for unsupervised action recognition. Information Sciences
569, 90–109 (2021)
18. Shah, A., Sra, S., Chellappa, R., Cherian, A.: Max-margin contrastive learning. In: Proceed-
ings of the AAAI Conference on Artificial Intelligence. vol. 36, pp. 8220–8230 (2022)
19. Shen, H., Yan, Y., Xu, S., Ballas, N., Chen, W.: Evaluation of semi-supervised learning
method on action recognition. Multimedia Tools and Applications 74, 523–542 (2015)
20. Singh, A., Chakaborty, O., Varshney, A., Panda, R., Feris, R., Senko, K., Das, A.: Semi
- supervised action recognition with temporal contrastive learning. In: Proceedings of the
IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 10389–10399
(2021)
21. Sohn, K., Berthelot, D., Carlini, N., Zhang, Z., Zhang, H., Raffel, C.A., Cubuk, E.D., Ku-
rakin, A., Li, C.L.: Fixmatch: Simplifying semi-supervised learning with consistency and
confidence. Advances in neural information processing systems 33, 596–608 (2020)
16 Dass et al.
22. Soomro, K., Zamir, A.R., Shah, M.: Ucf101: A dataset of 101 human actions classes from
videos in the wild. arXiv preprint arXiv:1212.0402 (2012)
23. Tian, Y., Sun, C., Poole, B., Krishnan, D., Schmmid, C., Isola, P.: What makes for good views
for contrastive learning? Advances in neural information processing systems 33, 6837–6839
(2020)
24. Tian, Y.: Understanding deep contrastive learning via coordinate-wise optimization. Ad-
vances in Neural Information Processing Systems 35, 19511–19522 (2022)
25. Tong, A., Tang, C., Weng, W.: Semi - supervised action recognition from temporal aug-
mentation using curriculum learning. IEEE Transactions on Circuits and Systems for Video
Technology 33(3), 1301–1312 (2023). https://doi.org/10.1109/TCSVT.2022.
3310271
26. Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jegou, H.: Training data-
efficient image transformers & distillation through attention. In: Meila, M., Zhang, T.
(eds.) Proceedings of the 38th International Conference on Machine Learning. Proceed-
ings of Machine Learning Research, vol. 139, pp. 10347–10357. PMLR (18–24 Jul 2021),
https://proceedings.mlr.press/v139/touvron21a.html
27. Varshney, N., Bakariya, B.: Deep convolutional neural model for human activities recog-
nition in a sequence of video by combining multiple cnn streams. Multimedia Tools and
Applications 81(29), 42117–42129 (2022)
28. Wang, X., Girshick, R., Gupta, A., He, K.: Non-local neural networks. In: Proceedings of the
IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (June 2018)
29. Wang, Y., Wang, H., Shen, Y., Fei, J., Li, W., Jin, G., Wu, L., Zhao, R., Le, X.: Semi-
supervised semantic segmentation using unreliable pseudo-labels. In: Proceedings of the
IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). pp. 4248–
4257 (June 2022)
30. Wu, J., Sun, W., Gan, T., Ding, N., Jiang, F., Shen, J., Nie, L.: Neighbor-guided consistent
and contrastive learning for semi-supervised action recognition. IEEE Transactions on Image
Processing (2023)
31. Xiao, J., Jing, L., Zhang, L., He, J., She, Q., Zho, Z., Yuile, A., Li, Y.: Learning from tem-
poral gradient for semi - supervised action recognition. In: Proceedings of the IEEE/CVF
Conference on Computer Vision and Pattern Recognition. pp. 3252–3262 (2022)
32. Xiong, B., Fan, H., Grauman, K., Feichtenhofer, C.: Multiview pseudo-labeling for semi-
supervised learning from video. In: Proceedings of the IEEE/CVF International Conference
on Computer Vision (ICCV). pp. 7209–7219 (October 2021)
33. Xu, Y., Wei, F., Sun, X., Yang, C., Shen, Y., Dai, B., Zhou, B., Lin, S.: Cross-model pseudo-
labeling for semi-supervised action recognition. In: Proceedings of the IEEE/CVF Confer-
ence on Computer Vision and Pattern Recognition. pp. 2959–2968 (2022)
34. Yang, C., Xu, Y., Dai, B., Zho, B.: Video representation learning with visual temporal con-
sistency. arXiv preprint arXiv:2006.15599 (2020)
35. Zhang, B., Wang, Y., Hou, W., Wu, H., Wang, J., Okumura, M., Shinozaki, T.: Flexmatch:
Boosting semi-supervised learning with curriculum pseudo labeling. Advances in Neural
Information Processing Systems 34, 18408–18419 (2021)
36. Zhang, T., Liu, S., Xu, C., Lu, H.: Boosted multi-class semi-supervised learning for human
action recognition. Pattern recognition 44(10-11), 2334–2342 (2011)
37. Zheng, S., Chen, S., Jin, Q.: Few-shot action recognition with hierarchical matching and
contrastive learning. In: European Conference on Computer Vision. pp. 297–313. Springer
(2022)
38. Zhu, Y., Li, X., Liu, C., Zolfaghari, M., Xiong, Y., Wu, C., Zhang, Z., Tighe, J., Man-
matha, R., Li, M.: A comprehensive study of deep video action recognition. arXiv preprint
arXiv:2012.06567 (2020)
ActNetFormer: Transformer-ResNet Hybrid Pipeline
for Semi-Supervised Action Recognition in Videos
(Supplementary Material) *
1
School of Information Technology, Monash University, Malaysia
2
Robotics and Autonomous Systems Group, TCS Research, India
{sharana.sureshdass, hrishav.barua, ganesh.krishnasamy,
raveendran.paramesran, raphael.phan}@monash.edu
The impact of spatial augmentation [1, 2] and temporal augmentation [4, 7] is as-
sessed in Table S2. The baseline denotes the removal of both spatial augmentation tech-
niques and temporal augmentation techniques across all branches. Under this condition,
there is a significant decrease in experimental performance. However, performance im-
proves when spatial or temporal augmentation is individually applied to the baseline.
The best result is achieved when both spatial and temporal augmentation are adapted.
Transformer-ResNet Hybrid Pipeline for Semi-Supervised Action Recognition 3
0.25
0.20
0.15
0.10
0.05
0 10 20 30 40 50 60 70 80 90 100 110 120 130 140 150 160 170 180 190 200 210 220 230 240 250
Epochs
Fig. S1: Training accuracy curves illustrate the performance of the primary model of
ActNetFormer (blue), the model used in FixMatch (red), and the auxiliary model of
ActNetFormer (yellow). Evaluation is conducted on samples assigned pseudo-labels by
the auxiliary model.
S2 Empirical Analysis
Table S3 shows the architecture of the primary model adapted from [3]. The net-
work architecture consists of one ResNetBasicStem layer responsible for initial con-
volution and pooling operations, followed by four ResStage blocks, each containing
multiple ResBlock layers for performing the main residual computations. Finally, the
architecture concludes with one ResNetBasicHead layer, which performs average pool-
ing, dropout, and linear projection to output the classes. The structure of our primary
model which is the 3D-ResNet50 network is characterized by convolutional kernels
Transformer-ResNet Hybrid Pipeline for Semi-Supervised Action Recognition 5
Transformer encoder
Embedded patches
Attention (6)
Multi-Head
Norm
Norm
(384)
MLP
+ +
Component Description
Model Type Video Transformer (VIT-S)
Patch Size Input images are divided into patches of size 16 × 16
Number of Layers 12
Embedding Size 384
Number of Attention Heads 6
Dropout Probability 0.0
Activation Function GELU (Gaussian Error Linear Unit)
Normalization Layer normalization with ϵ = 1e − 06
Positional Encoding Positional encoding is implicit in the patch positions
Attention Mechanism Multi-head self-attention
MLP Hidden Layer Size 1536
Linear layer with output size 101 for (UCF-101) or 400 for
Output
(Kinetics-400)
Table S4: Architecture of the auxiliary model (VIT-S).
frames are divided into patches of size 16x16 pixels using a PatchEmbed layer, fa-
cilitating spatial decomposition. Following this, both positional and temporal dropout
layers are applied to the embedded patches, aiding in regularizing the model during
training.
The core of the model consists of a series of blocks, each of which comprises several
components. Within each block, layer normalization is applied to the input embeddings,
ensuring stable training dynamics. The self-attention mechanism is employed within
each block to capture spatial and temporal relationships from the input video data.
This attention mechanism is complemented by multi-layer perceptron (MLP) modules,
which enable the model to learn complex non-linear mappings between input and output
representations.
Overall, the auxiliary model (VIT-S) leverages the power of self-attention and MLPs
to process both spatial and temporal information in video data. The final output of the
model is obtained by passing the transformed representations through a linear layer.
References
1. Bouthillier, X., Konda, K., Vincent, P., Memisevic, R.: Dropout as data augmentation (2016)
2. Cubuk, E.D., Zoph, B., Mane, D., Vasudevan, V., Le, Q.V.: Autoaugment: Learning augmen-
tation strategies from data. In: Proceedings of the IEEE/CVF Conference on Computer Vision
and Pattern Recognition (CVPR) (June 2019)
3. Feichtenhofer, C., Fan, H., Malik, J., He, K.: Slowfast networks for video recognition. In:
Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV) (October
2019)
4. Singh, A., Chakaborty, O., Varshney, A., Panda, R., Feris, R., Senko, K., Das, A.: Semi - super-
vised action recognition with temporal contrastive learning. In: Proceedings of the IEEE/CVF
Conference on Computer Vision and Pattern Recognition. pp. 10389–10399 (2021)
5. Sohn, K., Berthelot, D., Carlini, N., Zhang, Z., Zhang, H., Raffel, C.A., Cubuk, E.D., Ku-
rakin, A., Li, C.L.: Fixmatch: Simplifying semi-supervised learning with consistency and
confidence. Advances in neural information processing systems 33, 596–608 (2020)
6. Touvron, H., Cord, M., Douze, M., Massa, F., Sablayroles, A., Jégou, H.: Training data-
efficient image transformers & distillation through attention. In: International conference on
machine learning. pp. 11347–11357. PMLR (2021)
7. Yang, C., Xu, Y., Dai, B., Zho, B.: Video representation learning with visual temporal consis-
tency. arXiv preprint arXiv:2006.15599 (2020)