Abstract
Video action recognition needs to model any differences by subdividing the spatio-temporal features to distinguish various actions. We propose rethinking spatio-temporal cross attention transformer (STAR++), a multi-modal transformer-based model that uses both RGB and skeleton information as an extended version of STAR-Transformer. STAR++ unifies the encoder-decoder structure of the base spatio-temporal cross attention transformer (STAR-Transformer) into an encoder structure and applies a new method of using interval attention as spatio-temporal cross attention. STAR++ provides interval attention from local features to global features as the layer deepens, allowing it to learn appropriately based on the transformer properties, improving the performance. In addition, STAR++ additionally proposes a deformable 3D token selection that can dynamically select and learn tokens for an attention operation such that tokens can be efficiently learned. The proposed STAR++ demonstrated competitive performance when compared with other state-of-the-art models using Penn action and NTU-RGB+D 60, 120, which are action recognition benchmark datasets. In addition, an ablation study was conducted to confirm that each proposed module has an essential effect on the performance improvement.






Similar content being viewed by others
Explore related subjects
Discover the latest articles, news and stories from top researchers in related subjects.Code Availability
Researchers or interested parties are welcome to contact the corresponding author B.C.K. for further explanation, who may also provide the Python codes upon request
References
Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Adv Neural Inf Process Syst 27
Feichtenhofer C, Pinz A, Zisserman A (2016) Convolutional two-stream network fusion for video action recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 1933–1941
Carreira J, Zisserman A (2017) Quo vadis, action recognition? A new model and the kinetics dataset. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 6299–6308
Arnab A, Dehghani M, Heigold G, Sun C, Lučić M, Schmid C (2021) Vivit: a video vision transformer. In: Proceedings of the IEEE/CVF international conference on computer vision, pp 6836–6846
Bertasius G, Wang H, Torresani L (2021) Is space-time attention all you need for video understanding? In: ICML, 2:4
Liu Z, Ning J, Cao Y, Wei Y, Zhang Z, Lin S, Hu H (2022) Video swin transformer. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 3202–3211
Wang J, Torresani L (2022) Deformable video transformer. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 14053–14062
Zhu X, Su W, Lu L, Li B, Wang X, Dai J (2020) Deformable detr: deformable transformers for end-to-end object detection. arXiv:2010.04159
Xia Z, Pan X, Song S, Li LE, Huang G (2022) Vision transformer with deformable attention. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 4794–4803
Wang J, Yang X, Li H, Liu L, Wu Z, Jiang Y-G (2022) Efficient video transformers with spatial-temporal token selection. In: Computer Vision–ECCV 2022: 17th European conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXV, pp 69–86
Yin H, Vahdat A, Alvarez JM, Mallya A, Kautz J, Molchanov P (2022) A-vit: adaptive tokens for efficient vision transformer. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 10809–10818
Rao Y, Zhao W, Liu B, Lu J, Zhou J, Hsieh C-J (2021) Dynamicvit: efficient vision transformers with dynamic token sparsification. Advances in neural information processing systems 34:13937–13949
Ahn D, Kim S, Hong H, Ko BC (2023) Star-transformer: a spatio-temporal cross attention transformer for human action recognition. In: Proceedings of the IEEE/CVF winter conference on applications of computer vision, pp 3330–3339
Zhu J, Zou W, Xu L, Hu Y, Zhu Z, Chang M, Huang J, Huang G, Du D (2018) Action machine: rethinking action recognition in trimmed videos. arXiv:1812.05770
Baradel F, Wolf C, Mille J, Taylor GW (2018) Glimpse clouds: human activity recognition from unstructured feature points. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 469–478
Wang Z, She Q, Smolic A (2021) Action-net: multipath excitation for action recognition. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 13214–13223
Liu X, Pintea SL, Nejadasl FK, Booij O, Van Gemert JC (2021) No frame left behind: full video action recognition. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 14892–14901
Feichtenhofer C (2020) X3d: expanding architectures for efficient video recognition. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 203–213
Tran D, Wang H, Torresani L, Ray J, LeCun Y, Paluri M (2018) A closer look at spatiotemporal convolutions for action recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 6450–6459
Neimark D, Bar O, Zohar M, Asselmann D (2021) Video transformer network. In: Proceedings of the IEEE/CVF international conference on computer vision, pp 3163–3172
Xu M, Xiong Y, Chen H, Li X, Xia W, Tu Z, Soatto S (2021) Long short-term transformer for online action detection. Adv Neural Inf Process Syst 34:1086–1099
Yan S, Xiong X, Arnab A, Lu Z, Zhang M, Sun C, Schmid C (2022) Multiview transformers for video recognition. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 3333–3343
Yu B, Yin H, Zhu Z (2017) Spatio-temporal graph convolutional networks: a deep learning framework for traffic forecasting. arXiv:1709.04875
Zhang C, Li Q, Song D (2019) Aspect-based sentiment classification with aspect-specific graph convolutional networks. arXiv:1909.03477
Cheng K, Zhang Y, He X, Chen W, Cheng J, Lu H (2020) Skeleton-based action recognition with shift graph convolutional network. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 183–192
Chen Y, Zhang Z, Yuan C, Li B, Deng Y, Hu W (2021) Channel-wise topology refinement graph convolution for skeleton-based action recognition. In: Proceedings of the IEEE/CVF international conference on computer vision, pp 13359–13368
Chi H-g, Ha MH, Chi S, Lee SW, Huang Q, Ramani K (2022) Infogcn: representation learning for human skeleton-based action recognition. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 20186–20196
Yang D, Wang Y, Dantcheva A, Garattoni L, Francesca G, Bremond F (2021) Unik: a unified framework for real-world skeleton-based action recognition. arXiv:2107.08580
Das S, Sharma S, Dai R, Bremond F, Thonnat M (2020) Vpn: learning video-pose embedding for activities of daily living. In: Computer vision–ECCV 2020: 16th European conference, Glasgow, UK, August 23–28, 2020, proceedings, part IX 16, pp 72–90. Springer
Duan H, Zhao Y, Chen K, Lin D, Dai B (2022) Revisiting skeleton-based action recognition. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 2969–2978
Joze HRV, Shaban A, Iuzzolino ML, Koishida K (2020) Mmtm: multimodal transfer module for cnn fusion. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 13289–13299
Munro J, Damen D (2020) Multi-modal domain adaptation for fine-grained action recognition. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 122–132
Gao R, Oh T-H, Grauman K, Torresani L (2020) Listen to look: action recognition by previewing audio. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 10457–10467
Alamri H, Cartillier V, Das A, Wang J, Cherian A, Essa I, Batra D, Marks TK, Hori C, Anderson P et al (2019) Audio visual scene-aware dialog. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 7558–7567
Goyal P, Sahu S, Ghosh S, Lee C (2020) Cross-modal learning for multi-modal video categorization. arXiv:2003.03501
Zhang B, Wang L, Wang Z, Qiao Y, Wang H (2016) Real-time action recognition with enhanced motion vector cnns. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 2718–2726
Yang L, Huang Y, Sugano Y, Sato Y (2022) Interact before align: leveraging cross-modal knowledge for domain adaptive action recognition. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 14722–14732
Alfasly S, Lu J, Xu C, Zou Y (2022) Learnable irrelevant modality dropout for multimodal action recognition on modality-specific annotated videos. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 20208–20217
Shi Z, Liang J, Li Q, Zheng H, Gu Z, Dong J, Zheng B (2021) Multi-modal multi-action video recognition. In: Proceedings of the IEEE/CVF international conference on computer vision, pp 13678–13687
Miech A, Laptev I, Sivic J (2018) Learning a text-video embedding from incomplete and heterogeneous data. arXiv:1804.02516
Ijaz M, Diaz R, Chen C (2022) Multimodal transformer for nursing activity recognition. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 2065–2074
Liu Z, Lin Y, Cao Y, Hu H, Wei Y, Zhang Z, Lin S, Guo B (2021) Swin transformer: hierarchical vision transformer using shifted windows. In: Proceedings of the IEEE/CVF international conference on computer vision, pp 10012–10022
Jaderberg M, Simonyan K, Zisserman A et al (2015) Spatial transformer networks. Adv Neural Inf Process Syst 28
Dosovitskiy A, Beyer L, Kolesnikov A, Weissenborn D, Zhai X, Unterthiner T, Dehghani M, Minderer M, Heigold G, Gelly S et al (2020) An image is worth 16x16 words: transformers for image recognition at scale. arXiv:2010.11929
Raghu M, Unterthiner T, Kornblith S, Zhang C, Dosovitskiy A (2021) Do vision transformers see like convolutional neural networks? Adv Neural Inf Process Syst 34:12116–12128
Si C, Yu W, Zhou P, Zhou Y, Wang X, Yan S (2022) Inception transformer. arXiv:2205.12956
Zhang W, Zhu M, Derpanis KG (2013) From actemes to action: a strongly-supervised representation for detailed action understanding. In: Proceedings of the IEEE international conference on computer vision, pp 2248–2255
Shahroudy A, Liu J, Ng T-T, Wang G (2016) Ntu rgb+ d: a large scale dataset for 3d human activity analysis. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 1010–1019
Liu J, Shahroudy A, Perez M, Wang G, Duan L-Y, Kot AC (2019) Ntu rgb+ d 120: a large-scale benchmark for 3d human activity understanding. IEEE Trans Pattern Anal Mach Intell 42(10):2684–2701
Soomro K, Zamir AR, Shah M (2012) Ucf101: A dataset of 101 human actions classes from videos in the wild. arXiv:1212.0402
Kuehne H, Jhuang H, Garrote E, Poggio T, Serre T (2011) Hmdb: a large video database for human motion recognition. In: 2011 international conference on computer vision, pp 2556–2563. IEEE
Kay W, Carreira J, Simonyan K, Zhang B, Hillier C, Vijayanarasimhan S, Viola F, Green T, Back T, Natsev P et al (2017) The kinetics human action video dataset. arXiv:1705.06950
Guo T, Liu H, Chen Z, Liu M, Wang T, Ding R (2022) Contrastive learning from extremely augmented skeleton sequences for self-supervised action recognition. Proceedings of the AAAI conference on artificial intelligence 36:762–770
Liu Y, Zhang H, Xu D, He K (2022) Graph transformer network with temporal kernel attention for skeleton-based action recognition. Knowl-Based Syst 240:108146
Zeng A, Sun X, Yang L, Zhao N, Liu M, Xu Q (2021) Learning skeletal graph neural networks for hard 3d pose estimation. In: Proceedings of the IEEE/CVF international conference on computer vision, pp 11436–11445
Liu M, Yuan J (2018) Recognizing human actions as the evolution of pose estimation maps. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 1159–1168
Chen T, Zhou D, Wang J, Wang S, Guan Y, He X, Ding E (2021) Learning multi-granular spatio-temporal graph network for skeleton-based action recognition. In: Proceedings of the 29th ACM international conference on multimedia, pp 4334–4342
Bruce X, Liu Y, Chan KC (2021) Multimodal fusion via teacher-student network for indoor action recognition. Proceedings of the AAAI Conference on Artificial Intelligence 35:3199–3207
Cao C, Zhang Y, Zhang C, Lu H (2017) Body joint guided 3-d deep convolutional descriptors for action recognition. IEEE Trans Cybernet 48(3):1095–1108
Luvizon DC, Picard D, Tabia H (2018) 2d/3d pose estimation and action recognition using multitask deep learning. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 5137–5146
Zhao R, Xu W, Su H, Ji Q (2019) Bayesian hierarchical dynamic model for human action recognition. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 7733–7742
Sun JJ, Zhao J, Chen L-C, Schroff F, Adam H, Liu T (2020) View-invariant probabilistic embedding for human pose. In: Computer vision–ECCV 2020: 16th European conference, Glasgow, UK, August 23–28, 2020, proceedings, part V 16, pp 53–70. Springer
Hachiuma R, Sato F, Sekii T (2023) Unified keypoint-based action recognition framework via structured keypoint pooling. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 22962–22971
Feichtenhofer C, Fan H, Malik J, He K (2019) Slowfast networks for video recognition. In: Proceedings of the IEEE/CVF international conference on computer vision, pp 6202–6211
Duan H, Zhao Y, Xiong Y, Liu W, Lin D (2020) Omni-sourced webly-supervised learning for video recognition. In: European conference on computer vision, pp 670–688. Springer
Sun K, Xiao B, Liu D, Wang J (2019) Deep high-resolution representation learning for human pose estimation. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 5693–5703
Bruce X, Liu Y, Zhang X, Zhong S-h, Chan KC (2022) Mmnet: a model-based multimodal network for human action recognition in rgb-d videos. IEEE Trans Pattern Anal Mach Intell
Acknowledgements
This research was supported by Basic Science Research Program through the National Research Foundation of Korea(NRF) funded by the Ministry of Education(2022R1I1A3058128)
Funding
This study was funded by Basic Science Research Program through the National Research Foundation of Korea(NRF) funded by the Ministry of Education(2022R1I1A3058128)
Author information
Authors and Affiliations
Contributions
D.A. and S.K. were responsible for the design and overall investigation. B.C.K. was responsible for the data curation, supervision, writing and editing of manuscript
Corresponding author
Ethics declarations
Conflicts of interest
The authors declare that they have no conflict of interest
Ethical approval
Not applicable
Consent to participate
Not applicable
Consent for publication
Not applicable
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Ahn, D., Kim, S. & Ko, B.C. STAR++: Rethinking spatio-temporal cross attention transformer for video action recognition. Appl Intell 53, 28446–28459 (2023). https://doi.org/10.1007/s10489-023-04978-7
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10489-023-04978-7