Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
research-article

CANet: : Comprehensive Attention Network for video-based action recognition

Published: 19 July 2024 Publication History

Abstract

Attention mechanisms play a crucial role in improving action recognition performance. A video, a type of 3D data, can be effectively explored using attention mechanisms from temporal, spatial, and channel dimensions. However, existing methods based on 2D CNN tend to deal with complex spatiotemporal information from one or two of the dimensions, which eventually hampers their overall performance. In this paper, we propose a novel Comprehensive Attention Network (CANet) to model spatiotemporal information in all three dimensions adaptively. CANet is composed of three core plug-and-play components, namely the Global Guided Short-term Motion Module (GG-SMM), the Second-order Guided Long-term Motion Module (SG-LMM), and the Spatial Motion Adaptive Module (SMAM). Specifically, (1) the GG-SMM module is designed to represent local motion clues in the short-term temporal dimension to improve the classification accuracy of fast-tempo actions. (2) The SG-LMM module is designed to jointly motivate fine-grained motion information in the long-term temporal and channel dimensions, thereby facilitating the discrimination of long-term motions. (3) The SMAM module is used to represent motion-sensitive regions in the spatial dimension by learning the spatial object offsets. Extensive experiments have been conducted on four widely used action recognition benchmarks, namely, Something-Something V1, Kinetics-400, UCF-101, and HMDB-51. Experimental results demonstrate that the proposed CANet achieves excellent performance compared with other state-of-the-art methods.

Highlights

The GG-SMM is proposed to represent short-term temporal motion clues.
The SG-LMM is used to motivate long-term temporal and channel motion features.
The SMAM is designed to focus on the spatial motion-sensitive regions.
This is the first attempt to perform attention mechanism for all dimensions of video.

References

[1]
Özyer T., Ak D.S., Alhajj R., Human action recognition approaches with video datasets—A survey, Knowl.-Based Syst. 222 (2021).
[2]
Essa E., Abdelmaksoud I.R., Temporal-channel convolution with self-attention network for human activity recognition using wearable sensors, Knowl.-Based Syst. 278 (2023).
[3]
Li F., Zhu A., Li J., Xu Y., Zhang Y., Yin H., Hua G., Frequency-driven channel attention-augmented full-scale temporal modeling network for skeleton-based action recognition, Knowl.-Based Syst. 256 (2022).
[4]
Shu X., Yang J., Yan R., Song Y., Expansion-squeeze-excitation fusion network for elderly activity recognition, IEEE Trans. Circuits Syst. Video Technol. 32 (8) (2022) 5281–5292.
[5]
A. Karpathy, G. Toderici, S. Shetty, T. Leung, R. Sukthankar, L. Fei-Fei, Large-scale video classification with convolutional neural networks, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2014, pp. 1725–1732.
[6]
D. Tran, L. Bourdev, R. Fergus, L. Torresani, M. Paluri, Learning spatiotemporal features with 3d convolutional networks, in: Proceedings of the IEEE International Conference on Computer Vision, 2015, pp. 4489–4497.
[7]
Kay W., Carreira J., Simonyan K., Zhang B., Hillier C., Vijayanarasimhan S., Viola F., Green T., Back T., Natsev P., et al., The kinetics human action video dataset, 2017, arXiv preprint arXiv:1705.06950.
[8]
Soomro K., Zamir A.R., Shah M., A dataset of 101 human action classes from videos in the wild, Cent. Res. Comput. Vis. 2 (11) (2012).
[9]
H. Kuehne, H. Jhuang, E. Garrote, T. Poggio, T. Serre, HMDB: a large video database for human motion recognition, in: Proceedings of the IEEE International Conference on Computer Vision, 2011, pp. 2556–2563.
[10]
R. Goyal, S. Ebrahimi Kahou, V. Michalski, J. Materzynska, S. Westphal, H. Kim, V. Haenel, I. Fruend, P. Yianilos, M. Mueller-Freitag, et al., The” something something” video database for learning and evaluating visual common sense, in: Proceedings of the IEEE International Conference on Computer Vision, 2017, pp. 5842–5850.
[11]
Simonyan K., Zisserman A., Two-stream convolutional networks for action recognition in videos, Adv. Neural Inf. Process. Syst. 27 (2014) 568–576.
[12]
C. Feichtenhofer, A. Pinz, A. Zisserman, Convolutional two-stream network fusion for video action recognition, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2016, pp. 1933–1941.
[13]
Wang H., Yu B., Li J., Zhang L., Chen D., Multi-stream interaction networks for human action recognition, IEEE Trans. Circuits Syst. Video Technol. 32 (5) (2021) 3050–3060.
[14]
C. Feichtenhofer, X3d: Expanding architectures for efficient video recognition, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2020, pp. 203–213.
[15]
J. Carreira, A. Zisserman, Quo vadis, action recognition? a new model and the kinetics dataset, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2017, pp. 6299–6308.
[16]
X. Wang, A. Gupta, Videos as space-time region graphs, in: Proceedings of the European Conference on Computer Vision, 2018, pp. 399–417.
[17]
M. Zolfaghari, K. Singh, T. Brox, Eco: Efficient convolutional network for online video understanding, in: Proceedings of the European Conference on Computer Vision, 2018, pp. 695–712.
[18]
Horn B.K., Schunck B.G., Determining optical flow, Artificial Intelligence 17 (1–3) (1981) 185–203.
[19]
Z. Qiu, T. Yao, T. Mei, Learning spatio-temporal representation with pseudo-3d residual networks, in: Proceedings of the IEEE International Conference on Computer Vision, 2017, pp. 5533–5541.
[20]
D. Tran, H. Wang, L. Torresani, J. Ray, Y. LeCun, M. Paluri, A closer look at spatiotemporal convolutions for action recognition, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2018, pp. 6450–6459.
[21]
S. Xie, C. Sun, J. Huang, Z. Tu, K. Murphy, Rethinking spatiotemporal feature learning: Speed-accuracy trade-offs in video classification, in: Proceedings of the European Conference on Computer Vision, 2018, pp. 305–321.
[22]
B. Jiang, M. Wang, W. Gan, W. Wu, J. Yan, Stm: Spatiotemporal and motion encoding for action recognition, in: Proceedings of the IEEE International Conference on Computer Vision, 2019, pp. 2000–2009.
[23]
Y. Li, B. Ji, X. Shi, J. Zhang, B. Kang, L. Wang, Tea: Temporal excitation and aggregation for action recognition, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2020, pp. 909–918.
[24]
Z. Liu, D. Luo, Y. Wang, L. Wang, Y. Tai, C. Wang, J. Li, F. Huang, T. Lu, Teinet: Towards an efficient architecture for video recognition, in: Proceedings of the AAAI Conference on Artificial Intelligence, 2020, pp. 11669–11676.
[25]
L. Wang, Z. Tong, B. Ji, G. Wu, Tdn: Temporal difference networks for efficient action recognition, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2021, pp. 1895–1904.
[26]
Z. Wang, Q. She, A. Smolic, Action-net: Multipath excitation for action recognition, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2021, pp. 13214–13223.
[27]
Chen Y., Ge H., Liu Y., Cai X., Sun L., Agpn: Action granularity pyramid network for video action recognition, IEEE Trans. Circuits Syst. Video Technol. 33 (8) (2023) 3912–3923.
[28]
Wang Q., Hu Q., Gao Z., Li P., Hu Q., AMS-Net: Modeling adaptive multi-granularity spatio-temporal cues for video action recognition, IEEE Trans. Neural Netw. Learn. Syst. (2023).
[29]
Sheng X., Li K., Shen Z., Xiao G., A progressive difference method for capturing visual tempos on action recognition, IEEE Trans. Circuits Syst. Video Technol. 33 (3) (2022) 977–987.
[30]
Xie Z., Chen J., Wu K., Guo D., Hong R., Global temporal difference network for action recognition, IEEE Trans. Multimed. (2022) 1–14.
[31]
Li Z., Li J., Ma Y., Wang R., Shi Z., Ding Y., Liu X., Spatio-temporal adaptive network with bidirectional temporal difference for action recognition, IEEE Trans. Circuits Syst. Video Technol. 33 (9) (2023) 5174–5185.
[32]
X. Wang, R. Girshick, A. Gupta, K. He, Non-local neural networks, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2018, pp. 7794–7803.
[33]
Kong Y., Wang Y., Li A., Spatiotemporal saliency representation learning for video action recognition, IEEE Trans. Multimed. 24 (2021) 1515–1528.
[34]
T. Han, W. Xie, A. Zisserman, Temporal alignment networks for long-term video, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 2906–2916.
[35]
L. Wang, Y. Xiong, Z. Wang, Y. Qiao, D. Lin, X. Tang, L. Van Gool, Temporal segment networks: Towards good practices for deep action recognition, in: Proceedings of the European Conference on Computer Vision, 2016, pp. 20–36.
[36]
J. Lin, C. Gan, S. Han, Tsm: Temporal shift module for efficient video understanding, in: Proceedings of the IEEE International Conference on Computer Vision, 2019, pp. 7083–7093.
[37]
H. Shao, S. Qian, Y. Liu, Temporal interlacing network, in: Proceedings of the AAAI Conference on Artificial Intelligence, 2020, pp. 11966–11973.
[38]
Zhang S., Guo S., Huang W., Scott M.R., Wang L., V4d: 4d convolutional neural networks for video-level representation learning, 2020, arXiv preprint arXiv:2002.07442.
[39]
C.-Y. Wu, Y. Li, K. Mangalam, H. Fan, B. Xiong, J. Malik, C. Feichtenhofer, Memvit: Memory-augmented multiscale vision transformer for efficient long-term video recognition, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 13587–13597.
[40]
Dai K., Li X., Ye Y., Wang Y., Feng S., Xian D., Exploring and exploiting high-order spatial-temporal dynamics for long-term frame prediction, IEEE Trans. Circuits Syst. Video Technol. (2023).
[41]
J. Hu, L. Shen, G. Sun, Squeeze-and-excitation networks, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2018, pp. 7132–7141.
[42]
Q. Wang, B. Wu, P. Zhu, P. Li, W. Zuo, Q. Hu, ECA-Net: Efficient channel attention for deep convolutional neural networks, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2020, pp. 11534–11542.
[43]
Z. Qin, P. Zhang, F. Wu, X. Li, Fcanet: Frequency channel attention networks, in: Proceedings of the IEEE International Conference on Computer Vision, 2021, pp. 783–792.
[44]
S. Woo, J. Park, J.-Y. Lee, I.S. Kweon, Cbam: Convolutional block attention module, in: Proceedings of the European Conference on Computer Vision, 2018, pp. 3–19.
[45]
Ma F., Wu Y., Yu X., Yang Y., Learning with noisy labels via self-reweighting from class centroids, IEEE Trans. Neural Netw. Learn. Syst. 33 (11) (2021) 6275–6285.
[46]
Xiang W., Li C., Wang B., Wei X., Hua X.-S., Zhang L., Spatiotemporal self-attention modeling with temporal patch shift for action recognition, in: Proceedings of the European Conference on Computer Vision, Springer, 2022, pp. 627–644.
[47]
G. Bertasius, H. Wang, L. Torresani, Is space-time attention all you need for video understanding?, in: Proceedings of the International Conference on Machine Learning, 2021, p. 4.
[48]
A. Arnab, M. Dehghani, G. Heigold, C. Sun, M. Lučić, C. Schmid, Vivit: A video vision transformer, in: Proceedings of the IEEE International Conference on Computer Vision, 2021, pp. 6836–6846.
[49]
Ma F., Jin X., Wang H., Huang J., Zhu L., Feng J., Yang Y., Temporal perceiving video-language pre-training, 2023, arXiv preprint arXiv:2301.07463.
[50]
Wang M., Xing J., Liu Y., Actionclip: A new paradigm for video action recognition, 2021, arXiv preprint arXiv:2109.08472.
[51]
Ma F., Zhu L., Yang Y., Weakly supervised moment localization with decoupled consistent concept prediction, Int. J. Comput. Vis. 130 (5) (2022) 1244–1258.
[52]
Ma F., Zhu L., Yang Y., Zha S., Kundu G., Feiszli M., Shou Z., Sf-net: Single-frame supervision for temporal action localization, in: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part IV. Vol. 16, Springer, 2020, pp. 420–437.
[53]
K. He, X. Zhang, S. Ren, J. Sun, Deep residual learning for image recognition, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2016, pp. 770–778.
[54]
Deng J., Dong W., Socher R., Li L.-J., Li K., Fei-Fei L., Imagenet: A large-scale hierarchical image database, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, IEEE, 2009, pp. 248–255.
[55]
C. Feichtenhofer, H. Fan, J. Malik, K. He, Slowfast networks for video recognition, in: Proceedings of the IEEE International Conference on Computer Vision, 2019, pp. 6202–6211.
[56]
Bottou L., Stochastic gradient descent tricks, in: Neural Networks: Tricks of the Trade, second ed., Springer, 2012, pp. 421–436.
[57]
X. Li, Y. Wang, Z. Zhou, Y. Qiao, Smallbignet: Integrating core and contextual views for video classification, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2020, pp. 1092–1101.
[58]
H. Wang, D. Tran, L. Torresani, M. Feiszli, Video modeling with correlation networks, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2020, pp. 352–361.
[59]
Liu Y., Yuan J., Tu Z., Motion-driven visual tempo learning for video-based action recognition, IEEE Trans. Image Process. 31 (2022) 4104–4116.
[60]
Tian Y., Yan Y., Zhai G., Guo G., Gao Z., Ean: event adaptive network for enhanced action recognition, Int. J. Comput. Vis. 130 (10) (2022) 2453–2471.
[61]
Wang Y., Yue Y., Lin Y., Jiang H., Lai Z., Kulikov V., Orlov N., Shi H., Huang G., Adafocus v2: End-to-end training of spatial dynamic networks for video recognition, in: 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR, IEEE, 2022, pp. 20030–20040.
[62]
Z. Liu, L. Wang, W. Wu, C. Qian, T. Lu, Tam: Temporal adaptive module for video recognition, in: Proceedings of the IEEE International Conference on Computer Vision, 2021, pp. 13708–13718.
[63]
Dosovitskiy A., Beyer L., Kolesnikov A., Weissenborn D., Zhai X., Unterthiner T., Dehghani M., Minderer M., Heigold G., Gelly S., et al., An image is worth 16x16 words: Transformers for image recognition at scale, 2020, arXiv preprint arXiv:2010.11929.
[64]
S. Yan, X. Xiong, A. Arnab, Z. Lu, M. Zhang, C. Sun, C. Schmid, Multiview transformers for video recognition, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 3333–3343.
[65]
Z. Liu, J. Ning, Y. Cao, Y. Wei, Z. Zhang, S. Lin, H. Hu, Video swin transformer, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 3202–3211.
[66]
X. Li, C. Liu, B. Shuai, Y. Zhu, H. Chen, J. Tighe, Nuta: Non-uniform temporal aggregation for action recognition, in: Proceedings of the IEEE Winter Conference on Applications of Computer Vision, 2022, pp. 3683–3692.
[67]
D. Tran, H. Wang, L. Torresani, M. Feiszli, Video classification with channel-separated convolutional networks, in: Proceedings of the IEEE/CVF International Conference on Computer Vision, 2019, pp. 5552–5561.
[68]
Kataoka H., Wakamiya T., Hara K., Satoh Y., Would mega-scale datasets further enhance spatiotemporal 3D cnns?, 2020, arXiv preprint arXiv:2004.04968.
[69]
Gao Z., Wang Q., Zhang B., Hu Q., Li P., Temporal-attentive covariance pooling networks for video recognition, Adv. Neural Inf. Process. Syst. 34 (2021) 13587–13598.
[70]
Dong W., Wang Z., Zhang B., Zhang J., Zhang Q., High-order correlation network for video recognition, in: Proceedings of the IEEE International Joint Conference on Neural Networks, IEEE, 2022, pp. 1–7.
[71]
Li K., Li X., Wang Y., Wang J., Qiao Y., Ct-net: Channel tensorization network for video classification, 2021, arXiv preprint arXiv:2106.01603.

Cited By

View all

Recommendations

Comments

Information & Contributors

Information

Published In

cover image Knowledge-Based Systems
Knowledge-Based Systems  Volume 296, Issue C
Jul 2024
972 pages

Publisher

Elsevier Science Publishers B. V.

Netherlands

Publication History

Published: 19 July 2024

Author Tags

  1. Action recognition
  2. Comprehensive attention
  3. Short-term motion modeling
  4. Long-term motion modeling
  5. Spatial motion modeling

Qualifiers

  • Research-article

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • 0
    Total Citations
  • 0
    Total Downloads
  • Downloads (Last 12 months)0
  • Downloads (Last 6 weeks)0
Reflects downloads up to 08 Mar 2025

Other Metrics

Citations

Cited By

View all

View Options

View options

Figures

Tables

Media

Share

Share

Share this Publication link

Share on social media