Multi-scale Spatial-Temporal Attention for Action Recognition

Zhang, Qing; Yan, Hongping; Wang, Lingfeng

doi:10.1007/978-3-030-31654-9_3

Qing Zhang¹⁶,
Hongping Yan¹⁶ &
Lingfeng Wang¹⁷

Part of the book series: Lecture Notes in Computer Science ((LNIP,volume 11857))

Included in the following conference series:

Chinese Conference on Pattern Recognition and Computer Vision (PRCV)

2732 Accesses
1 Citations

Abstract

In this paper, we propose a new attention model by integrating multi-scale features to recognize human action. We introduce multi-scale features through different sizes of convolution kernel on both spatial and temporal fields. The spatial attention model considers the relationship between detail and integral of the human action, therefore our model can focus on the significant part of the action on the spatial field. The temporal attention model considers the speed of action, in order that our model can concentrate on the pivotal clips of the action on the temporal field. We verify the validity of multi-scale features in the benchmark action recognition datasets, including UCF-101 ($88.8\%$), HMDB-51 ($60.0\%$) and Penn ($96.3\%$). As a result that the accuracy of our model outperforms the previous methods.

The first author of this paper is an undergraduate.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Spatial-Temporal Bottom-Up Top-Down Attention Model for Action Recognition

Improved SSD using deep multi-scale attention spatial–temporal features for action recognition

Article 14 July 2021

STAR++: Rethinking spatio-temporal cross attention transformer for video action recognition

Article 05 October 2023

References

Sharma, S., Kiros, R., Salakhutdinov, R.: Action recognition using visual attention. Comput. Sci. (2015)
Google Scholar
Yu, T.Z., Guo, C.X., Wang, L.F., Gu, H.X., Xiang, S.M., Pan, C.H.: Joint spatial-temporal attention for action recognition. Pattern Recogn. Lett. 112, 226–233 (2018)
Article Google Scholar
Dempsey, P.W., Allison, M.E.D., Akkaraju, S., et al.: C3d of complement as a molecular adjuvant: bridging innate and acquired immunity. Science 271(5247), 348–350 (1996)
Article Google Scholar
Simonyan, K., Zisserman, A.: Two-stream convolutional networks for action recognition in videos (2014)
Google Scholar
Balaguer, J.F., Gobbetti, E.: i3D: a high-speed 3D Web browser. In: Proceedings ACM Symposium on VRML, pp. 69–76 (1995)
Google Scholar
Qiu, Z., Yao, T., Mei, T.: Learning spatio-temporal representation with pseudo-3D residual networks (2017)
Google Scholar
Wang, L., Xiong, Y., Wang, Z., et al.: Towards good practices for very deep two-stream ConvNets. Comput. Sci. (2015)
Google Scholar
Feichtenhofer, C., Pinz, A., Zisserman, A.: Convolutional two-stream network fusion for video action recognition. In: Computer Vision Pattern Recognition (2016)
Google Scholar
Kosiorek, A.R., Bewley, A., Posner, I.: Hierarchical attentive recurrent tracking (2017)
Google Scholar
Wang, F., Jiang, M., Qian, C., et al.: Residual attention network for image classification (2017)
Google Scholar
Hu, J., Shen, L., Albanie, S., et al.: Squeeze-and-excitation networks (2017)
Google Scholar
Mnih, V., Heess, N., Graves, A., et al.: Recurrent models of visual attention. In: Advances in Neural Information Processing Systems (2014)
Google Scholar
Chen, L.C., Yi, Y., Jiang, W., et al.: Attention to scale: scale-aware semantic image segmentation. In: IEEE Conference on Computer Vision Pattern Recognition (2016)
Google Scholar
Florack, L., Romeny, B.T.H., Viergever, M., et al.: The Gaussian scale-space paradigm and the multiscale local jet. Int. J. Comput. Vision 18(1), 61–75 (1996)
Article Google Scholar
Long, J., Shelhamer, E., Darrell, T.: Fully convolutional networks for semantic segmentation. IEEE Trans. Pattern Anal. Mach. Intell. 39(4), 640–651 (2014)
Google Scholar
Lin, G.S., Shen, C.H., Hengel, A., Reid, I.: Efficient piecewise training of deep structured models for semantic segmentation. In: CVPR, pp. 3194–3203 (2016)
Google Scholar
He, K., Zhang, X., Ren, S., et al.: Deep residual learning for image recognition (2015)
Google Scholar
Clement, F., Couprie, C., Najman, L., et al.: Learning hierarchical features for scene labeling. IEEE Trans. Pattern Anal. Mach. Intell. (2013)
Google Scholar
Wang, L., et al.: Temporal segment networks: towards good practices for deep action recognition. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9912, pp. 20–36. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-46484-8_2
Chapter Google Scholar
Karpathy, A., Toderici, G., Shetty, S., Leung, T., Sukthankar, R., Feifei, L.: Large-scale video classification with convolutional neural networks. In: CVPR, pp. 1725–1732 (2014)
Google Scholar
Donahue, J., et al.: Long-term recurrent convolutional networks for visual recognition and description. In: CVPR, pp. 2625–2634 (2015)
Google Scholar
Sun, L., Jia, K., Shi, B.: Human action recognition using factorized spatio-temporal convolutional networks. In: ICCV, pp. 57–65 (2015)
Google Scholar
Rohit, G., Deva, R.: Attentional pooling for action recognition. In: NIPS, pp. 33–44 (2017)
Google Scholar
Cao, C., Zhang, Y., Zhang, C., Lu, H.: Body joint guided 3D deep convolutional descriptors for action recognition, CoRR. abs/1704.07160 (2017)
Google Scholar
Yu, T., Gu, H., Wang, L., Xiang, S., Pan, C.: Cascaded temporal spatial features for video action recognition. In: ICIP, pp. 1552–1556 (2017)
Google Scholar
Feichtenhofer, C., Pinz, A., Zisserman, A.: Convolutional two-stream network fusion for video action recognition. In: CVPR, pp. 1933–1941 (2016)
Google Scholar
Soomro, K., Zamir, A.R., Shah, M.: UCF101: a dataset of 101 human actions classes from videos in the wild, CoRR. abs/1212.0402 (2012)
Google Scholar
Kuehne, H., Jhuang, H., Garrote, E., Poggio, T., Serre, T.: HMDB: a large video database for human motion recognition. In: ICCV, pp. 2556–2563 (2011)
Google Scholar
Zhang, W., Zhu, M., Derpanis, K.: From actemes to action: a strongly-supervised representation for detailed action understanding. In: ICCV, pp. 2248–2255 (2013)
Google Scholar
Wang, H., Schmid, C.: Action recognition with improved trajectories. In: ICCV, pp. 3551–3558 (2013)
Google Scholar
Iqbal, U., Garbade, M., Gall, J.: Pose for action-action for pose. In: FG, pp. 438–445 (2017)
Google Scholar

Download references

Acknowledgments

This work is supported by the National Natural Science Foundation of China (Grant Number 61773377 and 61573352).

Author information

Authors and Affiliations

Institute of Information Engineering, China University of Geosciences, Beijing, China
Qing Zhang & Hongping Yan
Institute of Automation, Chinese Academy of Sciences, Beijing, China
Lingfeng Wang

Authors

Qing Zhang
View author publications
You can also search for this author in PubMed Google Scholar
Hongping Yan
View author publications
You can also search for this author in PubMed Google Scholar
Lingfeng Wang
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Lingfeng Wang .

Editor information

Editors and Affiliations

School of EECS, Peking University, Beijing, China
Zhouchen Lin
Institute of Automation, Chinese Academy of Sciences, Beijing, China
Liang Wang
Nanjing University, Nanjing, Jiangsu, China
Jian Yang
Xidian University, Xi'an, China
Guangming Shi
Institute of Automation, Chinese Academy of Sciences, Beijing, China
Tieniu Tan
Institute of Artificial Intelligence, Xi'an Jiaotong University, Xi'an, Shaanxi, China
Nanning Zheng
Chinese Academy of Sciences, Beijing, China
Xilin Chen
Northwestern Polytechnical University, Xi'an, China
Yanning Zhang

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Zhang, Q., Yan, H., Wang, L. (2019). Multi-scale Spatial-Temporal Attention for Action Recognition. In: Lin, Z., et al. Pattern Recognition and Computer Vision. PRCV 2019. Lecture Notes in Computer Science(), vol 11857. Springer, Cham. https://doi.org/10.1007/978-3-030-31654-9_3

Download citation

DOI: https://doi.org/10.1007/978-3-030-31654-9_3
Published: 31 October 2019
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-31653-2
Online ISBN: 978-3-030-31654-9
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics