Action Recognition Based on Multi-perspective Feature Excitation

Li, Xiaoyang; Yang, Wenzhu; Cui, Zhenchao

doi:10.1007/978-981-96-0122-6_18

Xiaoyang Li¹²,
Wenzhu Yang^12,13 &
Zhenchao Cui¹²

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 15283))

Included in the following conference series:

Pacific Rim International Conference on Artificial Intelligence

171 Accesses

Abstract

Some traditional action recognition methods mainly rely on capturing appearance information to recognize human activities, which in the real world often take place in complex environments, so it becomes a challenge to identify human activities in complex environments accurately. A good way to address this challenge is to excite valuable features from multiple angles (e.g., appearance information, temporal relations and channel relations) for action recognition. Based on this idea, we proposed a Group Excitation (GE) block that excites features from different perspectives along different channel groups in parallel. The GE block enhances the ability to capture complementary information that includes temporal and spatial context, maintaining relatively low computational costs. In particular, we design a set of excitation paths whose axial contexts are dynamically aggregated from other axes to contextualize the feature channel groups. We equip ResNet-50 with the GE block to form a simple but effective GENet with limited extra computational cost. The GENet can capture contextual information from different perspectives, making the network more resilient in recognizing complex human activities. We conducted extensive experiments on Something-Something V1, V2, and UCF101, and GENet has achieved competitive performance.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 64.99; Price excludes VAT (USA)

Softcover Book: USD 79.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

Carreira, J., Zisserman, A.: Quo vadis, action recognition? a new model and the kinetics dataset. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6299–6308 (2017)
Google Scholar
Chen, Y., Kalantidis, Y., Li, J., Yan, S., Feng, J.: A$^2$-nets: double attention networks. Adv. Neural Inf. Process. Syst. 31 (2018)
Google Scholar
Fan, L., et al.: RubiksNet: learnable 3D-shift for efficient video action recognition. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12364, pp. 505–521. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58529-7_30
Goyal, R., et al.: The “something something” video database for learning and evaluating visual common sense. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 5842–5850 (2017)
Google Scholar
Hao, Y., Zhang, H., Ngo, C.W., He, X.: Group contextualization for video recognition. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 928–938 (2022)
Google Scholar
He, D., et al.: Stnet: local and global spatial-temporal modeling for action recognition. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 8401–8408 (2019)
Google Scholar
Jiang, B., Wang, M., Gan, W., Wu, W., Yan, J.: STM: spatiotemporal and motion encoding for action recognition. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 2000–2009 (2019)
Google Scholar
Jiang, Z., Zhang, Y., Hu, S.: ESTI: an action recognition network with enhanced spatio-temporal information. Int. J. Mach. Learn. Cybern. 14(9), 3059–3070 (2023)
Article Google Scholar
Li, X., Wang, Y., Zhou, Z., Qiao, Y.: Smallbignet: integrating core and contextual views for video classification. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 1092–1101 (2020)
Google Scholar
Li, X., Xie, M., Zhang, Y., Ding, G., Tong, W.: Dual attention convolutional network for action recognition. IET Image Proc. 14(6), 1059–1065 (2020)
Article Google Scholar
Li, X., Shuai, B., Tighe, J.: Directional temporal modeling for action recognition. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12351, pp. 275–291. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58539-6_17
Chapter Google Scholar
Li, Y., Ji, B., Shi, X., Zhang, J., Kang, B., Wang, L.: Tea: temporal excitation and aggregation for action recognition. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 909–918 (2020)
Google Scholar
Lin, J., Gan, C., Han, S.: TSM: temporal shift module for efficient video understanding. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 7083–7093 (2019)
Google Scholar
Liu, Y., Yuan, J., Tu, Z.: Motion-driven visual tempo learning for video-based action recognition. IEEE Trans. Image Process. 31, 4104–4116 (2022)
Article Google Scholar
Liu, Z., et al.: Teinet: towards an efficient architecture for video recognition. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 34, pp. 11669–11676 (2020)
Google Scholar
Liu, Z., Wang, L., Wu, W., Qian, C., Lu, T.: Tam: temporal adaptive module for video recognition. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 13708–13718 (2021)
Google Scholar
Luo, C., Yuille, A.L.: Grouped spatial-temporal aggregation for efficient action recognition. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 5512–5521 (2019)
Google Scholar
Mahdisoltani, F., Berger, G., Gharbieh, W., Fleet, D., Memisevic, R.: On the effectiveness of task granularity for transfer learning. arXiv preprint arXiv:1804.09235 (2018)
Russakovsky, O., et al.: Imagenet large scale visual recognition challenge. Int. J. Comput. Vision 115, 211–252 (2015)
Google Scholar
Ryu, S., Hong, S., Lee, S.: Making TSM Better: Preserving Foundational Philosophy for Efficient Action Recognition. ICT Express (2023)
Google Scholar
Selvaraju, R.R., Cogswell, M., Das, A., Vedantam, R., Parikh, D., Batra, D.: Grad-cam: visual explanations from deep networks via gradient-based localization. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 618–626 (2017)
Google Scholar
Soomro, K., Zamir, A.R., Shah, M.: Ucf101: a dataset of 101 human actions classes from videos in the wild. arXiv preprint arXiv:1212.0402 (2012)
Wang, H., Tran, D., Torresani, L., Feiszli, M.: Video modeling with correlation networks. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 352–361 (2020)
Google Scholar
Wang, L., et al.: Temporal segment networks: towards good practices for deep action recognition. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9912, pp. 20–36. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-46484-8_2
Wang, X., Gupta, A.: Videos as space-time region graphs. In: Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y. (eds.) ECCV 2018. LNCS, vol. 11209, pp. 413–431. Springer, Cham (2018). https://doi.org/10.1007/978-3-030-01228-1_25
Chapter Google Scholar
Wang, Z., She, Q., Smolic, A.: Action-net: multipath excitation for action recognition. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 13214–13223 (2021)
Google Scholar
Xie, S., Sun, C., Huang, J., Tu, Z., Murphy, K.: Rethinking spatiotemporal feature learning: speed-accuracy trade-offs in video classification. In: Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y. (eds.) ECCV 2018. LNCS, vol. 11219, pp. 318–335. Springer, Cham (2018). https://doi.org/10.1007/978-3-030-01267-0_19
Chapter Google Scholar

Download references

Author information

Authors and Affiliations

School of Cyber Security and Computer, Hebei University, Qiyi East Road, Lian Chi, Baoding, 071000, Hebei Province, China
Xiaoyang Li, Wenzhu Yang & Zhenchao Cui
Machine Vision Engineering Research Center, Hebei University, Qiyi East Road, Lian Chi, Baoding, 071000, Hebei Province, China
Wenzhu Yang

Authors

Xiaoyang Li
View author publications
You can also search for this author in PubMed Google Scholar
Wenzhu Yang
View author publications
You can also search for this author in PubMed Google Scholar
Zhenchao Cui
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Wenzhu Yang .

Editor information

Editors and Affiliations

Kyoto University, Kyoto, Japan
Rafik Hadfi
Lincoln University, Christchurch, New Zealand
Patricia Anthony
RIKEN Center for Integrative Medical Sciences, Yokohama, Japan
Alok Sharma
Kyoto University, Kyoto, Japan
Takayuki Ito
University of Tasmania, Tasmania, TAS, Australia
Quan Bai

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Li, X., Yang, W., Cui, Z. (2025). Action Recognition Based on Multi-perspective Feature Excitation. In: Hadfi, R., Anthony, P., Sharma, A., Ito, T., Bai, Q. (eds) PRICAI 2024: Trends in Artificial Intelligence. PRICAI 2024. Lecture Notes in Computer Science(), vol 15283. Springer, Singapore. https://doi.org/10.1007/978-981-96-0122-6_18

Download citation

DOI: https://doi.org/10.1007/978-981-96-0122-6_18
Published: 12 November 2024
Publisher Name: Springer, Singapore
Print ISBN: 978-981-96-0121-9
Online ISBN: 978-981-96-0122-6
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Action Recognition Based on Multi-perspective Feature Excitation