Abstract
Some traditional action recognition methods mainly rely on capturing appearance information to recognize human activities, which in the real world often take place in complex environments, so it becomes a challenge to identify human activities in complex environments accurately. A good way to address this challenge is to excite valuable features from multiple angles (e.g., appearance information, temporal relations and channel relations) for action recognition. Based on this idea, we proposed a Group Excitation (GE) block that excites features from different perspectives along different channel groups in parallel. The GE block enhances the ability to capture complementary information that includes temporal and spatial context, maintaining relatively low computational costs. In particular, we design a set of excitation paths whose axial contexts are dynamically aggregated from other axes to contextualize the feature channel groups. We equip ResNet-50 with the GE block to form a simple but effective GENet with limited extra computational cost. The GENet can capture contextual information from different perspectives, making the network more resilient in recognizing complex human activities. We conducted extensive experiments on Something-Something V1, V2, and UCF101, and GENet has achieved competitive performance.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
Carreira, J., Zisserman, A.: Quo vadis, action recognition? a new model and the kinetics dataset. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6299–6308 (2017)
Chen, Y., Kalantidis, Y., Li, J., Yan, S., Feng, J.: A\(^2\)-nets: double attention networks. Adv. Neural Inf. Process. Syst. 31 (2018)
Fan, L., et al.: RubiksNet: learnable 3D-shift for efficient video action recognition. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12364, pp. 505–521. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58529-7_30
Goyal, R., et al.: The “something something” video database for learning and evaluating visual common sense. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 5842–5850 (2017)
Hao, Y., Zhang, H., Ngo, C.W., He, X.: Group contextualization for video recognition. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 928–938 (2022)
He, D., et al.: Stnet: local and global spatial-temporal modeling for action recognition. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 8401–8408 (2019)
Jiang, B., Wang, M., Gan, W., Wu, W., Yan, J.: STM: spatiotemporal and motion encoding for action recognition. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 2000–2009 (2019)
Jiang, Z., Zhang, Y., Hu, S.: ESTI: an action recognition network with enhanced spatio-temporal information. Int. J. Mach. Learn. Cybern. 14(9), 3059–3070 (2023)
Li, X., Wang, Y., Zhou, Z., Qiao, Y.: Smallbignet: integrating core and contextual views for video classification. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 1092–1101 (2020)
Li, X., Xie, M., Zhang, Y., Ding, G., Tong, W.: Dual attention convolutional network for action recognition. IET Image Proc. 14(6), 1059–1065 (2020)
Li, X., Shuai, B., Tighe, J.: Directional temporal modeling for action recognition. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12351, pp. 275–291. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58539-6_17
Li, Y., Ji, B., Shi, X., Zhang, J., Kang, B., Wang, L.: Tea: temporal excitation and aggregation for action recognition. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 909–918 (2020)
Lin, J., Gan, C., Han, S.: TSM: temporal shift module for efficient video understanding. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 7083–7093 (2019)
Liu, Y., Yuan, J., Tu, Z.: Motion-driven visual tempo learning for video-based action recognition. IEEE Trans. Image Process. 31, 4104–4116 (2022)
Liu, Z., et al.: Teinet: towards an efficient architecture for video recognition. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 34, pp. 11669–11676 (2020)
Liu, Z., Wang, L., Wu, W., Qian, C., Lu, T.: Tam: temporal adaptive module for video recognition. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 13708–13718 (2021)
Luo, C., Yuille, A.L.: Grouped spatial-temporal aggregation for efficient action recognition. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 5512–5521 (2019)
Mahdisoltani, F., Berger, G., Gharbieh, W., Fleet, D., Memisevic, R.: On the effectiveness of task granularity for transfer learning. arXiv preprint arXiv:1804.09235 (2018)
Russakovsky, O., et al.: Imagenet large scale visual recognition challenge. Int. J. Comput. Vision 115, 211–252 (2015)
Ryu, S., Hong, S., Lee, S.: Making TSM Better: Preserving Foundational Philosophy for Efficient Action Recognition. ICT Express (2023)
Selvaraju, R.R., Cogswell, M., Das, A., Vedantam, R., Parikh, D., Batra, D.: Grad-cam: visual explanations from deep networks via gradient-based localization. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 618–626 (2017)
Soomro, K., Zamir, A.R., Shah, M.: Ucf101: a dataset of 101 human actions classes from videos in the wild. arXiv preprint arXiv:1212.0402 (2012)
Wang, H., Tran, D., Torresani, L., Feiszli, M.: Video modeling with correlation networks. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 352–361 (2020)
Wang, L., et al.: Temporal segment networks: towards good practices for deep action recognition. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9912, pp. 20–36. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-46484-8_2
Wang, X., Gupta, A.: Videos as space-time region graphs. In: Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y. (eds.) ECCV 2018. LNCS, vol. 11209, pp. 413–431. Springer, Cham (2018). https://doi.org/10.1007/978-3-030-01228-1_25
Wang, Z., She, Q., Smolic, A.: Action-net: multipath excitation for action recognition. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 13214–13223 (2021)
Xie, S., Sun, C., Huang, J., Tu, Z., Murphy, K.: Rethinking spatiotemporal feature learning: speed-accuracy trade-offs in video classification. In: Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y. (eds.) ECCV 2018. LNCS, vol. 11219, pp. 318–335. Springer, Cham (2018). https://doi.org/10.1007/978-3-030-01267-0_19
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2025 The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd.
About this paper
Cite this paper
Li, X., Yang, W., Cui, Z. (2025). Action Recognition Based on Multi-perspective Feature Excitation. In: Hadfi, R., Anthony, P., Sharma, A., Ito, T., Bai, Q. (eds) PRICAI 2024: Trends in Artificial Intelligence. PRICAI 2024. Lecture Notes in Computer Science(), vol 15283. Springer, Singapore. https://doi.org/10.1007/978-981-96-0122-6_18
Download citation
DOI: https://doi.org/10.1007/978-981-96-0122-6_18
Published:
Publisher Name: Springer, Singapore
Print ISBN: 978-981-96-0121-9
Online ISBN: 978-981-96-0122-6
eBook Packages: Computer ScienceComputer Science (R0)