Abstract
Human activity recognition (HAR) in video streams becomes a thriving research area in computer vision and pattern recognition. Activity recognition in actual video is quite demanding due to a lack of data with respect to motion, way or style, and cluttered background. The current HAR approaches primarily apply pre-trained weights of various deep learning (DL) models for the apparent description of frames during the learning phase. It impacts the assessment of feature discrepancies, like the separation between both the temporal and visual cues. To address this issue, a residual deep gated recurrent unit (RD-GRU)-enabled attention framework with a dilated convolutional neural network (DiCNN) is introduced in this article. This approach particularly targets potential information in the input video frame to recognize the distinct activities in the videos. The DiCNN network is used to capture the crucial, unique features. In this network, the skip connection segment is employed with DiCNN to update the information that retains more knowledge than a shallow layer. Moreover, these features are fed into an attention module to capture the added high-level discriminative action associated with patterns and signs. The attention mechanism is followed by an RD-GRU to learn the long video sequences in order to enhance the performance. The performance metrics, namely accuracy, precision, recall, and f1-score, are used to evaluate the performance of the introduced model on four diverse benchmark datasets: UCF11, UCF Sports, JHMDB, and THUMOS. On these datasets it achieves an accuracy of 98.54%, 99.31%, 82.47%, and 95.23%, respectively. This illustrates the validity of the proposed work compared with state-of-the-art (SOTA) methods.
Similar content being viewed by others
Explore related subjects
Discover the latest articles, news and stories from top researchers in related subjects.Data availability
The data supporting the funding of this manuscript are available on the links https://www.crcv.ucf.edu/data/UCF_YouTube_Action.php for UCF11, https://www.crcv.ucf.edu/data/UCF_Sports_Action.php for UCF Sports Action, and http://jhmdb.is.tue.mpg.de/dataset for JHMDB dataset.
References
Gan, C., Wang, L., Zhang, Z., Wang, Z.: Sparse attention based separable dilated convolutional neural network for targeted sentiment analysis. Knowl.-Based Syst. 188, 1–10 (2019)
Keshavarzian, A., Sharifian, S., Seyedin, S.: Modified deep residual network architecture deployed on serverless framework of IoT platform based on human activity recognition application. Future Gener. Comput. Syst. 101, 14–28 (2019)
Antar, A.D., Ahmed, M., Ahad, M.A.R.: Challenges in sensor-based human activity recognition and a comparative analysis of benchmark datasets: A review, in: 2019 Joint 8th International Conference on Informatics, Electronics and Vision (ICIEV) and 3rd International Conference on Imaging, Vision and Pattern Recognition, IcIVPR, IEEE, (2019)
da Costa, K.A., Papa, J.P., Lisboa, C.O., Munoz, R., de Albuquerque, V.H.C.: Internet of things: A survey on machine learning-based intrusion detection approaches. Comput. Netw. 151, 147–157 (2019)
Herath, S., Harandi, M., Porikli, F.: Going deeper into action recognition: A survey. Image Vis. Comput. 60, 4–21 (2017)
Dai, C., Liu, X., Lai, J.: Human action recognition using two-stream attention based LSTM networks. Appl. Soft Comput. 86, 105820 (2020)
Ji, S., Xu, W., Yang, M., Yu, K.: 3D convolutional neural networks for human action recognition. IEEE Trans. Pattern Anal. Mach. Intell. 35(1), 221–231 (2013)
Xu, J., Song, R., Wei, H., Guo, J., Zhou, Y., Huang, X.: A fast human action recognition network based on spatio-temporal features. Neurocomputing. 441, 350–358 (2021)
Qiu, Z., Yao, T., Mei, T.: Learning spatio-temporal representation with pseudo-3D residual networks. in Proc. IEEE Int. Conf. Comput. Vis., Venice, Italy. pp. 5534-5542 (2017)
Abdelbaky, A., Aly, S.: Two-stream spatiotemporal feature fusion for human action recognition. The Visual Computer. 37(7), 1821–1835 (2021)
Gan, C., Wang, L., Zhang, Z., Wang, Z.: Sparse attention based separable dilated convolutional neural network for target entities sentiment analysis. Knowl. Based Syst. 188(1), 1–10 (2020)
Wang, F. et al.: Residual attention network for image classification. in Proc. IEEE Conf. Comput. Vis. Pattern Recognit. Honolulu, HI, USA, pp. 6450-6458 (2017)
Di, Wu, Sharma, Nabin, Blumenstein, Michael: Recent advances in video-based human action recognition using deep learning: a review, in: 2017 International Joint Conference on Neural Networks. IJCNN. IEEE. pp. 2865-2872 (2017)
Kwon, H., et al.: First person action recognition via two-stream convnet with long-term fusion pooling. Pattern Recognit. Lett. 112, 161–167 (2018)
Hejazi, S.M., Abhayaratne, C.: Handcrafted localized phase features for human action recognition. Image and Vision Computing. 123, 104465 (2022)
Kumar, P., Rautaray, S. S., Agrawal, A.: Hand data glove: A new generation real-time mouse for human-computer interaction. In 2012 1st International Conference on Recent Advances in Information Technology (RAIT). IEEE. pp. 750-755 (2012)
Zhao, Yuerong, Hongbo Guo, Ling Gao, Hai Wang, Jie Zheng, Kan Zhang, Yong Zheng: Multifeature fusion action recognition based on key frames. Concurrency and Computation: Practice and Experience. e6137 (2021)
Wei, Xiu-Shen., Wang, Peng, Liu, Lingqiao, Shen, Chunhua, Jianxin, Wu.: Piecewise classifier mappings: Learning fine-grained learners for novel categories with few examples. IEEE Transactions on Image Processing. 28(12), 6116–6125 (2019)
Garcia-Garcia, Alberto, Orts-Escolano, Sergio, Oprea, Sergiu, Villena-Martinez, Victor, Martinez-Gonzalez, Pablo, Garcia-Rodriguez, Jose: A survey on deep learning techniques for image and video semantic segmentation. Appl. Soft Comput. 70, 41–65 (2018)
Lee, T.M., Yoon, J.-C., Lee, I.-K.: Motion sickness prediction in stereoscopic videos using 3D convolutional neural networks. IEEE Trans. Vis. Comput. Graphics. 25(5), 1919–1927 (2019)
Khan, Samee Ullah: Ijaz Ul Haq, Seungmin Rho, Sung Wook Baik, and Mi Young Lee: Cover the violence: A novel deep-learning-based approach towards violence-detection in movies. Appl. Sci. 9(22), 4963 (2019)
Tu, Z., et al.: Multi-stream CNN: Learning representations based on human-related regions for action recognition. Pattern Recognit. 79, 32–43 (2018)
Gammulle, H. et al.: Two stream lstm: A deep fusion framework for human action recognition, in: 2017 IEEE Winter Conference on Applications of Computer Vision, WACV, IEEE. (2017)
Pandey, A., Kumar, P., and Prasad, S.: 2D Convolutional LSTM-Based Approach for Human Action Recognition on Various Sensor Data. In Intelligent Data Engineering and Analytics: Proceedings of the 10th International Conference on Frontiers in Intelligent Computing: Theory and Applications (FICTA 2022). Singapore: Springer Nature Singapore. pp. 405-417 (2023)
Zhang, Z., Yang, Y., Lv, Z., Gan, C., Zhu, Q.: LMFNet: Human Activity Recognition Using Attentive 3-D Residual Network and Multistage Fusion Strategy. IEEE Internet of Things Journal. 8(7), 6012–6023 (2020)
Yeung, S., Russakovsky, O., Jin, N., Andriluka, M., Mori, G., Li, F.: Every moment counts: Dense detailed labeling of actions in complex videos. Int. J. Comput. Vis. 126(2–4), 375–389 (2015)
Li, D., Yao, T., Duan, L., Mei, T., Rui, Y.: Unified spatio-temporal attention networks for action recognition in videos. IEEE Trans. Image Process. 21(2), 416–428 (2019)
Liu, Q., Che, X., Bie, M.: R-STAN: Residual spatio-temporal attention network for action recognition. IEEE Access. 7, 82246–82255 (2019)
Ullah, A., Ahmad, J., Muhammad, K., Sajjad, M., Baik, S.W.: Action recognition in video sequences using deep Bi-directional LSTM with CNN features. IEEE Access 6, 1155–1166 (2018)
Chung, J., Gulcehre, C., Cho, K., Bengio, Y.: Empirical evaluation of gated recurrent neural networks on sequence modeling. arXiv preprint arXiv:1412.3555 (2014)
Khodabandelou, G., Jung, P.G., Amirat, Y., Mohammed, S.: Attention-based gated recurrent unit for gesture recognition. IEEE Transactions on Automation Science and Engineering. 18(2), 495–507 (2020)
He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning forimage recognition. in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., Las Vegas, NV, USA, pp. 770-778 (2016)
Ullah, A., Muhammad, K., Ding, W., Palade, V., Haq, I.U., Baik, S.W.: Efficient activity recognition using lightweight CNN and DS-GRU network for surveillance applications. Applied Soft Computing 103, 107102 (2021)
Vrskova, R., Hudec, R., Kamencay, P., Sykora, P.: Human activity classification using the 3DCNN architecture. Applied Sciences. 12(2), 931 (2022)
Zhen, P., Yan, X., Wang, W., Wei, H., Chen, H. B.: A Highly Compressed Accelerator with Temporal Optical Flow Feature Fusion and Tensorized LSTM for Video Action Recognition on Terminal Device. IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems. (2023)
Gharaee, Z., Gärdenfors, P., Johnsson, M.: First and second order dynamics in a hierarchical SOM system for action recognition. Appl. Soft Comput. 59, 574–585 (2017)
Sahoo, S.P., Modalavalasa, S., Ari, S.: DISNet: A sequential learning framework to handle occlusion in human action recognition with video acquisition sensors. Digital Signal Processing 131, 103763 (2022)
Sowmyayani et al.: STHARNet: spatio-temporal human action recognition network in content-based video retrieval. Multimedia Tools and Applications. 1-16 (2022)
Ma, M., et al.: Region-sequence based six-stream CNN features for general and fine-grained human action recognition in videos. Pattern Recognit. 76, 506–521 (2018)
Li, H., Hu, W., Zang, Y., Zhao, S.: Action recognition based on attention mechanism and depthwise separable residual module. Signal, Image, and Video Processing. 17(1), 57–65 (2023)
Yang, W., Lyons, T., Ni, H., Schmid, C., Jin, L.: Developing the path signature methodology and its application to landmark-based human action recognition. In Stochastic Analysis, Filtering, and Stochastic Optimization: A Commemorative Volume to Honor Mark HA Davis’s Contributions. Cham: Springer International Publishing. pp. 431-464 (2022)
Ahmad, T., Jin, L., Feng, J., Tang, G.: Human action recognition in unconstrained trimmed videos using residual attention network and joints path signature. IEEE Access. 7, 121212–121222 (2019)
Cho, S., Maqbool, M., Liu, F., Foroosh, H.: Self-Attention Network for Skeleton-Based Human Action Recognition. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, Snowmass Village, CO, USA, 1-5 March 2020; pp. 635-644. (2020)
Kondratyuk, D., Yuan, L., Li, Y., Zhang, L., Tan, M., Brown, M., Gong, B.: Movinets: Mobile Video Networks for Efficient Video Recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Virtually, 19-25 June 2021; pp. 16020-16030. (2021)
Yan, S., Xiong, X., Arnab, A., Lu, Z., Zhang, M., Sun, C., Schmid, C.: Multiview Transformers for Video Recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18-24 June 2022; pp. 3333-3343. (2022)
Liu, J., Luo, J., Shah, M.: Recognizing realistic actions from videos in the wild, in 2009 IEEE Conference on Computer Vision and Pattern Recognition. IEEE. (2009)
Soomro, K., Zamir, A. R.: Action recognition in realistic sports videos. In Computer vision in sports. Springer International Publishing. pp. 181-208 (2015)
Jhuang, H. et al.: Towards understanding action recognition, in Proceedings of the IEEE international conference on computer vision. (2013)
Soomro, K., Zamir, A.R., Shah, M.: UCF101: A Dataset of 101 Human Actions Classes from Videos in the Wild. arXiv , arXiv:1212.0402. (2012)
He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition, in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. pp. 770-778. (2014)
Ullah, H., Munir, A.: Human Activity Recognition Using Cascaded Dual Attention CNN and Bi-Directional GRU Framework. arXiv preprint arXiv:2208.05034 (2022)
Muhammad, K., Ullah, A., Imran, A.S., Sajjad, M., Kiran, M.S., Sannino, G., de Albuquerque, V.H.C.: Human action recognition using attention-based LSTM network with dilated CNN features. Future Generation Computer Systems. 125, 820–830 (2021)
Malibari, A.A., Alzahrani, J.S., Qahmash, A., Maray, M., Alghamdi, M., Alshahrani, R., Hilal, A.M.: Quantum Water Strider Algorithm with Hybrid-Deep-Learning-Based Activity Recognition for Human-Computer Interaction. Applied Sciences. 12(14), 6848 (2022)
Zhou, Y., Sun, X., Zha, Z.J., Zeng, W.: Mict: Mixed 3D/2D Convolutional Tube for Human Action Recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18-22 June 2018; pp. 449-458. (2018)
Majd, M., Safabakhsh, R.: Correlational Convolutional LSTM for Human Action Recognition. Neurocomputing. 396, 224–229 (2020)
Ranasinghe, K., Naseer, M., Khan, S., Khan, F.S., Ryoo, M.S.: Self-supervised video transformer. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, pp. 2874-2884. (2022)
Xing, Z., Dai, Q., Hu, H., Chen, J., Wu, Z., Jiang, Y.G.: Svformer: Semi-supervised video transformer for action recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, pp. 18816-18826. (2023)
Zhou, A., Ma, Y., Ji, W., Zong, M., Yang, P., Wu, M., Liu, M.: Multi-head attention-based two-stream EfficientNet for action recognition. Multimedia Systems. 29(2), 487–98 (2023)
Zhang, C., Xu, Y., Xu, Z., Huang, J., Lu, J.: Hybrid handcrafted and learned feature framework for human action recognition. Applied Intelligence 52(11), 12771–12787 (2022)
Tran, D., Bourdev, L., Fergus, R., Torresani, L., Paluri, M.: Learning Spatiotemporal Features with 3D Convolutional Networks. ICCV 2015, 4489–4497 (2015)
Hara, K., Kataoka, H., Satoh, Y.: Can Spatiotemporal 3D CNNs Retrace the History of 2D CNNs and ImageNet? CVPR 2018, 6546–6555 (2018)
Wang, L., Xiong, Y., Wang, Z., Qiao, Y., Lin, D., Tang, X., Van Gool, L.: Temporal segment networks: Towards good practices for deep action recognition. In European conference on computer vision (pp. 20-36). Springer, Cham. (2016)
Jiang, G., Jiang, X., Fang, Z., Chen, S.: An efficient attention module for 3d convolutional neural networks in action recognition. Applied Intelligence, 1-15. (2021)
Funding
No funding provided from any source is used for this research in this manuscript.
Author information
Authors and Affiliations
Contributions
AP designed the environmental setup platform, concluded the experiments, and performed the statistical analysis, whereas PiK wrote the abstract section and literature survey. AP and PiK wrote the first draft of the manuscript. AP and PiK contributed to the investigation and framing of the results. PiK edited the first draft of this paper. Both authors participated in reviewing and approving the final version of the manuscript. Ajeet Pandey and Piyush Kumar contributed equally to this work.
Corresponding author
Ethics declarations
Conflict of interest
Both authors declare that he or she has no conflict of interest concerning the research, authorship, and/or publication of this article.
Ethical approval
This article does not contain any studies with human participants performed by the authors. Therefore, this section does not apply to this paper.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Pandey, A., Kumar, P. Residual deep gated recurrent unit-based attention framework for human activity recognition by exploiting dilated features. Vis Comput 40, 8693–8712 (2024). https://doi.org/10.1007/s00371-024-03266-w
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s00371-024-03266-w