Residual deep gated recurrent unit-based attention framework for human activity recognition by exploiting dilated features

Pandey, Ajeet; Kumar, Piyush

doi:10.1007/s00371-024-03266-w

Residual deep gated recurrent unit-based attention framework for human activity recognition by exploiting dilated features

Original article
Published: 06 February 2024

Volume 40, pages 8693–8712, (2024)
Cite this article

The Visual Computer Aims and scope Submit manuscript

341 Accesses
1 Altmetric
Explore all metrics

Abstract

Human activity recognition (HAR) in video streams becomes a thriving research area in computer vision and pattern recognition. Activity recognition in actual video is quite demanding due to a lack of data with respect to motion, way or style, and cluttered background. The current HAR approaches primarily apply pre-trained weights of various deep learning (DL) models for the apparent description of frames during the learning phase. It impacts the assessment of feature discrepancies, like the separation between both the temporal and visual cues. To address this issue, a residual deep gated recurrent unit (RD-GRU)-enabled attention framework with a dilated convolutional neural network (DiCNN) is introduced in this article. This approach particularly targets potential information in the input video frame to recognize the distinct activities in the videos. The DiCNN network is used to capture the crucial, unique features. In this network, the skip connection segment is employed with DiCNN to update the information that retains more knowledge than a shallow layer. Moreover, these features are fed into an attention module to capture the added high-level discriminative action associated with patterns and signs. The attention mechanism is followed by an RD-GRU to learn the long video sequences in order to enhance the performance. The performance metrics, namely accuracy, precision, recall, and f1-score, are used to evaluate the performance of the introduced model on four diverse benchmark datasets: UCF11, UCF Sports, JHMDB, and THUMOS. On these datasets it achieves an accuracy of 98.54%, 99.31%, 82.47%, and 95.23%, respectively. This illustrates the validity of the proposed work compared with state-of-the-art (SOTA) methods.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Human Activity Recognition Using a Hybrid Dilated CNN and GRU

An approach combining convolutional layers and gated recurrent unit to recognize human activities

Article 12 December 2023

Real-time detection of abnormal human activity using deep learning and temporal attention mechanism in video surveillance

Article 05 December 2023

Discover the latest articles, news and stories from top researchers in related subjects.

Artificial Intelligence

Data availability

The data supporting the funding of this manuscript are available on the links https://www.crcv.ucf.edu/data/UCF_YouTube_Action.php for UCF11, https://www.crcv.ucf.edu/data/UCF_Sports_Action.php for UCF Sports Action, and http://jhmdb.is.tue.mpg.de/dataset for JHMDB dataset.

References

Gan, C., Wang, L., Zhang, Z., Wang, Z.: Sparse attention based separable dilated convolutional neural network for targeted sentiment analysis. Knowl.-Based Syst. 188, 1–10 (2019)
Google Scholar
Keshavarzian, A., Sharifian, S., Seyedin, S.: Modified deep residual network architecture deployed on serverless framework of IoT platform based on human activity recognition application. Future Gener. Comput. Syst. 101, 14–28 (2019)
Antar, A.D., Ahmed, M., Ahad, M.A.R.: Challenges in sensor-based human activity recognition and a comparative analysis of benchmark datasets: A review, in: 2019 Joint 8th International Conference on Informatics, Electronics and Vision (ICIEV) and 3rd International Conference on Imaging, Vision and Pattern Recognition, IcIVPR, IEEE, (2019)
da Costa, K.A., Papa, J.P., Lisboa, C.O., Munoz, R., de Albuquerque, V.H.C.: Internet of things: A survey on machine learning-based intrusion detection approaches. Comput. Netw. 151, 147–157 (2019)
Article Google Scholar
Herath, S., Harandi, M., Porikli, F.: Going deeper into action recognition: A survey. Image Vis. Comput. 60, 4–21 (2017)
Article Google Scholar
Dai, C., Liu, X., Lai, J.: Human action recognition using two-stream attention based LSTM networks. Appl. Soft Comput. 86, 105820 (2020)
Article Google Scholar
Ji, S., Xu, W., Yang, M., Yu, K.: 3D convolutional neural networks for human action recognition. IEEE Trans. Pattern Anal. Mach. Intell. 35(1), 221–231 (2013)
Article Google Scholar
Xu, J., Song, R., Wei, H., Guo, J., Zhou, Y., Huang, X.: A fast human action recognition network based on spatio-temporal features. Neurocomputing. 441, 350–358 (2021)
Article Google Scholar
Qiu, Z., Yao, T., Mei, T.: Learning spatio-temporal representation with pseudo-3D residual networks. in Proc. IEEE Int. Conf. Comput. Vis., Venice, Italy. pp. 5534-5542 (2017)
Abdelbaky, A., Aly, S.: Two-stream spatiotemporal feature fusion for human action recognition. The Visual Computer. 37(7), 1821–1835 (2021)
Article Google Scholar
Gan, C., Wang, L., Zhang, Z., Wang, Z.: Sparse attention based separable dilated convolutional neural network for target entities sentiment analysis. Knowl. Based Syst. 188(1), 1–10 (2020)
Google Scholar
Wang, F. et al.: Residual attention network for image classification. in Proc. IEEE Conf. Comput. Vis. Pattern Recognit. Honolulu, HI, USA, pp. 6450-6458 (2017)
Di, Wu, Sharma, Nabin, Blumenstein, Michael: Recent advances in video-based human action recognition using deep learning: a review, in: 2017 International Joint Conference on Neural Networks. IJCNN. IEEE. pp. 2865-2872 (2017)
Kwon, H., et al.: First person action recognition via two-stream convnet with long-term fusion pooling. Pattern Recognit. Lett. 112, 161–167 (2018)
Article Google Scholar
Hejazi, S.M., Abhayaratne, C.: Handcrafted localized phase features for human action recognition. Image and Vision Computing. 123, 104465 (2022)
Article Google Scholar
Kumar, P., Rautaray, S. S., Agrawal, A.: Hand data glove: A new generation real-time mouse for human-computer interaction. In 2012 1st International Conference on Recent Advances in Information Technology (RAIT). IEEE. pp. 750-755 (2012)
Zhao, Yuerong, Hongbo Guo, Ling Gao, Hai Wang, Jie Zheng, Kan Zhang, Yong Zheng: Multifeature fusion action recognition based on key frames. Concurrency and Computation: Practice and Experience. e6137 (2021)
Wei, Xiu-Shen., Wang, Peng, Liu, Lingqiao, Shen, Chunhua, Jianxin, Wu.: Piecewise classifier mappings: Learning fine-grained learners for novel categories with few examples. IEEE Transactions on Image Processing. 28(12), 6116–6125 (2019)
Article MathSciNet Google Scholar
Garcia-Garcia, Alberto, Orts-Escolano, Sergio, Oprea, Sergiu, Villena-Martinez, Victor, Martinez-Gonzalez, Pablo, Garcia-Rodriguez, Jose: A survey on deep learning techniques for image and video semantic segmentation. Appl. Soft Comput. 70, 41–65 (2018)
Article Google Scholar
Lee, T.M., Yoon, J.-C., Lee, I.-K.: Motion sickness prediction in stereoscopic videos using 3D convolutional neural networks. IEEE Trans. Vis. Comput. Graphics. 25(5), 1919–1927 (2019)
Article Google Scholar
Khan, Samee Ullah: Ijaz Ul Haq, Seungmin Rho, Sung Wook Baik, and Mi Young Lee: Cover the violence: A novel deep-learning-based approach towards violence-detection in movies. Appl. Sci. 9(22), 4963 (2019)
Article Google Scholar
Tu, Z., et al.: Multi-stream CNN: Learning representations based on human-related regions for action recognition. Pattern Recognit. 79, 32–43 (2018)
Article Google Scholar
Gammulle, H. et al.: Two stream lstm: A deep fusion framework for human action recognition, in: 2017 IEEE Winter Conference on Applications of Computer Vision, WACV, IEEE. (2017)
Pandey, A., Kumar, P., and Prasad, S.: 2D Convolutional LSTM-Based Approach for Human Action Recognition on Various Sensor Data. In Intelligent Data Engineering and Analytics: Proceedings of the 10th International Conference on Frontiers in Intelligent Computing: Theory and Applications (FICTA 2022). Singapore: Springer Nature Singapore. pp. 405-417 (2023)
Zhang, Z., Yang, Y., Lv, Z., Gan, C., Zhu, Q.: LMFNet: Human Activity Recognition Using Attentive 3-D Residual Network and Multistage Fusion Strategy. IEEE Internet of Things Journal. 8(7), 6012–6023 (2020)
Article Google Scholar
Yeung, S., Russakovsky, O., Jin, N., Andriluka, M., Mori, G., Li, F.: Every moment counts: Dense detailed labeling of actions in complex videos. Int. J. Comput. Vis. 126(2–4), 375–389 (2015)
MathSciNet Google Scholar
Li, D., Yao, T., Duan, L., Mei, T., Rui, Y.: Unified spatio-temporal attention networks for action recognition in videos. IEEE Trans. Image Process. 21(2), 416–428 (2019)
Google Scholar
Liu, Q., Che, X., Bie, M.: R-STAN: Residual spatio-temporal attention network for action recognition. IEEE Access. 7, 82246–82255 (2019)
Article Google Scholar
Ullah, A., Ahmad, J., Muhammad, K., Sajjad, M., Baik, S.W.: Action recognition in video sequences using deep Bi-directional LSTM with CNN features. IEEE Access 6, 1155–1166 (2018)
Article Google Scholar
Chung, J., Gulcehre, C., Cho, K., Bengio, Y.: Empirical evaluation of gated recurrent neural networks on sequence modeling. arXiv preprint arXiv:1412.3555 (2014)
Khodabandelou, G., Jung, P.G., Amirat, Y., Mohammed, S.: Attention-based gated recurrent unit for gesture recognition. IEEE Transactions on Automation Science and Engineering. 18(2), 495–507 (2020)
Article Google Scholar
He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning forimage recognition. in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., Las Vegas, NV, USA, pp. 770-778 (2016)
Ullah, A., Muhammad, K., Ding, W., Palade, V., Haq, I.U., Baik, S.W.: Efficient activity recognition using lightweight CNN and DS-GRU network for surveillance applications. Applied Soft Computing 103, 107102 (2021)
Article Google Scholar
Vrskova, R., Hudec, R., Kamencay, P., Sykora, P.: Human activity classification using the 3DCNN architecture. Applied Sciences. 12(2), 931 (2022)
Article Google Scholar
Zhen, P., Yan, X., Wang, W., Wei, H., Chen, H. B.: A Highly Compressed Accelerator with Temporal Optical Flow Feature Fusion and Tensorized LSTM for Video Action Recognition on Terminal Device. IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems. (2023)
Gharaee, Z., Gärdenfors, P., Johnsson, M.: First and second order dynamics in a hierarchical SOM system for action recognition. Appl. Soft Comput. 59, 574–585 (2017)
Article Google Scholar
Sahoo, S.P., Modalavalasa, S., Ari, S.: DISNet: A sequential learning framework to handle occlusion in human action recognition with video acquisition sensors. Digital Signal Processing 131, 103763 (2022)
Article Google Scholar
Sowmyayani et al.: STHARNet: spatio-temporal human action recognition network in content-based video retrieval. Multimedia Tools and Applications. 1-16 (2022)
Ma, M., et al.: Region-sequence based six-stream CNN features for general and fine-grained human action recognition in videos. Pattern Recognit. 76, 506–521 (2018)
Article Google Scholar
Li, H., Hu, W., Zang, Y., Zhao, S.: Action recognition based on attention mechanism and depthwise separable residual module. Signal, Image, and Video Processing. 17(1), 57–65 (2023)
Article Google Scholar
Yang, W., Lyons, T., Ni, H., Schmid, C., Jin, L.: Developing the path signature methodology and its application to landmark-based human action recognition. In Stochastic Analysis, Filtering, and Stochastic Optimization: A Commemorative Volume to Honor Mark HA Davis’s Contributions. Cham: Springer International Publishing. pp. 431-464 (2022)
Ahmad, T., Jin, L., Feng, J., Tang, G.: Human action recognition in unconstrained trimmed videos using residual attention network and joints path signature. IEEE Access. 7, 121212–121222 (2019)
Article Google Scholar
Cho, S., Maqbool, M., Liu, F., Foroosh, H.: Self-Attention Network for Skeleton-Based Human Action Recognition. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, Snowmass Village, CO, USA, 1-5 March 2020; pp. 635-644. (2020)
Kondratyuk, D., Yuan, L., Li, Y., Zhang, L., Tan, M., Brown, M., Gong, B.: Movinets: Mobile Video Networks for Efficient Video Recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Virtually, 19-25 June 2021; pp. 16020-16030. (2021)
Yan, S., Xiong, X., Arnab, A., Lu, Z., Zhang, M., Sun, C., Schmid, C.: Multiview Transformers for Video Recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18-24 June 2022; pp. 3333-3343. (2022)
Liu, J., Luo, J., Shah, M.: Recognizing realistic actions from videos in the wild, in 2009 IEEE Conference on Computer Vision and Pattern Recognition. IEEE. (2009)
Soomro, K., Zamir, A. R.: Action recognition in realistic sports videos. In Computer vision in sports. Springer International Publishing. pp. 181-208 (2015)
Jhuang, H. et al.: Towards understanding action recognition, in Proceedings of the IEEE international conference on computer vision. (2013)
Soomro, K., Zamir, A.R., Shah, M.: UCF101: A Dataset of 101 Human Actions Classes from Videos in the Wild. arXiv , arXiv:1212.0402. (2012)
He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition, in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. pp. 770-778. (2014)
Ullah, H., Munir, A.: Human Activity Recognition Using Cascaded Dual Attention CNN and Bi-Directional GRU Framework. arXiv preprint arXiv:2208.05034 (2022)
Muhammad, K., Ullah, A., Imran, A.S., Sajjad, M., Kiran, M.S., Sannino, G., de Albuquerque, V.H.C.: Human action recognition using attention-based LSTM network with dilated CNN features. Future Generation Computer Systems. 125, 820–830 (2021)
Article Google Scholar
Malibari, A.A., Alzahrani, J.S., Qahmash, A., Maray, M., Alghamdi, M., Alshahrani, R., Hilal, A.M.: Quantum Water Strider Algorithm with Hybrid-Deep-Learning-Based Activity Recognition for Human-Computer Interaction. Applied Sciences. 12(14), 6848 (2022)
Article Google Scholar
Zhou, Y., Sun, X., Zha, Z.J., Zeng, W.: Mict: Mixed 3D/2D Convolutional Tube for Human Action Recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18-22 June 2018; pp. 449-458. (2018)
Majd, M., Safabakhsh, R.: Correlational Convolutional LSTM for Human Action Recognition. Neurocomputing. 396, 224–229 (2020)
Article Google Scholar
Ranasinghe, K., Naseer, M., Khan, S., Khan, F.S., Ryoo, M.S.: Self-supervised video transformer. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, pp. 2874-2884. (2022)
Xing, Z., Dai, Q., Hu, H., Chen, J., Wu, Z., Jiang, Y.G.: Svformer: Semi-supervised video transformer for action recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, pp. 18816-18826. (2023)
Zhou, A., Ma, Y., Ji, W., Zong, M., Yang, P., Wu, M., Liu, M.: Multi-head attention-based two-stream EfficientNet for action recognition. Multimedia Systems. 29(2), 487–98 (2023)
Article Google Scholar
Zhang, C., Xu, Y., Xu, Z., Huang, J., Lu, J.: Hybrid handcrafted and learned feature framework for human action recognition. Applied Intelligence 52(11), 12771–12787 (2022)
Article Google Scholar
Tran, D., Bourdev, L., Fergus, R., Torresani, L., Paluri, M.: Learning Spatiotemporal Features with 3D Convolutional Networks. ICCV 2015, 4489–4497 (2015)
Google Scholar
Hara, K., Kataoka, H., Satoh, Y.: Can Spatiotemporal 3D CNNs Retrace the History of 2D CNNs and ImageNet? CVPR 2018, 6546–6555 (2018)
Google Scholar
Wang, L., Xiong, Y., Wang, Z., Qiao, Y., Lin, D., Tang, X., Van Gool, L.: Temporal segment networks: Towards good practices for deep action recognition. In European conference on computer vision (pp. 20-36). Springer, Cham. (2016)
Jiang, G., Jiang, X., Fang, Z., Chen, S.: An efficient attention module for 3d convolutional neural networks in action recognition. Applied Intelligence, 1-15. (2021)

Download references

Funding

No funding provided from any source is used for this research in this manuscript.

Author information

Authors and Affiliations

Computer Science and Engineering, National Institute of Technology Patna, Ashok Rajpath, Patna, Bihar, 800005, India
Ajeet Pandey & Piyush Kumar

Authors

Ajeet Pandey
View author publications
You can also search for this author in PubMed Google Scholar
Piyush Kumar
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

AP designed the environmental setup platform, concluded the experiments, and performed the statistical analysis, whereas PiK wrote the abstract section and literature survey. AP and PiK wrote the first draft of the manuscript. AP and PiK contributed to the investigation and framing of the results. PiK edited the first draft of this paper. Both authors participated in reviewing and approving the final version of the manuscript. Ajeet Pandey and Piyush Kumar contributed equally to this work.

Corresponding author

Correspondence to Ajeet Pandey.

Ethics declarations

Conflict of interest

Both authors declare that he or she has no conflict of interest concerning the research, authorship, and/or publication of this article.

Ethical approval

This article does not contain any studies with human participants performed by the authors. Therefore, this section does not apply to this paper.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Cite this article

Pandey, A., Kumar, P. Residual deep gated recurrent unit-based attention framework for human activity recognition by exploiting dilated features. Vis Comput 40, 8693–8712 (2024). https://doi.org/10.1007/s00371-024-03266-w

Download citation

Accepted: 02 January 2024
Published: 06 February 2024
Issue Date: December 2024
DOI: https://doi.org/10.1007/s00371-024-03266-w

Keywords

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Residual deep gated recurrent unit-based attention framework for human activity recognition by exploiting dilated features

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

Human Activity Recognition Using a Hybrid Dilated CNN and GRU

An approach combining convolutional layers and gated recurrent unit to recognize human activities

Real-time detection of abnormal human activity using deep learning and temporal attention mechanism in video surveillance

Data availability

References

Funding

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Conflict of interest

Ethical approval

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Keywords

Subscribe and save

Buy Now

Navigation

Residual deep gated recurrent unit-based attention framework for human activity recognition by exploiting dilated features

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

Human Activity Recognition Using a Hybrid Dilated CNN and GRU

An approach combining convolutional layers and gated recurrent unit to recognize human activities

Real-time detection of abnormal human activity using deep learning and temporal attention mechanism in video surveillance

Explore related subjects

Data availability

References

Funding

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Conflict of interest

Ethical approval

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Subscribe and save

Buy Now

Search

Navigation