research-article

DTR-HAR: deep temporal residual representation for human activity recognition

Authors:

Fatma Ezahra Sayadi,

Adel M. AlimiAuthors Info & Claims

The Visual Computer, Volume 38, Issue 3

Pages 993 - 1013

https://doi.org/10.1007/s00371-021-02064-y

Published: 01 March 2022 Publication History

Abstract

Human activity recognition (HAR) is a highly prized application in the pattern recognition and the computer vision fields. Up till now, deep neural networks have acquired big attention in computer studies and image processing fields, and have generated significant results. In this paper, we propose a deep temporal residual system for daily living activity recognition that aims to enhance spatiotemporal feature representation in order to improve the HAR system performance. To this end, we adopt a deep residual convolutional neural network (RCN) to retain discriminative visual features relayed to appearance and long short-term memory neural network to capture the long-term temporal evolution of actions. The latter was considered to implement time dependencies occurring when carrying out the activity to enhance features extracted from the RCN network by adding time information to address the dynamic activity recognition problem as a sequence labeling job. The deep temporal residual model for human activity recognition system is performed on two benchmark publicly available datasets: MSRDailyActivity3D and CAD-60. the proposed system achieves very competitive results when compared to others from the state of the art.

References

[1]

Zhou, T., Wang, W., Qi, S., Ling, H., Shen, J.: Cascaded human–object interaction recognition. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 4263–4272. IEEE (2020)

[2]

Wang, W., Zhu, H., Dai, J., Pang, Y., Shen, J., Shao, L.: Hierarchical human parsing with typed part-relation reasoning. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 8929–8939. IEEE (2020)

[3]

Li, T., Liang, Z., Zhao, S., Gong, J., Shen, J.: Self-learning with rectification strategy for human parsing. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9263–9272. IEEE (2020)

[4]

Qi, S., Wang, W., Jia, B., Shen, J.,Zhu, S.C.: Learning human–object interactions by graph parsing neural networks. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 401–417. Springer (2018)

[5]

Yilmaz, A., Shah, M.: Actions sketch: a novel action representation. In: IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR’05), pp. 984–989. IEEE (2005)

[6]

Blank, M., Gorelick, L., Shechtman, E., Irani, M., Basri, R.: Actions as space-time shapes. In: Tenth IEEE International Conference on Computer Vision (ICCV’05), pp. 1395–1402. IEEE (2005)

[7]

He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778. IEEE, Las Vegas, NV (2016)

[8]

Krizhevsky, A., Sutskever, I., Hinton, G.E.: Imagenet classification with deep convolutional neural networks. In: Advances in Neural Information Processing Systems, pp. 1097–1105 (2012)

[9]

Szegedy, C., Liu, W., Jia, Y., Sermanet, P., Reed, S., Anguelov, D.,

\dots

, Rabinovich, A.: Going deeper with convolutions. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1–9 (2015)

[10]

Simonyan, K., Zisserman, A.: Very deep convolutional networks for large-scale image recognition (2014). arXiv preprint arXiv:1409.1556

[11]

Karpathy, A., Toderici, G., Shetty, S., Leung, T., Sukthankar, R., Fei-Fei, L.: Large-scale video classification with convolutional neural networks. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1725–1732. IEEE, Columbus, OH (2014)

[12]

Scovanner, P., Ali, S., Shah, M.: A 3-dimensional sift descriptor and its application to action recognition. In: 15th ACM International Conference on Multimedia, pp. 357–360. ACM, Augsburg, Germany (2007)

[13]

Klaser, A., Marszałek, M., Schmid, C.: A spatio-temporal descriptor based on 3D-gradients. In: 19th British Machine Vision Conference (BMVC), pp. 1–10. Leeds, United Kingdom (2008)

[14]

Oreifej, O., Liu, Z.: Hon4d: histogram of oriented 4D normals for activity recognition from depth sequences. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 716–723. IEEE, Portland, OR (2013)

[15]

Asadi-Aghbolaghi M and Kasaei S Supervised spatio-temporal kernel descriptor for human action recognition from RGB-depth videos Multimed. Tools Appl. 2018 77 11 14115-14135

[16]

Lowe DG Distinctive image features from scale-invariant keypoints Int. J. Comput. Vis. 2004 60 2 91-110

[17]

Bay H, Ess A, Tuytelaars T, and Van Gool L Speeded-up robust features (SURF) Comput. Vis. Image Underst. 2008 110 3 346-359

[18]

Willems, G., Tuytelaars, T., Van Gool, L.: An efficient dense and scale-invariant spatio-temporal interest point detector. In: European Conference on Computer Vision, pp. 650–663. Springer, Berlin, Heidelberg (2008)

[19]

Hu, Y., Cao, L., Lv, F., Yan, S., Gong, Y., Huang, T.S.: Action detection in complex scenes with spatial and temporal ambiguities. In: 12th International Conference on Computer Vision, pp. 128–135. IEEE, Kyoto (2009)

[20]

Zhang M and Sawchuk AA Human daily activity recognition with sparse representation using wearable sensors IEEE J. Biomed. Health. Inf. 2013 17 3 553-560

[21]

Liu C, Ying J, Yang H, Hu X, and Liu J Improved human action recognition approach based on two-stream convolutional neural network model Vis. Comput. 2020

[22]

Li X, Shen H, Zhang L, Zhang H, Yuan Q, and Yang G Recovering quantitative remote sensing products contaminated by thick clouds and shadows using multitemporal dictionary learning IEEE Trans. Geosci. Remote Sens. 2014 52 11 7086-7098

[23]

Dong X, Shen J, Wu D, Guo K, Jin X, and Porikli F Quadruplet network with one-shot learning for fast visual object tracking IEEE Trans. Image Process. 2019 28 7 3516-3527

[24]

Liang Z and Shen J Local semantic Siamese networks for fast tracking IEEE Trans. Image Process. 2019 29 3351-3364

[25]

Wang W, Shen J, and Shao L Video salient object detection via fully convolutional networks IEEE Trans. Image Process. 2017 27 1 38-49

[26]

Lai Q, Wang W, Sun H, and Shen J Video saliency prediction using spatiotemporal residual attentive networks IEEE Trans. Image Process. 2019 29 1113-1126

[27]

Wang W, Shen J, and Ling H A deep network solution for attention and aesthetics aware photo cropping IEEE Trans. Pattern Anal. Mach. Intell. 2018 41 7 1531-1544

[28]

Kuanar, S., Athitsos, V., Pradhan, N., Mishra, A., Rao, K.R.: Cognitive analysis of working memory load from EEG, by a deep recurrent neural network. In: IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 2576–2580. IEEE SigPort (2018)

[29]

Kuanar, S., Athitsos, V., Mahapatra, D., Rao, K.R., Akhtar, Z., Dasgupta, D.: Low dose abdominal CT image reconstruction: an unsupervised learning based approach. In: IEEE International Conference on Image Processing (ICIP), pp. 1351–1355. IEEE (2019)

[30]

Ji S, Xu W, Yang M, and Yu K 3D convolutional neural networks for human action recognition IEEE Trans. Pattern Anal. Mach. Intell. 2012 35 1 221-231

[31]

Simonyan, K., Zisserman, A.: Two-stream convolutional networks for action recognition in videos. In: Advances in Neural Information Processing Systems, pp. 568–576. MIT Press, Cambridge, MA (2014)

[32]

Tran, D., Bourdev, L., Fergus, R., Torresani, L., Paluri, M.: Learning spatiotemporal features with 3D convolutional networks. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 4489–4497. IEEE Computer Society, USA (2015)

[33]

Bhattacharya S, Nurmi P, Hammerla N, and Plötz T Using unlabeled data in a sparse-coding framework for human activity recognition Pervasive Mob. Comput. 2014 15 242-262

[34]

Zaremba, W., Sutskever, I., Vinyals, O.: Recurrent neural network regularization (2014). arXiv preprint arXiv:1409.2329

[35]

Donahue, J., Anne Hendricks, L., Guadarrama, S., Rohrbach, M., Venugopalan, S., Saenko, K., Darrell, T.: Long-term recurrent convolutional networks for visual recognition and description. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2625–2634 (2015)

[36]

. Veeriah, V., Zhuang, N., Qi, G.J.: Differential recurrent neural networks for action recognition. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 4041–4049. IEEE Computer Society, USA (2015)

[37]

Yue-Hei Ng, J., Hausknecht, M., Vijayanarasimhan, S., Vinyals, O., Monga, R., Toderici, G.: Beyond short snippets: deep networks for video classification. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4694–4702. Boston, MA (2015)

[38]

Wu, Z., Wang, X., Jiang, Y.G., Ye, H., Xue, X.: Modeling spatial-temporal clues in a hybrid deep learning framework for video classification. In: Proceedings of the 23rd ACM International Conference on Multimedia, pp. 461–470. ACM, New York, NY (2015)

[39]

Kim JH, Hong GS, Kim BG, and Dogra DP deepGesture: deep learning-based gesture recognition scheme using motion sensors Displays 2018 55 38-45

[40]

Madhuranga D, Madushan R, Siriwardane C, and Gunasekera K Real-time multimodal ADL recognition using convolution neural networks Vis. Comput. 2020

[41]

Srivastava, N., Mansimov, E., Salakhudinov, R.: Unsupervised learning of video representations using LSTMs. In: International Conference on Machine Learning, pp. 843–852. JMLR.org (2015)

[42]

Ercolano, G., Riccio, D., Rossi, S.: Two deep approaches for ADL recognition: a multi-scale LSTM and a CNN-LSTM with a 3D matrix skeleton representation. In: 26th IEEE International Symposium on Robot and Human Interactive Communication (RO-MAN), pp. 877-882. IEEE, Lisbon (2017)

[43]

Ullah A, Ahmad J, Muhammad K, Sajjad M, and Baik SW Action recognition in video sequences using deep bi-directional LSTM with CNN features IEEE Access 2017 6 1155-1166

[44]

Ma CY, Chen MH, Kira Z, and AlRegib G TS-LSTM and temporal-inception: exploiting spatiotemporal dynamics for activity recognition Signal Process. Image Commun. 2019 71 76-87

[45]

Zhao, R., Ali, H., Van der Smagt, P.: Two-stream RNN/CNN for action recognition in 3D videos. In: IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pp. 4260–4267. IEEE, Vancouver, BC (2017)

[46]

Ding L, Fang W, Luo H, Love PE, Zhong B, and Ouyang X A deep hybrid learning model to detect unsafe behavior: integrating convolution neural networks and long short-term memory Autom. Constr. 2018 86 118-124

[47]

Baradel, F., Wolf, C., Mille, J.: Human activity recognition with pose-driven attention to RGB. In: 29th British Machine Vision Conference, pp. 1–14. Newcastle, United Kingdom (2018)

[48]

Arif S, Wang J, Ul Hassan T, and Fei Z 3D-CNN-based fused feature maps with LSTM applied to action recognition Future Internet 2019 11 2 42

[49]

Das, S., Koperski, M., Bremond, F., Francesca, G.: Deep-temporal LSTM for daily living action recognition. In: 15th IEEE International Conference on Advanced Video and Signal Based Surveillance (AVSS), pp. 1–6. IEEE, Auckland, New Zealand (2018)

[50]

Russakovsky O, Deng J, Su H, Krause J, Satheesh S, Ma S, and Berg AC Imagenet large scale visual recognition challenge Int. J. Comput. Vis. 2015 115 3 211-252

[51]

Pascanu, R., Mikolov, T., Bengio, Y.: On the difficulty of training recurrent neural networks. In: International Conference on Machine Learning, pp. 1310–1318. JMLR.org, Atlanta, GA (2013)

[52]

Sung, J., Ponce, C., Selman, B., Saxena, A.: Unstructured human activity detection from RGBD images. In: IEEE International Conference on Robotics and Automation, pp. 842–849. IEEE, Saint Paul, MN (2012)

[53]

Wang, J., Liu, Z., Wu, Y., Yuan, J.: Mining actionlet ensemble for action recognition with depth cameras. In: IEEE Conference on Computer Vision and Pattern Recognition, pp. 1290–1297. IEEE, USA (2012)

[54]

Khaire P, Kumar P, and Imran J Combining CNN streams of RGB-D and skeletal data for human activity recognition Pattern Recognit. Lett. 2018 115 107-116

[55]

Zhu Y, Chen W, and Guo G Evaluating spatiotemporal interest point features for depth-based action recognition Image Vis. Comput. 2014 32 8 453-464

[56]

Ni B, Pei Y, Moulin P, and Yan S Multilevel depth and image fusion for human activity detection IEEE Trans. Cybern. 2013 43 5 1383-1394

[57]

Koppula HS, Gupta R, and Saxena A Learning human activities and object affordances from RGB-D videos Int. J. Robot. Res. 2013 32 8 951-970

[58]

Koperski, M., Bremond, F.: Modeling spatial layout of features for real world scenario RGB-D action recognition. In: 13th IEEE International Conference on Advanced Video and Signal Based Surveillance (AVSS), pp. 44–50. IEEE (2016)

[59]

Hu, J.F., Zheng, W.S., Lai, J., Zhang, J.: Jointly learning heterogeneous features for RGB-D activity recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 5344–5352. IEEE (2015)

[60]

Srihari D, Kishore PVV, Kumar EK, Kumar DA, Kumar MTK, Prasad MVD, and Prasad CR A four-stream ConvNet based on spatial and depth flow for human action classification using RGB-D data Multimed. Tools Appl. 2020

[61]

Nunez JC, Cabido R, Pantrigo JJ, Montemayor AS, and Velez JF Convolutional neural networks and long short-term memory for skeleton-based human activity and hand gesture recognition Pattern Recognit. 2018 76 80-94

[62]

Luo J, Wang W, and Qi H Spatio-temporal feature extraction and representation for RGB-D human action recognition Pattern Recognit. Lett. 2014 50 139-148

[63]

Baradel, F., Wolf, C., Mille, J.: Human activity recognition with pose-driven attention to RGB. In: 29th British Machine Vision Conference, pp. 1–14. Newcastle, United Kingdom (2018)

[64]

Zhou, Y., Ni, B., Hong, R., Wang, M., Tian, Q.: Interaction part mining: a mid-level approach for fine-grained action recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 3323–3331. IEEE (2015)

[65]

Wang, L., Xiong, Y., Wang, Z., Qiao, Y., Lin, D., Tang, X., Van Gool, L.: Temporal segment networks: towards good practices for deep action recognition. In: European Conference on Computer Vision, pp. 20–36. Springer, Cham (2016)

[66]

Das, S., Koperski, M., Bremond, F., Francesca, G.: Action recognition based on a mixture of RGB and depth based skeleton. In: 14th IEEE International Conference on Advanced Video and Signal Based Surveillance (AVSS), pp. 1–6. IEEE, Lecce (2017)

[67]

Liu, L., Shao, L.: Learning discriminative representations from RGB-D video data. In: International Joint Conference on Artificial Intelligence (2013)

[68]

Kong, Y., Fu, Y.: Bilinear heterogeneous information machine for RGB-D action recognition. In: IEEE Conference on Computer Vision and Pattern Recognition, pp. 1054–1062 (2015)

Cited By

Liang SWang JZhuang Z(2024)Patch excitation network for boxless action recognition in still imagesThe Visual Computer: International Journal of Computer Graphics10.1007/s00371-023-03071-x40:6(4099-4113)Online publication date: 1-Jun-2024
https://dl.acm.org/doi/10.1007/s00371-023-03071-x
Srihari PHarikiran JSai Chandana BSurendra Reddy V(2023)Effective framework for human action recognition in thermal images using capsnet techniqueJournal of Intelligent & Fuzzy Systems: Applications in Engineering and Technology10.3233/JIFS-23050545:6(11737-11755)Online publication date: 1-Jan-2023
https://dl.acm.org/doi/10.3233/JIFS-230505
Basly HZayene MSayadi F(2023)Spatiotemporal Self-Attention Mechanism Driven by 3D Pose to Guide RGB Cues for Daily Living Human Activity RecognitionJournal of Intelligent and Robotic Systems10.1007/s10846-023-01926-y109:1Online publication date: 17-Aug-2023
https://dl.acm.org/doi/10.1007/s10846-023-01926-y
Show More Cited By

Index Terms

DTR-HAR: deep temporal residual representation for human activity recognition
1. Computing methodologies
  1. Artificial intelligence
    1. Computer vision
      1. Computer vision problems
      2. Computer vision tasks
        Scene understanding
  2. Machine learning
    1. Machine learning approaches
      1. Neural networks

Index terms have been assigned to the content through auto-classification.

Recommendations

Ultra-wideband data as input of a combined EfficientNet and LSTM architecture for human activity recognition

The world population is aging in the last few years and this trend is expected to increase in the future. The number of persons requiring assistance in their everyday life is also expected to rise. Luckily, smart homes are becoming a more and more ...
Human Activity Recognition Based on Wearable Sensor Using Hierarchical Deep LSTM Networks
Abstract
In recent years, with the rapid development of artificial intelligence, human activity recognition has become a research focus. The complex, dynamic and variable features of human activities lead to the relatively low accuracy of the traditional ...
Human Activity Recognition Using Wearable Sensors by Deep Convolutional Neural Networks
MM '15: Proceedings of the 23rd ACM international conference on Multimedia

Human physical activity recognition based on wearable sensors has applications relevant to our daily life such as healthcare. How to achieve high recognition accuracy with low computational cost is an important issue in the ubiquitous computing. Rather ...

Comments

Information & Contributors

Information

Published In

cover image The Visual Computer: International Journal of Computer Graphics

The Visual Computer: International Journal of Computer Graphics Volume 38, Issue 3

Mar 2022

403 pages

ISSN:0178-2789

Issue’s Table of Contents

© The Author(s), under exclusive licence to Springer-Verlag GmbH, DE part of Springer Nature 2021.

Publisher

Springer-Verlag

Berlin, Heidelberg

Publication History

Published: 01 March 2022

Accepted: 05 January 2021

Author Tags

Qualifiers

Research-article

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

6
Total Citations
View Citations
0
Total Downloads

Downloads (Last 12 months)0
Downloads (Last 6 weeks)0

Reflects downloads up to 25 Jan 2025

Other Metrics

View Author Metrics

Citations

Cited By

Liang SWang JZhuang Z(2024)Patch excitation network for boxless action recognition in still imagesThe Visual Computer: International Journal of Computer Graphics10.1007/s00371-023-03071-x40:6(4099-4113)Online publication date: 1-Jun-2024
https://dl.acm.org/doi/10.1007/s00371-023-03071-x
Srihari PHarikiran JSai Chandana BSurendra Reddy V(2023)Effective framework for human action recognition in thermal images using capsnet techniqueJournal of Intelligent & Fuzzy Systems: Applications in Engineering and Technology10.3233/JIFS-23050545:6(11737-11755)Online publication date: 1-Jan-2023
https://dl.acm.org/doi/10.3233/JIFS-230505
Basly HZayene MSayadi F(2023)Spatiotemporal Self-Attention Mechanism Driven by 3D Pose to Guide RGB Cues for Daily Living Human Activity RecognitionJournal of Intelligent and Robotic Systems10.1007/s10846-023-01926-y109:1Online publication date: 17-Aug-2023
https://dl.acm.org/doi/10.1007/s10846-023-01926-y
Villegas-Cortez JRomán-Alonso GFernandez De Vega FFlores-Morales YCordero-Sanchez S(2023)Implementation of Parallel Evolutionary Convolutional Neural Network for Classification in Human Activity and Image RecognitionAdvances in Computational Intelligence10.1007/978-3-031-47765-2_24(327-345)Online publication date: 13-Nov-2023
https://dl.acm.org/doi/10.1007/978-3-031-47765-2_24
Bayoudh KKnani RHamdaoui FMtibaa A(2022)A survey on deep multimodal learning for computer vision: advances, trends, applications, and datasetsThe Visual Computer: International Journal of Computer Graphics10.1007/s00371-021-02166-738:8(2939-2970)Online publication date: 1-Aug-2022
https://dl.acm.org/doi/10.1007/s00371-021-02166-7
Ben Ahmed IOuarda WBen Amar C(2022)Hybrid UNET Model Segmentation for an Early Breast Cancer Detection Using Ultrasound ImagesComputational Collective Intelligence10.1007/978-3-031-16014-1_37(464-476)Online publication date: 28-Sep-2022
https://dl.acm.org/doi/10.1007/978-3-031-16014-1_37

View Options

View options

Figures

Tables

Media

View Issue’s Table of Contents