Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
research-article
Open access

MMTSA: Multi-Modal Temporal Segment Attention Network for Efficient Human Activity Recognition

Published: 27 September 2023 Publication History

Abstract

Multimodal sensors provide complementary information to develop accurate machine-learning methods for human activity recognition (HAR), but introduce significantly higher computational load, which reduces efficiency. This paper proposes an efficient multimodal neural architecture for HAR using an RGB camera and inertial measurement units (IMUs) called Multimodal Temporal Segment Attention Network (MMTSA). MMTSA first transforms IMU sensor data into a temporal and structure-preserving gray-scale image using the Gramian Angular Field (GAF), representing the inherent properties of human activities. MMTSA then applies a multimodal sparse sampling method to reduce data redundancy. Lastly, MMTSA adopts an inter-segment attention module for efficient multimodal fusion. Using three well-established public datasets, we evaluated MMTSA's effectiveness and efficiency in HAR. Results show that our method achieves superior performance improvements (11.13% of cross-subject F1-score on the MMAct dataset) than the previous state-of-the-art (SOTA) methods. The ablation study and analysis suggest that MMTSA's effectiveness in fusing multimodal data for accurate HAR. The efficiency evaluation on an edge device showed that MMTSA achieved significantly better accuracy, lower computational load, and lower inference latency than SOTA methods.

References

[1]
Aparna Akula, Anuj K Shah, and Ripul Ghosh. 2018. Deep learning approach for human action recognition in infrared images. Cognitive Systems Research 50 (2018), 146--154.
[2]
Gedas Bertasius, Heng Wang, and Lorenzo Torresani. 2021. Is space-time attention all you need for video understanding?. In ICML, Vol. 2. 4.
[3]
Valentina Bianchi, Marco Bassoli, Gianfranco Lombardo, Paolo Fornacciari, Monica Mordonini, and Ilaria De Munari. 2019. IoT wearable sensor and deep learning: An integrated approach for personalized human activity recognition in a smart home environment. IEEE Internet of Things Journal 6, 5 (2019), 8553--8562.
[4]
João Carreira and Andrew Zisserman. 2017. Quo Vadis, Action Recognition? A New Model and the Kinetics Dataset. In 2017 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2017, Honolulu, HI, USA, July 21-26, 2017. IEEE Computer Society, 4724--4733. https://doi.org/10.1109/CVPR.2017.502
[5]
Kaixuan Chen, Dalin Zhang, Lina Yao, Bin Guo, Zhiwen Yu, and Yunhao Liu. 2021. Deep learning for sensor-based human activity recognition: Overview, challenges, and opportunities. ACM Computing Surveys (CSUR) 54, 4 (2021), 1--40.
[6]
Vahid Ashkani Chenarlogh and Farbod Razzazi. 2019. Multi-stream 3D CNN structure for human action recognition trained by limited data. IET Computer Vision 13, 3 (2019), 338--344.
[7]
Lu Chi, Guiyu Tian, Yadong Mu, and Qi Tian. 2019. Two-stream video classification with cross-modality attention. In Proceedings of the IEEE/CVF International Conference on Computer Vision Workshops. 0--0.
[8]
Hyeongju Choi, Apoorva Beedu, Harish Haresamudram, and Irfan Essa. 2022. Multi-Stage Based Feature Fusion of Multi-Modal Data for Human Activity Recognition. arXiv preprint arXiv:2211.04331 (2022).
[9]
Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, et al. 2020. An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020).
[10]
Christoph Feichtenhofer, Haoqi Fan, Jitendra Malik, and Kaiming He. 2019. Slowfast networks for video recognition. In Proceedings of the IEEE/CVF international conference on computer vision. 6202--6211.
[11]
Christoph Feichtenhofer, Axel Pinz, and Andrew Zisserman. 2016. Convolutional two-stream network fusion for video action recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition. 1933--1941.
[12]
Alejandra García-Hernández, Carlos E Galván-Tejada, Jorge I Galván-Tejada, José M Celaya-Padilla, Hamurabi Gamboa-Rosales, Perla Velasco-Elizondo, and Rogelio Cárdenas-Vargas. 2017. A similarity analysis of audio signal to develop a human activity recognition using similarity networks. Sensors 17, 11 (2017), 2688.
[13]
Andrey Ignatov. 2018. Real-time human activity recognition from accelerometer data using Convolutional Neural Networks. Applied Soft Computing 62 (2018), 915--922.
[14]
Md Mofijul Islam and Tariq Iqbal. 2020. Hamlet: A hierarchical multimodal attention-based human activity recognition algorithm. In 2020 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS). IEEE, 10285--10292.
[15]
Md Mofijul Islam and Tariq Iqbal. 2021. Multi-gat: A graphical attention-based hierarchical multimodal representation learning approach for human activity recognition. IEEE Robotics and Automation Letters 6, 2 (2021), 1729--1736.
[16]
Md Mofijul Islam and Tariq Iqbal. 2022. MuMu: Cooperative multitask learning-based guided multimodal fusion,". AAAI.
[17]
Hamid Reza Vaezi Joze, Amirreza Shaban, Michael L Iuzzolino, and Kazuhito Koishida. 2020. MMTM: Multimodal transfer module for CNN fusion. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 13289--13299.
[18]
Andrej Karpathy, George Toderici, Sanketh Shetty, Thomas Leung, Rahul Sukthankar, and Li Fei-Fei. 2014. Large-scale video classification with convolutional neural networks. In Proceedings of the IEEE conference on Computer Vision and Pattern Recognition. 1725--1732.
[19]
Evangelos Kazakos, Arsha Nagrani, Andrew Zisserman, and Dima Damen. 2019. Epic-fusion: Audio-visual temporal binding for egocentric action recognition. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 5492--5501.
[20]
Nikita Kitaev, Łukasz Kaiser, and Anselm Levskaya. 2020. Reformer: The efficient transformer. arXiv preprint arXiv:2001.04451 (2020).
[21]
Quan Kong, Ziming Wu, Ziwei Deng, Martin Klinkigt, Bin Tong, and Tomokazu Murakami. 2019. MMAct: A Large-Scale Dataset for Cross Modal Human Action Understanding. In The IEEE International Conference on Computer Vision (ICCV).
[22]
Jennifer R Kwapisz, Gary M Weiss, and Samuel A Moore. 2011. Activity recognition using cell phone accelerometers. ACM SigKDD Explorations Newsletter 12, 2 (2011), 74--82.
[23]
Colin Lea, Rene Vidal, Austin Reiter, and Gregory D Hager. 2016. Temporal convolutional networks: A unified approach to action segmentation. In Computer Vision--ECCV 2016 Workshops: Amsterdam, The Netherlands, October 8-10 and 15-16, 2016, Proceedings, Part III 14. Springer, 47--54.
[24]
Chen Liang, Chun Yu, Yue Qin, Yuntao Wang, and Yuanchun Shi. 2021. DualRing: Enabling Subtle and Expressive Hand Interaction with Dual IMU Rings. Proc. ACM Interact. Mob. Wearable Ubiquitous Technol. 5, 3, Article 115 (sep 2021), 27 pages. https://doi.org/10. 1145/3478114
[25]
Ji Lin, Chuang Gan, and Song Han. 2019. Tsm: Temporal shift module for efficient video understanding. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 7083--7093.
[26]
Yang Liu, Keze Wang, Guanbin Li, and Liang Lin. 2021. Semantics-aware adaptive knowledge distillation for sensor-to-vision action recognition. IEEE Transactions on Image Processing 30 (2021), 5573--5588.
[27]
Xiang Long, Chuang Gan, Gerard Melo, Xiao Liu, Yandong Li, Fu Li, and Shilei Wen. 2018. Multimodal keyless attention fusion for video classification. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 32.
[28]
Jiasen Lu, Dhruv Batra, Devi Parikh, and Stefan Lee. 2019. Vilbert: Pretraining task-agnostic visiolinguistic representations for vision-and-language tasks. Advances in neural information processing systems 32 (2019).
[29]
Jianjie Lu and Kai-Yu Tong. 2019. Robust single accelerometer-based activity recognition using modified recurrence plot. IEEE Sensors Journal 19, 15 (2019), 6317--6324.
[30]
Subhas Chandra Mukhopadhyay. 2014. Wearable sensors for human activity monitoring: A review. IEEE sensors journal 15, 3 (2014), 1321--1330.
[31]
Abdulmajid Murad and Jae-Young Pyun. 2017. Deep recurrent neural networks for human activity recognition. Sensors 17, 11 (2017), 2556.
[32]
Ferda Ofli, Rizwan Chaudhry, Gregorij Kurillo, René Vidal, and Ruzena Bajcsy. 2013. Berkeley mhad: A comprehensive multimodal human action database. In 2013 IEEE workshop on applications of computer vision (WACV). IEEE, 53--60.
[33]
Madhuri Panwar, S Ram Dyuthi, K Chandra Prakash, Dwaipayan Biswas, Amit Acharyya, Koushik Maharatna, Arvind Gautam, and Ganesh R Naik. 2017. CNN based approach for activity recognition using a wrist-worn accelerometer. In 2017 39th Annual International Conference of the IEEE Engineering in Medicine and Biology Society (EMBC). IEEE, 2438--2441.
[34]
Madhuri Panwar, S Ram Dyuthi, K Chandra Prakash, Dwaipayan Biswas, Amit Acharyya, Koushik Maharatna, Arvind Gautam, and Ganesh R Naik. 2017. CNN based approach for activity recognition using a wrist-worn accelerometer. In 2017 39th Annual International Conference of the IEEE Engineering in Medicine and Biology Society (EMBC). IEEE, 2438--2441.
[35]
Razvan Pascanu, Tomas Mikolov, and Yoshua Bengio. 2013. On the difficulty of training recurrent neural networks. In International conference on machine learning. PMLR, 1310--1318.
[36]
Rafael Possas, Sheila Pinto Caceres, and Fabio Ramos. 2018. Egocentric activity recognition on a budget. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 5967--5976.
[37]
Yoli Shavit and Itzik Klein. 2021. Boosting inertial-based human activity recognition with transformers. IEEE Access 9 (2021), 53540--53547.
[38]
Karen Simonyan and Andrew Zisserman. 2014. Two-stream convolutional networks for action recognition in videos. Advances in neural information processing systems 27 (2014).
[39]
Salwa O Slim, Ayman Atia, Marwa MA Elfattah, and Mostafa-Sami M Mostafa. 2019. Survey on human activity recognition based on acceleration data. International Journal of Advanced Computer Science and Applications 10, 3 (2019).
[40]
Sibo Song, Vijay Chandrasekhar, Ngai-Man Cheung, Sanath Narayan, Liyuan Li, and Joo-Hwee Lim. 2014. Activity recognition in egocentric life-logging videos. In Asian conference on computer vision. Springer, 445--458.
[41]
Sibo Song, Vijay Chandrasekhar, Bappaditya Mandal, Liyuan Li, Joo-Hwee Lim, Giduthuri Sateesh Babu, Phyo Phyo San, and Ngai-Man Cheung. 2016. Multimodal multi-stream deep learning for egocentric activity recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition workshops. 24--31.
[42]
Sibo Song, Ngai-Man Cheung, Vijay Chandrasekhar, Bappaditya Mandal, and Jie Liri. 2016. Egocentric activity recognition with multimodal fisher vector. In 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2717--2721.
[43]
Odongo Steven Eyobu and Dong Seog Han. 2018. Feature representation and data augmentation for human activity classification based on wearable IMU sensor data using a deep LSTM neural network. Sensors 18, 9 (2018), 2892.
[44]
Ke Sun, Yuntao Wang, Chun Yu, Yukang Yan, Hongyi Wen, and Yuanchun Shi. 2017. Float: One-Handed and Touch-Free Target Selection on Smartwatches. In Proceedings of the 2017 CHI Conference on Human Factors in Computing Systems (Denver, Colorado, USA) (CHI '17). Association for Computing Machinery, New York, NY, USA, 692--704. https://doi.org/10.1145/3025453.3026027
[45]
Lin Sun, Kui Jia, Kevin Chen, Dit-Yan Yeung, Bertram E Shi, and Silvio Savarese. 2017. Lattice long short-term memory for human action recognition. In Proceedings of the IEEE international conference on computer vision. 2147--2156.
[46]
Senem Tanberk, Zeynep Hilal Kilimci, Dilek Bilgin Tükel, Mitat Uysal, and Selim Akyokuş. 2020. A Hybrid Deep Model Using Deep Learning and Dense Optical Flow Approaches for Human Activity Recognition. IEEE Access 8 (2020), 19799--19809. https://doi.org/10.1109/ACCESS.2020.2968529
[47]
Catherine Tong, Jinchen Ge, and Nicholas D Lane. 2021. Zero-shot learning for imu-based activity recognition using video embeddings. Proceedings of the ACM on Interactive, Mobile, Wearable and Ubiquitous Technologies 5, 4 (2021), 1--23.
[48]
Du Tran, Lubomir Bourdev, Rob Fergus, Lorenzo Torresani, and Manohar Paluri. 2015. Learning Spatiotemporal Features With 3D Convolutional Networks. In Proceedings of the IEEE International Conference on Computer Vision (ICCV).
[49]
Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. Advances in neural information processing systems 30 (2017).
[50]
Jindong Wang, Yiqiang Chen, Shuji Hao, Xiaohui Peng, and Lisha Hu. 2019. Deep learning for sensor-based activity recognition: A survey. Pattern recognition letters 119 (2019), 3--11.
[51]
Jiahao Wang, Qiuling Long, Kexuan Liu, Yingzi Xie, et al. 2019. Human action recognition on cellphone using compositional bidir-lstm-cnn networks. In 2019 International Conference on Computer, Network, Communication and Information Systems (CNCI 2019). Atlantis Press, 687--692.
[52]
Limin Wang, Yuanjun Xiong, Zhe Wang, Yu Qiao, Dahua Lin, Xiaoou Tang, and Luc Van Gool. 2016. Temporal segment networks: Towards good practices for deep action recognition. In European conference on computer vision. Springer, 20--36.
[53]
Yuntao Wang, Jiexin Ding, Ishan Chatterjee, Farshid Salemi Parizi, Yuzhou Zhuang, Yukang Yan, Shwetak Patel, and Yuanchun Shi. 2022. FaceOri: Tracking Head Position and Orientation Using Ultrasonic Ranging on Earphones. In Proceedings of the 2022 CHI Conference on Human Factors in Computing Systems (New Orleans, LA, USA) (CHI '22). Association for Computing Machinery, New York, NY, USA, Article 290, 12 pages. https://doi.org/10.1145/3491102.3517698
[54]
Zhiguang Wang and Tim Oates. 2015. Encoding time series as images for visual inspection and classification using tiled convolutional neural networks. In Workshops at the twenty-ninth AAAI conference on artificial intelligence.
[55]
Haoran Wei, Roozbeh Jafari, and Nasser Kehtarnavaz. 2019. Fusion of video and inertial sensing for deep learning--based human action recognition. Sensors 19, 17 (2019), 3680.
[56]
Chuhan Wu, Fangzhao Wu, Tao Qi, Yongfeng Huang, and Xing Xie. 2021. Fastformer: Additive attention can be all you need. arXiv preprint arXiv:2108.09084 (2021).
[57]
Yunan Wu, Feng Yang, Ying Liu, Xuefan Zha, and Shaofeng Yuan. 2018. A comparison of 1-D and 2-D deep convolutional neural networks in ECG classification. arXiv preprint arXiv:1810.07088 (2018).
[58]
Huatao Xu, Pengfei Zhou, Rui Tan, Mo Li, and Guobin Shen. 2021. LIMU-BERT: Unleashing the Potential of Unlabeled Data for IMU Sensing Applications. In Proceedings of the 19th ACM Conference on Embedded Networked Sensor Systems. 220--233.
[59]
Jiewen Yang, Xingbo Dong, Liujun Liu, Chao Zhang, Jiajun Shen, and Dahai Yu. 2022. Recurring the transformer for video action recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 14063--14073.
[60]
Ming Zeng, Le T Nguyen, Bo Yu, Ole J Mengshoel, Jiang Zhu, Pang Wu, and Joy Zhang. 2014. Convolutional neural networks for human activity recognition using mobile sensors. In 6th international conference on mobile computing, applications and services. IEEE, 197--205.
[61]
Xiyuxing Zhang, Yuntao Wang, Jingru Zhang, Yaqing Yang, Shwetak Patel, and Yuanchun Shi. 2023. EarCough: Enabling Continuous Subject Cough Event Detection on Hearables. In Extended Abstracts of the 2023 CHI Conference on Human Factors in Computing Systems (Hamburg, Germany) (CHI EA '23). Association for Computing Machinery, New York, NY, USA, Article 94, 6 pages. https://doi.org/10.1145/3544549.3585903

Cited By

View all
  • (2024)G-VOILA: Gaze-Facilitated Information Querying in Daily ScenariosProceedings of the ACM on Interactive, Mobile, Wearable and Ubiquitous Technologies10.1145/36596238:2(1-33)Online publication date: 15-May-2024
  • (2024)The EarSAVAS DatasetProceedings of the ACM on Interactive, Mobile, Wearable and Ubiquitous Technologies10.1145/36596168:2(1-26)Online publication date: 15-May-2024
  • (2024)CrossHAR: Generalizing Cross-dataset Human Activity Recognition via Hierarchical Self-Supervised PretrainingProceedings of the ACM on Interactive, Mobile, Wearable and Ubiquitous Technologies10.1145/36595978:2(1-26)Online publication date: 15-May-2024
  • Show More Cited By

Index Terms

  1. MMTSA: Multi-Modal Temporal Segment Attention Network for Efficient Human Activity Recognition

        Recommendations

        Comments

        Information & Contributors

        Information

        Published In

        cover image Proceedings of the ACM on Interactive, Mobile, Wearable and Ubiquitous Technologies
        Proceedings of the ACM on Interactive, Mobile, Wearable and Ubiquitous Technologies  Volume 7, Issue 3
        September 2023
        1734 pages
        EISSN:2474-9567
        DOI:10.1145/3626192
        Issue’s Table of Contents
        This work is licensed under a Creative Commons Attribution International 4.0 License.

        Publisher

        Association for Computing Machinery

        New York, NY, United States

        Publication History

        Published: 27 September 2023
        Published in IMWUT Volume 7, Issue 3

        Check for updates

        Author Tags

        1. Human activity recognition
        2. edge computing
        3. multimodal sensing
        4. neural network
        5. ubiquitous computing

        Qualifiers

        • Research-article
        • Research
        • Refereed

        Funding Sources

        • Tsinghua University Initiative Scientifc Research Program
        • Institute for Artifcial Intelligence, Tsinghua University
        • Natural Science Foundation of China
        • Young Elite Scientists Sponsorship Program by CAST
        • Beijing Key Lab of Networked Multimedia
        • Beijing National Research Center for Information Science and Technology (BNRist)

        Contributors

        Other Metrics

        Bibliometrics & Citations

        Bibliometrics

        Article Metrics

        • Downloads (Last 12 months)792
        • Downloads (Last 6 weeks)70
        Reflects downloads up to 22 Sep 2024

        Other Metrics

        Citations

        Cited By

        View all
        • (2024)G-VOILA: Gaze-Facilitated Information Querying in Daily ScenariosProceedings of the ACM on Interactive, Mobile, Wearable and Ubiquitous Technologies10.1145/36596238:2(1-33)Online publication date: 15-May-2024
        • (2024)The EarSAVAS DatasetProceedings of the ACM on Interactive, Mobile, Wearable and Ubiquitous Technologies10.1145/36596168:2(1-26)Online publication date: 15-May-2024
        • (2024)CrossHAR: Generalizing Cross-dataset Human Activity Recognition via Hierarchical Self-Supervised PretrainingProceedings of the ACM on Interactive, Mobile, Wearable and Ubiquitous Technologies10.1145/36595978:2(1-26)Online publication date: 15-May-2024
        • (2024)RepMobile: A MobileNet-Like Network With Structural Reparameterization for Sensor-Based Human Activity RecognitionIEEE Sensors Journal10.1109/JSEN.2024.341273624:15(24224-24237)Online publication date: 1-Aug-2024
        • (2023)Integrating Gaze and Mouse Via Joint Cross-Attention Fusion Net for Students' Activity Recognition in E-learningProceedings of the ACM on Interactive, Mobile, Wearable and Ubiquitous Technologies10.1145/36108767:3(1-35)Online publication date: 27-Sep-2023

        View Options

        View options

        PDF

        View or Download as a PDF file.

        PDF

        eReader

        View online with eReader.

        eReader

        Get Access

        Login options

        Full Access

        Media

        Figures

        Other

        Tables

        Share

        Share

        Share this Publication link

        Share on social media