research-article

Open access

MMTSA: Multi-Modal Temporal Segment Attention Network for Efficient Human Activity Recognition

Authors:

Yuanchun ShiAuthors Info & Claims

Proceedings of the ACM on Interactive, Mobile, Wearable and Ubiquitous Technologies, Volume 7, Issue 3

Article No.: 96, Pages 1 - 26

https://doi.org/10.1145/3610872

Published: 27 September 2023 Publication History

Abstract

Multimodal sensors provide complementary information to develop accurate machine-learning methods for human activity recognition (HAR), but introduce significantly higher computational load, which reduces efficiency. This paper proposes an efficient multimodal neural architecture for HAR using an RGB camera and inertial measurement units (IMUs) called Multimodal Temporal Segment Attention Network (MMTSA). MMTSA first transforms IMU sensor data into a temporal and structure-preserving gray-scale image using the Gramian Angular Field (GAF), representing the inherent properties of human activities. MMTSA then applies a multimodal sparse sampling method to reduce data redundancy. Lastly, MMTSA adopts an inter-segment attention module for efficient multimodal fusion. Using three well-established public datasets, we evaluated MMTSA's effectiveness and efficiency in HAR. Results show that our method achieves superior performance improvements (11.13% of cross-subject F1-score on the MMAct dataset) than the previous state-of-the-art (SOTA) methods. The ablation study and analysis suggest that MMTSA's effectiveness in fusing multimodal data for accurate HAR. The efficiency evaluation on an edge device showed that MMTSA achieved significantly better accuracy, lower computational load, and lower inference latency than SOTA methods.

References

[1]

Aparna Akula, Anuj K Shah, and Ripul Ghosh. 2018. Deep learning approach for human action recognition in infrared images. Cognitive Systems Research 50 (2018), 146--154.

[2]

Gedas Bertasius, Heng Wang, and Lorenzo Torresani. 2021. Is space-time attention all you need for video understanding?. In ICML, Vol. 2. 4.

[3]

Valentina Bianchi, Marco Bassoli, Gianfranco Lombardo, Paolo Fornacciari, Monica Mordonini, and Ilaria De Munari. 2019. IoT wearable sensor and deep learning: An integrated approach for personalized human activity recognition in a smart home environment. IEEE Internet of Things Journal 6, 5 (2019), 8553--8562.

[4]

João Carreira and Andrew Zisserman. 2017. Quo Vadis, Action Recognition? A New Model and the Kinetics Dataset. In 2017 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2017, Honolulu, HI, USA, July 21-26, 2017. IEEE Computer Society, 4724--4733. https://doi.org/10.1109/CVPR.2017.502

[5]

Kaixuan Chen, Dalin Zhang, Lina Yao, Bin Guo, Zhiwen Yu, and Yunhao Liu. 2021. Deep learning for sensor-based human activity recognition: Overview, challenges, and opportunities. ACM Computing Surveys (CSUR) 54, 4 (2021), 1--40.

Digital Library

[6]

Vahid Ashkani Chenarlogh and Farbod Razzazi. 2019. Multi-stream 3D CNN structure for human action recognition trained by limited data. IET Computer Vision 13, 3 (2019), 338--344.

Digital Library

[7]

Lu Chi, Guiyu Tian, Yadong Mu, and Qi Tian. 2019. Two-stream video classification with cross-modality attention. In Proceedings of the IEEE/CVF International Conference on Computer Vision Workshops. 0--0.

[8]

Hyeongju Choi, Apoorva Beedu, Harish Haresamudram, and Irfan Essa. 2022. Multi-Stage Based Feature Fusion of Multi-Modal Data for Human Activity Recognition. arXiv preprint arXiv:2211.04331 (2022).

[9]

Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, et al. 2020. An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020).

[10]

Christoph Feichtenhofer, Haoqi Fan, Jitendra Malik, and Kaiming He. 2019. Slowfast networks for video recognition. In Proceedings of the IEEE/CVF international conference on computer vision. 6202--6211.

[11]

Christoph Feichtenhofer, Axel Pinz, and Andrew Zisserman. 2016. Convolutional two-stream network fusion for video action recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition. 1933--1941.

[12]

Alejandra García-Hernández, Carlos E Galván-Tejada, Jorge I Galván-Tejada, José M Celaya-Padilla, Hamurabi Gamboa-Rosales, Perla Velasco-Elizondo, and Rogelio Cárdenas-Vargas. 2017. A similarity analysis of audio signal to develop a human activity recognition using similarity networks. Sensors 17, 11 (2017), 2688.

[13]

Andrey Ignatov. 2018. Real-time human activity recognition from accelerometer data using Convolutional Neural Networks. Applied Soft Computing 62 (2018), 915--922.

[14]

Md Mofijul Islam and Tariq Iqbal. 2020. Hamlet: A hierarchical multimodal attention-based human activity recognition algorithm. In 2020 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS). IEEE, 10285--10292.

Digital Library

[15]

Md Mofijul Islam and Tariq Iqbal. 2021. Multi-gat: A graphical attention-based hierarchical multimodal representation learning approach for human activity recognition. IEEE Robotics and Automation Letters 6, 2 (2021), 1729--1736.

[16]

Md Mofijul Islam and Tariq Iqbal. 2022. MuMu: Cooperative multitask learning-based guided multimodal fusion,". AAAI.

[17]

Hamid Reza Vaezi Joze, Amirreza Shaban, Michael L Iuzzolino, and Kazuhito Koishida. 2020. MMTM: Multimodal transfer module for CNN fusion. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 13289--13299.

[18]

Andrej Karpathy, George Toderici, Sanketh Shetty, Thomas Leung, Rahul Sukthankar, and Li Fei-Fei. 2014. Large-scale video classification with convolutional neural networks. In Proceedings of the IEEE conference on Computer Vision and Pattern Recognition. 1725--1732.

Digital Library

[19]

Evangelos Kazakos, Arsha Nagrani, Andrew Zisserman, and Dima Damen. 2019. Epic-fusion: Audio-visual temporal binding for egocentric action recognition. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 5492--5501.

[20]

Nikita Kitaev, Łukasz Kaiser, and Anselm Levskaya. 2020. Reformer: The efficient transformer. arXiv preprint arXiv:2001.04451 (2020).

[21]

Quan Kong, Ziming Wu, Ziwei Deng, Martin Klinkigt, Bin Tong, and Tomokazu Murakami. 2019. MMAct: A Large-Scale Dataset for Cross Modal Human Action Understanding. In The IEEE International Conference on Computer Vision (ICCV).

[22]

Jennifer R Kwapisz, Gary M Weiss, and Samuel A Moore. 2011. Activity recognition using cell phone accelerometers. ACM SigKDD Explorations Newsletter 12, 2 (2011), 74--82.

Digital Library

[23]

Colin Lea, Rene Vidal, Austin Reiter, and Gregory D Hager. 2016. Temporal convolutional networks: A unified approach to action segmentation. In Computer Vision--ECCV 2016 Workshops: Amsterdam, The Netherlands, October 8-10 and 15-16, 2016, Proceedings, Part III 14. Springer, 47--54.

[24]

Chen Liang, Chun Yu, Yue Qin, Yuntao Wang, and Yuanchun Shi. 2021. DualRing: Enabling Subtle and Expressive Hand Interaction with Dual IMU Rings. Proc. ACM Interact. Mob. Wearable Ubiquitous Technol. 5, 3, Article 115 (sep 2021), 27 pages. https://doi.org/10. 1145/3478114

Digital Library

[25]

Ji Lin, Chuang Gan, and Song Han. 2019. Tsm: Temporal shift module for efficient video understanding. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 7083--7093.

[26]

Yang Liu, Keze Wang, Guanbin Li, and Liang Lin. 2021. Semantics-aware adaptive knowledge distillation for sensor-to-vision action recognition. IEEE Transactions on Image Processing 30 (2021), 5573--5588.

[27]

Xiang Long, Chuang Gan, Gerard Melo, Xiao Liu, Yandong Li, Fu Li, and Shilei Wen. 2018. Multimodal keyless attention fusion for video classification. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 32.

[28]

Jiasen Lu, Dhruv Batra, Devi Parikh, and Stefan Lee. 2019. Vilbert: Pretraining task-agnostic visiolinguistic representations for vision-and-language tasks. Advances in neural information processing systems 32 (2019).

[29]

Jianjie Lu and Kai-Yu Tong. 2019. Robust single accelerometer-based activity recognition using modified recurrence plot. IEEE Sensors Journal 19, 15 (2019), 6317--6324.

[30]

Subhas Chandra Mukhopadhyay. 2014. Wearable sensors for human activity monitoring: A review. IEEE sensors journal 15, 3 (2014), 1321--1330.

[31]

Abdulmajid Murad and Jae-Young Pyun. 2017. Deep recurrent neural networks for human activity recognition. Sensors 17, 11 (2017), 2556.

[32]

Ferda Ofli, Rizwan Chaudhry, Gregorij Kurillo, René Vidal, and Ruzena Bajcsy. 2013. Berkeley mhad: A comprehensive multimodal human action database. In 2013 IEEE workshop on applications of computer vision (WACV). IEEE, 53--60.

Digital Library

[33]

Madhuri Panwar, S Ram Dyuthi, K Chandra Prakash, Dwaipayan Biswas, Amit Acharyya, Koushik Maharatna, Arvind Gautam, and Ganesh R Naik. 2017. CNN based approach for activity recognition using a wrist-worn accelerometer. In 2017 39th Annual International Conference of the IEEE Engineering in Medicine and Biology Society (EMBC). IEEE, 2438--2441.

[34]

Madhuri Panwar, S Ram Dyuthi, K Chandra Prakash, Dwaipayan Biswas, Amit Acharyya, Koushik Maharatna, Arvind Gautam, and Ganesh R Naik. 2017. CNN based approach for activity recognition using a wrist-worn accelerometer. In 2017 39th Annual International Conference of the IEEE Engineering in Medicine and Biology Society (EMBC). IEEE, 2438--2441.

[35]

Razvan Pascanu, Tomas Mikolov, and Yoshua Bengio. 2013. On the difficulty of training recurrent neural networks. In International conference on machine learning. PMLR, 1310--1318.

[36]

Rafael Possas, Sheila Pinto Caceres, and Fabio Ramos. 2018. Egocentric activity recognition on a budget. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 5967--5976.

[37]

Yoli Shavit and Itzik Klein. 2021. Boosting inertial-based human activity recognition with transformers. IEEE Access 9 (2021), 53540--53547.

[38]

Karen Simonyan and Andrew Zisserman. 2014. Two-stream convolutional networks for action recognition in videos. Advances in neural information processing systems 27 (2014).

[39]

Salwa O Slim, Ayman Atia, Marwa MA Elfattah, and Mostafa-Sami M Mostafa. 2019. Survey on human activity recognition based on acceleration data. International Journal of Advanced Computer Science and Applications 10, 3 (2019).

[40]

Sibo Song, Vijay Chandrasekhar, Ngai-Man Cheung, Sanath Narayan, Liyuan Li, and Joo-Hwee Lim. 2014. Activity recognition in egocentric life-logging videos. In Asian conference on computer vision. Springer, 445--458.

[41]

Sibo Song, Vijay Chandrasekhar, Bappaditya Mandal, Liyuan Li, Joo-Hwee Lim, Giduthuri Sateesh Babu, Phyo Phyo San, and Ngai-Man Cheung. 2016. Multimodal multi-stream deep learning for egocentric activity recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition workshops. 24--31.

[42]

Sibo Song, Ngai-Man Cheung, Vijay Chandrasekhar, Bappaditya Mandal, and Jie Liri. 2016. Egocentric activity recognition with multimodal fisher vector. In 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2717--2721.

Digital Library

[43]

Odongo Steven Eyobu and Dong Seog Han. 2018. Feature representation and data augmentation for human activity classification based on wearable IMU sensor data using a deep LSTM neural network. Sensors 18, 9 (2018), 2892.

[44]

Ke Sun, Yuntao Wang, Chun Yu, Yukang Yan, Hongyi Wen, and Yuanchun Shi. 2017. Float: One-Handed and Touch-Free Target Selection on Smartwatches. In Proceedings of the 2017 CHI Conference on Human Factors in Computing Systems (Denver, Colorado, USA) (CHI '17). Association for Computing Machinery, New York, NY, USA, 692--704. https://doi.org/10.1145/3025453.3026027

Digital Library

[45]

Lin Sun, Kui Jia, Kevin Chen, Dit-Yan Yeung, Bertram E Shi, and Silvio Savarese. 2017. Lattice long short-term memory for human action recognition. In Proceedings of the IEEE international conference on computer vision. 2147--2156.

[46]

Senem Tanberk, Zeynep Hilal Kilimci, Dilek Bilgin Tükel, Mitat Uysal, and Selim Akyokuş. 2020. A Hybrid Deep Model Using Deep Learning and Dense Optical Flow Approaches for Human Activity Recognition. IEEE Access 8 (2020), 19799--19809. https://doi.org/10.1109/ACCESS.2020.2968529

[47]

Catherine Tong, Jinchen Ge, and Nicholas D Lane. 2021. Zero-shot learning for imu-based activity recognition using video embeddings. Proceedings of the ACM on Interactive, Mobile, Wearable and Ubiquitous Technologies 5, 4 (2021), 1--23.

Digital Library

[48]

Du Tran, Lubomir Bourdev, Rob Fergus, Lorenzo Torresani, and Manohar Paluri. 2015. Learning Spatiotemporal Features With 3D Convolutional Networks. In Proceedings of the IEEE International Conference on Computer Vision (ICCV).

Digital Library

[49]

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. Advances in neural information processing systems 30 (2017).

[50]

Jindong Wang, Yiqiang Chen, Shuji Hao, Xiaohui Peng, and Lisha Hu. 2019. Deep learning for sensor-based activity recognition: A survey. Pattern recognition letters 119 (2019), 3--11.

Digital Library

[51]

Jiahao Wang, Qiuling Long, Kexuan Liu, Yingzi Xie, et al. 2019. Human action recognition on cellphone using compositional bidir-lstm-cnn networks. In 2019 International Conference on Computer, Network, Communication and Information Systems (CNCI 2019). Atlantis Press, 687--692.

[52]

Limin Wang, Yuanjun Xiong, Zhe Wang, Yu Qiao, Dahua Lin, Xiaoou Tang, and Luc Van Gool. 2016. Temporal segment networks: Towards good practices for deep action recognition. In European conference on computer vision. Springer, 20--36.

[53]

Yuntao Wang, Jiexin Ding, Ishan Chatterjee, Farshid Salemi Parizi, Yuzhou Zhuang, Yukang Yan, Shwetak Patel, and Yuanchun Shi. 2022. FaceOri: Tracking Head Position and Orientation Using Ultrasonic Ranging on Earphones. In Proceedings of the 2022 CHI Conference on Human Factors in Computing Systems (New Orleans, LA, USA) (CHI '22). Association for Computing Machinery, New York, NY, USA, Article 290, 12 pages. https://doi.org/10.1145/3491102.3517698

Digital Library

[54]

Zhiguang Wang and Tim Oates. 2015. Encoding time series as images for visual inspection and classification using tiled convolutional neural networks. In Workshops at the twenty-ninth AAAI conference on artificial intelligence.

[55]

Haoran Wei, Roozbeh Jafari, and Nasser Kehtarnavaz. 2019. Fusion of video and inertial sensing for deep learning--based human action recognition. Sensors 19, 17 (2019), 3680.

[56]

Chuhan Wu, Fangzhao Wu, Tao Qi, Yongfeng Huang, and Xing Xie. 2021. Fastformer: Additive attention can be all you need. arXiv preprint arXiv:2108.09084 (2021).

[57]

Yunan Wu, Feng Yang, Ying Liu, Xuefan Zha, and Shaofeng Yuan. 2018. A comparison of 1-D and 2-D deep convolutional neural networks in ECG classification. arXiv preprint arXiv:1810.07088 (2018).

[58]

Huatao Xu, Pengfei Zhou, Rui Tan, Mo Li, and Guobin Shen. 2021. LIMU-BERT: Unleashing the Potential of Unlabeled Data for IMU Sensing Applications. In Proceedings of the 19th ACM Conference on Embedded Networked Sensor Systems. 220--233.

Digital Library

[59]

Jiewen Yang, Xingbo Dong, Liujun Liu, Chao Zhang, Jiajun Shen, and Dahai Yu. 2022. Recurring the transformer for video action recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 14063--14073.

[60]

Ming Zeng, Le T Nguyen, Bo Yu, Ole J Mengshoel, Jiang Zhu, Pang Wu, and Joy Zhang. 2014. Convolutional neural networks for human activity recognition using mobile sensors. In 6th international conference on mobile computing, applications and services. IEEE, 197--205.

[61]

Xiyuxing Zhang, Yuntao Wang, Jingru Zhang, Yaqing Yang, Shwetak Patel, and Yuanchun Shi. 2023. EarCough: Enabling Continuous Subject Cough Event Detection on Hearables. In Extended Abstracts of the 2023 CHI Conference on Human Factors in Computing Systems (Hamburg, Germany) (CHI EA '23). Association for Computing Machinery, New York, NY, USA, Article 94, 6 pages. https://doi.org/10.1145/3544549.3585903

Digital Library

Cited By

Wang ZShi YWang YYao YYan KWang YJi LXu XYu C(2024)G-VOILA: Gaze-Facilitated Information Querying in Daily ScenariosProceedings of the ACM on Interactive, Mobile, Wearable and Ubiquitous Technologies10.1145/36596238:2(1-33)Online publication date: 15-May-2024
https://dl.acm.org/doi/10.1145/3659623
Zhang XWang YHan YLiang CChatterjee ITang JYi XPatel SShi Y(2024)The EarSAVAS DatasetProceedings of the ACM on Interactive, Mobile, Wearable and Ubiquitous Technologies10.1145/36596168:2(1-26)Online publication date: 15-May-2024
https://dl.acm.org/doi/10.1145/3659616
Hong ZLi ZZhong SLyu WWang HDing YHe TZhang D(2024)CrossHAR: Generalizing Cross-dataset Human Activity Recognition via Hierarchical Self-Supervised PretrainingProceedings of the ACM on Interactive, Mobile, Wearable and Ubiquitous Technologies10.1145/36595978:2(1-26)Online publication date: 15-May-2024
https://dl.acm.org/doi/10.1145/3659597
Show More Cited By

Index Terms

MMTSA: Multi-Modal Temporal Segment Attention Network for Efficient Human Activity Recognition
1. Computing methodologies
  1. Artificial intelligence
    1. Computer vision
  2. Machine learning
    1. Machine learning approaches
2. Human-centered computing
  1. Ubiquitous and mobile computing

Recommendations

PriMA-Care: Privacy-Preserving Multi-modal Dataset for Human Activity Recognition in Care Robots
HRI '24: Companion of the 2024 ACM/IEEE International Conference on Human-Robot Interaction

In the field of robotics, caregiving robots and personal assistants are assuming an increasingly prominent role, directly impacting human lives. Especially in healthcare domains, these systems are starting to provide continuous 24/7 care by monitoring ...
USC-HAD: a daily activity dataset for ubiquitous activity recognition using wearable sensors
UbiComp '12: Proceedings of the 2012 ACM Conference on Ubiquitous Computing

Many ubiquitous computing applications involve human activity recognition based on wearable sensors. Although this problem has been studied for a decade, there are a limited number of publicly available datasets to use as standard benchmarks to compare ...
Elderly Assistance Using Wearable Sensors by Detecting Fall and Recognizing Fall Patterns
UbiComp '18: Proceedings of the 2018 ACM International Joint Conference and 2018 International Symposium on Pervasive and Ubiquitous Computing and Wearable Computers

Falling is a serious threat to the elderly people. One severe fall can cause hazardous problems like bone fracture or may lead to some permanent disability or even death. Thus, it has become the need of the hour to continuously monitor the activities of ...

Comments

Information & Contributors

Information

Published In

cover image Proceedings of the ACM on Interactive, Mobile, Wearable and Ubiquitous Technologies

Proceedings of the ACM on Interactive, Mobile, Wearable and Ubiquitous Technologies Volume 7, Issue 3

September 2023

1734 pages

EISSN:2474-9567

DOI:10.1145/3626192

Issue’s Table of Contents

Copyright © 2023 Owner/Author.

This work is licensed under a Creative Commons Attribution International 4.0 License.

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 27 September 2023

Published in IMWUT Volume 7, Issue 3

Check for updates

Author Tags

Qualifiers

Research-article
Research
Refereed

Funding Sources

Tsinghua University Initiative Scientifc Research Program
Institute for Artifcial Intelligence, Tsinghua University
Natural Science Foundation of China
Young Elite Scientists Sponsorship Program by CAST
Beijing Key Lab of Networked Multimedia
Beijing National Research Center for Information Science and Technology (BNRist)

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

5
Total Citations
View Citations
792
Total Downloads

Downloads (Last 12 months)792
Downloads (Last 6 weeks)70

Reflects downloads up to 22 Sep 2024

Other Metrics

View Author Metrics

Citations

Cited By

Wang ZShi YWang YYao YYan KWang YJi LXu XYu C(2024)G-VOILA: Gaze-Facilitated Information Querying in Daily ScenariosProceedings of the ACM on Interactive, Mobile, Wearable and Ubiquitous Technologies10.1145/36596238:2(1-33)Online publication date: 15-May-2024
https://dl.acm.org/doi/10.1145/3659623
Zhang XWang YHan YLiang CChatterjee ITang JYi XPatel SShi Y(2024)The EarSAVAS DatasetProceedings of the ACM on Interactive, Mobile, Wearable and Ubiquitous Technologies10.1145/36596168:2(1-26)Online publication date: 15-May-2024
https://dl.acm.org/doi/10.1145/3659616
Hong ZLi ZZhong SLyu WWang HDing YHe TZhang D(2024)CrossHAR: Generalizing Cross-dataset Human Activity Recognition via Hierarchical Self-Supervised PretrainingProceedings of the ACM on Interactive, Mobile, Wearable and Ubiquitous Technologies10.1145/36595978:2(1-26)Online publication date: 15-May-2024
https://dl.acm.org/doi/10.1145/3659597
Yu JZhang LCheng DBu CWu HSong A(2024)RepMobile: A MobileNet-Like Network With Structural Reparameterization for Sensor-Based Human Activity RecognitionIEEE Sensors Journal10.1109/JSEN.2024.341273624:15(24224-24237)Online publication date: 1-Aug-2024
https://doi.org/10.1109/JSEN.2024.3412736
Zhu RShi LSong YCai Z(2023)Integrating Gaze and Mouse Via Joint Cross-Attention Fusion Net for Students' Activity Recognition in E-learningProceedings of the ACM on Interactive, Mobile, Wearable and Ubiquitous Technologies10.1145/36108767:3(1-35)Online publication date: 27-Sep-2023
https://dl.acm.org/doi/10.1145/3610876

View Options

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

Media

Figures

Other

Tables

View Issue’s Table of Contents