1. Introduction
With the emergence of the Internet of Medical Things (IoMT), the continuous monitoring of patients has become increasingly accessible in everyday life [
1,
2,
3]. IoMT enables the integration of IoT communication protocols with medical equipments and systems, enabling remote, real-time, and intelligent patient monitoring and treatment [
4,
5]. Physicians are able to treat more patients with real-time patient monitoring, and patients are reassured that someone is always watching out for them. A rapid improvement in wearable technologies has helped to develop intelligent and real-time healthcare services, including Parkinson’s disease monitoring, Alzheimer’s disease monitoring, and fall detection [
6,
7,
8,
9]. It is possible to immediately and accurately detect physiological states with wearable technologies, but some acute and dormant illnesses, such as lumbago and neuralgia, remain indefinable or prohibitively expensive to treat [
10].
Computer vision (CV) methods are capable of excavating these symptoms for standard medical measures if comfort and functionality are taken into account [
11]. Real-time patient monitoring systems can use human action recognition (HAR) as a context-aware application. With HAR in smart healthcare environments, action recognition will be easier from visual data as well as sensor data, such as Microsoft skeleton data. It is necessary to deploy HAR models on GPU-enabled edge devices, such as Jetson Nano, Jetson TX2, and Jetson AGX Xavier.
Most hospitals, clinics, and healthcare centers today have video cameras that can be used to monitor patients. Monitoring patients and manually detecting their conditions in real time is time-consuming and expensive, and informing the appropriate authority in case of an emergency is time-consuming and expensive. Moreover, in the event of an emergency, informing the appropriate authority may take some time. On the other hand, an automated action recognition system can do so almost immediately in an intelligent healthcare environment.
An intelligent patient monitoring ecosystem is illustrated in
Figure 1. For example, Microsoft Kinect can be used to collect skeleton points from a visual sensor. Once the information is passed to the CV module, an HAR model predicts an action. Using sensor data, the CV module can recognize actions in real time. A router connects the whole system to the cloud so that it can notify the authorities in case of an emergency. By doing so, it will be possible to implement a vision-based real-time monitoring system for patients. Notifications can also be sent to a mobile application in a home surveillance system. In addition to storing predicted footage, the CV module also includes a storage system.
The skeleton data consist of 3D points from motion cameras or pose estimation technology that can be used to analyze human behavior. Since skeleton data contain fewer dimensions, it is computationally more efficient than traditional RGB videos for representing human dynamics. Furthermore, it is resilient to illumination issues, flickering clips, motion blur, and complex backgrounds [
12]. We present our method for skeleton-based action recognition, which can be used in smart healthcare systems to monitor patients.
In order to solve this problem, different deep learning (DL)-based approaches have been proposed. Skeleton points are traditionally represented by joint-coordinate vectors and passed to recurrent neural networks (RNNs) [
13,
14] or pseudo-images from skeleton data are passed to convolutional neural networks (CNNs) [
15,
16]. If skeleton points are represented as graph structures, then their full potential can be exploited. Graph convolutional networks (GCNs), which perform convolutional operations on graphs, have also gained considerable attention [
17,
18].
In recent years, GCNs have been successfully applied to skeleton-based recognition tasks with success [
19,
20]. The existing methods, however, are computationally inefficient and suffer from a slow execution speed and large model sizes. Some methods combine multiple input streams, such as bone and joint data, which make the models even heavier and restricts their application to real-life applications, including patient monitoring. Modern GCNs construct spatial graphs from skeleton data and pass them into spatial GCNs. In order to obtain temporal features, they are passed to a temporal convolutional layer. It is possible that spatial GCN does not extract any significant temporal features during sequential feature extraction. There is therefore a loss of information because spatial and temporal features are not extracted from the same spatio-temporal feature state.
This paper introduces a novel architecture called a redefined spatio-temporal graph convolutional network (STGCN) for skeleton-based HAR, which independently extracts relevant spatial and temporal information, merges them, and detects action. We propose a spatial and temporal adaptive graph convolution operation [
20] that extracts significant spatial and temporal features independently from skeleton joint data, as illustrated in
Figure 2. In the proposed model, spatial, and temporal adaptive graph convolutional layers are combined to extract significant spatial and temporal features from the same spatio-temporal position. A further benefit of our model is that it uses only one stream of input, as opposed to other multi-channel methods, which use multiple input streams. As compared to other multi-channel methods, our method ensures better feature extraction since the same type of layer is applied multiple times to multiple input streams and then combined later.
In order to demonstrate the effectiveness of our proposed model, we performed extensive experiments on a skeleton-based action recognition dataset, namely the NTU-RGBD [
13] dataset. Based on the NTU-RGB dataset, our model achieves state-of-the-art results. Furthermore, we demonstrate the applicability of our model in a real-world environment by measuring its performance on edge devices, such as the Nvidia Jetson Nano. Due to its low computational complexity, reduced parameter size, and fast processing speed, our model is ideal for dynamic detection and deployment in the real-time monitoring of patients in intelligent healthcare ecosystems.
Our main contributions are summarized as follows.
In the context of IoMT, an efficient spatial and temporal feature extraction framework for HAR is introduced, together with a framework for utilizing the features.
A novel architecture, STGCN, is proposed to enable the independent extraction of spatial and temporal features. Due to its reduced number of parameters and efficient feature extraction method, our model extracts spatial and temporal features from only joint-level information.
Finally, we provide a strong framework for skeleton-based HAR. We demonstrate with extensive experimentation and analysis that our models achieve competitive accuracy with state-of-the-art models. The baselines we established should be useful to future research on skeleton-based HAR and vision-based patient monitoring.
The rest of the paper is organized as follows.
Section 2 represents an overview of the related work.
Section 3 explains the detailed architecture of STGCN. Details of our experimental setup are described in
Section 4. Then, we show the results and analysis of our experiments in
Section 5. Finally, we our conclusion is provided in
Section 6.