3.1. Dataset and Experimental Setup
The dataset contributed by Sztyler et al. [
8] was adopted to test the proposed methods. The reasons for this were that it is up-to-date and, according to the authors, is the most complete, realistic, and transparent dataset for on-body position detection that is currently available [
8]. This dataset contains the acceleration data of eight activities—climbing stairs down (A1), climbing stairs up (A2), jumping (A3), lying (A4), jogging (A5), standing (A6), sitting (A7), and walking (A8)—of 15 subjects (age 31.9 ± 12.4, height 173.1 ± 6.9, and weight 74.1 ± 13.8, with eight males and seven females). For each activity, the acceleration of seven body positions—chest (P1), forearm (P2), head (P3), shin (P4), thigh (P5), upper arm (P6), and waist (P7)—were recorded simultaneously. The subjects performed each activity for roughly 10 min, except for jumping (about 1.7 min) due to the physical exertion. In total, the dataset covers 1065 min of acceleration data for each position and axes, with a sampling rate of 50 Hz. We filtered and reorganized the dataset to make it suitable for training deep learning models. The detailed processing method and the prepared datasets are available online [
32].
As shown in
Figure 1, the experiments in this study were implemented with a computer equipped with a four-core Intel Core i5-4460 3.2GHz CPU, an AMD Barts Pro Radeon HD 6850 Graphic Processing Unit (GPU) and a 12 GB of random-access memory (RAM). The operating system is Ubuntu Linux 16.04 64-bit version. Built on top of these is a software combination of RStudio and TensorFlow.
The data preprocessing was performed with RStudio, including data segmentation, data transformation, and shallow feature extracting. The details and complete code is also available in R-markdown format [
32]. The deep learning model training and testing were conducted with TensorFlow (Version 1.0), and the model was built in Python (Version 2.7) language. TensorFlow is an interface for expressing machine learning algorithms, and an application for executing such algorithms, including training and inference algorithms for deep neural network models. More specifically, the TF.Learn module of TensorFlow was adopted for creating, configuring, training, and evaluating the deep learning model. TF.Learn is a high-level Python module for distributed machine learning inside TensorFlow. It integrates a wide range of state-of-art machine learning algorithms built on top of TensorFlow’s low-level APIs for small- to large-scale supervised and unsupervised problems. The details of building deep learning models with TensorFlow are provided online, and some of the trained models are also available [
32].
3.2. Results and Discussion
There are two evaluation schemes for the activity recognition model, which are a person-dependent method and a person-independent, leave-one-out method [
17]. For person-dependent evaluation, the data from the same subject are separated to training samples and testing samples. For person-independent evaluation, the data of one or more subjects are excluded from the training process and used for testing. In our study, considering the small number of subjects we have, and in order to compare with a previous study [
8], we used the person-dependent method. The classifiers were trained and evaluated for each subject individually. The data of each subject were segmented with a non-overlapping method to avoid over-fitting caused by data duplication in training and testing datasets. Ten percent of the segmented samples were used as testing data, and the remaining samples were used as training data. Sequential selection of samples in time was applied in order to avoid the over-fitting caused by predicting past based on future. All segment lengths were power values of 2 in order to better perform STFT when generating spectrogram images.
These segments were transformed into raw acceleration plots, multichannel plots and spectrogram images, according to the preprocessing methods that were introduced above. For each segment, the 15 shallow features that appear in
Table 1 were extracted for each position and axis. Since each segment contains the acceleration data of three axis and seven positions, 315 shallow features were extracted for each segment. The details of data transformation and feature extracting are available [
32].
Different deep learning models were built and trained for each combination of the five segmentation options and four data transformation methods. The introduced methods were evaluated for each individual subject.
Table 2 presents the aggregated classification results of all 15 subjects, based on different segmentation and transformation combinations. The highest overall accuracy was 97.19%, using the multichannel method based on a segment length of 512 (10.24 s).
The results show that the multichannel method achieved the best performance for all segment lengths. For each of these four transformation methods, the performance improved with the increase of the segment length, from segment length 64 (2.56 s) to 512 (10.24 s). There is an accuracy decrease from segment length 512 (10.24 s) to 1024 (20.48 s). A possible explanation is the significant drop of training sample numbers. The accuracy of the multichannel method is more stable than other methods, among different segment lengths. This means that the performance variance of the multichannel method is less than that of others, and its classification accuracy is less dependent on segment lengths, which implies that this method is more suitable for short-time HAR tasks. Moreover, the introduction of shallow features did not increase performance as expected. In fact, it slightly decreased performance compared to the spectrogram method. One possible explanation is that the number of shallow features, which was 315, was too many, and they were confused with features extracted by the deep learning models.
With the same data preprocessing method, the classification accuracies of different individuals were different due to the variation of data quality, dataset size, and individual behaviors.
Table 3 summarizes the overall classification accuracies of the 15 subjects, based on a segment length of 512 (10.24 s) with the four data preprocessing methods.
Leaving out the impact of the segment length, the four models that were based on the segment length of 512 (10.24 s) were compared in detail.
Table 4 presents the classification accuracy of each of the eight activities that the four models produced.
Regarding to training time, the multichannel method also achieved outstanding performance. As shown in
Figure 6, the multichannel model took only 40 min to reach the highest accuracy, whereas the other methods required at least 360 min. This proved that the multichannel method provided the best performance, in this case from both accuracy and training time points of view.
Considering the classification accuracy of each activity, the multichannel method perfectly classified the 68 climbing down (A1) samples, as presented in
Table 5. It produced a relatively lower accuracy for running activity (A5), where 5 out of 105 running samples were misclassified as standing activity (A7).
The classification above is based on the acceleration data that were collected from seven body positions. In real life scenarios, it is difficult to obtain such a complete dataset. Therefore, activity classification using the data from each single position was also undertaken in this study. The combination of segment length 512 (10.24 s) and the multichannel method was used to better compare with the above-mentioned results.
Figure 7 shows the overall classification accuracy for the eight activities. The data from the head produced the lowest accuracy (79.32%), whereas the data collected from the shin provided the highest accuracy (90.51%). This result agrees with practical experience that the movements of the head are more stable than other body positions, whereas the movements of the shin are more closely related to different activities, especially to such dynamic ones such as running, jumping, climbing up, and climbing down. By combining the data from the two positions with the data of highest accuracies, the shin and forearm, an overall accuracy of 93.00% was achieved. This is close to the result that was obtained based on the data from all of the seven positions, which was 97.20%.
Compared to other traditional classification techniques, such as ANN, DT, k-NN, NB, SVM, and RF, deep learning methods improved the classification accuracy significantly.
Figure 8 shows a comparison of the results achieved by the proposed multichannel deep learning method (marked as DL) based on the segment length of 64 (1.28 s) and the results reported in [
8], using the same dataset with a similar segment length of one second. It is shown that the deep learning method achieved an overall classification accuracy that was 7.22% higher than RF.
Beside the dataset used above, in order to testify its feasibility, the proposed multichannel data preprocessing method was also applied to another three public HAR datasets, which are WISDM v1.1 (daily activity data collected by a smartphone in a laboratory, with a sampling rate of 20 Hz) [
33], WISDM v2.0 (daily activity data collected by a smartphone in an uncontrolled environment, with a sampling rate of 20 Hz) [
34,
35], and Skoda (manipulative gestures performed in a car maintenance scenario, with sampling rate of 98 Hz) [
36]. These datasets were used by Ravì et al. [
23], and we used the same segment length as they did, which is a non-overlapping window size of 4 s (for the Skoda dataset) and 10 s (for the WISDM v1.1 and WISDM v2.0 datasets).
The comparison about the per-class precision and recall values obtained by the proposed multichannel transformation method (abbreviated as MCT in the tables) against the results produced by [
23] is presented in
Table 6. The result shows that the proposed method outperforms the spectrogram integrated with shallow features method in most activities, except the walking and jogging in the WISDM v1.1 dataset and walking in the WISDM v2.0 dataset. In terms of the multi-sensor Skoda dataset, the proposed method perfectly classified most activities, except the open and close left front door activities. This comparison result reveals that the proposed multichannel method is more suitable for multi-source data, although it can also achieve good results for singular sensor data.