The results of the experiments are presented in this section. All the results were calculated using the leave-one-subject-out method, meaning that one study subject’s data is used for testing and all the other data is used for training, with the process then repeated in turn (
Figure 4). Due to this, the trained models are user-independent. When the results for all the study subjects were obtained, they were combined as one sequence and MSE and
values were calculated from these combined sequences. As most of the models used in this article contain random elements, the models were trained five times. All of the results presented in this section are averages from these runs, with the standard deviation between the runs shown in parenthesis. The scale of the target variables was [−4, 4]; if the estimated value was outside of this scale, it was replaced with −4 or 4.
5.1. Comparison of Prediction Models and Normalization Methods
The results presented in
Table 3 show how well different classification and regression models can predict valence and arousal levels based on raw sensor data and how the normalization of signals and target values affects the recognition rates. From these results, it can be noted that it is possible to reliably estimate valence and arousal levels based on data from wrist-worn wearable sensors and up-to-date prediction models. Moreover, this estimation is especially reliable when the prediction is made based on the LSTM model. It should be noted that the LSTM outperforms other classification and regression algorithms. The best results were obtained using the LSTM regression model with a baseline reduction as the normalization method, in this case, for valence level estimation MSE = 0.43 and
= 0.71 and for arousal level estimation MSE = 0.59 and
= 0.81.
A comparison of the results from the classification and regression models shows that, in general, the regression models performed better than the classification models, and only in very few cases did classification models perform better than regression models. This is not surprising, as valence and arousal are continuous phenomena and are not discrete, meaning that they should be analyzed using regression methods, not classification methods. However, in certain cases classification using LSTM worked very well. For instance, when the valence level is recognized, the LSTM-based classification model with baseline reduction normalization (mean MSE = 0.57 and
= 0.55) performs nearly as well as the LSTM-based regression model with baseline reduction normalization, which has the overall best MSE score (0.43). In addition, when the arousal level is predicted using the LSTM-based classification model with baseline reduction normalization, the performance of the model is nearly as good as when using the LSTM-based regression model with baseline reduction normalization (MSE = 0.81 and
= 0.75 compared to MSE = 0.59 and
= 0.81). Therefore, it is not possible to conclude based on MSE and
that LSTM-based regression models are better than LSTM-based classification models. To study the performance of the LSTM-based models in more detail and compare their classification and regression versions,
Table 4 presents a comparison using MSE and
along with RMSE and MAE. According to these results, baseline reduction is the best normalization method, supporting the findings based on the results of
Table 3. Moreover, according to
Table 4, in the case of valence recognition the difference between LSTM-based classification and regression models with baseline reduction is small when MSE,
, RMSE, and MAE values are compared. Nonetheless, when all four performance metrics are compared, the LSTM regression model with baseline reduction is better than the most similar classification model according to three metrics out of four. In the case of arousal recognition, the difference is clear, and again the LSTM based regression model with baseline reduction is the best model according to three metrics out of four.
According to
Table 4, the two best models are LSTM regression and classification models with baseline reduction. To obtain more insight into these models,
Figure 5 and
Figure 6 illustrate how the predicted valance and arousal estimates follow the user-reported target variables when these models are used in prediction. The figures are drawn based on the results of the best runs; in the case of the regression model, MSE and
for valence estimation were 0.38 and 0.74, respectively, while for arousal estimation they were 0.51 and 0.84, respectively. For the classification model, the MSE and
for valence estimation were 0.42 and 0.69, respectively, while for arousal estimation they were 0.68 and 0.80, respectively. In the figures, predictions using an LSTM-based regression model are shown with a blue line, those using a classification model are shown using a green line, and the true arousal level is shown in orange. Due to subjective differences, the estimation is not as good for all subjects; however, these figures show that in general prediction is highly accurate with both models. In fact, for a number of study subjects the prediction is almost perfect. However, while the difference between LSTM regression and classification models according to the MSE and
is minimal,
Figure 5 and
Figure 6 reveal differences. It can be noted that the WESAD data does not contain very many samples from cases in which the level of valence is very high or very low, and it contains very few negative arousal cases. In fact,
Figure 5 shows that the models have difficulty detecting high valence values; in particular, the classification models seem to suffer due to this lack of training data for high valence values. According to
Figure 5, the classification model performs badly for samples in which valence is above zero, while the regression model has fewer such problems. Similarly,
Figure 6 shows that the classification model has problems detecting high arousal values; here, the problems are not as severe as in the case of valence recognition, as the training data contain more cases with high values for arousal than for valence. In addition, according to
Figure 6, neither model detects negative arousal samples.
Earlier results have already shown that baseline reduction is the most effective normalization method. However, when different normalization methods are compared, it is especially interesting to see the effects different normalization methods have on LSTM models, as these outperform other models. This is visualized in
Figure 7. The results of this figure are taken from
Table 4 by calculating the average performance of each normalization method when LSTM classification and regression models are used to detect valence and arousal levels. The figure clearly shows that there are large differences between the normalization methods; no matter which performance metric is used, baseline reduction always provides the best results. For MSE, RMSE, and MSE the error is the lowest and for
the value is highest when using baseline reduction. In fact, according to
Table 4, for the cases of both valance, and arousal the best results are obtained when using baseline reduction as the normalization method. Both classification and regression models benefit from this, showing that normalization should be used instead of analyzing raw data. The low performance of
z-score normalization is surprising; it provides good results only rarely, and in this study, the only good results using
z-score normalization were obtained when the valence level was detected using an LSTM-based regression model (MSE = 0.70 and
= 0.75, see
Table 4). While
z-score normalization does not perform well compared to baseline reduction, it is a much better option than analyzing data without any normalization. In fact,
Figure 7 shows that, on average, the worst results were obtained from raw data, with the performance of non-normalized data being especially bad according to the RMSE value.
Figure 5 and
Figure 6 show that the valence and arousal levels can be estimated with high reliability when studied separately, and an LSTM-based regression model with baseline reduction is the best method to do it. However, the most important thing is to understand how well emotions can be estimated when valence and arousal estimations are combined and visualized using Russell’s circumplex model of emotions (see
Figure 1).
Figure 8 shows this visualization for different emotion classes; these estimations are from the run that provided the best results when the target values were normalized using baseline reduction. Therefore, they are the same ones shown in
Figure 5 and
Figure 6 for the LSTM-based regression model. In
Figure 8, the estimated values are shown in blue and the target values provided by the study subjects are visualized using red dots. As baseline reduction is used, in the case of the baseline class the target value for valence and arousal is zero.
Figure 8a shows that the baseline emotion can be estimated with high accuracy, as almost all the estimations are close to the origin. In this case, the average estimated valence is −0.01 and the average estimated arousal is 0.04. According to
Figure 1, strong negative emotions are located at the top left quarter of Russell’s circumplex model of emotions, which is exactly where estimations of stress-class observations are located based on the models presented in this article (see
Figure 8b). Moreover, the target values obtained from the study subjects are located in the same place. In fact, the predicted values and target values are very close to each other. Observations from the amusement class are estimated to be located close to the origin (
Figure 8c) or to the right bottom quarter of Russell’s circumplex model of emotions, where relaxed emotions are located. While the model estimates only slightly relaxed emotions during the amusement class, and the detected emotions are not as strong as those recognized from the stress class, this does not mean that the model performs badly in this case. Indeed, when predicted values are compared to the target values, it can again be noted that they are distributed in the same area on the valence–arousal graph. Therefore, prediction models based on the LSTM regression model and baseline reduction can estimate the valence and arousal levels for each emotion class with high accuracy, making it emotion-independent based on this analysis.
5.2. Subject-Wise Results
Subject-wise valence and arousal level estimation results from the best-performing regression models are presented in
Table 5, where LSTM models without any normalization and with baseline reduction are compared to AdaBoost and Random Forest models with baseline reduction. It should be noted that, according to
Table 3, the AdaBoost and Random Forest models perform much worse on average than the LSTM models. The results in
Table 5 show that for most of the study subjects the levels of valence and arousal can be predicted by all of these models, as well as with the AdaBoost and Random Forest models. There are even cases in which AdaBoost and Random Forest perform better than LSTM. However, the largest difference between AdaBoost, Random Forest, and LSTM is that in certain cases AdaBoost and Random Forest perform very badly, while the variance between the prediction rates for different study subjects is much smaller using LSTM. For instance, when the valence of subject 11 was predicted using the AdaBoost regression model, the
-score was −131.95, and for Random Forest the
-score was −127.81. These naturally have a huge effect on the average values presented in
Table 3.
The results in
Table 5 show that certain study subjects have data that are more difficult to predict. For instance, each model has difficulty predicting the valence of study subjects 14 and 17 and the arousal of study subjects 2, 14, and 17. There may be problems with the data of study subjects 14 and 17, or their bodies may react differently to stimuli compared to other study subjects. If the differences are caused by different stimuli, this suggests that it would be possible to obtain better results via model personalization. In addition, there are model-specific differences. For instance, the valence level of study subject 4 is not predicted well by LSTM when the model uses raw data; however, when the same person’s data is predicted with the LSTM model trained using baseline reduction normalized data, the prediction is highly accurate. This shows the importance of normalization. Moreover, while LSTM performs well in most cases, for certain subjects the
-score is negative.
For this experiment, all of the models were trained five times; the results presented in this section are averages from these runs, with the standard deviation from these runs for each individual presented in parentheses in
Table 5. When the standard deviations are studied in detail, it can be noted that for certain study subjects the results differ a great deal between different runs, especially when it comes to valence level detection. For instance, for study subjects 2, 5, and 14 the standard deviation of the
score is greater than 1 when valence is detected using LSTM and baseline reduction.