1. Introduction
A great diversity of time series has been successfully analysed in the last several decades since the widespread availability of digital computers and the development of efficient data acquisition and processing methods: biology time series [
1], econometrics records [
2], environmental sciences data [
3], industrial processes and manufacturing information [
4], and many more. The case of non-linear methods, capable of extracting elusive features from any type of time series, is especially remarkable. However, these methods can sometimes be difficult to customize for a specific purpose, and some signal classification problems remain unsolved or scarcely studied. In this regard, this paper addresses the problem of physiological temperature record classification. This problem has only recently begun to be studied [
5], with so far only marginally significant differences [
6]. Instead of trying to find a single better non-linear optimal measure or parameter configuration, we propose a new approach, based on a combination of several sub-optimal methods.
Electroencephalographic (EEG) and heart rate variability (HRV) records are probably the two types of physiological time series most analyzed in signal classification studies using non-linear methods. The rationale of this scientific popularity is twofold. On the one hand, these records are frequently used in clinical practice, convenient and affordable monitoring devices abound, and a growing body of publicly available data has been created in the last several decades. On the other hand, recently developed non-linear measures suit well the features and requirements of these records with regard to a number of samples and noise levels.
As a consequence, there are a myriad of scientific papers describing successful classification approaches for different types of EEG or HRV signals. For instance, in [
7], EEG signals were classified using several entropy statistics under noisy conditions. Despite high noise levels, most of the entropy methods were able to find differences among signals acquired from patients with disparate clinical backgrounds. The study in [
8] used two of these measures to group epileptic recordings of very short length, only 868 samples. The work [
9] also used EEG records to detect Alzheimer’s disease based on changes in the regularity of the signals. One out of the two methods assessed was able to find significant differences between pathological and healthy subjects. Regarding RR time series, studies such as [
10] classified congestive heart failure and normal sinus rhythm again using two entropy measures and assessed the influence of the input parameters on these measures. One of the very first applications of sample entropy (SampEn) was the analysis of neonatal HRV [
11]. Other approaches based on ordinal patterns instead of amplitude differences have also been successful in classifying HRV records, in this case for the diagnosis of cardiovascular autonomic neuropathy [
12]. In summary, EEG and HRV records have been extensively processed using approximate entropy (ApEn), SampEn, distribution entropy (DistEn), fuzzy entropy (FuzzyEn), permutation entropy (PE), and many more, in isolation or in comparative studies.
Conversely, other biomedical records, such as blood or interstitial glucose, arterial blood pressure, or body temperature data, have not been studied as extensively. Despite their convenience and demonstrated diagnosis potential [
13,
14], the use of entropy, complexity, or regularity measures is still lacking in these contexts. These records are more often found in clinical settings as single readings instead of time series, and if continuous readings are available, they are usually very short and sampled at very low frequencies. As a result of this lack of scientific literature and data on which further studies can be rooted, the selection of methods and parameters is more difficult. Thus, the quest for suitable classification methods may become a brute force search, and often the results achieved are at the borderline of significance, at most. The question that arises in these cases can be: Are there no method differences because the records from several classes do not exhibit any particular feature or because the method has not been used to its full potential?
Obviously, when a single feature is not sufficient for a clear classification of objects, more features can be included in the classification function, following the general pattern recognition principles for feature extraction and selection. For instance, in [
14], in addition to a non-linear measure for early fever peak detection, the classification function used other parameters such as the temperature gradient between core and peripheral body temperature. To detect atrial fibrillation in very short RR records, in [
15], the authors proposed adding the heart rate as a predictor variable along with the entropy estimate. In these and other examples, a single non-linear method was combined with other parameters to improve the accuracy of the classifier employed. Along these lines, in this paper, we propose the use of two variables for classifying body temperature records. However, instead of using a single non-linear method combined with other unrelated parameters [
15], the main novelty of our work is the utilization of two uncorrelated entropy measures as explanatory variables of the same equation: PE and ApEn/SampEn. There are other works with comparative analyses using several entropy measures independently, and some authors have even recommended applying more than one method together to reveal different features of the underlying dynamics [
16], but our method combines more than one measure together in a single function after a correlation analysis. There are a few studies using pattern recognition techniques and more than one entropy statistic, such as [
17], to improve the classification accuracy of a single method.
For many years, temperature recordings in standard clinical practice have been limited to scarce measurements (once per day or once per shift), which provides very little information about the processes underlying body temperature regulation [
18,
19]. For these reasons, physicians are only capable of distinguishing between febrile patients and afebrile patients. However, information from continuous body temperature recordings may be helpful in improving our understanding of body temperature disorders in patients with fever [
5,
20,
21].
ApEn and SampEn are arguably the two families of statistics most extensively used in the non-linear biosignal processing realm, with ApEn accounting for more than 1100 citations in PubMed and SampEn for almost 800. PE is not that common yet, since it is a more recent method, but it is probably the best representative of the tools based on sample order differences instead of on sample amplitude differences, as is the case for ApEn and SampEn. Different values of SampEn/ApEn and PE between healthy individuals and patients with fever are likely to reflect subtle changes in body temperature regulation that may be more relevant than the mere identification of a fever peak. It seems reasonable to believe that the process of body temperature regulation may be altered during infectious diseases and it may return to normal during the recovery phase [
22]. Therefore, information obtained by non-linear methods could be useful to evaluate the response to antimicrobial treatments or to adjust the length of those treatments.
Each method separately provides a borderline temperature body time series classification, as is the case in many other studies, but the two combined improve its accuracy significantly. The results of our study show that logistic models including SampEn/ApEn and PE have and accuracy that is acceptable for classifying temperature time series from patients with fever and healthy individuals. The ability of the models developed in this work to classify body temperature time series seems to be the first step in giving temperature recording a more significant role in clinical practice. As has been proved with other clinical signals like heart rate or glycaemia [
10,
13], many diseases reflect a deep disturbance of complex physiologic systems, which can be measured by non-linear statistics. This scheme could therefore be exported to other similar situations where several methods are assessed but none of them reaches the significance level desired. The solution to most of these problems probably lies in a similar approach to that described in the present paper, whose main contributions are an improvement in body temperature classification accuracy and the introduction of a logistic model to perform such a classification.
3. Experiments and Results
The length of the time series was fixed at
, the 8 h interval stated above. ApEn and SampEn were also first tested using different values for their input parameters in the vicinity of the usual recommended configuration of
and
[
47]. Specifically, the values for
m were 1 and 2, and
r varied between 0.1 and 0.25 in 0.05 steps. Except for
, with a relatively low classification performance of 64%, all the tested parameter values yielded a very similar accuracy, around 70%, with
and
offering a slightly superior performance. This final parameter configuration is very similar to that used in previous similar studies [
5].
The influence of the embedded dimension on PE was also analysed, with
m ranging from 3 up to 8. The classification results for each value are shown in
Table 1. The value for the embedded dimension in PE was finally set at 8. This configuration was found to be optimal for the same time series in terms of classification performance and computational cost [
48]. However, since
does not satisfy the recommendation
,
was also used in the computation of the final model.
The three statistics, ApEn (
), SampEn (
), and PE (
), were computed for each record first. Results are shown in
Table 2.
The next step was to assess the independence between the input variables used to build the model. This step was carried out using a correlation matrix and by computing the
p-values of the correlation test between variable pairs, as described in
Table 3 and
Table 4. This correlation analysis was used to assess the association degree between the information provided by ApEn and SampEn, in order to omit possible redundancy in the models fitted, and provide a rationale for not using both measures in the same model.
As expected, ApEn and SampEn are strongly correlated. However, PE exhibits very low correlations, and high p-values, which suggests there is no correlation between PE and any of the other two measures, ApEn or SampEn. This may be due to the fact that PE is based on ordinal differences, whereas the other two are based on amplitude differences, as was hypothesized.
In the following sections, the predictive capability of each one of the measures is assessed, using a logistic model for all the variables and their combinations, discarding the correlated cases. PE, ApEn, or SampEn are the temperature time series features used for classification.
3.1. Individual Models
Table 5 shows the results of the model using only PE. This model, whose assessment parameters are summarized in
Table 6, achieves a significant classification performance, with 83.3% correctly classified records (
Table 7) and an average classification performance of 77.6% using the LOO method (
Table 6). The LOO method leaves out one time series (validation set) of each class, and a model is built using the remaining data (training data). This model is used to make a prediction about the validation set, and the final classification performance using LOO is obtained by averaging all the partial results. The classification achieved is expected to be lower than that for the entire dataset since training and test sets are different, but provides a good picture of the generalization capabilities of the model.
The percentages in
Table 7 account for sensitivity (correct percentage for Class 0), specificity (correct percentage for Class 1), and classification accuracy (total). This will be repeated for the other models (confusion matrix).
Replacing the values obtained for the model coefficients,
can then be computed as
.
For illustrative purposes,
Table 8 shows the
values obtained for all the PE results (
) in
Table 2. If
, time series should be classified as 1 or 0. According to this threshold, there are 2 classification errors in Class 0, and 3 in Class 1. This process can be repeated for all the models fitted in this study.
The results of the model using only SampEn are shown in
Table 9. In contrast to results with PE, this model, whose parameters are summarized in
Table 10, achieves a borderline classification performance instead, with 70% correctly classified records (
Table 11), but only 57.1% for Class 1 records. The average classification performance was 68.7% using the LOO method (
Table 10).
The last individual model, using only ApEn, achieves a better performance than that of SampEn. Its modelization results are shown in
Table 12, and summary in
Table 13. The classification performance is also at the verge of significance: 0.014 and 0.021, with an overall accuracy of 73.3%, but with better Class 1 classification, 64.3% (
Table 14). The average classification performance was 69.7% using the LOO method (
Table 10). The better performance of ApEn over SampEn using temperature records, although counter–intuitive, is in accordance with other similar studies [
5].
3.2. Joint Models
The joint models correspond to models where PE and SampEn, or PE and ApEn, are combined to improve the classification performance of the models described in the previous section. The model results using PE and SampEn are shown in
Table 15,
Table 16 and
Table 17. In comparison with previous individual results for PE or SampEn, there is a compelling performance improvement, from 83.3% to 90% classification accuracy, although the performance for SampEn was only 70%. Arguably, there is a synergy between PE and SampEn, as expected. The average classification performance was 87.2% using the LOO method (
Table 16).
Figure 1 summarizes the ROC plots of all the models studied. It becomes apparent in this figure how the performance significantly increases for the joint models.
Visually, the separability of the classes using SampEn and PE combined in a logistic model, is shown in
Figure 2. As numerically described in
Table 17, only 1 or 2 objects are located in the opposite group.
The model results using PE and ApEn are shown in
Table 18,
Table 19 and
Table 20. As for PE with SampEn, in comparison with previous individual results, there is a compelling performance improvement, from 83.3% to 93.3% classification accuracy, although the performance for ApEn was 73.3%. Again, there appears to be a synergy between PE and ApEn. The average classification performance was 90.1% using the LOO method (
Table 19).
This is the model with the highest classification accuracy. Replacing the values obtained, the fitted logistic model becomes
from which
can be computed replacing
and
by their results for each time series, as done for the univariate PE model.
The separability of the classes using ApEn and PE combined in a logistic model is depicted in
Figure 3. As numerically described in
Table 20, only one object of each class is located in the opposite group.
The LOO analysis was also performed using these joint models, omitting a record from each class in each experiment, and averaging the classification results obtained. For the case with PE and ApEn, classification accuracy dropped from 90% to 87.22%. For the second model, it also dropped, from 93.3% to 90.11%. These performance decrements can be expected in any LOO analysis. A 3% difference can be considered small enough to assume a reasonable generalization capability for the joint models.
Table 21 summarizes the performance of all the models studied.
The computation of the final model was repeated using the PE results achieved using
, as described in
Table 1. In this case, the parameters of the model became
, with
. Using this model instead, there were 4 classification errors in Class 0, and 1 error in Class 1, with a global accuracy of 83.3%. This is the same performance using only PE, but with a more conservative approach in terms of
m. The classification was also improved by 10% in comparison with the results achieved by PE and ApEn in isolation with the same parameter configuration.
Finally, in order to further validate the approach proposed in this study, we applied the same scheme to EEG records of the Bonn database [
49]. This database is publicly available and has been used in many studies, including ours [
7,
48,
50] and others that have also proposed using more than an entropy statistic simultaneously [
16,
17] to improve classification performance. Therefore, we omit the details of this database, since it is not the focus of the present study, which can be obtained from those papers.
We applied the same SampEn and ApEn configuration as in [
17], and the PE configuration used here. There is a great classification performance variation for each pair of classes, but the segmentation of EEGs from healthy subjects with eyes open (Group A in [
17]) and from subjects with epilepsy during a seizure-free period from the epileptogenic zone (Group C in [
17]) yielded a borderline significant classification performance (52% for PE, 78% for SampEn, and 72% for ApEn) that suited very well the case studied in the present paper.
A model including PE and SampEn was created as described above, with the following results:
, with
. Applying that model in a similar way as that in Equation (
9), there were only 7 objects of Group A and 4 of Group C that were misclassified. Overall, the classification performance was increased up to 94.5%.
4. Discussion
PE could be initially supposed to look at signal properties that are different from those that ApEn or SampEn do. Indeed, the correlation analysis in
Table 3 present very low values (–0.2374 and –0.1342), whereas, as expected, ApEn and SampEn were strongly correlated. This initial test suggested that only models with PE and either SampEn or ApEn should be studied. In addition, all the coefficients obtained were reasonably similar, without very large standard errors, which confirms that the S-shaped logistic model function is a suitable relationship for the data (there are no separation problems [
51]).
Individual models were first computed for each measure in order to assess their performance independently. The classification results are acceptable for PE, but not for ApEn or SampEn, at most, borderline. PE results were 87.5% and 78.6%, but only 81.3% and 64.3% for ApEn and 81.3% and 57.1% for SampEn. While the classification accuracy for Class 0 is similar in all measures, it is very poor for Class 1 using ApEn or SampEn.
Two joint models were studied using PE and ApEn, and PE and SampEn, namely, by using pairs of the uncorrelated explanatory variables according to
Table 3. The model with PE and ApEn improved the best individual performances in all cases, up to 93.8% and 92.9%. The model with PE and SampEn also improved the individual results, but to a lesser extent: 87.5% and 92.9%. Therefore, the classification results indicate that PE and ApEn are the best choice for a model in this case, confirmed by the minimum value achieved by the AIC (
Table 21). According to these results, ApEn outperforms SampEn, which may seem counter-intuitive, but this also happened in a similar study with temperature records [
5]. Moreover, the LOO analysis yielded a very similar classification performance, with only a 3% drop and still well above the individual performances. Specifically, the classification dropped from 83.3% to 77.6% using PE, from 70% to 68.7% using SampEn, and from 73% to 69.7% using ApEn. Regarding joint models, it also dropped from 93.3% to 87.2% using PE and ApEn, but with PE and SampEn it was fairly constant: 90% against 90.1%. Therefore, it can be concluded that the models are able to generalize well, given the small dataset available.
The Nagelkerke coefficients were smaller than 0.5 for the individual models using ApEn or SampEn only (0.493 and 0.399 respectively), whereas for PE, it was 0.588. These values also confirm that the individual results can only be considered significant for PE, although ApEn almost reached a level of significance in terms of . The two joint models also improved with regard to this parameter, with values higher than 0.77.
In terms of class balance, the individual model based solely on SampEn yields 6 and 3 errors for each class—slightly unbalanced. However, results are more equally distributed for the other two individual models (3 and 2 errors for PE, and 5 and 3 errors for ApEn). For the joint models, there is a balanced classification, with 2 and 1 errors, or even 1 error for each class for the model proposed. This can be considered another advantage of the method proposed, since the classification is not only more accurate but also more equally distributed.
5. Conclusions
Entropy measures are sometimes unable to find significant differences among time series from disjoint clusters. This can be due to a sub-optimal parameter configuration, specific signal features, or simply because the method chosen is not appropriate for that purpose in that specific context. However, despite not finding statistically significant differences, classification results are frequently well over simple guessing, and such results are almost meaningful. Taking advantage of the fact that each measure is usually more focused on a specific region of the parameter space, we hypothesized that a combination of uncorrelated statistics could arguably improve the individual classification results achieved by each one independently and reach a suitable significance level.
With that purpose in mind, we analyzed the classification performance of a logistic model built from two entropy statistics, PE and ApEn/SampEn. These two measures look at different relationships of the information in the time series: ordinal or amplitude variations. Separately, they were less capable of tackling the difficult problem of body temperature time series classification (83% and 73% accuracy, respectively), but together the accuracy of the classification rose to 93% and 90% using a LOO approach. It is important to note that the main goal of this work was not to determine the exact percentage of correctly assigned objects was not the main goal of this work but to demonstrate that a combined approach can improve the baseline performance, however high or low it is already.
This scheme could be applied to other classification problems where independent measures achieve borderline results if applied in isolation. The exploitation of possible synergies between different methods is a novel approach that has not been applied very extensively so far, and could open doors to more accurate methods.