The primary aim of this article is to investigate whether XAI methods can enhance explainability of ML predictions in clinical gait classification. In this section, the classification results are analyzed, compared, and interpreted in terms of classification accuracy and relevance-based explanations. These explanations are, furthermore, evaluated from a statistical and clinical viewpoint. Additionally, we discuss dependencies, influences, and interesting observations with respect to different classification methods, tasks, normalization methods, and signal components (horizontal forces and affected/unaffected leg signals).
6.2 Explainability Results
In the following, we discuss different related aspects with regard to our first leading research question: Which input features or signal regions are most relevant for automatic gait classification? The visualizations for all classification tasks and classification methods can be found in the supplementary Figures S1–S12.
Which input features are relevant for the classification of functional gait disorders? LRP identified several regions of high relevance in the GRF signals for all classification tasks.
The ML models often used regions (and not single time-discrete values) encompassing peaks and valleys in the GRF signals to distinguish between the different classes, e.g., for task
HC/GD using the CNN (see Figure
5) in the affected and unaffected
(all three local maxima and minima), affected
(both peaks), unaffected
(first peak), affected
(first lateral peak), and unaffected
(both lateral peaks). The highest total relevance scores are present in the signals of the affected side and most commonly in
for all investigated classification tasks. This is in line with earlier studies, e.g., where the peaks and valley (as time-discrete parameters) of the affected
showed the highest discriminatory power [
66].
Are signal regions of the unaffected side important for the classification of functional gait disorders? Across all classification tasks, relevant regions are also pronounced in the GRF signals of the unaffected side, but less than in those of the affected side. In earlier studies [
67,
68], we showed that the omission of the unaffected side during classification negatively affected classification accuracy. The explainability results confirm this observation. The unaffected side seems to capture complementary information relevant to the classification task under consideration. In particular, the identified relevant regions in the GRF signals occur at similar relative (e.g., in both peaks of
) or absolute (e.g., the second peak of the affected
and the first peak of the unaffected
) time points of the stance phases of the unaffected and affected side.
Are the anterior-posterior and medio-lateral forces relevant for the task? While the highest total relevance scores can be observed in
in most cases, relevant regions are always also observed in the horizontal GRF signals (
and
). However, the locations and degree of relevance within the horizontal signals vary for different classification tasks, e.g., for task
HC/A, the highest relevance scores occur in the affected
(and
) and hardly any relevant regions exist in
(see supplementary Figure S10), while the highest relevance score for the task
HC/H appears at the beginning of the affected
(see supplementary Figure S4).
What is the impact of normalization on explainability results? Normalization of input data is a standard procedure prior to classification with ML models to ensure equal numerical ranges of different signals [
14,
31]. XAI methods such as LRP allow to visualize the effects of normalization on the predictions of ML models directly at the level of the input signals. To gain a deeper understanding of these effects and the underlying data, we also conducted experiments without normalization of input data (see supplementary Figures S13–S24). For the classification of non-normalized GRF signals, the most relevant input values are located in
, i.e., especially the two peaks and the valley in between are relevant for the tasks. A minimal degree of relevance can be observed in the peaks of the affected and unaffected
signals.
The reason for the absence of relevant regions in the horizontal forces could be their small value range. The rather small range compared to the
component may lead to a smaller influence on the training of the classification models. Explainability results for min-max normalized input data show that highly relevant regions are identified in the horizontal forces of the affected and unaffected side (e.g., Figure
5). Thus, normalization amplifies the relevance of values in the horizontal forces and thereby makes them similarly important as
. Based on the LRP relevance scores, we conclude that normalization is important to obtain unbiased predictions of ML models (bias introduced by different signal amplitudes).
Are all identified relevant regions necessaryfor the task? For all classification tasks and classification methods, with min-max normalized input data, many regions of the GRF signals are identified to be relevant for classification according to LRP. The classification performance with and without normalization does, however, not vary significantly for the binary classification tasks (see classification results in Section
5.1). This raises the question of whether all regions identified as relevant are necessary to achieve peak performance in classification or whether some of them are redundant (i.e., not yielding an increase in classification performance when combined). Note that the assumption of redundancy is supported by the fact that the three GRF components represent individual dimensions of the same three-dimensional physical process. Thus, a strong correlation is
a priori given in the data.
To answer the question, we conducted additional experiments with occluded parts of the input vector and evaluated the changes in classification performance. Occlusion is realized by replacing the horizontal forces (
and
) of both sides (affected and unaffected) with zero values. Table
2 shows the classification results for the experiments with occluded input signals as deviation from the mean classification accuracy of the experiments with non-occluded input signals. The results decrease on average when the horizontal forces are occluded (except for tasks
HC/GD and
HC/A using the CNN). Thus, relevant regions in the horizontal forces cannot be completely redundant to those in
and, therefore, represent also complementary information. This is in line with previous quantitative performance evaluations [
67,
68]. However, the classification results of the binary classification tasks are not influenced by the occlusion of horizontal forces in a statistically significant way. This was confirmed by several dependent t-tests (p
0.05) with Bonferroni-Holm [
25] correction. Our results indicate that the relevant regions identified by LRP may represent an over-complete set, which exhibits a certain degree of redundancy, as removing relevant sections does not necessarily lead to reduced classification performance. However, redundancy is not necessarily a negative property, as it may help to achieve higher robustness to noise and possibly also to outliers and missing data [
29].
Do different ML methods rely on different patterns? A comparison of the three employed classification methods is depicted in Figure
6. Across all binary classification tasks, relevant signal regions for all three classification methods are largely consistent, especially with respect to their location. Minor differences exist in the amplitude of the relevance scores, e.g., at the beginning of the affected
or the second peak in the affected
(see Figure
6). The similarities between MLP and SVM are more pronounced. The remaining binary classification tasks, i.e.,
HC/H (see supplementary Figures S4, S5, and S6),
HC/K (see supplementary Figures S7, S8, and S9), and
HC/A (see supplementary Figures S10, S11, and S12) confirm these findings.
Although LRP clearly shows where the prediction is grounded, it cannot explain why these patterns are important. However, it allows to identify and compare the learning strategies of different classification methods.
Can we derive additional properties of the models from the explanations, e.g., different learning strategies? Explanations provided by local XAI methods, such as LRP, inform about a model’s reasoning on individual samples. A more general understanding about the model’s learned patterns can be obtained via the evaluation of larger sets of sample-specific explanations [
34]. In the previous sections, we achieved this by averaging relevance patterns across all samples of a given class. To perform a more detailed analysis that is able to identify different learning strategies of the ML models, we propose the use of SpRAy [
35] as described in [
5] for clinical gait data. The basic idea of this approach is to cluster the relevance patterns obtained for different samples and classes and to analyze the resulting clusters and subclusters.
SpRAy is a statistical analysis method for the explorative discovery of a model’s characteristic prediction strategies from XAI-based relevance patterns. With its core in Spectral Clustering [
43,
47], the method discovers structure within the set of given relevance patterns and yields, among its outputs, a spectral embedding
together with suggested groupings within the embedding in form of
k cluster labels. Here, the embedding
directly corresponds to the individual relevance patterns, under consideration of their local, global, and potentially non-linear affinity structure. Sets of samples with similar relevance patterns are tightly grouped together in the spectral embedding space, while samples with dissimilar patterns are located far apart. Together with the suggested cluster labels, the analytically derived solution in
can then be visualized in
, e.g., via a t-SNE projection [
5,
39]. We implemented and evaluated SpRAy using the CoRelAy
3 framework [
4] for Python.
Figure
7 shows exemplary SpRAy results for task
HC/GD (with min-max normalized GRF signals) using the CNN as classification method. Based on the clustering provided in Figure
7(C) and
7(F), we see that the relevance patterns are grouped into clusters. This indicates that the ML model learned different classification strategies. Considering the ground truth class labels (see Figure
7(D)), we see that the model’s explanations for the overall gait disorder (
GD) class are grouped into distinct clusters that contain samples from the individual gait disorder classes (
H,
K, and
A), even though the model was never explicitly trained to do so in this classification task. This means that the model learned different strategies for different pathological subclasses in
GD. Considering the participant labels (see Figure
7(B) and Figure
7(E)), we can see that the relevance patterns of the five trials of a participant are often clustered together (Figure
7(B) and
7(E)). This means that the model learns similar strategies for the samples belonging to one participant. From a biomechanical perspective, this is plausible because each individual person has unique gait patterns that differ from the gait patterns of other individuals [
30]. For clinical experts, it is important to see that the model is able to reflect such patterns.
In conclusion, SpRAy demonstrates the ability of ML models to learn patterns and dependencies in the data without explicit label information. For the clinical domain, this ability is of great value, since pathologies have various manifestations (that are sometimes even beyond the expertise of a clinical expert).
6.4 Clinical Evaluation
To what extent are input features or signal regions identified as being relevant for a given gait classification task in line with clinical assessment? This question is answered in the following by two clinical experts in human gait analysis. To assist the reader in following the discussion and to facilitate the interpretation of the input signals, the domain-specific terms and gait cycle definitions are described in Figure
8. For further details on the principles of human gait and its clinical implications, the interested reader is referred to literature such as Perry and Burnfield [
53] or Winter [
79].
The explainability results for classification of healthy controls (
HC) and the aggregated class of all three gait disorders (
GD) based on min-max normalized GRF signals illustrate clinically meaningful patterns (see Figure
5). High LRP relevance scores occurred during loading response, terminal stance, and pre-swing in
and
as well as in loading response, mid-stance, terminal stance, and pre-swing in
.
These phases are especially sensitive toward gait anomalies as loading response requires the absorption of body weight and terminal stance plays an essential role for forward propulsion [
33]. Both aspects are affected in case of gait impairments due to a diminished walking speed (requiring less absorption or push-off) as well as factors that go along with an injury, such as the presence of pain, a decreased range of motion, and/or lessened muscle strength [
64,
78]. When analyzing the explainability results in more detail, one can identify specific gait dynamics that can be traced back to an impairment at a certain joint level.
For classification task
HC/A (see supplementary Figure S10), we can observe pronounced peaks in the total relevance curves of
and
caused by alterations in the terminal stance and pre-swing phase of the affected side. This is in agreement with the observations of Son et al. [
69], who found a significantly increased propulsive force (
in terminal stance) for patients with chronic ankle instability. They also identified an increased
during late terminal stance (push-off) compared to healthy controls, which is also in line with the relevance scores obtained in our study. Both our explainability results and the study of Son et al. [
69] did not indicate any relevance or difference to healthy controls in the
.
For classification task
HC/K, the highest LRP relevance scores are present in
,
, and
(see supplementary Figure S7). Changes in
may result from lessened knee flexibility that hinders typical knee dynamics over the entire course of the stance phase. More precisely, healthy walking requires a slightly flexed knee joint during initial contact followed by a knee flexion thereafter, by definition called loading response. During the mid-stance phase the walker’s center of gravity is shifted forward and thus demands further knee extension. This is in line with the study of Cook et al. [
15], who analyzed the effects of restricted knee flexion and walking speed on the
. According to their results, the loading rate (slope during loading response), unloading rate (slope during pre-swing), and peak
of the restricted leg showed significant speed-knee flexion restriction interactions.
Highest LRP relevance values for the classification task
HC/H are obtained during loading response and terminal stance in
of the affected side (see supplementary Figure S4). McCrory et al. [
41] and Martinez-Ramirez et al. [
40] identified the
as an objective measure of gait for patients following hip arthroplasty. McCrory et al. [
41] found significant differences between patients and healthy controls in several variables of the
such as the first and second local peaks, impulse, and stance time. They also identified that the unaffected side holds relevant information, as significant differences were found in the
either compared to the control group or the affected side. This is also seen in our obtained LRP relevance scores for the classification task
HC/H where two distinct relevance peaks are present for
for the first and second
peak of the affected side. These results are also in agreement with Martinez-Ramirez et al. [
40], who demonstrated that patients after successful hip arthroplasty still show significantly altered
for both the affected and unaffected leg including a continuing
asymmetry between both sides.
With regard to our second research question, we conclude that signal regions with high relevance according to LRP can be largely associated with clinical gait analysis literature and are plausible from a clinical point of view according to two domain experts.
6.5 On the Usefulness of XAI Methods for Clinical Gait Analysis
XAI methods increase transparency and can make the decision process of ML models more comprehensible for clinical experts. Transparency of state-of-the-art ML models is crucial to promote the acceptance of such systems in clinical practice, allowing clinicians to benefit from high, and in some cases already better than human [
16,
21,
42], classification accuracy that ML models achieve.
In the previous subsections (i.e., Sections
6.3 and
6.4), we showed that explainability results are consistent from a statistical and domain experts’ point of view. In particular, regions of high relevance according to LRP are highly discriminatory according to SPM, and the clinical experts associated these regions with clinical explanations. Having evaluated the explainability results, we now want to address the question:
What is the added value that XAI methods can provide to clinical practice?The two experts reported that they mainly focus on regions in the
signals during the evaluation process of patients in clinical practice. In particular, the evaluation of the unaffected
is very important for the clinicians. The main motivation for this is that many compensatory patterns manifest in this signal, i.e., as patients try to put as little weight on the affected leg as possible, they take shorter steps with the unaffected leg. This is reflected in a reduced slope in the unaffected
during loading response.
Our explainability results show that in addition to regions in
, regions in
and
are also highly relevant for the classification tasks. These signals are less considered in clinical practice. However, the relevant regions in
and
indicate additional information about the classification of pathological gait patterns.
Explainability approaches can lead to novel insights and a deeper understanding of the models and the underlying data as illustrated in the following example. In the clinical evaluation of the explainability results, the experts identified also relevant regions for the ML models that are not directly related to the specific functional gait disorders, according to their personal expertise and the literature. The experts assumed that, e.g., the relevant regions in the affected and unaffected
, in particular during mid-stance, terminal stance, and pre-swing, are strongly influenced by differences in walking speed between healthy controls and patients. From this observation the clinical experts derived the hypothesis that the trained ML models might be biased by the walking speed.
Using the
HC/K classification task as an example, we examined whether there is a significant difference in walking speed between
HC and
K. An independent samples t-test revealed a statistically significant difference in walking speed between
HC and
K (p
0.001). The differences in walking speed affect the shape of the signals (although the signals were time-normalized) and the ML models could have learned these dissimilarities. To assess the influence of walking speed on the ML models, we repeated the experiment for the task
HC/K on a subsample of the original data. This subsample does not exhibit statistically significant differences with respect to walking speed (independent samples t-test; p
0.068). A comparison of the explainability results obtained for task
HC/K (with min-max normalized GRF signals) using CNNs that were trained on the original and walking speed-matched data are presented in Figure
9. The results for the walking speed-matched data clearly show that most of the relevant regions according to LRP agree with the regions obtained for the original data (with only small changes in amplitude). However, relevant regions in the unaffected
after loading response are less relevant for the model trained on walking speed-matched data. Thus, in contrast to the model trained on the original data, this model barely takes these regions into account. The conclusion that can be drawn is that these regions are related to differences in walking speed.
Using our XAI approach, we have been able to show that some degree of walking speed-related bias was learned in the original models, but that this influence was not as strong as assumed by the clinical experts. Another interesting aspect of the experiment concerns the SPM results. While the trend of effect size and the total relevance remain similar, the statistically significant regions are clearly reduced (compare gray-shaded areas for both settings in Figure
9), showing the sensitivity of SPM to the alpha level.
Overall, we showed that our proposed XAI approach exhibits substantial usefulness for the clinical setting, as we were able to demonstrate that: (i) regions in the signals that are less focused on in the literature and clinical evaluation, i.e.,
and
, also contain informative and relevant regions that can be associated to the underlying pathology, (ii) ML models learn different strategies for different samples and patient groups (experiment with SpRAy; see Section
6.2), and (iii) XAI methods allow the identification of biases in ML models, e.g., with respect to normalization or walking speed-related differences between classes.
The increased transparency provides additional insights into the working mechanisms of the trained ML models, enabling clinicians to better understand them and increase their level of trust [
70].
6.6 Limitations and Future Work
A fundamental problem in evaluating the explainability results is the absence of a ground truth. A challenge in interpreting the explainability results is that alterations of the input signals can be caused not only by the influence of a pathology, but also by other independent parameters, e.g., a lower walking speed or an increased body mass. To minimize potential biases introduced by independent parameters on prediction explanations, future research should attempt to develop normalization procedures for input signals that compensate such influencing factors or develop classification models that inherently learn the relationship between influencing factors and input signals.
Another limiting factor is that we solely used GRF signals for classification. This does not perfectly reflect best practice in clinical gait analysis where clinicians usually base medical decisions on a combination of GRF and 3D kinematic data [
9]. The additional use of kinematic data is expected to improve the classification accuracy to an appropriate level for clinical application, in particular for multi-class classification tasks. However, 3D kinematic data are prone to several difficulties such as inconsistencies due to inter-assessor and inter-laboratory differences [
20,
60]. This makes it more difficult to create a homogeneous, large-scale, and real-world dataset compared to using simple data, such as GRF signals. Thus, the utilized
GaitRec data [
28] provide a large-scale dataset with an easy to comprehend clinical example, which allows to showcase how XAI methods can support transparency of ML models and their predictions.
Besides visual explanations as presented in this article, a translation into human-understandable textual explanations would be desired for clinical application. An interesting direction for future research is the generation of textual explanations based on biomechanical parameters estimated from the input signals. This would enable approaches that exceed pure explainability and provide deeper interpretations for clinical experts in the form of, e.g., “there is a high probability of a pathology in the knee due to a limited knee extension during the mid-stance phase.”
We will conduct further research to compare different explanation methods and rule-based approaches [
32] for different classification tasks and datasets. In addition, we want to point out that quantitative and objective methods are necessary to assess the quality of prediction explanations [
57] including datasets with respective ground truth explanations.