1 Introduction

Progressive neurological disorders (e.g., Alzheimer’s disease) affect 40 million people worldwide [1] and are a common cause of death [2, 3]. However, only 25 % of affected people receive a diagnosis. There are multiple reasons for this, including stigma and a lack of awareness and resources [4, 5]. The number of older adults with Alzheimer’s disease is predicted to rise to 150 million by 2050 [6]. Consequently, detecting pre-illness stages and analyzing the progression of neurological disorders using cost-effective and efficient intelligent systems is crucial to ensure timely diagnosis, risk assessment, and early intervention [3, 7].

Numerous cognitive tests (ct, e.g., Alzheimer’s Disease Assessment Scale-Cognition - adascog, Cognitive Dementia Rating - cdr, Mini-Mental State Examination - mmse, Montreal Cognitive Assessment - Moca, etc.) are currently used to diagnose and monitor neurological disorders, but they need to be applied manually and are therefore costly [5, 8, 9]. These tests can be automated with the help of Artificial Intelligence (ai), enabling more frequent screening of target populations [10, 11] and the conduct of longitudinal studies. The proposal described in this work consists of an automatic, continuous evaluation of cognitive performance based on engaging dialogues established with end users. Engagement fosters evaluation over time, which is of interest to longitudinal studies. The dialogues are supported by advanced ai techniques.

ai-based solutions for clinical assessment purposes is a fast-growing research field, and numerous studies have already analyzed applications in the fields of progressive neurological disorders and dementia [12,13,14,15,16]. Machine Learning (ml) models [1, 17,18,19] (e.g., Convolutional Neural Networks - cnn [20]) and Natural Language Processing (nlp) techniques [15, 21,22,23] have been applied to textual and voice data. These techniques can effectively predict cognitive impairment and content-dependent and context-independent features can be leveraged to improve the predictive performance of ml models in this area.

Among the multiple bio-markers available for the study of cognitive decline [24,25,26], language acquisition is an inexpensive, non-invasive, and readily accessible tool [9]. However, language conceals large volumes of information within complex relationships that can be difficult to decipher. Large Language Models (llms) are better equipped to navigate and process complex information and have therefore received increasing attention in the medical field [1, 27, 28]. llms have been used to analyze images and prescribe medical treatments and have also proven capable of passing medical accreditation exams [29]. However, personalized medical treatment recommendations provided by llms remain unreliable [30].

llms have broader language generation and understanding capabilities [9, 31], compared to domain-dependent models, when adequate prompt engineering is applied [32]. One noteworthy recent example of llm enhanced text generation capabilities is gpt-4 [33]. This llm is used by Chatgpt [34] to generate coherent dialogues with human users. However, there is still limited research on whether llms are better than niche solutions based on highly focused, context- or domain-dependent ml models, particularly in the clinical field [33]. Our approach combines both approaches. It employs an llm to create a friendly conversational assistant environment. The dialogue is augmented by additional services (e.g., weather reports, and medication reminders) to generate interest among the end users and enable a transparent longitudinal evaluation of cognitive decline. This llm is also used to generate high-level reasoning and content-independent side features for a specialized ml model. This combined usage of an llm and ml model for conducting explainable cognitive evaluation is an entirely novel contribution.

In general, establishing trust in ai is necessary to fight the common perception that models are black boxes, leaving end users and developers in the dark about their decision-making processes [35]. This issue is especially relevant in medical practice, particularly in personalized treatments [9]. Explainable ai (xai) [36] exploits the intrinsic interpretability of certain ml algorithms (e.g., tree-based models such as Random Forest - rf), or implements methods to bypass the opacity of non-interpretable models [37]. xai techniques include feature importance [38], counterfactual explanations [39], natural language descriptions [40], and visual representations [41]. Explainability in llm models is far from being solved. Our proposal combines llms with explainable ml classifiers to generate an explanation of the predictions that caregivers or healthcare experts will be able to understand.

Summing up, the main contribution of this work is a hybrid ml solution for assessing cognitive state, which combines context-dependent features with an llm for generating high-level reasoning, context-independent features. The latter provides high-level reasoning on the behavior of a user who dialogues with a conversational assistant on a leisure topic. Moreover, these features greatly enhance the explainability of the system decisions about cognitive state compared to rigid content-dependent features.

The rest of this paper is organized as follows. Section 2 reviews the relevant competing works on cognitive decline detection involving llms, and Section 2.1 summarizes the contribution of this work beyond state of the art. Section 3 explains the proposed solution. Section 4 describes the experimental data set, our implementations, and the results obtained. Finally, Section 5 concludes the paper and proposes future research.

2 Related work

ai solutions for healthcare seek to enhance diagnosis and treatment while simultaneously optimizing resources [42]. One promising line in this regard is the development of intelligent assistants or chatbots [43, 44]. A representative example is the work by Kurtz et al. [45]. They presented a cognitive decline detection solution based on a voice assistant. The authors exploited lexical and semantic features, and embeddings for that purpose.

Although many nlp-based speech analysis solutions exist for the early prediction of Alzheimer’s disease [46, 47], there is limited, preliminary research on the use of llms in the field, apart from specific use cases [9] and medical applications [48].

Table 1 Comparison of related solutions taking into account the llm and ml models used (n/a: not applicable), the context of the input data (ct: cognitive tests, ehrs: electronic health records), the features involved (hlr: high-level reasoning), and explainability capability (Exp.)

llms have the ability to offer powerful nlp functionalities that can be utilized in various medical tasks [49]. They have demonstrated clinical reasoning abilities [50], passed medical licensing exams [51], provided medical advice [52], and even generated clinical notes [53].

The following prior works explore using llms for cognitive evaluation.

Yuan et al. [54] applied bert and ernie models to model disfluencies and language problems in patients with Alzheimer’s disease. The experimental data consisted of transcriptions from cognitive tests in the adress (Alzheimer’s Dementia Recognition through Spontaneous Speech) data set. However, after fine-tuning, they relied mainly on word embeddings extracted from the llms. The only side (content-independent) features considered were word frequency and speech pauses, thereby limiting the generalizability of the results when non-standard tests were used. Qiao et al. [19] extracted disfluency measures and combined them into a stacking classification approach. The absence of content- and context-independent side features makes this approach less robust than ours in settings prone to contextual changes.

Roshanzamir et al. [28] combined transformer-based deep neural network language models (bert, xlm, xlnet) with ml (Logistic Regression - lr, Long Short-Term Memory - lstm, cnn) to detect Alzheimer’s disease using image description testing. Their approach exclusively used embeddings extracted from transcripts and content-dependent features. Zhu et al. [55] screened for dementia using both non-semantic (speech pauses) and semantic features (word embeddings) extracted from cognitive test speech data by bert. As in the work by Yuan et al. [54], the solution was fine-tuned with non-semantic information. However, compared to our solution, the approaches in both [28] and [55] were highly dependent on semantic information.

Li et al. [56] estimated the ability of Llama and Chatgpt llms to detect cognitive impairment from electronic health record (ehr) notes. The prediction outcome, combined with manual assessments by experts, was used to fine-tune the bert model. However, no information is provided about the system’s performance without expert backup.

Agbavor and Liang [9] predicted dementia from standard cognitive tests driven by gpt-3. The authors both classified and predicted the severity of Alzheimer’s disease. For the classification task, they employed traditional ml classifiers (i.e., lr, Support Vector Classifier - svc, and rf) with acoustic and word embedding features. Regression analysis was then performed using the same features to predict mmse scores. Therefore, the training process was based on user interactions in standard tests. Mao et al. [5] followed a similar approach to that used in [28] but used a Linear Classifier (lc). These studies were based on clinical tests or doctors’ notes, limiting their generalizability to non-clinical data. As in previous research, they were strongly dependent on context-dependent word embeddings. Conversely, our proposal uses content-independent side features to focus on longitudinal dialogues on all topics of interest.

Fig. 1
figure 1

System scheme

Instead of word embeddings extracted with llms, we propose using high-level reasoning features generated in response to questions presented to the llms, inspired by human reasoning and independent of particular conversations (e.g., the language register of the user, i.e., adult, child, elder, etc.). Wang et al. [33] followed a similar approach. However, they tasked Chatgpt to extract high-level features from the DementiaBank data set instead of using free dialogues, as in our case. None of the previous works discussed provided explainability capabilities.

2.1 Contributions

We propose using the Chatgpt llm to extract high-level reasoning, content-independent side features from loosely driven entertainment dialogues with a chatbot (as explained in Section 4.1). We identified relevant features by analyzing previous research on cognitive function decline (e.g., changes in the content, comprehension, decreased awareness, increased distraction, memory problems, etc.) [1, 3, 8, 28, 57]. In addition to being context-independent and thus adequate for free dialogue, unlike word-embedding or content-dependent features, the side features used in our proposal support much more interpretable descriptions of the decisions on cognitive decline. As shown in Table 1, few studies have applied xai techniques to our target problem. Although Bellantuono et al. [58] did not use llms, they combined an rf classification model for dementia with SHapley Additive ExPlanations (shap) techniques to infer the contributions of the features to the model’s predictive performance. A similar approach was applied by Lombardi et al. [59], who stated that xai was still in its infancy in computational neuroscience.

Summing up, our work is the first to apply ml techniques to high-level reasoning features extracted from free-form dialogues using an llm to detect cognitive decline. The ml techniques used were Naive Bayes - nb, Decision Tree - dt, and rf, and we further describe the decisions using explainability techniques.

3 Methodology

Figure 1 illustrates the proposed solution for assessing cognitive impairment using free dialogues with older adults. The solution is composed of (i) a preprocessing module to prepare the content for further analysis, (ii) a feature engineering module that generates an appropriate and comprehensive set of features using nlp techniques to train the cognitive design classification models, (iii) a feature analysis and selection module, and (iv) a classification module, which is evaluated using standard ml metrics (accuracy, precision, recall) and provides explainability results. Finally, the most representative features of the ml model were selected to evaluate if they can enhance the direct cognitive impairment prediction capabilities of the Chatgpt llm by means of prompt engineering.

To ensure reproducibility, the specific methods used to implement all the steps mentioned and their configuration parameters are specified in Section 4.

3.1 Preprocessing module

The conversations used in this research are free dialogues between users and an intelligent conversational assistant [60], organized into daily sessions and automatically transcribed into text. The preprocessing module is essential for ensuring the quality of the input data in the feature engineering process involving both prompt engineering and n-gram generation. For prompt engineering, nlp techniques are applied to remove emoticons and hashtags. For n-gram generation, images, links, and special characters (e.g., &, $, €) are removed using regular expressions. Then, the textual content is tokenized and lemmatized. Because this use case is based on free dialogues, an ad hoc stopword list was created to exclude the terms ‘no’, ‘yes’ (si), ‘more’ (más), ‘but’ (pero), ‘very’ (muy), ‘without’(sin), ‘much’ (mucho), ‘little’, (poco) and ‘nothing’ (nada). The final step is the removal of numeric values, isolated characters, and accents.

3.2 Feature engineering module

Once the textual content of the user’s utterances is processed, the feature engineering module generates a broad set of features, detailed in Table 2. Each entry in the data set corresponds to a user dialogue session. Special attention is paid to user engagement (features 1-8), emotional state (features 9-12), language used (features 13-22), and other linguistic aspects (features 23-26), such as the use of polar responses and bad/complex words. The measurements that produce features 1-24 are within the value range (0, 1), where 0 is the minimum and 1 the maximum (except for feature 11, for which the measurements take values \(\{0,0.5,1\}\)Footnote 1). Features 25-26 are directly based on counters. Features 1-26 are calculated using an llm and prompt engineering. The measurements of features 1-24 are first computed for each machine-user utterance pair in the session dialogue. Then, each final feature is a list comprising a) the mean, maximum, minimum, and three quartile values (25 %, median, 75 %) of the corresponding measurements across all the machine-user utterance pairs of the current user session plus the whole dialogue in the session and b) the mean, maximum, minimum, and three quartile values of these same statistics based on the current session and all past sessions of the same user (session history). Features 25-26 are first calculated as lists of the mean, maximum, minimum, and three quartile values only for the user utterances in the session (that is, ignoring the machine utterances) and then computed again with the mean, maximum, minimum, and three quartile values of these same statistics based on the session history of the same user. In other words, features 1-24 are lists of forty-nine real values (which we call components of those features), while features 25-26 are lists of forty-two real values.

Content-independent features 1-26 can be used to assess cognitive deterioration based on free dialogues with the conversational assistant. Linguistic aspects (features 23-26 in Table 2) and characteristics of the language used (features 13-22) are independent of the conversation topic but allow us to measure the quality and coherence of the user responses. Exclusively for comparison purposes, these features are supplemented with content-dependent features based on char-grams and word-grams extracted from all user utterances from the current session (features 27-28). Algorithm 1 describes the complete feature engineering process.

Algorithm 1
figure e

: Feature engineering.

Table 2 Features engineered to detect cognitive decline in free dialogues

3.3 Feature analysis & selection module

In the feature analysis and selection stage, irrelevant and other features that could degrade the system’s performance are removed to ensure optimal training of the ml classifiers. We have implemented two feature analysis and selection techniques: (i) a relevance-based technique using a tree algorithm and (ii) correlation analysis with the Pearson coefficient.

A meta-transformer wrapper based on a tree-based ml model is applied to select the most significant feature components for training the classification module, regardless of the specific ml model used (thus, we follow a model-agnostic strategy [61]). The wrapper, using the Mean Decrease in Impurity (mdi) technique, leverages importance weights to identify and eliminate features based on their impurity contributions [62]. This technique calculates the average proportion of sample splits of each feature in each node across all the trees within the tree-based ml model used. Subsequently, feature components with an mdi lower than the average mdi are excluded from further consideration in the classification module.

To select the final set of feature components for the llm prompt engineering stage (scenario 3 in the next section), we calculated the Pearson correlation coefficient [63], where -1 and 1, respectively, represented the two variables with the strongest possible inverse and direct relationship between the most relevant feature components as previously described and the target (cognitive decline). We then selected the components that exceeded a correlation threshold.

3.4 Classification module

Three different scenarios are considered:

Scenario 1 trains an ml classification based exclusively on content-dependent textual features derived from user utterances (features 27-28 in Table 2). This scenario is intended to evaluate n-grams as a source, as in prior works, [19, 21, 22] and serves as a baseline for our study.

Scenario 2 trains an ml classification considering context-independent side features 1-26 as defined in Table 2, where the base measurements are calculated by applying an llm to user utterances, the machine-user utterance pairs and the whole dialogue, depending on the case.

Scenario 3 evaluates the performance of the llm as a classifier of cognitive impairment [33, 56] using two approaches: (a) the llm is directly used to analyze the dialogue of a user session as a whole and guess the target “cognitive decline” with prompt engineering, and (b) the same setting plus prompt enhancements based on a pick of the most relevant features from scenario 2, using the Pearson coefficient as described in the previous section.

In summary, scenario 1 is our baseline, and scenario 2 is used to evaluate our approach. Scenario 3 uses an llm exclusively to directly classify cognitive impairment and compare its performance with our proposal, in which the llm is employed to generate high-level reasoning features.

In scenarios 1 and 2, we use the meta-transformer wrapper to identify the most significant features (see Section 3.3).

In scenarios 2 and 3, we apply prompt engineering. The prompts are divided into a context section, a request, and the output format. Specific prompts are used in scenario 2 to generate the measurements for the context-independent side features from the textual input (both for features 1-24 and counter-type features 25-26). In scenario 3, two prompts directly address the cognitive impairment level. The first prompt directly applies to the whole dialogue to assess cognitive decline. The second adds a pick of the most relevant features from scenario 2, as previously explained, to the context section of the first prompt. Algorithms 2, 3, 4 and 5 respectively describe the step-by-step processes in scenarios 1, 2, 3a and 3b.

The algorithms nb, dt, and rf ml were selected based on their favorable performance in related studies in the literature [64, 65].

Algorithm 2
figure f

: Scenario 1 pseudocode.

Algorithm 3
figure g

: Scenario 2 pseudocode.

Algorithm 4
figure h

: Scenario 3a pseudocode.

Algorithm 5
figure i

: Scenario 3b pseudocode.

3.5 Explainability module

Natural language explanations about decisions on users’ cognitive state are based on the relevant feature components in scenario 2 (the components selected by the meta-transformer wrapper). The relevant features are arranged in descending order of relevance. Note that counter-type features 25-26 are normalized by the number of words. For each decision to be explained, six components of features 1-26 with the highest and lowest values, three of each type, are employed. In the case of a tie components are chosen randomly for the explanation template.

4 Evaluation and discussion

This section provides an overview of the experimental data set, describes the implementations to facilitate their reproducibility, and presents the results achieved. The experiments were conducted on a computer with the following specifications.

  • Operating System: Ubuntu 22.10 lts 64 bits

  • Processor: intel® Core i7-12700K

  • RAM: 32 gb ddr4

  • Disk: 1 tb ssd

4.1 Experimental data set

The dialogues were collected via the Celia web applicationFootnote 2 (see Fig. 2), and transcribed to text using the Google Cloud Speech-to-Text libraryFootnote 3. Celia has been designed to entertain and accompany elderly people. A rich dialogue is achieved by combining questions generated through an llm with services such as weather, curiosities, and saint days, which vary throughout the year. This allows the free generation of a dynamic conversation on varying topics based on user preferences.

Fig. 2
figure 2

Celia web application (English example)

The experimental data setFootnote 4 consists of 8220 utterances in Spanish from 523 sessions held with 42 users registered in the application. Of these, 14 had cognitive impairment. Each user participated in an average of 12.45 sessions, with a standard deviation of ± 6.32 sessions, and each session had on average 15.72 utterances, with a standard deviation of ± 4.89 utterances. Each session had 67.95 words on average, with a standard deviation of ± 70.14 words. Utterances by people with cognitive impairment comprised 49.55 words ± 43.74 on average. This represents a reduction of 38.41 % compared with people without this condition (80.45 words ± 81.15). Table 3 details the percentages of sessions held with users with and without cognitive impairment. As seen, the data set was rather balanced.

Table 3 Experimental data set distribution

4.2 Preprocessing module

Hashtags, images, links, and special characters were identified using the regular expressions in Listing 1 and removed. Emoticons and accents were deleted using the nfkd Unicode normalization formFootnote 5. The nltk stop word list was used to delete stop wordsFootnote 6. Finally, textual content was lemmatized using the es_core_news_md core model from the spaCy Python libraryFootnote 7.

figure j

4.3 Feature engineering module

Side features (1-26 in Table 2) were obtained with the gpt-3.5-turboFootnote 8llm, selected for its performance and cost-effectiveness ($0.0005/1,000 tokens for prompt text input and $0.0015/1,000 tokens for model text output at the time this paper was written).

Listings 2 and 3 respectively illustrate the prompt used for feature engineering and the measurements obtained from the llm for features 1-24 in scenario 2 for the particular machine-user utterance pair example at the bottom of Listing 2. The same prompt was used for the whole dialogue (for features 1-24). Listings 4 and 5 respectively show the prompt and the results obtained for features 25-26, for the user utterance at the bottom of Listing 4.

Listings 6 and 7 respectively show the prompt for the baseline in scenario 3 (i) and an example of the corresponding response by the llm. Finally, Listing 8 shows the prompt designed for the enhanced scenario 3 (ii), including the effect of relevant features selected from scenario 2 as described in Section 3.3. The ultimate goal is to improve the ability of the llm to detect cognitive decline directly.

Content-dependent features 27-28 in Table 2 were generated with the CountVectorizer libraryFootnote 9 from the scikit-learn Python library. The optimal parameters in Listing 9 and 10 were calculated using the GridSearchCVFootnote 10 function from the scikit-learn Python library. A total of 1282 features were obtained.

figure k
figure l
figure m
Table 4 Pearson correlation results
figure n
figure o
figure p
figure q
figure r
figure s

4.4 Feature analysis & selection module

In scenarios 1 and 2, features were selected with the SelectFromModelFootnote 11 transformer wrapper using rf, with default parameter settings. The result of the selection was:

  • Scenario 1: 435 char-gram and word-gram features.

  • Scenario 2: 262 feature components comparing statistics from the current session with those of other sessions plus 7 feature components corresponding to statistics from the current session.

The Pearson correlation coefficientFootnote 12 was employed in scenario 3 to further select features from scenario 2 with a stronger direct or indirect relationship with the target. Only features with a correlation over 0.45 were selected based on empirical tests (see Table 4). All the features corresponded to feature components, comparing statistics from the current session and those from other sessions.

4.5 Classification module

The implementations of the ml algorithms were:

  • nb. Gaussian Naive BayesFootnote 13 from the scikit-learn Python library.

  • dt. DecisionTreeClassifierFootnote 14 from the scikit-learn Python library.

  • rf. RandomForestClassifierFootnote 15 from the scikit-learn Python library.

For each implementation, optimal hyperparameters were tuned with the aforementioned GridSearchCV method using 10-fold cross-validation. Listings 11-13 show the ranges of values explored for scenarios 1 and 2. The final parameters selected were the following:

Scenario 1

  • DT: splitter=random, class_weight=None, max_features=log2, max_depth=100, min_samples_split=0.1, min_samples_leaf=0.001

  • NB: var_smoothing=1e-9

  • RF: n_estimators=75, class_weight=None, max_features=log2, max_depth=100, min_samples_split=5, min_samples_leaf=1

Scenario 2

  • DT: splitter=best, class_weight=None, max_features=None, max_depth=100, min_samples _split=0.001, min_samples_leaf=0.001

  • NB: var_smoothing=1e-6

  • RF: n_estimators=50, class_weight=balanced, max_features=log2, max_depth=10, min_samples_split=2, min_samples_leaf=1

Table 5 Classification results for scenario 1 (Yes: cognitive impairment, No: otherwise). Best values are marked in bold
figure t
figure u
figure v

Tables 5 and 6 respectively present the results of the performance evaluation for scenarios 1 and 2 with 10-fold cross-validation, implemented with the scikit-learn Python libraryFootnote 16. This evaluation method divides the data set into 10 folds (9 for training and 1 for testing). This split is repeated 10 times using different partitions without overlapping testing sets to minimize evaluation bias. The final results are an average of results for the different splits. Our experiments used 471 and 52 samples to train and test the models in each split.

Table 6 Classification results for scenario 2 (Yes: cognitive impairment, No: otherwise). Best values are marked in bold

In scenarios 1 and 2, the best accuracy levels were achieved by the rf model, 76.67 % and 98.47 %, respectively. In all cases, scenario 2 outperformed scenario 1. This is interesting for free dialogues from a practical perspective since scenario 2 is context-independent.

The results for scenario 3 were significantly worse than those obtained in scenario 2. The baseline approach based only on Chatgpt directly applied to free dialogues only attained 57.17 % accuracy. However, when knowledge about context-independent high-level reasoning features was transferred to the llm by prompt engineering, the accuracy rate rose to 61.19 %. We concluded that pre-trained models cannot replace specialized models but are extremely useful for producing valuable context-independent features.

In terms of related solutions in the literature, our proposal was more accurate (26.19 %) than the approach described by Agbavor and Liang [9]. Moreover, it was associated with respective improvements of +16.33 % and +13.65 % for the detection of absent and existing cognitive impairment compared to the solution described by Qiao et al. [19]. On comparing our findings with those of Yuan et al. [54], our proposal was 3.19 % and 14.48 % more accurate for detecting absent and existing cognitive impairment classes, respectively. The corresponding improvements over the proposal of Zhu et al. [55], were +10.01 % and +20.64 %. Finally, our proposal achieved better results than Wang et al. [33] (+6.67 % accuracy), the most closely related work, when different types of content (chatbot-human dialogues and doctor-patient dialogues) were analyzed. Nevertheless, when Chatgpt rather than hlr features derived from doctor-patient dialogues were used as the content source, the improvement grew to +37.28 %.

The following paragraph describes how our system’s mean token consumption was calculated. In scenario 2 of our analysis, we used 1400 and 498 characters for the prompts in listings 2 and 4, respectively. On average, the complete dialogue by the conversational assistant with a user comprised 15.72 utterances with 909 characters. Also, on average, each artificially generated utterance had 57.82 characters. The generated outputs approximately consumed 483 characters for the first prompt (see Listing 3) and 147 characters for the second (see Listing 3). Therefore, the average total numbers of input and output characters were 33,325.19 (1,400 + 909 + (1,400 + 57.82) \(\times \) 15.72 + (498 + 17.22) \(\times \) 15.72) and 10,386.60 (483 \(\times \) 16.72 + 147 \(\times \) 15.72), respectively. Since the character/token ratio in Open aiFootnote 17 is approximately 4/1, the cost per user session was moderate (i.e., $0.0081). It did not compromise the scalability of the study nor its possible transfer to industry.

Fig. 3
figure 3

Celia application with explainability capabilities

4.6 Explainability module

To explain the model’s predictions, we used the template in Listing 14. This template details the most relevant<features> used to classify a<user> from a given<conversation> (a dialogue session in our case). The features used in this template are the six relevant features selected using the meta-transformer wrapper, comprising several components described in Section 3.2. Only the components with the highest and lowest values were considered for explainability after discarding the minimum and maximum statistics. The values of these components were sorted, and the three largest and three most minor were used.

figure w

Listing 15 and Fig. 3 show a real-life example of a generated explanation for a classified conversation. In this case, the system did not detect cognitive impairment because, among other reasons, the user was relaxed and natural, used an adult register, and did not respond concisely (recall that the explainability template only uses the most relevant features to classify a particular individual). This explanation in natural language is descriptive and understandable, even without expert knowledge in healthcare or ml.

figure x

5 Conclusions

This work is the first to apply ml techniques to high-level reasoning features extracted using llms from free dialogues to detect cognitive decline with the Celia entertainment chatbot. High-level reasoning features are context-independent and are, therefore, more widely applicable for characterizing free dialogues than word-embedding or content-dependent features. They also support significantly more interpretable descriptions of the decisions using explainability techniques. The essential advantage of free dialogues with engaging systems is that they may encourage end users to participate in longitudinal studies, which currently need to be more feasible in healthcare systems due to the high cost of manual screening.

A mixed approach combining a specialized ml with feature extraction using the Chatgpt llm outperformed direct evaluation of cognitive decline by Chatgpt, even when prompt engineering was used to boost Chatgpt with the best features of the ml model. The performance levels achieved are remarkable.

In future work, we plan to train our models using a streaming mode that incorporates the session history. We will also employ new methodologies (e.g., reinforcement learning) to study different motivational topics and analyze how new features influence prediction outcomes. Note that we are only using a llm to obtain precise answers on the semantic relationships between different pieces of text. Thus, our approach is less affected by certain issues of generative llms (e.g., context and memory management, layer pruning, hallucination issues, etc.) than the direct application of a llm in scenario 3. However, in future work, we plan to analyze the sensitivity of these issues.