Abstract
Cognitive and neurological impairments are very common, but only a small proportion of affected individuals are diagnosed and treated, partly because of the high costs associated with frequent screening. Detecting pre-illness stages and analyzing the progression of neurological disorders through effective and efficient intelligent systems can be beneficial for timely diagnosis and early intervention. We propose using Large Language Models to extract features from free dialogues to detect cognitive decline. These features comprise high-level reasoning content-independent features (such as comprehension, decreased awareness, increased distraction, and memory problems). Our solution comprises (i) preprocessing, (ii) feature engineering via Natural Language Processing techniques and prompt engineering, (iii) feature analysis and selection to optimize performance, and (iv) classification, supported by automatic explainability. We also explore how to improve Chatgpt’s direct cognitive impairment prediction capabilities using the best features in our models. Evaluation metrics obtained endorse the effectiveness of a mixed approach combining feature extraction with Chatgpt and a specialized Machine Learning model to detect cognitive decline within free-form conversational dialogues with older adults. Ultimately, our work may facilitate the development of an inexpensive, non-invasive, and rapid means of detecting and explaining cognitive decline.
Similar content being viewed by others
Explore related subjects
Discover the latest articles, news and stories from top researchers in related subjects.Avoid common mistakes on your manuscript.
1 Introduction
Progressive neurological disorders (e.g., Alzheimer’s disease) affect 40 million people worldwide [1] and are a common cause of death [2, 3]. However, only 25 % of affected people receive a diagnosis. There are multiple reasons for this, including stigma and a lack of awareness and resources [4, 5]. The number of older adults with Alzheimer’s disease is predicted to rise to 150 million by 2050 [6]. Consequently, detecting pre-illness stages and analyzing the progression of neurological disorders using cost-effective and efficient intelligent systems is crucial to ensure timely diagnosis, risk assessment, and early intervention [3, 7].
Numerous cognitive tests (ct, e.g., Alzheimer’s Disease Assessment Scale-Cognition - adascog, Cognitive Dementia Rating - cdr, Mini-Mental State Examination - mmse, Montreal Cognitive Assessment - Moca, etc.) are currently used to diagnose and monitor neurological disorders, but they need to be applied manually and are therefore costly [5, 8, 9]. These tests can be automated with the help of Artificial Intelligence (ai), enabling more frequent screening of target populations [10, 11] and the conduct of longitudinal studies. The proposal described in this work consists of an automatic, continuous evaluation of cognitive performance based on engaging dialogues established with end users. Engagement fosters evaluation over time, which is of interest to longitudinal studies. The dialogues are supported by advanced ai techniques.
ai-based solutions for clinical assessment purposes is a fast-growing research field, and numerous studies have already analyzed applications in the fields of progressive neurological disorders and dementia [12,13,14,15,16]. Machine Learning (ml) models [1, 17,18,19] (e.g., Convolutional Neural Networks - cnn [20]) and Natural Language Processing (nlp) techniques [15, 21,22,23] have been applied to textual and voice data. These techniques can effectively predict cognitive impairment and content-dependent and context-independent features can be leveraged to improve the predictive performance of ml models in this area.
Among the multiple bio-markers available for the study of cognitive decline [24,25,26], language acquisition is an inexpensive, non-invasive, and readily accessible tool [9]. However, language conceals large volumes of information within complex relationships that can be difficult to decipher. Large Language Models (llms) are better equipped to navigate and process complex information and have therefore received increasing attention in the medical field [1, 27, 28]. llms have been used to analyze images and prescribe medical treatments and have also proven capable of passing medical accreditation exams [29]. However, personalized medical treatment recommendations provided by llms remain unreliable [30].
llms have broader language generation and understanding capabilities [9, 31], compared to domain-dependent models, when adequate prompt engineering is applied [32]. One noteworthy recent example of llm enhanced text generation capabilities is gpt-4 [33]. This llm is used by Chatgpt [34] to generate coherent dialogues with human users. However, there is still limited research on whether llms are better than niche solutions based on highly focused, context- or domain-dependent ml models, particularly in the clinical field [33]. Our approach combines both approaches. It employs an llm to create a friendly conversational assistant environment. The dialogue is augmented by additional services (e.g., weather reports, and medication reminders) to generate interest among the end users and enable a transparent longitudinal evaluation of cognitive decline. This llm is also used to generate high-level reasoning and content-independent side features for a specialized ml model. This combined usage of an llm and ml model for conducting explainable cognitive evaluation is an entirely novel contribution.
In general, establishing trust in ai is necessary to fight the common perception that models are black boxes, leaving end users and developers in the dark about their decision-making processes [35]. This issue is especially relevant in medical practice, particularly in personalized treatments [9]. Explainable ai (xai) [36] exploits the intrinsic interpretability of certain ml algorithms (e.g., tree-based models such as Random Forest - rf), or implements methods to bypass the opacity of non-interpretable models [37]. xai techniques include feature importance [38], counterfactual explanations [39], natural language descriptions [40], and visual representations [41]. Explainability in llm models is far from being solved. Our proposal combines llms with explainable ml classifiers to generate an explanation of the predictions that caregivers or healthcare experts will be able to understand.
Summing up, the main contribution of this work is a hybrid ml solution for assessing cognitive state, which combines context-dependent features with an llm for generating high-level reasoning, context-independent features. The latter provides high-level reasoning on the behavior of a user who dialogues with a conversational assistant on a leisure topic. Moreover, these features greatly enhance the explainability of the system decisions about cognitive state compared to rigid content-dependent features.
The rest of this paper is organized as follows. Section 2 reviews the relevant competing works on cognitive decline detection involving llms, and Section 2.1 summarizes the contribution of this work beyond state of the art. Section 3 explains the proposed solution. Section 4 describes the experimental data set, our implementations, and the results obtained. Finally, Section 5 concludes the paper and proposes future research.
2 Related work
ai solutions for healthcare seek to enhance diagnosis and treatment while simultaneously optimizing resources [42]. One promising line in this regard is the development of intelligent assistants or chatbots [43, 44]. A representative example is the work by Kurtz et al. [45]. They presented a cognitive decline detection solution based on a voice assistant. The authors exploited lexical and semantic features, and embeddings for that purpose.
Although many nlp-based speech analysis solutions exist for the early prediction of Alzheimer’s disease [46, 47], there is limited, preliminary research on the use of llms in the field, apart from specific use cases [9] and medical applications [48].
llms have the ability to offer powerful nlp functionalities that can be utilized in various medical tasks [49]. They have demonstrated clinical reasoning abilities [50], passed medical licensing exams [51], provided medical advice [52], and even generated clinical notes [53].
The following prior works explore using llms for cognitive evaluation.
Yuan et al. [54] applied bert and ernie models to model disfluencies and language problems in patients with Alzheimer’s disease. The experimental data consisted of transcriptions from cognitive tests in the adress (Alzheimer’s Dementia Recognition through Spontaneous Speech) data set. However, after fine-tuning, they relied mainly on word embeddings extracted from the llms. The only side (content-independent) features considered were word frequency and speech pauses, thereby limiting the generalizability of the results when non-standard tests were used. Qiao et al. [19] extracted disfluency measures and combined them into a stacking classification approach. The absence of content- and context-independent side features makes this approach less robust than ours in settings prone to contextual changes.
Roshanzamir et al. [28] combined transformer-based deep neural network language models (bert, xlm, xlnet) with ml (Logistic Regression - lr, Long Short-Term Memory - lstm, cnn) to detect Alzheimer’s disease using image description testing. Their approach exclusively used embeddings extracted from transcripts and content-dependent features. Zhu et al. [55] screened for dementia using both non-semantic (speech pauses) and semantic features (word embeddings) extracted from cognitive test speech data by bert. As in the work by Yuan et al. [54], the solution was fine-tuned with non-semantic information. However, compared to our solution, the approaches in both [28] and [55] were highly dependent on semantic information.
Li et al. [56] estimated the ability of Llama and Chatgpt llms to detect cognitive impairment from electronic health record (ehr) notes. The prediction outcome, combined with manual assessments by experts, was used to fine-tune the bert model. However, no information is provided about the system’s performance without expert backup.
Agbavor and Liang [9] predicted dementia from standard cognitive tests driven by gpt-3. The authors both classified and predicted the severity of Alzheimer’s disease. For the classification task, they employed traditional ml classifiers (i.e., lr, Support Vector Classifier - svc, and rf) with acoustic and word embedding features. Regression analysis was then performed using the same features to predict mmse scores. Therefore, the training process was based on user interactions in standard tests. Mao et al. [5] followed a similar approach to that used in [28] but used a Linear Classifier (lc). These studies were based on clinical tests or doctors’ notes, limiting their generalizability to non-clinical data. As in previous research, they were strongly dependent on context-dependent word embeddings. Conversely, our proposal uses content-independent side features to focus on longitudinal dialogues on all topics of interest.
Instead of word embeddings extracted with llms, we propose using high-level reasoning features generated in response to questions presented to the llms, inspired by human reasoning and independent of particular conversations (e.g., the language register of the user, i.e., adult, child, elder, etc.). Wang et al. [33] followed a similar approach. However, they tasked Chatgpt to extract high-level features from the DementiaBank data set instead of using free dialogues, as in our case. None of the previous works discussed provided explainability capabilities.
2.1 Contributions
We propose using the Chatgpt llm to extract high-level reasoning, content-independent side features from loosely driven entertainment dialogues with a chatbot (as explained in Section 4.1). We identified relevant features by analyzing previous research on cognitive function decline (e.g., changes in the content, comprehension, decreased awareness, increased distraction, memory problems, etc.) [1, 3, 8, 28, 57]. In addition to being context-independent and thus adequate for free dialogue, unlike word-embedding or content-dependent features, the side features used in our proposal support much more interpretable descriptions of the decisions on cognitive decline. As shown in Table 1, few studies have applied xai techniques to our target problem. Although Bellantuono et al. [58] did not use llms, they combined an rf classification model for dementia with SHapley Additive ExPlanations (shap) techniques to infer the contributions of the features to the model’s predictive performance. A similar approach was applied by Lombardi et al. [59], who stated that xai was still in its infancy in computational neuroscience.
Summing up, our work is the first to apply ml techniques to high-level reasoning features extracted from free-form dialogues using an llm to detect cognitive decline. The ml techniques used were Naive Bayes - nb, Decision Tree - dt, and rf, and we further describe the decisions using explainability techniques.
3 Methodology
Figure 1 illustrates the proposed solution for assessing cognitive impairment using free dialogues with older adults. The solution is composed of (i) a preprocessing module to prepare the content for further analysis, (ii) a feature engineering module that generates an appropriate and comprehensive set of features using nlp techniques to train the cognitive design classification models, (iii) a feature analysis and selection module, and (iv) a classification module, which is evaluated using standard ml metrics (accuracy, precision, recall) and provides explainability results. Finally, the most representative features of the ml model were selected to evaluate if they can enhance the direct cognitive impairment prediction capabilities of the Chatgpt llm by means of prompt engineering.
To ensure reproducibility, the specific methods used to implement all the steps mentioned and their configuration parameters are specified in Section 4.
3.1 Preprocessing module
The conversations used in this research are free dialogues between users and an intelligent conversational assistant [60], organized into daily sessions and automatically transcribed into text. The preprocessing module is essential for ensuring the quality of the input data in the feature engineering process involving both prompt engineering and n-gram generation. For prompt engineering, nlp techniques are applied to remove emoticons and hashtags. For n-gram generation, images, links, and special characters (e.g., &, $, €) are removed using regular expressions. Then, the textual content is tokenized and lemmatized. Because this use case is based on free dialogues, an ad hoc stopword list was created to exclude the terms ‘no’, ‘yes’ (si), ‘more’ (más), ‘but’ (pero), ‘very’ (muy), ‘without’(sin), ‘much’ (mucho), ‘little’, (poco) and ‘nothing’ (nada). The final step is the removal of numeric values, isolated characters, and accents.
3.2 Feature engineering module
Once the textual content of the user’s utterances is processed, the feature engineering module generates a broad set of features, detailed in Table 2. Each entry in the data set corresponds to a user dialogue session. Special attention is paid to user engagement (features 1-8), emotional state (features 9-12), language used (features 13-22), and other linguistic aspects (features 23-26), such as the use of polar responses and bad/complex words. The measurements that produce features 1-24 are within the value range (0, 1), where 0 is the minimum and 1 the maximum (except for feature 11, for which the measurements take values \(\{0,0.5,1\}\)Footnote 1). Features 25-26 are directly based on counters. Features 1-26 are calculated using an llm and prompt engineering. The measurements of features 1-24 are first computed for each machine-user utterance pair in the session dialogue. Then, each final feature is a list comprising a) the mean, maximum, minimum, and three quartile values (25 %, median, 75 %) of the corresponding measurements across all the machine-user utterance pairs of the current user session plus the whole dialogue in the session and b) the mean, maximum, minimum, and three quartile values of these same statistics based on the current session and all past sessions of the same user (session history). Features 25-26 are first calculated as lists of the mean, maximum, minimum, and three quartile values only for the user utterances in the session (that is, ignoring the machine utterances) and then computed again with the mean, maximum, minimum, and three quartile values of these same statistics based on the session history of the same user. In other words, features 1-24 are lists of forty-nine real values (which we call components of those features), while features 25-26 are lists of forty-two real values.
Content-independent features 1-26 can be used to assess cognitive deterioration based on free dialogues with the conversational assistant. Linguistic aspects (features 23-26 in Table 2) and characteristics of the language used (features 13-22) are independent of the conversation topic but allow us to measure the quality and coherence of the user responses. Exclusively for comparison purposes, these features are supplemented with content-dependent features based on char-grams and word-grams extracted from all user utterances from the current session (features 27-28). Algorithm 1 describes the complete feature engineering process.
3.3 Feature analysis & selection module
In the feature analysis and selection stage, irrelevant and other features that could degrade the system’s performance are removed to ensure optimal training of the ml classifiers. We have implemented two feature analysis and selection techniques: (i) a relevance-based technique using a tree algorithm and (ii) correlation analysis with the Pearson coefficient.
A meta-transformer wrapper based on a tree-based ml model is applied to select the most significant feature components for training the classification module, regardless of the specific ml model used (thus, we follow a model-agnostic strategy [61]). The wrapper, using the Mean Decrease in Impurity (mdi) technique, leverages importance weights to identify and eliminate features based on their impurity contributions [62]. This technique calculates the average proportion of sample splits of each feature in each node across all the trees within the tree-based ml model used. Subsequently, feature components with an mdi lower than the average mdi are excluded from further consideration in the classification module.
To select the final set of feature components for the llm prompt engineering stage (scenario 3 in the next section), we calculated the Pearson correlation coefficient [63], where -1 and 1, respectively, represented the two variables with the strongest possible inverse and direct relationship between the most relevant feature components as previously described and the target (cognitive decline). We then selected the components that exceeded a correlation threshold.
3.4 Classification module
Three different scenarios are considered:
Scenario 1 trains an ml classification based exclusively on content-dependent textual features derived from user utterances (features 27-28 in Table 2). This scenario is intended to evaluate n-grams as a source, as in prior works, [19, 21, 22] and serves as a baseline for our study.
Scenario 2 trains an ml classification considering context-independent side features 1-26 as defined in Table 2, where the base measurements are calculated by applying an llm to user utterances, the machine-user utterance pairs and the whole dialogue, depending on the case.
Scenario 3 evaluates the performance of the llm as a classifier of cognitive impairment [33, 56] using two approaches: (a) the llm is directly used to analyze the dialogue of a user session as a whole and guess the target “cognitive decline” with prompt engineering, and (b) the same setting plus prompt enhancements based on a pick of the most relevant features from scenario 2, using the Pearson coefficient as described in the previous section.
In summary, scenario 1 is our baseline, and scenario 2 is used to evaluate our approach. Scenario 3 uses an llm exclusively to directly classify cognitive impairment and compare its performance with our proposal, in which the llm is employed to generate high-level reasoning features.
In scenarios 1 and 2, we use the meta-transformer wrapper to identify the most significant features (see Section 3.3).
In scenarios 2 and 3, we apply prompt engineering. The prompts are divided into a context section, a request, and the output format. Specific prompts are used in scenario 2 to generate the measurements for the context-independent side features from the textual input (both for features 1-24 and counter-type features 25-26). In scenario 3, two prompts directly address the cognitive impairment level. The first prompt directly applies to the whole dialogue to assess cognitive decline. The second adds a pick of the most relevant features from scenario 2, as previously explained, to the context section of the first prompt. Algorithms 2, 3, 4 and 5 respectively describe the step-by-step processes in scenarios 1, 2, 3a and 3b.
The algorithms nb, dt, and rf ml were selected based on their favorable performance in related studies in the literature [64, 65].
3.5 Explainability module
Natural language explanations about decisions on users’ cognitive state are based on the relevant feature components in scenario 2 (the components selected by the meta-transformer wrapper). The relevant features are arranged in descending order of relevance. Note that counter-type features 25-26 are normalized by the number of words. For each decision to be explained, six components of features 1-26 with the highest and lowest values, three of each type, are employed. In the case of a tie components are chosen randomly for the explanation template.
4 Evaluation and discussion
This section provides an overview of the experimental data set, describes the implementations to facilitate their reproducibility, and presents the results achieved. The experiments were conducted on a computer with the following specifications.
-
Operating System: Ubuntu 22.10 lts 64 bits
-
Processor: intel® Core™ i7-12700K
-
RAM: 32 gb ddr4
-
Disk: 1 tb ssd
4.1 Experimental data set
The dialogues were collected via the Celia web applicationFootnote 2 (see Fig. 2), and transcribed to text using the Google Cloud Speech-to-Text libraryFootnote 3. Celia has been designed to entertain and accompany elderly people. A rich dialogue is achieved by combining questions generated through an llm with services such as weather, curiosities, and saint days, which vary throughout the year. This allows the free generation of a dynamic conversation on varying topics based on user preferences.
The experimental data setFootnote 4 consists of 8220 utterances in Spanish from 523 sessions held with 42 users registered in the application. Of these, 14 had cognitive impairment. Each user participated in an average of 12.45 sessions, with a standard deviation of ± 6.32 sessions, and each session had on average 15.72 utterances, with a standard deviation of ± 4.89 utterances. Each session had 67.95 words on average, with a standard deviation of ± 70.14 words. Utterances by people with cognitive impairment comprised 49.55 words ± 43.74 on average. This represents a reduction of 38.41 % compared with people without this condition (80.45 words ± 81.15). Table 3 details the percentages of sessions held with users with and without cognitive impairment. As seen, the data set was rather balanced.
4.2 Preprocessing module
Hashtags, images, links, and special characters were identified using the regular expressions in Listing 1 and removed. Emoticons and accents were deleted using the nfkd Unicode normalization formFootnote 5. The nltk stop word list was used to delete stop wordsFootnote 6. Finally, textual content was lemmatized using the es_core_news_md core model from the spaCy Python libraryFootnote 7.
4.3 Feature engineering module
Side features (1-26 in Table 2) were obtained with the gpt-3.5-turboFootnote 8llm, selected for its performance and cost-effectiveness ($0.0005/1,000 tokens for prompt text input and $0.0015/1,000 tokens for model text output at the time this paper was written).
Listings 2 and 3 respectively illustrate the prompt used for feature engineering and the measurements obtained from the llm for features 1-24 in scenario 2 for the particular machine-user utterance pair example at the bottom of Listing 2. The same prompt was used for the whole dialogue (for features 1-24). Listings 4 and 5 respectively show the prompt and the results obtained for features 25-26, for the user utterance at the bottom of Listing 4.
Listings 6 and 7 respectively show the prompt for the baseline in scenario 3 (i) and an example of the corresponding response by the llm. Finally, Listing 8 shows the prompt designed for the enhanced scenario 3 (ii), including the effect of relevant features selected from scenario 2 as described in Section 3.3. The ultimate goal is to improve the ability of the llm to detect cognitive decline directly.
Content-dependent features 27-28 in Table 2 were generated with the CountVectorizer libraryFootnote 9 from the scikit-learn Python library. The optimal parameters in Listing 9 and 10 were calculated using the GridSearchCVFootnote 10 function from the scikit-learn Python library. A total of 1282 features were obtained.
4.4 Feature analysis & selection module
In scenarios 1 and 2, features were selected with the SelectFromModelFootnote 11 transformer wrapper using rf, with default parameter settings. The result of the selection was:
-
Scenario 1: 435 char-gram and word-gram features.
-
Scenario 2: 262 feature components comparing statistics from the current session with those of other sessions plus 7 feature components corresponding to statistics from the current session.
The Pearson correlation coefficientFootnote 12 was employed in scenario 3 to further select features from scenario 2 with a stronger direct or indirect relationship with the target. Only features with a correlation over 0.45 were selected based on empirical tests (see Table 4). All the features corresponded to feature components, comparing statistics from the current session and those from other sessions.
4.5 Classification module
The implementations of the ml algorithms were:
-
nb. Gaussian Naive BayesFootnote 13 from the scikit-learn Python library.
-
dt. DecisionTreeClassifierFootnote 14 from the scikit-learn Python library.
-
rf. RandomForestClassifierFootnote 15 from the scikit-learn Python library.
For each implementation, optimal hyperparameters were tuned with the aforementioned GridSearchCV method using 10-fold cross-validation. Listings 11-13 show the ranges of values explored for scenarios 1 and 2. The final parameters selected were the following:
Scenario 1
-
DT: splitter=random, class_weight=None, max_features=log2, max_depth=100, min_samples_split=0.1, min_samples_leaf=0.001
-
NB: var_smoothing=1e-9
-
RF: n_estimators=75, class_weight=None, max_features=log2, max_depth=100, min_samples_split=5, min_samples_leaf=1
Scenario 2
-
DT: splitter=best, class_weight=None, max_features=None, max_depth=100, min_samples _split=0.001, min_samples_leaf=0.001
-
NB: var_smoothing=1e-6
-
RF: n_estimators=50, class_weight=balanced, max_features=log2, max_depth=10, min_samples_split=2, min_samples_leaf=1
Tables 5 and 6 respectively present the results of the performance evaluation for scenarios 1 and 2 with 10-fold cross-validation, implemented with the scikit-learn Python libraryFootnote 16. This evaluation method divides the data set into 10 folds (9 for training and 1 for testing). This split is repeated 10 times using different partitions without overlapping testing sets to minimize evaluation bias. The final results are an average of results for the different splits. Our experiments used 471 and 52 samples to train and test the models in each split.
In scenarios 1 and 2, the best accuracy levels were achieved by the rf model, 76.67 % and 98.47 %, respectively. In all cases, scenario 2 outperformed scenario 1. This is interesting for free dialogues from a practical perspective since scenario 2 is context-independent.
The results for scenario 3 were significantly worse than those obtained in scenario 2. The baseline approach based only on Chatgpt directly applied to free dialogues only attained 57.17 % accuracy. However, when knowledge about context-independent high-level reasoning features was transferred to the llm by prompt engineering, the accuracy rate rose to 61.19 %. We concluded that pre-trained models cannot replace specialized models but are extremely useful for producing valuable context-independent features.
In terms of related solutions in the literature, our proposal was more accurate (26.19 %) than the approach described by Agbavor and Liang [9]. Moreover, it was associated with respective improvements of +16.33 % and +13.65 % for the detection of absent and existing cognitive impairment compared to the solution described by Qiao et al. [19]. On comparing our findings with those of Yuan et al. [54], our proposal was 3.19 % and 14.48 % more accurate for detecting absent and existing cognitive impairment classes, respectively. The corresponding improvements over the proposal of Zhu et al. [55], were +10.01 % and +20.64 %. Finally, our proposal achieved better results than Wang et al. [33] (+6.67 % accuracy), the most closely related work, when different types of content (chatbot-human dialogues and doctor-patient dialogues) were analyzed. Nevertheless, when Chatgpt rather than hlr features derived from doctor-patient dialogues were used as the content source, the improvement grew to +37.28 %.
The following paragraph describes how our system’s mean token consumption was calculated. In scenario 2 of our analysis, we used 1400 and 498 characters for the prompts in listings 2 and 4, respectively. On average, the complete dialogue by the conversational assistant with a user comprised 15.72 utterances with 909 characters. Also, on average, each artificially generated utterance had 57.82 characters. The generated outputs approximately consumed 483 characters for the first prompt (see Listing 3) and 147 characters for the second (see Listing 3). Therefore, the average total numbers of input and output characters were 33,325.19 (1,400 + 909 + (1,400 + 57.82) \(\times \) 15.72 + (498 + 17.22) \(\times \) 15.72) and 10,386.60 (483 \(\times \) 16.72 + 147 \(\times \) 15.72), respectively. Since the character/token ratio in Open aiFootnote 17 is approximately 4/1, the cost per user session was moderate (i.e., $0.0081). It did not compromise the scalability of the study nor its possible transfer to industry.
4.6 Explainability module
To explain the model’s predictions, we used the template in Listing 14. This template details the most relevant<features> used to classify a<user> from a given<conversation> (a dialogue session in our case). The features used in this template are the six relevant features selected using the meta-transformer wrapper, comprising several components described in Section 3.2. Only the components with the highest and lowest values were considered for explainability after discarding the minimum and maximum statistics. The values of these components were sorted, and the three largest and three most minor were used.
Listing 15 and Fig. 3 show a real-life example of a generated explanation for a classified conversation. In this case, the system did not detect cognitive impairment because, among other reasons, the user was relaxed and natural, used an adult register, and did not respond concisely (recall that the explainability template only uses the most relevant features to classify a particular individual). This explanation in natural language is descriptive and understandable, even without expert knowledge in healthcare or ml.
5 Conclusions
This work is the first to apply ml techniques to high-level reasoning features extracted using llms from free dialogues to detect cognitive decline with the Celia entertainment chatbot. High-level reasoning features are context-independent and are, therefore, more widely applicable for characterizing free dialogues than word-embedding or content-dependent features. They also support significantly more interpretable descriptions of the decisions using explainability techniques. The essential advantage of free dialogues with engaging systems is that they may encourage end users to participate in longitudinal studies, which currently need to be more feasible in healthcare systems due to the high cost of manual screening.
A mixed approach combining a specialized ml with feature extraction using the Chatgpt llm outperformed direct evaluation of cognitive decline by Chatgpt, even when prompt engineering was used to boost Chatgpt with the best features of the ml model. The performance levels achieved are remarkable.
In future work, we plan to train our models using a streaming mode that incorporates the session history. We will also employ new methodologies (e.g., reinforcement learning) to study different motivational topics and analyze how new features influence prediction outcomes. Note that we are only using a llm to obtain precise answers on the semantic relationships between different pieces of text. Thus, our approach is less affected by certain issues of generative llms (e.g., context and memory management, layer pruning, hallucination issues, etc.) than the direct application of a llm in scenario 3. However, in future work, we plan to analyze the sensitivity of these issues.
Data Availability
Data will be made available on reasonable request.
Notes
Respectively corresponding to the typical sentiment categories \(\{-1,0,1\}\) to equalize the ranges of all the measurements.
Available at https://celiatecuida.com, February 2024.
Available at https://cloud.google.com/speech-to-text, February 2024.
Data are available on request from the authors. The data set also includes information on visual, movement, and hearing impairments and information about medication intake.
Available at https://unicode.org/reports/tr15, February 2024.
Available at https://www.nltk.org, February 2024.
Available at https://spacy.io/models/es, February 2024.
Available at https://platform.openai.com/docs/models/gpt-3-5, February 2024.
Available at https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html, February 2024.
Available at https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.GridSearchCV.html, February 2024.
Available at https://scikitlearn.org/stable/modules/generated/sklearn.feature_selection.SelectFromModel.html, February 2024.
Available at https://scikit-learn.org/stable/modules/generated/sklearn.feature_selection.r_regression.html, February 2024.
Available at https://scikit-learn.org/stable/modules/generated/sklearn.naive_bayes.GaussianNB.html, February 2024.
Available at https://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeClassifier.html, February 2024.
Available at https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html, February 2024.
Available at https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.cross_val_predict.html, February 2024.
Available at https://platform.openai.com/tokenizer, February 2024.
References
Balagopalan A, Eyre B, Robin J et al (2021) Comparing pre-trained and feature-based models for prediction of Alzheimer’s disease based on speech. Front Aging Neurosci 13:1–12. https://doi.org/10.3389/fnagi.2021.635945
Association A (2021) 2021 Alzheimer’s disease facts and figures. Alzheimer’s Dement 17:327–406. https://doi.org/10.1002/alz.12328
Savaş S (2022) Detecting the stages of Alzheimer’s disease with pre-trained deep learning architectures. Arab J Sci Eng 47:2201–2218. https://doi.org/10.1007/s13369-021-06131-3
Jammeh EA, Camille BC, Stephen WP et al (2018) Machine-learning based identification of undiagnosed dementia in primary care: a feasibility study. BJGP Open 2:1–13. https://doi.org/10.3399/bjgpopen18X101589
Mao C, Xu J, Rasmussen L et al (2023) AD-BERT: Using pre-trained language model to predict the progression from mild cognitive impairment to Alzheimer’s disease. J Biomed Inform 144:104,442–104,449. https://doi.org/10.1016/j.jbi.2023.104442
Lynch C (2020) World Alzheimer report 2019: attitudes to dementia, a global survey. Alzheimer’s Dement 16:1–1. https://doi.org/10.1002/alz.038255
Savaş S, Topaloğlu N, Kazcı Ömer et al (2019) Classification of carotid artery intima media thickness ultrasound images with deep learning. J Med Syst 43:273–284. https://doi.org/10.1007/s10916-019-1406-2
Guo Z, Ling Z, Li Y (2019) Detecting Alzheimer’s disease from continuous speech using language models. J Alzheimer’s Dis 70:1163–1174. https://doi.org/10.3233/JAD-190452
Agbavor F, Liang H (2022) Predicting dementia from spontaneous speech using large language models. PLOS Digit Health 1:1–14. https://doi.org/10.1371/journal.pdig.0000168
Graham SA, Lee EE, Jeste DV et al (2020) Artificial intelligence approaches to predicting and detecting cognitive decline in older adults: a conceptual review. Psychiatry Res 284:112,732–112,745. https://doi.org/10.1016/j.psychres.2019.112732
Alshayeji MH, Abed S (2023) Lung cancer classification and identification framework with automatic nodule segmentation screening using machine learning. Appl Intell 53:19,724–19,741. https://doi.org/10.1007/s10489-023-04552-1
Salvatore C, Cerasa A, Castiglioni I (2018) MRI characterizes the progressive course of AD and predicts conversion to Alzheimer’s dementia 24 months before probable diagnosis. Front Aging Neurosci 10:1–13. https://doi.org/10.3389/fnagi.2018.00135
Lee J, Lim JM (2022) Factors associated with the experience of cognitive training apps for the prevention of dementia: cross-sectional study using an extended health belief model. J Med Internet Res 24(1):1–9. https://doi.org/10.2196/31664
Merkin A, Krishnamurthi R, Medvedev ON (2022) Machine learning, artificial intelligence and the prediction of dementia. Curr Opin Psychiatry 35:123–129. https://doi.org/10.1097/YCO.0000000000000768
Amini S, Hao B, Zhang L et al (2023) Automated detection of mild cognitive impairment and dementia from voice recordings: a natural language processing approach. Alzheimer’s Dement 19:946–955. https://doi.org/10.1002/alz.12721
Sirilertmekasakul C, Rattanawong W, Gongvatana A et al (2023) The current state of artificial intelligence-augmented digitized neurocognitive screening test. Front Hum Neurosci 17:1–8. https://doi.org/10.3389/fnhum.2023.1133632
Hernández-Domínguez L, Ratté S, Sierra-Martínez G et al (2018) Computer-based evaluation of Alzheimer’s disease and mild cognitive impairment patients during a picture description task. Alzheimer’s Dement Diagn Assess Dis Monit 10:260–268. https://doi.org/10.1016/j.dadm.2018.02.004
Liu L, Zhao S, Chen H et al (2020) A new machine learning method for identifying Alzheimer’s disease. Simul Model Pract Theory 99:102,023–102,034. https://doi.org/10.1016/j.simpat.2019.102023
Qiao Y, Yin X, Wiechmann D et al (2021) Alzheimer’s disease detection from spontaneous speech through combining linguistic complexity and (Dis)fluency features with pretrained language models. In: Proceedings of the interspeech, vol 6. ISCA, pp 3805–3809. https://doi.org/10.21437/Interspeech.2021-1415
Lin W, Tong T, Gao Q et al (2018) Convolutional neural networks-based MRI image analysis for the Alzheimer’s disease prediction from mild cognitive impairment. Front Neurosci 12:1–13. https://doi.org/10.3389/fnins.2018.00777
Orimaye SO, Wong JSM, Golden KJ et al (2017) Predicting probable Alzheimer’s disease using linguistic deficits and biomarkers. BMC Bioinform 18:3–13. https://doi.org/10.1186/s12859-016-1456-0
Képešiová Z, Ružický E, Štefan Kozák et al (2023) Application of advanced machine learning algorithms for early detection of mild cognitive impairment and Alzheimer’s disease. In: Proceedings of the international scientific conference on computer science. IEEE, pp 1–5. https://doi.org/10.1109/COMSCI59259.2023.10315946
Ying Y, Yang T, Zhou H (2023) Multimodal fusion for alzheimer’s disease recognition. Appl Intell 53:16,029–16,040. https://doi.org/10.1007/s10489-022-04255-z
Molinuevo JL, Ayton S, Batrla R et al (2018) Current state of Alzheimer’s fluid biomarkers. Acta Neuropathol 136:821–853. https://doi.org/10.1007/s00401-018-1932-x
Villa C, Lavitrano M, Salvatore E et al (2020) Molecular and imaging biomarkers in Alzheimer’s disease: a focus on recent insights. J Pers Med 10:61–90. https://doi.org/10.3390/jpm10030061
Zhang T, Liao Q, Zhang D et al (2021) Predicting MCI to AD conversation using integrated sMRI and rs-fMRI: machine learning and graph theory approach. Front Aging Neurosci 13:1–17. https://doi.org/10.3389/fnagi.2021.688926
Lee J, Yoon W, Kim S et al (2020) BioBERT: a pre-trained biomedical language representation model for biomedical text mining. Bioinformatics 36:1234–1240. https://doi.org/10.1093/bioinformatics/btz682
Roshanzamir A, Aghajan H, Baghshah MS (2021) Transformer-based deep neural network language models for Alzheimer’s disease risk assessment from targeted speech. BMC Med Inform Decis Mak 21:92–106. https://doi.org/10.1186/s12911-021-01456-3
Hadi MU, qasem al tashi, Qureshi R et al (2023) Large language models: a comprehensive survey of its applications, challenges, limitations, and future prospects. Authorea Prepr pp 1–44. https://doi.org/10.36227/TECHRXIV.23589741.V4
Alberts IL, Mercolli L, Pyka T et al (2023) Large language models (LLM) and ChatGPT: what will the impact on nuclear medicine be? Eur J Nucl Med Mol Imaging 50:1549–1552. https://doi.org/10.1007/s00259-023-06172-w
Chen Z, Liu Z (2023) Fixed global memory for controllable long text generation. Appl Intell 53:13,993–14,007. https://doi.org/10.1007/s10489-022-04197-6
Chen Z, Li Z, Zeng Y et al (2024) GAP: a novel generative context-aware prompt-tuning method for relation extraction. Expert Systems with Applications 248(123):478. https://doi.org/10.1016/j.eswa.2024.123478
Wang C, Liu S, Li A et al (2023) Text dialogue analysis based ChatGPT for primary screening of mild cognitive impairment. medRxiv pp 1–19. https://doi.org/10.1101/2023.06.27.23291884
Deng J, Lin Y (2023) The benefits and challenges of ChatGPT: an overview. Front Comput Intell Syst 2:81–83. https://doi.org/10.54097/fcis.v2i2.4465
Dutt M, Redhu S, Goodwin M et al (2023) SleepXAI: an explainable deep learning approach for multi-class sleep stage identification. Appl Intell 53:16,830–16,843. https://doi.org/10.1007/s10489-022-04357-8
Barredo Arrieta A, Díaz-Rodríguez N, Del Ser J et al (2020) Explainable Artificial Intelligence (XAI): concepts, taxonomies, opportunities and challenges toward responsible AI. Inf Fusion 58:82–115. https://doi.org/10.1016/j.inffus.2019.12.012
Akbar MA, Khan AA, Mahmood S et al (2023) Trustworthy artificial intelligence: a decision-making taxonomy of potential challenges. Softw Pract Exp 1–30. https://doi.org/10.1002/spe.3216
de Arriba-Pérez F, García-Méndez S, González-Castaño FJ et al (2022) Explainable machine learning multi-label classification of Spanish legal judgements. J King Saud Univ Comput Inf Sci 34:10,180–10,192. https://doi.org/10.1016/j.jksuci.2022.10.015
Wachter S, Mittelstadt B, Russell C (2017) Counterfactual explanations without opening the black box: automated decisions and the GDPR. SSRN Electron J 842–861. https://doi.org/10.2139/ssrn.3063289
Ehsan U, Tambwekar P, Chan L et al (2019) Automated rationale generation: a technique for explainable AI and its effects on human perceptions. In: Proceedings of the international conference on intelligent user interfaces, vol part F147615. Association for Computing Machinery, pp 263–274. https://doi.org/10.1145/3301275.3302316
Spinner T, Schlegel U, Schafer H et al (2019) explAIner: a visual analytics framework for interactive and explainable machine learning. IEEE Trans Vis Comput Graph 26:1064–1074. https://doi.org/10.1109/TVCG.2019.2934629
Yu KH, Beam AL, Kohane IS (2018) Artificial intelligence in healthcare. Nat Biomed Eng 2:719–731. https://doi.org/10.1038/s41551-018-0305-z
Kim YM, Lee TH, Na SO (2023) Constructing novel datasets for intent detection and ner in a korean healthcare advice system: guidelines and empirical results. Appl Intell 53:941–961. https://doi.org/10.1007/s10489-022-03400-y
Padovan M, Cosci B, Petillo A et al (2023) ChatGPT in occupational medicine: a comparative study with human experts. medRxiv pp 1–9. https://doi.org/10.1101/2023.05.17.23290055
Kurtz E, Zhu Y, Driesse T et al (2023) Early detection of cognitive decline using voice assistant commands. In: Proceedings of the IEEE international conference on acoustics, speech and signal processing. IEEE, pp 1–5. https://doi.org/10.1109/ICASSP49357.2023.10095825
Eyigoz E, Mathur S, Santamaria M et al (2020) Linguistic markers predict onset of Alzheimer’s disease. EClinicalMedicine 28:100,583–100,591. https://doi.org/10.1016/j.eclinm.2020.100583
Voleti R, Liss JM, Berisha V (2020) A review of automated speech and language features for assessment of cognitive and thought disorders. IEEE J Sel Top Signal Process 14:282–298. https://doi.org/10.1109/JSTSP.2019.2952087
Hristidis V, Ruggiano N, Brown EL et al (2023) ChatGPT vs Google for queries related to dementia and other cognitive decline: comparison of results. J Med Internet Res 25:1–13. https://doi.org/10.2196/48966
Jethani N, Jones S, Genes N et al (2023) Evaluating ChatGPT in information extraction: a case study of extracting cognitive exam dates and scores. medRxiv pp 1–27. https://doi.org/10.1101/2023.07.10.23292373
Lee P, Bubeck S, Petro J (2023) Benefits, limits, and risks of GPT-4 as an AI Chatbot for medicine. N Engl J Med 388:1233–1239. https://doi.org/10.1056/NEJMsr2214184
Wang H, Wu W, Dou Z et al (2023) Performance and explora-tion of ChatGPT in medical examination, records and education in Chinese: pave the way for medical AI. Int J Med Inform 177:105,173–105,178. https://doi.org/10.1016/j.ijmedinf.2023.105173
Ayers JW, Poliak A, Dredze M et al (2023) Comparing physician and artificial intelligence Chatbot responses to patient questions posted to a public social media forum. JAMA Intern Med 183:589–596. https://doi.org/10.1001/jamainternmed.2023.1838
Cascella M, Montomoli J, Bellini V et al (2023) Evaluating the feasibility of ChatGPT in healthcare: an analysis of multiple clinical and research scenarios. J Med Syst 47:1–5. https://doi.org/10.1007/s10916-023-01925-4
Yuan J, Bian Y, Cai X et al (2020) Disfluencies and fine-tuning pre-trained language models for detection of Alzheimer’s disease. In: Interspeech 2020. Proceedings of the international speech communication association, pp 2162–2166. https://doi.org/10.21437/Interspeech.2020-2516
Zhu Y, Obyat A, Liang X et al (2021) Wavbert: exploiting semantic and non-semantic speech using wav2vec and bert for dementia detection. In: Proceedings of the interspeech, vol 2021. NIH Public Access, p 3790–3794. https://doi.org/10.21437/interspeech.2021-332
Li R, Wang X, Yu H (2023) Two directions for clinical data generation with large language models: data-to-label and label-to-data. In: Proceedings of the conference on empirical methods in natural language processing. association for computational linguistics, pp 7129–7143. https://doi.org/10.18653/v1/2023.findings-emnlp.474
Mueller KD, Koscik RL, Hermann BP et al (2018) Declines in connected language are associated with very early mild cognitive impairment: results from the Wisconsin registry for Alzheimer’s prevention. Front Aging Neurosci 9:1–14. https://doi.org/10.3389/fnagi.2017.00437
Bellantuono L, Monaco A, Amoroso N et al (2022) Worldwide impact of lifestyle predictors of dementia prevalence: an eXplainable artificial intelligence analysis. Front Big Data 5:1–17. https://doi.org/10.3389/fdata.2022.1027783
Lombardi A, Diacono D, Amoroso N et al (2022) A robust framework to investigate the reliability and stability of explainable artificial intelligence markers of mild cognitive impairment and Alzheimer’s disease. Brain Inform 9:1–17. https://doi.org/10.1186/s40708-022-00165-5
de Arriba-Pérez F, García-Méndez S, González-Castaño FJ et al (2022) Automatic detection of cognitive impairment in elderly people using an entertainment chatbot with natural language processing capabilities. J Ambient Intell Humaniz Comput 1–16. https://doi.org/10.1007/s12652-022-03849-2
Burkart N, Huber MF (2021) A survey on the explainability of supervised machine learning. J Artif Intell Res 70:245–317. https://doi.org/10.1613/JAIR.1.12228
Breiman L, Friedman JH, Olshen RA et al (2017) Classification and regression trees. Routledge. https://doi.org/10.1201/9781315139470
Benesty J, Chen J, Huang Y et al (2009) Pearson correlation coefficient. In: Springer topics in signal processing, vol 2. Springer, p 37–40. https://doi.org/10.1007/978-3-642-00296-0_5
Xu S (2018) Bayesian Naïve Bayes classifiers to text classification. J Inf Sci 44:48–59. https://doi.org/10.1177/0165551516677946
Kadhim AI (2019) Survey on supervised machine learning techniques for automatic text classification. Artif Intell Rev 52:273–292. https://doi.org/10.1007/s10462-018-09677-1
Funding
This work was partially supported by (i) Xunta de Galicia grants ED481B-2021-118, ED481B-2022-093, and ED431C 2022/04, Spain; (ii) Ministerio de Ciencia e Innovación grant TED2021-130824B-C21, Spain; and (iii) University of Vigo/CISUG for open access charge.
Author information
Authors and Affiliations
Contributions
Francisco de Arriba-Pérez: Conceptualization, Methodology, Software, Validation, Formal analysis, Investigation, Resources, Data Curation, Writing - Original Draft, Writing - Review & Editing, Visualization, Supervision, Project administration, Funding acquisition. Silvia García-Méndez: Conceptualization, Methodology, Software, Validation, Formal analysis, Investigation, Resources, Data Curation, Writing - Original Draft, Writing - Review & Editing, Visualization, Supervision, Project administration, Funding acquisition. Javier Otero-Mosquera: Software, Data Curation, Writing - Review & Editing. Francisco J. González-Castaño: Conceptualization, Methodology, Writing - Review & Editing, Supervision, Funding acquisition.
Corresponding author
Ethics declarations
Competing Interests
The authors have no competing interests to declare relevant to this article’s content.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.
About this article
Cite this article
de Arriba-Pérez, F., García-Méndez, S., Otero-Mosquera, J. et al. Explainable cognitive decline detection in free dialogues with a Machine Learning approach based on pre-trained Large Language Models. Appl Intell 54, 12613–12628 (2024). https://doi.org/10.1007/s10489-024-05808-0
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10489-024-05808-0