1 Introduction
Journals serve as a written record of an individual’s past events, thoughts, and feelings, allowing genuine expression [
89,
90]. Journaling helps people describe experiences and express emotions related to both negative [
70,
71] and positive experiences (
e.g., growth potential) [
9,
35], thereby reducing stress, anxiety, and depression. Prior work has shown the advantages of journaling in clinical mental health contexts, as journals frequently capture patients’ daily experiences, symptoms, and other contextual data that are challenging to gather during brief hospital visits [
27,
100]. Furthermore, these patient journals can enhance mental health professionals (MHPs) comprehension of their patient’s conditions, leading to improved treatment quality [
95]. However, writing about one’s past feelings and thoughts can be a complex process because people differ in their ability to understand, identify, and verbalize their emotions [
78]. In addition, patient under psychotherapy struggle with constructing a narrative and understanding their past [
23,
72].
Conversational AIs, or chatbots, have the potential as an alternative form of journaling, easing the collection of personal data. Researchers in the field of Human-Computer Interaction (HCI) have shown that chatbots can help individuals articulate and share their daily experiences. For instance, chatbots to elicit people’s self-disclosure can ease the process of emotional expression by providing a safe and supportive environment for individuals to share their experiences and emotions [
16,
52,
53,
68]. Furthermore, a machine’s inherent trait of not showing fatigue can make people more confident to share their stories truthfully and comfortably [
44,
68]. However, existing chatbot prototypes have commonly employed rule-based or retrieval-driven approaches [
1], which have limited capability of generating versatile responses following up serendipitous topics during conversation [
38,
41,
51].
This trend presents missed opportunities and a lack of understanding regarding conversational AIs that assist with journaling by suggesting, questioning, and empathizing based on the user’s diverse experiences.The recent achievement of Natural Language Processing in large language models (LLMs) opened up new opportunities for bootstrapping chatbots that can carry on more naturalistic conversation [
8,
14,
41,
77,
94]. Their capabilities accelerated the development of chatbots in varied topics that can benefit from open-ended conversation, such as regular check-up calls [
8,
41], personal health tracking [
94], and personal events and emotions [
80]. Despite such opportunities, LLMs’ inherent uncertainty in control of response generation calls for precautions to handle unintended or inaccurate responses [
26,
41,
47,
93]. If applied to clinical and mental health domains, LLM’s behaviors should be designed in collaboration with domain experts regarding the relevance and safety of responses.
In this work, we present a case of collaborative design, development, and evaluation of an LLM-infused conversational AI system designed to facilitate the self-reflection of patients and communication with MHPs. We designed and developed
MindfulDiary (Figure
1), which consists of (1) a mobile conversational AI with which patients can converse about daily experiences and thoughts and (2) a web dashboard that allows MHPs to review their patients’ dialogue history with the AI. MindfulDiary incorporates LLMs to generate a response, prompting patients differently according to the conversational phase. The conversation records are automatically summarized and presented on a clinician dashboard so MHPs can obtain insights about the patient.
As a multi-disciplinary research team, which included HCI researchers, AI engineers, and psychiatrists, we iteratively designed MindfulDiary and conducted a four-week field study involving 28 psychiatric patients diagnosed with major depressive disorder (MDD) and five psychiatrists who care for them. During the study, the patients freely used MindfulDiary to record daily conversations, and the psychiatrists used the clinician dashboard during regular clinical visits. Through this study, we found that the versatility, narrative-building capability, and diverse perspectives provided by MindfulDiary assisted patients in consistently enriching their daily records. Furthermore, MindfulDiary supported patients in overcoming the challenges of detailed record-keeping and expression, often hindered by feelings of apathy and cognitive burdens. The psychiatrists reported that enhanced records provided by MindfulDiary offered a more nuanced understanding of their patients, fostering empathy. In addition, MindfulDiary supplemented their consultation by eliciting candid thoughts from patients that may be invasive to be asked by the MHPs.
The key contributions of this work are:
(1)
Design and development of MindfulDiary, an LLM-driven journal designed to document psychiatric patients’ daily experiences through naturalistic conversations, designed in collaboration with MHPs.
(2)
Empirical findings from a four-week field study involving 28 patients and five psychiatrists, demonstrating how MindfulDiary supported patients in keeping their daily logs and assisted psychiatrists in monitoring and comprehending patient states. We also explore how MindfulDiary enhances the quality of patient-provider communication, emphasizing the role of LLMs in prompting deeper self-exploration, which can be instrumental in clinical settings.
(3)
Implications for designing and instrumenting LLM-infused conversational AIs in clinical mental health settings.
3 Formative Study: Focus Group Interview
To inform the design of MindfulDiary, we first conducted a Focus Group Interview (FGI) with MHPs. The goal of the FGI was to understand MHPs’ perspectives, expectations, values, and precautions in utilizing LLMs in the clinical mental health context. Based on this understanding, we aimed to design the functions and interactions that the system should provide. This was an essential process in our overall approach, not just technology-centered system design, but creating a system meaningful to users and stakeholders [
87].
3.1 Procedure and Analysis
We distributed recruitment flyers in the Department of Psychiatry at a local university hospital, inviting Mental Health Professionals (MHPs) working in departments of psychiatry and mental health care centers to participate. We recruited six MHPs (E1–6; two males and four females)—four clinical psychologists and two psychiatrists whose careers varied from 1 to 11 years. Four were clinical psychologists responsible for counseling and daily monitoring and intervention of at-risk patient groups in local mental health centers and university hospitals, and two were psychiatrists in charge of outpatient and inpatient ward treatment in the psychiatry department of university hospitals (see Table
1).
We invited participants to two 1-hour remote sessions on Zoom. Two researchers participated in the sessions. We first provided an overview of language model technologies and LLM’s natural language understanding and generation capabilities until we shared a common understanding of the principles, applications, opportunities, and limitations of LLMs. Considering that we were designing a system for individuals with mental health challenges, we thoroughly covered the drawbacks of LLMs, such as uncertainty in control and hallucinations.
After the overview, we went through group discussions on how LLMs could be utilized in the current patient treatment process. As a probe, we asked participants a focused set of questions on (1) the challenges MHPs currently face during patient treatment and counseling sessions, and (2) their expectations and envisioned opportunities of LLMs’ role in clinical mental health settings. We sought to understand the experts’ perspectives through questions such as, ‘What are the difficulties or challenges patients face in their daily lives between treatments (or counseling)?’, ‘What are the important considerations in self-care that patients perform in their daily lives?’, and ‘What questions or conversational techniques do you use to encourage patients to share about their daily lives and moods?’. The session was video recorded and later transcribed. We open-coded the transcripts to identify emerging themes. In the following, we cover the findings from the FGI.
3.2 Findings from the Interviews
3.2.1 Challenges in Eliciting Responses from Patients with Depression.
Participants indicated that eliciting disclosure from patients’ inner thoughts during a limited consultation time requires significant effort. Many patients with depression experience difficulty describing and expressing their feelings and thoughts to providers due to a sense of apathy, which is a common psychiatric symptom involved in Major Depressive Disorder: “In the consultation room, even if they sit like this, they often just remain silent for a long time.” (E5) Thus, providers often end up spending a substantial amount of time asking standardized and repetitive questions about mood, sleep, and major events to understand patients’ current states.
Participants also noted that they had their patients engaged in paper-based diary writing methods but most demonstrated low participation rates and low engagement: “We tried a diary method on paper(in the inpatient ward), and several patients did write. What we saw was quite trivial, like, ‘I just felt bad today.’ But we learned there were significant events upon consultation, like having a big argument with other patients, which they did not record. Because patients with depression, or those who have had suicidal or self-harming incidents, often have a dulled state in expressing their emotions or feel apathetic, they tend to find such expressions very difficult.” (E3)
3.2.2 LLMs as a Bridge for Enhanced Patient Communication.
Our participants envisioned LLMs as a transformative tool in mental health care, particularly for enhancing interactions with patients who struggle to express themselves. They recognized that the natural and flexible conversational abilities of LLMs could bridge communication gaps, offering a more nuanced understanding of patients’ conditions. This could be particularly beneficial in cases where patients have difficulty articulating their feelings due to symptoms like apathy or social phobia. Additionally, participants noted that using LLMs could be significantly more interactive and engaging than traditional paper-based approaches, potentially increasing compliance and participation in the therapeutic process.
Participants especially underscored the importance of capturing the continuum of thoughts leading up to a particular emotional state, such as fear, in the journaling process. They envisioned the need for using LLMs to introspect deeper into the patient’s psyche, revealing underlying thoughts and emotions that the patient might not be consciously aware of. E5 mentioned, "It would be good if the journal continuously records the flow of thoughts. For example, it would be beneficial to document the various thoughts and detailed reasons leading up to certain feelings like fear. Like, ’I feel scared when I’m in a place with many people,’ and then digging deeper into ’Why do I feel scared?’—I think a process that gets more specific like this would be good." This approach not only aids in a more comprehensive self-examination but also enriches the therapeutic dialogue between the patient and the MHP.
3.2.3 LLMs for Analytical Insights and Personalized Mental Health Support.
The participants further suggested that LLMs could analyze journal entries to identify key themes, words, or sentiments expressed over time, offering patients tangible feedback on their emotional patterns and progress. Such analytical capabilities could empower patients with a greater sense of control and awareness of their mental health journey, potentially motivating them towards self-management and active participation in their treatment. Additionally, the analysis could assist MHPs in a deeper understanding of their patient’s emotional states and thought processes by examining the tone, choice of words, and speech or writing patterns. The participants envisioned that insights derived from LLMs about patient journaling habits could inform MHPs about the most effective counseling approaches for each individual. They suggested, "Observing how patients react to different forms of communication can provide valuable information. Some patients might find solace in simple reassurance, while others may benefit from more straightforward, targeted feedback."
3.3 Improvements after the Interviews
Based on the lessons from the FGI, we refined the initial concept of MindfulDiary. We leveraged the conversational abilities of LLMs to help patients document their daily experiences between clinical visits. MHPs had access to the collected data to inform their clinical decision-making. Furthermore, both MHPs and the research team concur that LLMs should not act solely as the primary intervention due to their inherent limitations but should function as supportive tools for clinical consultations. The subsequent section outlines the design and development process of our system.
4 MindfulDiary
Informed by the findings from FGI with MHPs, we designed and developed MindfulDiary, which consists of two main components: (1) a patient mobile app for daily record-keeping and (2) a clinician dashboard that allows professionals to access and use these daily records in a clinical setting (See Figure
1). Below, we present a fictional usage scenario to demonstrate how the system works.
Jane, diagnosed with chronic anxiety, frequently grapples with panic attacks. To keep track of her daily experiences, her psychiatrist recommends trying MindfulDiary as part of her treatment plan..
Every evening, Jane converses with the MindfulDiary app regarding her daily activities, emotions, and thoughts. The AI leads the conversation with Jane by asking prompted and follow-up questions about her day. After a session, the app summarizes the dialogue into a journal-style essay, on which she can revisit and reflect later. She can explore the summarized essays whenever she wants to reflect on past events or thoughts..
Three weeks later, during a consultation, her psychiatrist uses the expert interface of MindfulDiary to review a data-driven summary of Jane’s entries. The data helped the psychiatrist identify patterns that Jane’s anxiety often spikes during her work commute. Based on this insight, the psychiatrist refines advice and introduces specific coping strategies, fostering a more personalized approach to care..
4.1 MindfulDiary App
The MindfulDiary app for patients aims to support people who might have difficulty journaling due to apathy and cognitive load through naturalistic conversation driven by an LLM. The app consists of a home screen containing an introduction and guide to the system (Figure
2a), a journal writing screen (Figure
2b 2c), and a screen to review the diary entries (Figure
2d).
4.1.1 Journaling User Interface.
Figure
3 illustrates the overall use flow of the journaling session, which begins with a Pre-Journaling Assessment (Figure
3-
①) that asks to fill out a questionnaire for mental health. The questionnaire comprised the modified PHQ-9 [
48] and a custom open-ended question inquiring about recent attempts of self-harm or suicide. This assessment prevents users who provided any clues of suicidal or self-harm from journaling on the same day. (We cover this feature in detail in section
5.4.)
On the next screen, the user converses with MindfulDiary, documenting the events of the day (See Figure
2b). After three turns, MindfulDiary provides a summary of the conversation as an essay. Users can edit this automatically generated summary any time. When the user ends the session by pressing the end button (Figure
2b 2c). Users can also leave a reflection message there. Lastly, users can browse their past records in the Diary Review menu (See Figure
2d).
4.1.2 Conversation Design.
We designed the chatbot’s conversational behavior based on insights from psychiatry literature [
66,
67], which covers foundational techniques and considerations for conducting clinical interviews. We also incorporated the hands-on clinical experiences of practicing psychiatrists.
As a result, we designed the conversation of a journaling session to follow a sequence of three stages:
Rapport building,
Exploration, and
Wrap-up. The
Rapport Building state is an ice-breaker, centered on casual exchanges about a user’s day. In this state, the assistant also shares bits of information to encourage users’ openness. This approach is based on previous research findings that a chatbot’s self-disclosure positively impacts user disclosure [
52] and leverages the natural story-building ability of LLMs [
99]. Overall, in this stage, our goal is to create an environment where users can comfortably share their stories. As we progress to the
Exploration state, the emphasis shifts to a comprehensive understanding of the user’s daily events, feelings, and thoughts, facilitated by a mix of open-ended and closed-ended queries that ensure users remain engaged and in control of the dialogue. While open-ended queries are intended to facilitate increase the expression of feelings and emotion and less judgemental, closed-ended queries is for specific and detailed description of the experiences [
66,
67].The conversation then transitions to the
Wrap-up, emphasizing completion and ensuring users have fully voiced their experiences while the system remains empathetic and receptive to any lingering topics.
Besides the three main stages, we also incorporated the Sensitive Topic state that handles the most sensitive subjects, such as self-harm or suicidal ideation. When this state is triggered, psychiatrists receive instant notifications. This allows them to oversee the conversation in real-time and step in to assist the patient if necessary. Here, the system begins by empathizing with the user, recognizing their struggles, and offering a reassuring message. Following this, the system gently probes the depth of their suicidal or self-harm thoughts. If the user expresses intense or specific plans related to self-harm or suicide, the system urges them to seek prompt assistance, either at a hospital or via the local helpline.
4.1.3 Conversational Pipeline.
Lengthy and complex input prompts for LLMs are known to cause poor task performance [
14] by partly omitting latent concepts [
96]. To steer the LLM to comply with the conversational design we intended diligently, we designed MindfulDiary’s dialogue system as a state machine. Each conversation stage is carried on with a dedicated input prompt, which is more succinct and clear than a single master prompt containing instructions for all stages.
Figure
4 illustrates our conversation pipeline that runs each time a new user message is received. The pipeline incorporates two LLM-driven modules, a
dialogue analyzer and the
response generator.
The
dialogue analyzer handles the stage transition, returning the stage suggestion—whether to stay or move to a new stage—and a summary paragraph of the current dialogue from the current dialogue information. The dialogue analyzer receives an input prompt that consists of the current number of turns in the conversation (progress level), the most recent stage information, and a list of criteria for recommending each stage (See
② in Figure
4). Based on this information, the underlying LLM performs a summarization task that yields a summarized paragraph of the current dialogue, a recommendation for the next stage based on the summary, turns, and the most recent stage information. For example, the system decides to move to the Wrap-up stage when the user expresses a desire to conclude or say goodbye.
The system then formulates an LLM prompt, combining a dedicated prompt for the current stage, the dialogue summary, and the recent six messages (i.e., three turn pairs) (See ③ in Figure 4). Receiving the prompt as an input, the
response generator generates an AI message. The stage prompt consists of the description of the task that the LLM is supposed to perform in the current stage, and the speaking rules describing the attitude that the module exhibits in the conversation. For example, the task description of Exploration stage instructs to “
ask questions that encourage users to reflect on their personal stories regarding daily events, thoughts, emotions, challenges, and etc.” The speaking rules for the Rapport-building stage instruct to keep conversations simple and friendly and reply in an empathetic way.
4.2 Pilot Evaluation
To ensure that MindfulDiary is reliable and safe for conversing with psychiatric patients, we underwent multiple rounds of pilot evaluation. First, we invited five psychiatrists and three clinical psychologists to test the conversational pipeline. The experts provided feedback on the instructions in the model prompts, focusing on their clinical relevance and the embedded terminology and strategies. Then, the experts inspected the chatbot’s behavior by chatting with it while role-playing as a patient persona. In particular, we examined the chatbot’s reactions to subtle implications of suicide or self-harm in user messages.
After iterating on the conversational pipeline, we conducted a pilot lab study with five patients admitted to a university hospital but about to be discharged soon. To ensure safety against risky messages from an LLM, we used a test platform where the participant’s clinician monitored the generated messages in real-time, approving them or sending better messages manually.
4.3 Clinician Dashboard
The clinician dashboard (c.f., Supplementary video) is a desktop application designed to facilitate monitoring patient’s journal entries and to provide analysis of the entries to help clinicians identify significant events, reactions, and emotions. The dashboard consists of the following components:
User Engagement. This section visualizes the participant’s overall engagement with MindfulDiary, including the number of journals written, the date and time they were written, and their length. The modified PHQ-9 scores for each session are also visualized, allowing professionals to track the user’s mental health trends using a validated tool.
Journals. This section displays the content of the journals written by patients. The information is presented in a card format, where each card offers a summary of the journal, including timestamps, total time taken to write the journal, and associated PHQ-9 score. The interaction logs between the patient and MindfulDiary are also provided in this section.
Insights. To assist professionals in browsing through the diary, this section visualizes (1) a word cloud to understand frequent terms that the participant used at a glance, (2) a summary of major events to highlight significant happenings and (3) summary of emotions to gauge the mood based on user input. When a specific period is selected for review, a comprehensive summary is generated. We used GPT-4 for most summarization tasks. To generate the word frequency data for the word cloud, we combined GPT-4 and a Korean morphological analysis package named Kiwi [
50] to filter only nouns and verbs from the GPT output. Due to the limitations of language model-driven analysis, there might be occasional inaccuracies in the generated content. First-time users of this interface are alerted about possible inaccuracies. An in-interface tooltip also reminds users that the summarized outcomes might not be accurate.
4.4 Technical Implementation
MindfulDiary’s interface is developed using React, a JavaScript-based framework. The server, responsible for interfacing with the LLM and overseeing database operations, is implemented in Python. Google Firebase handles user authentication, data storage, and retrieval tasks. The conversational capabilities of MindfulDiary are powered by
gpt-4, accessible through OpenAI’s API
1. We specifically used
gpt-4-0613 model. For parameter setting, we consistently set the temperature to 0.7 and both a presence penalty and frequency penalty to 0.5.
5 Field Deployment Study
Using MindfulDiary, we conducted a four-week field deployment study with 28 patients undergoing outpatient treatment. Through the study, we aimed to explore how patients and MHPs utilize MindfulDiary and what opportunities and challenges arise from its real-world use. The study protocol was approved by the Institutional Review Board of a university hospital.
5.1 Recruitment
We targeted outpatients from the Department of Mental Health at a University Hospital. Participants were selected based on specific criteria: (1) those who had been diagnosed with MDD and (2) those who did not exhibit heightened impulsive tendencies or harbor specific intentions towards self-harm or suicide. Key exclusion criteria included a history of psychotic disorders, substance-related disorders, neurodevelopmental disorders, and neurological disorders. Eligible participants were identified through evaluations conducted by psychiatrists. Flyers and consent forms were distributed to eligible patients. For minors, the consent form process was adhered to only when they were accompanied by a guardian at the hospital.
We compensated participants on a weekly basis of participation: For participating every seven days from the starting date, participants received 15,000 KRW (approx. 11 USD). If they completed the entire four-week study process, they received 20,000 KRW as a bonus (i.e., 80,000 KRW—approx. 60 USD—in total). We did not tie the number of dialogue entries to the compensation to ensure natural data entry behavior.
As a minimum requirement for study completion, we instructed the participants not to miss four consecutive days without conversing with MindfulDiary. If a participant missed three consecutive days, an experimenter sent a reminder. In cases where participants did not respond to these reminders, their participation in the study was discontinued. This procedure was implemented to ensure active monitoring and communication. Considering that our system is designed for individuals with mental health challenges, it was crucial to maintain contact with participants and ensure their adherence to the study protocol.
Initially, 36 patients started using MindfulDiary. During the deployment, eight dropped out as they did not meet the minimum data collection requirement. These participants were disengaged from MindfulDiary due to the lack of time or decreased interest. As a result, 28 participants (P1–28; 11 males and 17 females) completed the 4-week field study and were included in the analysis. The majority of participants were adolescents and adults, with ages ranging from 12 to 28 years, with an average age of 17.6 (
SD = 3.26). Table
2 presents the demographic details and severity of depressive symptoms of the study participants. These scores are derived from psychiatric evaluations conducted within one week before the starting dates.
5.2 Procedure
Figure
5 illustrates the procedure of the field deployment study. All interviews took place remotely on Zoom.
5.2.1 MindfulDiary App.
We deployed the MindfulDiary app to our patient participants. The patient protocol consisted of three parts: (1) an introductory session, (2) deployment, and (3) interviews.
Introductory Session. We first invited each participant to a remote introductory session. A researcher went through our study goal, the motivation of the MindfulDiary system, and the overall procedure of the study. We then played a demo video demonstrating how to use the MindfulDiary app. The session took about 15 minutes.
Deployment. The day following the introductory session, participants started using MindfulDiary for four weeks. We instructed participants to engage with the app whenever they have anything noteworthy but encouraged them to use it at the end of the day. We collected all data from their interactions with the MindfulDiary and the raw input content and outputs from the LLM. We asked our participants to fill out online surveys three times, at the beginning of Week 1, after Week 2, and after the deployment, to measure participants’ mental health status and their self-help capability in managing their mental health. The surveys utilized the PHQ-9 [
48], GAD-7 [
85], and Coping Strategies Scale [
98]. (The survey results from the scale are outside the scope of this investigation.)
Mid-study and Debriefing Interviews. We conducted two 15-minute interviews, after the second and fourth weeks, with each participant to understand their experiences and learn how they used MindfulDiary on a daily basis. Considering the characteristics of depression patients, who may struggle to focus for long periods of time, the interview session was divided into two shorter sessions.
5.2.2 Clinician Dashboard.
Most patient participants had a clinical visit during Week 2 through Week 4 of the deployment period. We deployed MindfulDiary’s clinician dashboard to five psychiatrists who are in charge of the participants.
Deployment of Clinician Dashboard in Clinic. We provided instructions to clinicians covering the main components of the clinician dashboard and how to interact with them. To explore the opportunities and limitations of the dashboard, we did not offer explicit instructions for utilizing the clinician dashboard in their workflow. However, we advised psychiatrists to be cautious with the LLM-driven analysis due to potential inaccuracies, emphasizing the importance of verifying data through the interaction logs. The psychiatrists autonomously utilized the clinician dashboard, making sure it didn’t disrupt their current treatment methods and preparation routines.
Debriefing Interviews. We interviewed psychiatrists who treated the patient participants to understand how they used the clinician dashboard in clinical settings. We further gathered feedback from the psychiatrists on the opportunities and limitations of MindfulDiary, as well as suggestions for improvements. The interviews with the psychiatrists were conducted offline for about one hour after the deployment study concluded.
5.3 Analysis
To explore participants’ usage patterns with MindfulDiary, we first conducted a descriptive statistics analysis. To determine any shifts in participants’ adherence over time, we examined weekly writing frequencies using a one-way repeated measures ANOVA (RM-ANOVA) with Greenhouse-Geisser correction. To gain a deeper qualitative understanding of the messages produced by MindfulDiary and interviews with patients and psychiatrists, we used open coding paired with thematic analysis [
13]. For a more in-depth qualitative analysis of the messages produced by MindfulDiary and the interviews with patients and psychiatrists, we employed open coding paired with thematic analysis [
13]. All interviews were audio-recorded and transcribed for this purpose.
The qualitative analysis was conducted by the first author, a PhD student in HCI, who open-coded the interview transcripts and interaction log data through multiple rounds of iteration. Another author who holds a PhD degree in HCI also contributed to this coding process. Following the initial coding, two psychiatrists reviewed the coded data to provide clinical insights and ensure the accuracy of interpretations. Through discussions among the research team, including these diverse perspectives, overarching themes were identified, enhancing the depth and validity of our qualitative findings.
5.4 Ethical Considerations
Conducting this study, we are fully aware of the inherent risks associated with our research, particularly given the characteristics of participants diagnosed with MDD. To mitigate the risks, we first carefully screened participants, relying on evaluations conducted by psychiatrists. Individuals displaying heightened impulsive tendencies or harboring specific intentions towards self-harm or suicide were excluded from the study. In addition, participants were asked to take the PHQ-9 before interacting with MindfulDiary, along with an additional set of questions probing their recent attempts at self-harm or suicide. If a participant’s response to question number 9 of the PHQ-9, regarding suicidal/self-harm thoughts, scored ‘moderate or higher’ or if any recent suicide attempt was verified, the system pivoted to provide content geared towards alleviating anxiety and reducing stress rather than proceeding with the standard system. In such a case, a real-time alert was also sent to psychiatrists. Lastly, if sensitive themes frequently surfaced in a participant’s input during the study, their interactions with the system were temporarily halted. Psychiatrists subsequently re-evaluated such participants to assess the viability of their ongoing participation. During our experiment, for the case of P11, mentions of repetitive suicide and self-harm were detected. Consequently, an expert contacted the participant, the experiment was suspended for three days, and after a re-evaluation in an outpatient clinic, we resumed the system use with P11.
Further, to mitigate potential risks from the LLMs’ outputs, we embraced an iterative design methodology. The system’s interactions underwent repeated assessments to ensure it generated safe, non-harmful outputs. In addition, in the first week of each participant’s system use, all interactions between participants and MindfulDiary were observed in real time. To facilitate this process, when a participant started the session, the research team received a notification email. This notification included real-time monitoring links and reports of the survey responses that participants answered before each session. After the first week, user interactions and MindfulDiary were reviewed within a 12-hour window. During the review process, if an interaction contained sensitive content (specifically, terms pre-defined as sensitive by psychiatrists), the psychiatrists on our research team assessed the situation and contacted the affected participants if necessary.
Lastly, given that we were handling the patients’ personal and sensitive data, ensuring the secure protection and management of data was critical. Therefore, during the study, we utilized the Google Firebase authentication service to manage the user authentication process for participants. We were thus able to ensure that only authorized personnel had access to the data, and any attempts at unauthorized access could be promptly detected and managed. After the field study, all data was separated from personal identifiers to maintain anonymity.
6 Results
In this section, we report the results of the field study in four parts: (1) Journaling adherence, (2) Dialogue patterns, (3) Patients’ perspectives on MindfulDiary, and (4) MHPs’ perspectives of MindfulDiary for clinical settings.
6.1 Journaling Adherence
Figure
6 summarizes the daily engagement of participants with MindfulDiary over the course of four weeks. The colored squares denote the days that participants conversed with MindfulDiary (
i.e., days with interaction). Across four weeks, participants submitted 501 journal entries (17.90 entries per participant on average), 0.62 entries on average per day (more than once every two days).
22 out of 28 participants used MindfulDiary more than once every two days. Participants generally engaged with the app at a regular frequency, but we note that their engagement was also affected by the three-day-miss reminder and their visit to the clinic between Week 2 and 4. Each journaling session lasted an average of 438 seconds (around 7 minutes) but with notable individual variability (
SD = 225.97). Each journal dialogue included messages with an average of 105.6 syllable count (
SD = 49.41). Our analysis did not reveal significant differences in either the participants’ input length (
F(1.735, 46.85) = 2.718,
p = .084) or writing time (
F(2.417, 65.25) = 2.549,
p = .076) across the four different time points, as determined by the RM-ANOVA test. This suggests that users mostly retained a steady level of engagement during the four-week study.
6.2 Dialogue Patterns
Participants and MindfulDiary exchanged a total of 4,410 messages (i.e., 2,205 pairs of the AI and participant’s messages) during the field study. Each session consisted of 10.82 messages (SD = 2.70). Most exchanges between the AI and participants were carried on for an exploration of patients’ daily lives and emotions, as well as for casual conversations. In terms of the stage of the conversation, 62% (2,732 messages) of the messages were from Exploration, 30% (1220 messages) for Rapport building, and 6% (282 messages) for Sensitive topic. Only a small amount of messages were accounted for Wrapping up (62 messages) or not selected (14 messages).
To understand the contents that MindfulDiary generated, we delved deep into the content it generated. 72% of the AI messages took the form of questions, aiming to elicit responses about users’ daily experiences and emotions. We identified and categorized the primary strategies that MindfulDiary employed to assist patients’ journaling. There were four strategies employed by the LLM:
Emotional Exploration,
Activity/Behavior Exploration,
In-depth Follow-up & Countermeasures, and
Future Plan Exploration. For a comprehensive breakdown of these strategies, along with their descriptions and exemplar questions, refer to Table
3.
The average length of participants’ responses was 29.42 syllable counts, with a median of 20 (SD = 35.9). This suggests a left-skewed distribution, where many participants gave shorter responses and a smaller number provided considerably longer answers, causing a high variation. The minimum response length was one character, and the maximum was 559 syllable counts. We further conducted a qualitative analysis of these responses, seeking to identify the themes present in users’ interactions with the LLM. This allowed us to understand the scope and topics of the daily records that MindfulDiary collected from the patients.
Participants interacting with MindfulDiary conveyed a range of topics (see Table
4). They described a spectrum of
emotional states, from negative feelings like exhaustion and anxiety to positive sentiments of pride and joy.
Events and activities were recounted, offering insights into their daily routines, such as walking during school times or decreased activity post-vacation. They also shared
thoughts and beliefs, sometimes related to current events, revealing patterns linked to mental health, like feelings of exclusion and loneliness. Regarding
perceived health status, comments spanned from immediate ailments, such as headaches, to long-term health challenges. Distorted perceptions about their body included content on excessive dieting. Specifically, participants frequently discussed medications, revealing not just their physical reactions but also their perceptions and behaviors toward them. Some expressed concerns over the taste, while others mentioned adverse reactions from intake, like discomfort after swallowing multiple pills at once. Lastly, the realm of
relationships & interactions had participants highlighting both the challenges and supports in their interpersonal connections, revealing their significant impact on mental well-being, from conflicts and trust issues to moments of affirmation and encouragement.
6.3 Patients’ Perspectives on MindfulDiary
Overall, participants viewed MindfulDiary as a space where they could open up and share their stories, feeling a sense of empathy from the system. Participants particularly found the dialogue-driven interactions with MindfulDiary useful. One participant, P15, mentioned, “If it was just about recording daily activities or emotions like a regular diary, it might have been less engaging, and I could’ve found it tedious or might not have persisted for long. But this felt like having a conversation with AI, which added an element of fun and kept me engaged in continuous use.” Such a dialogue-driven journaling process aided participants in maintaining consistent records and helped in forming a habit consistent with our user engagement analysis. P7 stated, “I liked chatting with the AI at first, so I kept using it. The more I used it, the more it became a habit.”
6.3.1 Broad Conversational Range: The Versatility in Documenting Diverse Interests.
Our participants appreciated the LLM’s flexibility and naturalness in responding to various utterances, topics, and situations. Such broad conversational capabilities of the LLM provided participants with a space where they could document a variety of subjects tailored to individual interests and preferences. In our study, participants interacted with the LLM on diverse topics ranging from games, webcomics, novels, and movies (see Dialogue
1) to hobbies like Pilates (see Dialogue
2), allowing them to create richer and more personal records. P3 remarked,
“AI systems that I have used in the past could only respond to specific words, but it is amazing how this one can respond to all sorts of things.”6.3.2 Expanding Views: Enriching Entries with Varied Perspectives.
Participants also valued the diverse and new perspectives that LLM-generated responses offered, as those helped participants reflect on their struggles, daily events, and emotions from various angles. Dialogue
3 shows how the system helps the participant to view the challenges of studying from the perspective of the satisfaction felt in gaming. This influence helped participants diverge from ruminating on depressive feelings. P12 mentioned,
“Sometimes when you note down emotions, that’s the only thought that comes to mind. Beyond that, I don’t remember much. Since MindfulDiary uses AI, my thoughts flow more easily, and I like it when it asks about different perspectives or topics.”.
6.3.3 Probing for Depth: Prompt Questions in Detailed Reflection.
MindfulDiary’s question-driven journaling process was also valued by participants as it assisted them with the process of daily reflections and documentation. Compared to their past experiences of journaling, where they had to reflect on their daily life by themselves, participants appreciated that MindfulDiary made the journaling process less daunting. P27 said,
“Because I have to rely solely on my thoughts when I write alone, I sometimes get stuck. But when I was unsure about how to write, the AI helped me. I liked that part.”. The questions posed by MindfulDiary also guided participants in documenting their daily lives in a more detailed manner by asking their thoughts and feelings about a particular event (See Dialogue
4). Such probing allowed for richer, more in-depth entries. P13 shared,
“I used to write diaries on my own and just wrote whatever came to mind. MindfulDiary, however, helped me write in more detail because of the specific questions.”6.3.4 Building Narratives: Structuring Daily Reflections with MindfulDiary.
MindfulDiary’s capabilities, such as generating contextualized follow-up questions and summarizing the conversation, made the process easier for participants who struggled to organize daily thoughts and events underpsychotherapy [
23]. In their past experiences, our participants expressed difficulties in journaling because of disjointed thoughts, a lack of clarity in ideas, or inconsistencies in their stories. However, with the support of the LLM in the MindfulDiary, these challenges were addressed, motivating them to record their daily lives persistently. P3 remarked,
“ I often had trouble putting sentences In the past, I would worry about writing the next part. But with this tool, I just tell the story of my day, and it seamlessly continues and wraps it up, presenting a well-structured diary entry. That’s its biggest advantage." (See Dialogue
5)
6.4 MHPs’ Perspectives on MindfulDiary for Clinical Mental Health Settings
In this section, we describe how MHPs utilized the clinician dashboard and the benefits and drawbacks of the system they reported, drawing on the debriefing interviews with the psychiatrists.
6.4.1 Utilization of MindfulDiary in Clinic.
During the deployment study, psychiatrists reviewed the journal entries from their patients every morning when they reviewed the medical charts of patients whom they would meet on the day. Depending on the severity and the focal concerns of the patient, psychiatrists spent about 5 to 10 minutes per patient reviewing the MindfulDiary data. After checking trends primarily through PHQ-9 in the clinician dashboard, psychiatrists read summaries about events and documented emotions. If there were spikes or drops in the PHQ-9 or events/emotions, they checked the actual dialogues.
6.4.2 Percevied Benefits of MindfulDiary for Enhanced Patient Insight and Empathetic Engagement.
All of the psychiatrists emphasized the critical value of an expert interface based on information recorded in the daily lives of patients. Specifically, E3 highlighted MindfulDiary’s value in that it consistently aids in recording daily entries, allowing them to utilize more detailed patient data during outpatient visits. “Patients, with the support of AI, can logically continue their narratives, ensuring more dialogue than a typical (paper-based) diary. This definitely aids me in my consultations.” (E3). In this section, we further report on how MindfulDiary has been helpful in the clinical practice of psychiatrists.
Enhancing Understanding and Empathy toward Patients. Psychiatrists indicated that MindfulDiary helped them gain a deeper understanding and empathy about their patients. They perceived that MindfulDiary served as a questioner that could elicit more objective and genuine responses from patients. Psychiatrists appreciated that the LLM was able to pose questions that might be sensitive or burdensome for them to ask, such as patients’ negative perceptions of their parents. E4 said: “There are times when it’s challenging to counter a patient’s narrative or offer an opposing perspective. For example, if a patient speaks very negatively about their mother, and we ask, ‘Didn’t she treat you well when you were younger?’, the patient might react aggressively, thinking, ‘Why is the therapist taking my mother’s side?’ However, since the LLM is a machine, such concerns are minimized.”.
Insights from Everyday Perspectives Outside Clinical Visits. Psychiatrists valued that MindfulDiary provided them with an understanding of patients’ conditions that would be difficult to gain during outpatient visits. For instance, E1 appreciated that MindfulDiary provided them with insights into patients’ positive feelings and experiences, which is typically difficult to obtain during clinical consultations. “Usually, when patients come for a consultation, they talk about bad experiences. Few people come to psychiatry to say, ‘I’ve been doing well.’ Even if they have good things to say, they usually don’t bring them up. But I was happy to see that there were many positive statements in these records, like ’I did that and felt good.’ Especially in depression, the presence or absence of positive emotions is crucial. It’s a good sign if they show such positive responses.”. E2 envisioned its potential application to medication management, which is another critical aspect of psychiatric care. He thought these records could be used as a window into understanding how patients react to and perceive medications. For patients undergoing drug therapy, “If the primary treatment method is pills, but they don’t seem to have an effective response or there’s a decline in medication acceptance, I could potentially understand the reasons for it through this diary.” (E2).
Understanding Patient Progress Through Consistent Record-Keeping. Feedback from patients highlighted that interactions with MindfulDiary made it easier for patients to maintain a consistent record, as it mitigated the challenges associated with recording. Psychiatrists perceived that having consistent daily data offered them opportunities to observe trends in a patient’s condition. E2 said: “From our perspective as clinicians, even though we might only see a patient once a month, having access to a record of how they’ve been throughout the month would allow us to track their progress, which is highly beneficial.”. In particular, the ability to examine changes not only through quantitative tools like the PHQ-9 but also using a qualitative approach can offer a comprehensive understanding and shed light on the mechanisms influencing a patient’s mental health.
6.4.3 Perceived Concerns about MindfulDiary.
While MHPs generally appraised the utility of the MindfulDiary positively, they also raised concerns regarding the integration of MindfulDiary into clinical settings.
Significance of Tone and Manner in Patient Data Analysis. Although patient data summarized and extracted in the expert interface effectively aided in understanding the patient, experts thought that the summarized texts would not convey the patient’s tone, pace, and other nuances, which are integral to the Mental Status Examination (MSE) that clinicians utilize. However, MHPs identified the opportunity to perform such analysis from the raw data that patients entered. As the MSE measures objective and quantitative aspects, incorporating such an analysis could make significant improvements in understanding the patient. E1 said, “In the same way as P14, understanding the tone of this patient may also be possible. That’s because we use something called psychiatric MSE, where we observe more than just the patient’s appearance, such as tone, pace, and more. Even a short analysis of one’s linguistic behavior would be great.”
Potential Misuses and Concerns around MindfulDiary. In our field study, one patient participant perceived the MindfulDiary as a channel to convey their intentions and situations to their psychiatrist. Specifically, the participant, P9, talked to their psychiatrist, “Have you seen what I wrote?", which indicated that the patient was actively attempting to share their current state and situation through MindfulDiary. In spite of the fact that such usage did not seem problematic per se, one psychiatrist raised concerns about the possibility that patients with borderline personality disorders might misuse MindfulDiary as a weapon to manipulate others, such as their providers and parents. “In some cases, people self-harm out of genuine distress, but others do it to manipulate others, instilling guilt in them so they’ll do what they want. There are some patients who write about their distress with sincerity, while there are some who exaggerate their distress in order to get attention.” For patients exhibiting symptoms of schizophrenia or delusions, there was a concern that MindfulDiary’s feature of revisiting past entries could act as a feedback loop, developing and amplifying their delusions. E2 said, “This diary lets you revisit and organize your past actions. For schizophrenia patients with delusions or unique beliefs, referencing past writings might reinforce their pre-existing delusions. Reaffirming ’Yes, I’m right’ can be problematic. The LLM’s summaries could exacerbate these delusions if they emphasize distorted content.”
7 Discussion
In this study, we present MindfulDiary, an LLM-driven journal designed to document the daily experiences of psychiatric patients through naturalistic conversations. Here, we reflect on the opportunities presented by LLM-driven journaling for psychiatric patients and discuss considerations for integrating an LLM-driven patient system into the clinical setting.
7.1 Guiding Patient Journaling through Conversations Offering Diverse Perspectives
Our study highlighted the potential of MindfulDiary in clinical settings, mainly where adherence to interventions is important [
62]. Core symptoms of depression, such as loss of energy, difficulty in carrying out mental processes, and feelings of apathy, often contribute to lower adherence to a professional’s advice or intervention [
43]. Clinicians who participated in our FGI also highlighted these challenges in motivating patients to utilize the diary writing app. Our findings demonstrated that MindfulDiary helped mitigate these challenges by transforming the conventional journaling process into engaging conversations. Using MindfulDiary, users were able to engage in conversations with the system by answering prompts and questions, which made them feel the journaling process was more accessible and intriguing. This active participation ensures that the users are not overwhelmed by the task and are guided in documenting their feelings and experiences more richly.
Depression often locks patients into negative and rigid thought patterns [
12]. Such patterns, resistance to change established thought paradigms, can severely limit a patient’s ability to perceive issues from multiple angles, leading to a harsh self-judgment [
61]. Our study highlighted that the varied perspectives offered by LLM-driven chatbots like MindfulDiary could help challenge such fixed viewpoints [
33]. By prompting users to revisit their initial evaluations or suggest alternative viewpoints, these chatbots could help break the cycle of cognitive rigidity. While our research underscores the promising role of LLM-driven chatbots in assisting psychiatric patients’ journaling process, it’s essential to note that these are preliminary findings. More work is needed to substantiate these findings in a clinical context.
7.2 MindfulDiary as a Facilitator for Fostering Patient-Provider Communication
Studies have suggested that sharing the data captured via chatbots with others, such as health professionals and family members, could further serve as an effective mediator that helps convey more truthful information [
52,
56]. For instance, patients consistently displayed deep self-disclosure through chatbots, whether or not they intended to share their inputs with health professionals [
52]. Aligned with prior work on PGHD [
20,
64], MHPs in our study also perceived that MindfulDiary has shed light on patients’ daily events, emotions, and thoughts that might have been difficult to gain through regular clinical visits. This data offered MHPs valuable insights into the patient’s experiences and context.
Building on these findings, we could expand the potential presented by MindfulDiary in patient-provider communication. In the field of personal health informatics, existing research highlights the role of technology, such as photo journaling, in managing conditions like Irritable Bowel Syndrome. This tool not only empowers patients to record their daily experiences more effectively but also fosters enhanced collaboration between patients and healthcare providers [
18,
79]. Such tools serve as vital artifacts in negotiating the boundaries of patient-provider interactions (i.e., boundary negotiating artifacts) [
19].
This work adds a new dimension to this discussion by showing how LLM-assisted journaling lowers barriers to generating health data in daily life and fosters patient understanding. Specifically, we found that through this system, patients and providers can collaboratively reflect on mental health conditions. In the context of the stage-based model of personal informatics, the patient module in our MindfulDiary helps patients reduce the burden of collecting daily data and supports deeper recording. The expert module’s dashboard allows for the combined and transformed processes of diary data, survey data, and quantitative engagement data, supporting MHPs’ integration and reflection. Collaborative data generation and utilization with patients can enable care that reflects the patient’s values and the characteristics of their daily life. These insights serve as a basis for patient-provider collaboration.
However, our study findings underscore the importance of careful consideration in the clinical integration of systems like MindfulDiary. While we did not observe patients exaggerating their conditions or needs, this potential issue was raised as a concern by MHPs. They expressed apprehension about the possibility that sharing journal content with MHPs through MindfulDiary might lead some patients to exaggerate their conditions or needs. This concern highlights the need to consider not only the design of chatbots that facilitate patient disclosure behavior but also the complex dynamics between patients and providers in clinical settings. It is crucial to address these dynamics to ensure the effective and safe use of such technologies in mental health care. The growing prevalence of chatbots in mental health domains emphasizes the need for a holistic approach to their design and implementation. We highlight that engineers and MHPs need to collaborate closely, ensuring that these tools are not only technically sound but also tailored to meet the intricate dynamics of clinical settings [
87].
7.3 Considerations for Integrating LLMs into Clinical Settings
In this section, we discuss the consideration for integrating LLMs into clinical mental health settings, drawing insights from the design and evaluation of MindfulDiary.
Aligning Domain Experts’ Expectations of LLMs. Developing and deploying MindfulDiary, we learned that aligning MHPs’ expectations with the capabilities and limitations of LLMs involves significant challenges. The capability of generative language models to improve mental health is difficult to measure in comparison with AI models in other medical domains, where objective metrics can determine performance. For instance, in medical imaging, AI can be evaluated based on its accuracy in identifying target diseases from MRI scans, using precise numerical percentages of correct identifications [
83]. On the other hand, in the realm of mental health chatbots, gauging success is more nuanced, as it involves subjective interpretations of emotional well-being and psychological improvement, which cannot be easily quantified or compared in the same straightforward manner. This challenge is amplified in mental health, where soft skills like rapport building and emotional observation are important [
30]. The use of LLMs in the mental health field is emerging, but little has been said about evaluating or defining the performance of models that are tailored to mental health. Our iterative evaluation process involving MHPs could inform researchers about how to develop and evaluate LLM-mediated mental health technology. When integrating into the clinical setting, this evaluation is also necessary for anticipating who the system would target and for what purpose it would be used. Hence, we advocate that engineers and researchers should carefully consider how to assist domain experts, who may lack AI expertise, in fully and accurately grasping the role and operation of LLM. It is also crucial for researchers and engineers to collaborate closely with these professionals to ensure the technology aligns with therapeutic needs and best practices [
87].
Tailored LLM Evaluation for Clinical Mental Health Domains. The domain of mental health, which our study addresses, is characterized by the vulnerability of its target user group. The content discussed within this domain is often emotionally charged and sensitive. Therefore, prioritizing user safety becomes even more essential in this domain than in others. Considering the sensitivity of the domain, during our evaluation process, MHPs thoroughly tested the LLM’s output by trying out conversations on various sensitive topics in both implicit and explicit ways, drawing upon their clinical experiences. The contents the MHPs input were much more diverse and wide-ranging than what engineers could generate during the development. Additionally, MHPs showed concern that the hallucinations of the LLM could reinforce or expand the delusions of patients with delusional disorders. We highlight that developing evidence-based tests or benchmark sets to anticipate the behavior of the language models in collaboration with MHPs is critical when leveraging LLMs for clinical mental health settings.
Incorporating Perspectives of MHPs in Testing and Monitoring. Considering the caveats of current LLMs [
47], it is critical to involve MHPs when deploying LLM-driven systems for patients in mental health contexts. While planning the field deployment study of MindfulDiary, we identified specific roles that MHPs could play. In the pre-use phase, MHPs should determine the suitability of users and facilitate the onboarding process with patients. During the mid-use phase, they should closely monitor interactions with the LLM and be prepared to intervene in cases of crises or unexpected use scenarios. Furthermore, they can offer or adjust treatments periodically based on long-term data. Additionally, they should regularly re-evaluate the continued use of the system. While some of these tasks should carefully be designed not to burden MHPs too much, it is important that LLMs do not make autonomous decisions about patients (e.g., diagnosis, prescription, or crisis management) but instead operate under professional oversight.
Providing Safeguards for Hallucinated LLM Generations . Our clinician dashboard provided various summarized information, such as word cloud, aggregating multiple dialogue entries so that the clinician quickly grasps the gist of the dialogues. Although we underwent intensive testing with the LLM-driven data summarizer, the LLM-driven data processing may still suffer from inaccuracies, biases, and misinterpretation [40, 75] of patient sentiments or context, which could adversely affect treatment decisions and patient well-being. To mitigate such drawbacks of LLMs in our study, we provided sufficient guidance to MHPs, cautioning them that the LLM-generated information they receive may be error-prone. However, in real-world settings, MHPs might accept the outputs of LLM without much attention. Therefore, when involving LLM-driven data processing, the system should foster careful reviewing of the content based on the expertise of MHPs. For example, future systems could incorporate features like highlighting in vivo phrases that were directly mentioned by patients and signify key aspects of their experience and feelings. By contrasting the in vivo phrases with the LLM’s original text, the system can encourage MHPs to put more scrutiny on the LLM’s original interpretation, which may contain errors, and the actual inputs spoken by patients. 7.4 Limitations and Future Work
Our recruitment method could impact the generalizability of our findings, as we recruited the patient participants for our field study from a single university hospital. Although we aimed to recruit patients with diverse types and levels of symptoms, our participants are not representative samples of psychiatric patients. They were young (mostly adolescents) and consulted by a fixed number of psychiatrists. While this work is just a first step toward designing an LLM-driven journaling app for psychiatric patients, further investigation is necessary with subjects from various backgrounds. To implement our pipeline, we used OpenAI’s GPT API, which provided the most capable LLM at the time of our study and was accessible via commercial API. As GPT models are continually updated, later models may not yield the same conversational behavior. To generalize the performance of our conversational pipeline design, future work is needed to compare multiple versions of MindfulDiary with different underlying LLMs.