In this article, we described our collaborative, human-centric approach to developing two AI models that can predict early-on, if a patient undergoing iCBT for depression and anxiety is likely to achieve a RI in their mental health symptoms by the end of treatment. We detailed how user research with iCBT supporters provided key insights into their work and information needs and how this, coupled with insights from the clinical literature and data availability constraints, enabled us to identify useful AI application scenarios and development targets for this mental healthcare context. To review our choices in pursuing outcome prediction and clarify the potential utility of the proposed models for clinical practice, we reported the findings of design sessions with iCBT supporters that investigated the integration of the achieved AI predictions within existing workflows. This further surfaced important concerns and risks associated with the use of outcome prediction in this context as well as a set of design sensitivities and requirements for developing appropriate representations of the AI outputs that resulted in a first UI realization within the SilverCloud product. Next, we will expand on some of these learnings and share our reflections on what constitutes a human-centered approach to AI design in this mental healthcare context.
6.1 Empowering not Replacing Clinicians with AI: Towards Human-AI Partnerships in Healthcare
There are many ambitious visions on how AI may drive forward health diagnostics, clinical decision-making, or treatment delivery, including—ultimately—the development of standalone AI systems such as the autonomous delivery of psychotherapy interventions. In such (future) scenarios, AI systems are often positioned as either capable of emulating humans (e.g., conducting health assessment, acting as therapist) or superior to humans, potentially outperforming them through improved data insights or productivity [
101]. However, as discussed in recent literature [
55,
70], it is unlikely for technology to achieve enough technical sophistication to replace human clinicians anytime soon. Thus, we believe that a more realistic, nearer-term, and perhaps more desirable strategy for developing AI applications is to orient design efforts towards the configuration of partnerships in how clinicians and AI insights might come together in healthcare delivery (see also [
15]). Referring to the term “augmented intelligence”, Johnson et al. [
55] suggest that while current AI does not replace humans, clinicians who use AI will replace those who do not. Miner et al. [
70] further formulated four approaches to care provision: (i) human only; (ii) human delivered, AI informed; (iii) AI delivered, human supervised; and (iv) AI only—all of which have different implications for scaling-up care or ensuring quality of care. Thus, as a first, tentative step forward in introducing AI within an actual mental health service, we chose for our work to focus on the sensible integration of AI insights within human-supported care practices. Suggesting that those data insights can serve as a
useful resource for humans [
101], we discuss next our specific design goals: (i) for enabling iCBT supporters to build on (or extend) their professional expertise and protect their sense of agency; and (ii) to not unnecessarily interfere with the all-important “therapeutic alliance” between clinicians and patients.
6.1.1 Positioning AI-derived Data Insights as Inputs to Human Sense- and Decision-making Processes.
Our user research identified two main ways in which predictions of RI outcomes could assist the work practices of iCBT supporters. They could serve as: (i) a “validator” to help confirm supporter decisions in cases where positive predictions align with their own clinical assessments, potentially boosting supporter confidence; and as (ii) a “flag” for negative prediction cases or where predictions were incongruent with supporter assessments, inviting pause to reflect and re-evaluate the patients’ current situation that can prompt for adaptations to existing practices. As such, our AI output does not provide any more specific (i.e., diagnostic) information that could assist supporters understanding, e.g., of the patients’ mental health state or potential treatment blockers, nor does it provide any concrete recommendations for what actions to take or propose to a particular patient. While more advanced AI applications are technically possible and could offer valuable additional insight, there are a number of reasons why we pursued a more general, less directive approach for generating AI insights:
Tradingoff the Risks and Benefits of Designing Complex System Inferences vs. Simpler Data Insights: Firstly, whilst the delivery of more complex data insights is an exciting prospect, it can be more challenging to achieve sufficiently robust and reliable data models. This is particularly pronounced in mental health due to general difficulties to establish what would constitute an optimal (aka ground truth) approach to treatment for a specific patient, even amongst health professionals [
33]. In other words, while more ambitious algorithmic modeling efforts may propose greater gains, these can also come at an increased risk for cases where patients are falsely predicted for [
101]. We believe that this presents a key challenge, especially for the design of
personalized interventions that seek to increase the relevance and outcome of treatment for a specific individual. Yet, in cases where more specific, tailored recommendations may fail to deliver on their promise and mismatch the needs of patients or care providers, this can have opposite effects on patient engagement and health, and diminish AI utility and trust. Being mindful that AI systems are rarely, if ever, 100% accurate, we were very deliberate in our choices to explicitly position AI-outputs as part of human (expert) assessment and decision-making processes as a mechanism for managing those risks. In doing so, this leaves the human, rather than the machine, accountable for interpreting each patient's unique circumstances and, in response, determining appropriate actions forward. It also broadens the scope for other, potentially unanticipated use scenarios of RI prediction, and ensures its application to a wide range of patients. Thus, tradingoff risks and benefits, we consider this “AI informed” approach to human-supported care delivery [
70] as a more ethical and responsible path towards early introductions of AI insights into mental healthcare contexts.
Designing for Human Expertise and Agency in AI-Informed Work Practices: Secondly, our research investigates how we can empower clinical supporters with AI. Thus, our aim is not to reduce the need for supporter input and analytical effort (in favor of the technology), but to explore how AI-insights could help maximize the impact of their “human” involvement in patient reviews. For this it is paramount that the supporters do not perceive the provision of AI-insights as competing with, or replacing them in, their professional expertise as this could unnecessarily undermine them in their role; as well as reduce their willingness to support the development and adoption of AI approaches in their work [
100]. Thus, by creating AI outputs that serve merely as a useful “flag” to inform clinical care, supporters remain “in charge” of examining more closely the circumstances and potential reasons for a particular prediction outcome and determining directions forward. The hope is that this can help preserve a
sense of agency and
purpose in their role, which is important for supporter motivation and job satisfaction. Other research exploring decision-support [
52] goes one step further and argues for AI systems to explicitly suggest appropriate next steps within the technology design to help clinicians make the connection between AI output and their healthcare practices. Either way, for HCI research this suggests opportunities for interface design to—implicitly or explicitly—aid supporters in the identification of the right subsequent actions, which in our case may involve explicit design decisions to assist supporters in their search for explanatory information (i.e., by encouraging them to look for certain mental health blockers).
Understanding the Impact of Design Choices on Work Practices and Workflow Integration Challenges: Our study findings also identified how specific design choices such as the frequency of (especially negative) prediction outcomes or their positioning and contrasting with other information (especially comparisons across patients) could add to work pressures, cause demotivation, and a reduced sense of agency in supporters that their actions can indeed affect positive change. All this warrants careful considerations in future design and research to study the actual impact of achieved AI predictions: (i) how AI applications can help care providers make more-informed, confident treatment choices for improved patient outcomes; as well as (ii) how integrations of AI outputs within everyday healthcare come to shape clinicians understanding of their own role; and how their design can help to minimize disruptions to their clinical expertise and work culture (cf. [
95]). All this can help advance learnings how AI technology may best assist health care providers in their practices.
6.1.2 Protecting the “Human-ness” in Human-supported, Digital Healthcare Delivery.
While, as described above, there can be many different visions for how AI technology could come to transform (mental) healthcare, we have chosen to focus our efforts on identifying strategies forward for empowering (rather than replacing) clinicians with AI. Especially in the context of psychotherapy, we further acknowledge the importance for technology to not unnecessarily disrupt the interpersonal relationship between patient and care providers; seeking to protect the all-important “human touch”, “genuine sense of care” and “empathic understanding” that often characterizes these relations, and are crucial for treatment success [
92,
115]. However, as indicated in our initial user study (see [
99] and Appendix A for main findings), trying to foster a connection between supporters and patients within a remote, self-administered therapy format that involves the asynchronous sending of online messages can already put into question the authenticity of supporters’ identity as “real” humans. To counteract this, our findings describe supporters active work in carefully crafting their feedback messages to patients such that they convey a “sense of care” by including personable expressions; person-specific guidance; communicating that they heard the person's concerns; and ensuring that they respond to these concerns in an “empathic way” by building on their own life and professional experiences.
Integrating AI Insights Sensibly within Interpersonal Dynamics and for Supporting Human Relations: Given the importance of developing a
genuine bond between supporter and patient within a computer-mediated setting, we were therefore deliberate in our choices for the AI and possible optimizations to supporter work to not go down routes towards standardizing or otherwise automating existing processes. Aiming to protect the “human-ness” of supporter communications we would favor, for example, the personal look and handcrafted feel of their personalized messages that bring forward individual communication styles, over more templated, machine-led communication approaches that may increase efficiency in message production, but at the costs of inviting perceptions of a “robotic, auto-responder system”. We believe that if we move beyond common development goals of “improving productivity” and considered more closely what may constitute a desirable use and integration of AI insight within healthcare from a patient and care provider point-of-view, this can open-up many important additional routes for AI application. Using goals of “protecting or nurturing supporter-patient relationships” as an example, future work may explore uses of AI to: (i) help increase patient awareness of the supporter's role and investment in their therapeutic success to foster their bond and associated benefits; and (ii) assist supporters to more closely connect with their patients. In this regard, our feedback-informed approach (RI prediction) in itself is sought to give supporters an additional view-point on their patients to enable them to be more responsive to those most in need of additional care, which can aid their therapeutic relationship. Future work may also focus more explicitly on the relational needs of the supporters by generating, i.e., data insights that foreground: how their actions came to matter to patients (i.e., highlighting support successes, or what types of actions are most helpful to their patients); what communication styles their patients may respond to best (see work by [
24] as an example); or otherwise expand ways in which supporters specific skills and expertise can become leveraged more. Such efforts can aid a feeling of “congruence” on the part of the supporter for investing in the patient's treatment success, keeping them engaged and motivated in the process, which is often rooted in an underlying desire to “be helping others”.
AI Acceptance vs. Perceptions of AI Dehumanizating Healthcare: Such considerations of where AI technology might come into interpersonal dynamics and caring relationships with a view to both sustain or extend human relations and avoid undermining health professionals in their roles and expertise may further play a key role in
improving acceptance of AI applications within such care contexts. Especially in healthcare, there are increasing concerns about the role that AI might play in “dehumanizing medicine” [
95]. Above and beyond already existing trends within health services to “continuously monitor” outcomes and focus on success “metrics”, there are tendencies within AI work to treat individuals as “data points” in algorithmic modeling [
20] by transforming a person's individual (mental) health experience into compressed mathematical representations that allow for the identification of large-scale patterns [
100]. This is a tension that we also saw in our user research findings that highlighted concerns about the introduction of a binary prediction to lead to simplified interpretations whereby patients become treated as a “number” and prediction outcomes simply read as “black and white”. Thus, in dealing with imperfect AI technology, as we will discuss next, it is paramount that we ensure in design and training that healthcare providers can maintain a more holistic view of their patients and focus on individualized care.
6.2 Dealing with “Imperfect” Technology in a Time-constrained Context: Implications for Trust in AI
In this article, we reported key concerns raised by iCBT supporters about the integration of AI insights into existing practices. This included the importance to avoid demoralizing supporters to take action and increasing performance pressures (i.e., by avoiding cross-patient or cross-supporter outcome comparisons); as well as multiple considerations pertaining to the interpretation and use of the prediction outcomes. Specifically, our work brought forward well-known risks and implications related to prediction errors, especially in cases where (more novice) supporters may uncritically treat and overly rely on the AI predictions; and where such reliance may cause undesired changes to the intensity and nature of patient care.
Moving Beyond Model Explanations: To better manage such risks, which are rooted in clinicians over-trusting the data, prior research in the field of
explainable AI (
XAI) suggests that providing interpretable explanations of the workings of the model can help cultivate transparency and assessments of the accuracy of offered predictions that enable the development of a more appropriate understanding and level of trust in AI outputs [
52,
105]. Yet, in time-constrained healthcare contexts, such as our iCBT setting, clinicians expressed their inability to engage with additional information. Instead, they emphasized the importance for the predictions to be understandable “at a glance” and that extra information—especially about the origin (or validation) of the model and how outputs are calculated—should only be available “on demand” to not distract from those insights most critical to their review and patient care (cf. [
111]). This echoes other recent findings on CDS systems [
52,
95] that describe how clinicians lack the extra time and mental capacity required to engage with such explanations, and that often assume substantial technical expertise and clinician interest in interrogating AI outputs. Thus, rather than a deeper understanding of how the AI insight is generated, clinicians favor an understanding of how they can make effective use of that information within their practice. Furthermore, Hirsch et al. [
49] found that the willingness of mental health professionals to trust the AI output was bound up with the perceived “legibility” of the AI results (the extent to which the AI output made sense to the person) rather than the extent to which the results were “statistically accurate”. For time-constrained healthcare contexts, all this suggests the need to identify other ways of establishing trust in the accuracy of AI models [
68,
95]. Next, we synthesize and suggest strategies for establishing trust in AI applications for healthcare, and explain the tradeoffs we made in balancing the use of specific trust mechanisms with other requirements posed by the specific design context.
Balancing Sufficient Model Robustness with Clinical Utility: Amongst existing ideas and approaches to aid user trust in AI outputs are proposals to carefully consider when, and when
not, to show predictions. In cases where prediction accuracy is lower and systems more likely to err, Jacobs et al. [
52] suggest that predictions should perhaps not be shown altogether. Similar decisions were made by Beede et al. [
9], who decided for their AI system that detects diabetic eye disease from retina images, to reject poorer quality images for analysis to reduce chances of incorrect assessments, even if the model could technically produce a strong prediction. Findings of a user study revealed how this created tensions among nurses, who reported frustration as they felt the images that they had taken as part of routine care, whilst human-readable, kept being rejected for AI analysis. Aside from considerations of technical robustness, Yang et al. [
111] further proposed to only show AI prognosis in cases where there is “a meaningful disagreement” in clinician's assessment of the situation with the AI recommendation so as to minimize clinician burden; however identifying those instances of misalignment may prove challenging. In our work, we too deliberated choices about
limiting when predictions are shown in practice to try maximize the robustness, reliability and clinical usefulness of offered predictions. This included: a prioritization of a very low false positive error rate (over false negatives); to only show predictions after three outcome measures, when they are more robust; and to only show predictions where clinically more relevant (i.e., by excluding predictions for patients with starting scores below RI thresholds, or below caseness). However, those restrictions also mean that predictions are not available earlier within treatment, where they could benefit especially those patients at risk of dropping out in the first 2–3 weeks of treatment; therefore presenting a tradeoff between maximizing for model robustness and clinical utility.
Engaging with Relevant Stakeholders and Demonstrating the Benefits of the AI: For establishing clinician trust in the accuracy of AI model outputs and thereby supporting the acceptability of innovative technology within healthcare [
94], Sendak et al. [
95] highlight the importance of engaging with target users in the design and development of the AI model and user interface
. As part of those engagement the authors suggested to
demonstrate how the AI helps solve important problems for the specific users (beyond technical innovation); and to
communicate the benefits of the AI application in ways that is directly relevant to those stakeholders. It is indeed through our engagement with iCBT supporters that we were able to develop a deeper understanding of their work practices and how AI could come to benefit them (Sections
3 and
5); and learn how to design and communicate AI outputs within the intervention (Section
5). We identified a number of additional insights:
1. Calibration and Fit with Existing Mental Models for Appropriately Interpreting Data Insights: In our work, we observed a certain
pragmatism in how supporters evaluated issues of trust and the impact of prediction errors. Research by Cai et al. [
16] too showed participants implicitly or explicitly describing how no AI tool (or person) is perfect. Similarly, when the supporters in our study reflected on the consequences of false predictions, they described the possibility of making mistakes in assessing a patients’ situation not as something that is newly introduced by the AI, but as something that exists in their current work as well. Thus, supporters would instead
assess the benefits of having RI predictions available to them as a way to reduce uncertainty and errors in their own judgment. Simultaneously, they were mindful that the AI insights would offer only one piece of information to their clinical assessment, and that this information comes with its own limitations—like any other data tools and measures. In other words, supporters arrived at this more adjusted, pragmatic understanding of the AI-derived data insight
through a comparison with other existing data practices and associated risk mitigation strategies. For example, in interpretations of patient's clinical scores (PHQ-9 and GAD-7), supporters also described the need for caution in interpretations, as those scores may not exactly reflect what is going on for the person. While such clinical assessments can provide a numerical indicator and trend of that person's mental health progression that can be informative to clinical practice, they noted that these scores should not be given too much weight in isolation; and instead be assessed in the context of other information provided in patient messages and conversations. This feedback suggests their treating and evaluating of the AI output as a piece of information with similar limitations as these clinical scores. Nonetheless, what our research findings also highlighted is the importance to remind supporters of the need to balance assessments of the AI output in the context of other patient information to reduce data over-reliance. As this can be more complicated in time-constrained contexts, it also emphasizes the need for
careful staff training prior to any AI deployment to help ensure appropriate understanding and use (see also [
16]).
2. Balancing Human-AI Interactivity and Interrogations for Trust with Costs of Time and Interference: As mentioned above, we considered the AI predictions as a simple data insight that would primarily serve as a useful “signal” to aid prioritization of patient cases, but otherwise would not take away additional supporter time (i.e., to review explanations of the model output) to respect already tight review schedules. Other works on CDS tools, however, have indeed demonstrated how mechanisms such as:
visualizing the most important model parameters [
68]; and
offering refinement tools that allow clinicians to fine-tune or otherwise experiment with AI input parameters to alter algorithmic outputs [
15] can play a key role in supporting user understanding of the AI and its capabilities, and promote both AI transparency and trust. We too believe that showing, i.e., the most predictive feature(s) can potentially add useful insights above and beyond a binary prediction. We also saw in supporter's evaluation of the Population Comparison (Table
1, concept 4, and Appendix C) that additional explanations of the model can positively contribute to assessments of AI output credibility. Nonetheless, we were mindful that
the provision of additional information or other interactivity may also lead to supporters potentially reading too much into the AI result and consuming more of their review time—especially in cases where additional data invites more ambiguity in the interpretation of the findings (cf. findings in Section
5.3.1). Worries about going too deep down into a thought process and tangential rabbit hole in reviewing and interacting with AI insights also surfaced in other related work [
15]. Thus, aiming to take a first step towards introducing AI into this specific iCBT context, we would prioritize a simpler, easy-to-comprehend AI insight that could “flag-up” if a patient was at risk of not fully benefiting from treatment, but otherwise would not take supporters too far away from their own thought processes and focus on the patient. With an increase in familiarity and understanding of AI use within clinical care, future work will likely expand on the scope and types of AI insights.
3. AI Credibility through Trusted Data Sources and Experiences of (Continued) Use: In the on-boarding of iCBT supporters to our design research, we found that explanations of our algorithms source of ground truth—the volume and type of data that the RI predictions are based on—contributed to their trust in the model outputs. As reported in our findings (Section
5.2.2), supporters would explicitly remark on the fact that the predictions were based on thousands of previous SilverCloud users. The large scale of the data (>46 K patient cases); its direct mapping to the specific application program; and its sourcing from the very company that the supporters work with and trust, impacted perceptions of the models’ credibility and shaped supporters’ assessment of the resulting AI output as potentially a “more reliable”, “more objective, evidence-based” indicator than their own judgment. In keeping with other recent recommendation's for onboarding health professionals to AI system [
16], this suggests the importance for key design decisions about data collection; source of ground truth; and model objectives to be made transparent to target users, both in prior-use training and interface design to aid transparency and the development of an appropriate mental model of the AI. Future work will also need to assess how trust in the predictions may develop and become calibrated through continuous use (cf. [
16,
68]). In addition, for any real-world deployment, it is important to put measures into place to continuously oversee and closely monitor the on-going performance and reliability of the AI in-use [
95] such that good performance and trust in the outputs are maintained over time.
4. Trust through Clinical Validation and External Approvals: Finally, there have been literature reports [
16,
52,
68,
95,
111] that describe the need for
formal, internal and/ or external, rigorous clinical assessments of AI model validity or rather utility, often in form of evidence-based methods (i.e., randomized controlled trials); research publications in prestigious journals; or FDA approval for health professionals’ to be able to trust AI outputs. Especially for early-stage AI research and development, this highlights the importance to set and communicate appropriate expectations of what AI models can realistically achieve to-date and at various development stages to not prematurely diminish AI credibility [
100]. Likely, gradually developing the AI, and building-up clinician's understanding of its workings and limitations—including demonstrations of technical validity and clinical utility—will also require the formation of longer-term healthcare partnerships.
6.3 Limitations and Future Work
Amongst the very many complex sociotechnical challenges involved in paving the way towards the successful development and adoption of AI interventions within real-world (mental) healthcare practices, our research focused on two specific aspects: the identification and appropriate design of a clinically useful AI application.
While our user research brought forward a wide range of possible and perhaps more impactful applications of AI in this context (see Appendix A and [
99]), we chose to pursue RI predictions based on patient's frequent report of clinical scores. This was the outcome of a rather complex and lengthy design and development process that intersected multiple research fields to create an AI application that is clinically useful, practically feasible, and implementable within existing care. Above, we discussed the various tradeoffs we made to bring together: insights from user research; the clinical literature; and needs to achieve robust AI models from the data available with our goals to pursue a very human-centered and careful approach towards AI integration within an actual mental health context. We also acknowledge that our proposed user interface for RI prediction presents the result of multiple design tradeoffs and may not present an optimal data representation—especially with regards to concerns about over-trusting or contesting the AI output, and for guiding supporters next steps towards specific actions. There further remains an open question about our focus on RI as the chosen outcome metric. It could be argued that a “RI” might present too high a hurdle for supporters to achieve, which can suggest improvement alone, without meeting the threshold of a significant change, could be portrayed as a negative.
Moreover, so far, our research only included the perspectives of a small number of iCBT supporters, who worked as PWPs at one specific NHS Trust in the UK. Not only do they present a rather homogenous group of low-intensity intervention specialists, their self-selection to engage in our research to investigate innovative AI uses may have also introduced a positivity bias towards the introduction of any such technology. In future work, we suggest a broader engagement with other stakeholders (i.e., different treatment localities) as well as patients.
As a next step towards addressing some of these issues, future research will need to: (i) delve deeper into clinician's experience with, and potential acceptance of, the developed RI prediction models (e.g., how access to the predictions may shape supporter practices, their sense of agency), (ii) investigate the effectiveness of their deployment for improving patient symptoms of depression and anxiety; and (iii) how these outcomes may differ for iCBT supporters with varying levels of expertise (novices vs. more experienced PWPs). To this end, and separate to the work presented here, researchers at SilverCloud Health have designed a large-scale randomized controlled trial (RCT)
7 to deepen understandings of the opportunities and unique challenges for how AI insights could come to support real-world healthcare practices, and benefit (mental) health patients.