1 Introduction
Artificial Intelligence (AI) is increasingly recognized as an important application in radiology [
57,
82,
101,
121]. In particular, the latest advancements in the creation and adaptation of multimodal foundation models (e.g., BioViL(-T) [
8,
17], ELIXR [
137], MAIRA [
58], Med-PaLM M [
128]) invite high expectations of how the use of AI may transform clinical practice through efficiency and quality gains [
121]; and improved overall patient care. By leveraging rich, multimodal data that particularly characterizes the healthcare domain, advanced AI models can achieve impressive new and improved capabilities. In this work, we focus particularly on the combination of large language models (LLMs) with vision capabilities – as so called vision-language models (VLMs). In the context of radiology imaging, this modality combination enables tasks such as: automatically generating a radiology report from a medical image (e.g., [
57,
58,
148]); using text queries to answer questions about a radiology image (cf. [
137]); or detecting errors in a radiology report text through its comparison with the image.
Despite great AI advances both in natural language processing and image-based analysis, translating recent research and development successes into clinical practice remains challenging [
32,
44,
97,
100,
121,
130,
132,
145,
149,
150]. Factors hindering successful AI implementation in radiology are wide ranging and include: skepticism due to inconsistent AI performance; lack of trust and overreliance in AI-generated outputs; and the need for clinical effectiveness trials (cf. [
44]). A key underlying factor is uncertainty about the value that AI applications bring to clinical practice. In what has been described as “a race for getting the technology right before exposing human-end users to new promising AI tools” [
100], the field of AI has been criticized for its development “in a vacuum” [
88], disconnected from well-defined needs of intended users or use contexts [
79,
126]. Seeking to close the gap between technical proof-of-concepts and lab experiments towards the successful integration and deployment of AI-enabled systems within routine care requires the adoption of human-centered, participatory approaches [
98,
125]. This involves engagement with relevant stakeholders throughout AI system development, starting as early as the ideation and problem formulation stages [
25,
59,
69,
91,
134,
144].
Within this broader context, we set out to better understand the design space of VLMs in healthcare, specifically in the context of radiology. Radiology imaging workflows involve
referring clinicians who request an imaging test for a patient; and
radiologists who examine the image and describe their
findings and
clinical impression. The resulting report goes back to referring clinicians to inform patient care and treatment [
71]. Building on the recent advances in AI research, we focused on
designing the right thing [
22]: What might be clinically relevant use cases for VLMs to enhance radiology imaging workflows for radiologists and clinicians? Would radiologists want to engage with a draft report generated by AI? Would clinicians find it useful to have report findings visually annotated on an image? What questions might radiologists and clinicians ask if they could query a patient X-ray or CT scan?
As a team of human-computer interaction (HCI) researchers, AI researchers, radiologists and clinicians, we engaged in an iterative design process to explore these questions. We conducted a three-phase study. The first phase involved in-depth discussions and brainstorming sessions within our team to elicit our clinical team members’ domain expertise, and ideate use cases with VLM capabilities. We discussed how radiologists interpret images and write reports, and how clinicians review these to make patient care decisions. We brainstormed VLM-based interactions using sketches, scenarios and wireflows to identify what would be useful and acceptable. In the second phase, we selected four specific use cases to further detail as design concepts: Draft Report Generation, Augmented Report Review, Visual Search and Querying, and Patient Imaging History Highlights. In the third phase, we recruited 13 radiologists and clinicians to conduct user feedback sessions probing whether and how these concepts might be useful for clinical practice, and potential concerns.
Overall, participants perceived the VLM concepts as valuable, but articulated many design requirements for these to be usable and acceptable. Particularly, they shared expectations of AI performance, workflow integration (e.g., well-defined, tool-based interactions rather than open-ended queries), and a desire for context-specificity.
This paper makes two main contributions. First, we identify and design VLM use cases to support radiology workflows, and offer initial insights into the perceived value of these concepts. Second, we present a reflective account of our design process as a case study of early phase AI innovation with clinical stakeholders, from brainstorming to prioritization, concept generation and initial assessment. We discuss the design implications and future research directions for integrating VLM capabilities into radiology, and healthcare more generally.
3 Overview of Radiology Workflows
Radiology workflows unfold across many clinician roles (Figure
1). First,
referring clinicians request an imaging study for a patient (e.g., a chest X-ray). Next,
radiographers perform patient scans, and
radiology coordinators may prioritize and assign patient images to radiologists. Next,
radiologists examine patient images, and document their
findings – descriptions of normal or abnormal observations, such as lesions or nodules – and their
clinical impression – a summary that synthesizes the findings and suggest possible causes or further tests. Referring clinicians then review the radiology report, and may consult radiologists for further questions or clarifications before making care decisions. In some cases, patient images are brought to multidisciplinary team meetings (MDT) to discuss patient treatment [
71].
A radiology report (Figure
6 in the Appendix) typically consists of a
Background section that describes the patient information and the clinical question that referring clinicians seek to answer, and
Findings and
Impression sections that communicates radiologists’ interpretation [
66]. Different imaging modalities have different workflows. For instance, plain (2D) imaging, such as X-rays, are high volume and fast-paced, taking minutes to review [
37]. Complex (3D) imaging on the other hand, such as CTs and MRIs, take more time (10-20 minutes) and cognitive effort [
37]. Reports are often in the form of prose (sometimes called
narrative report), while there is also research that calls for structured reporting approaches (e.g., short, bullet-point style sentences) for improved clarity [
45]. Reports are usually written using voice dictation, often utilizing templates or draft reports produced by radiology trainees (interns or residents in the US context) in hospital settings.
Depending on the imaging modality and context, clinicians may review images –especially plain images such as X-rays– before a radiology report becomes available. For example, intensive care physicians immediately review X-rays that are taken to assess if a feeding tube is inserted correctly [
73]. Regardless of whether acted upon or not, all images require a radiology report as it serves as a legal document in a patient’s record [
31]. A major challenge within the radiology workflow is the sheer volume of scans, leading to a backlog of unreported images [
108]. Wait times might be a few days to a week for radiology reports [
93]. In recent years, the majority of radiology services in the UK and the US have been outsourced to private vendors to reduce costs and wait times [
14,
108].
The majority of human-centered AI research on radiology imaging has focused on mechanisms to explain AI outputs to domain experts [
5,
27,
28,
97], such as explaining the diagnostic outputs for specific chest X-ray findings (e.g., cardiomegaly) by highlighting what feature changes in the medical image would lead the AI system to give a different diagnosis [
5]. Other work explored AI acceptance or the impact of using AI systems on radiologist diagnostic performance [
13,
26,
28]. Relatively little work investigated current radiology workflows or asked radiologists where they needed support [
97,
132,
136]. Xie et al.’s work presents a rare example of an early phase needfinding and design study, where they conducted a three-phase design process to explore opportunities for AI-assisted radiology in the context of X-rays [
136]. We build on this existing body of work by investigating radiologists’ and clinicians’ current needs and desired futures for VLM-assisted radiology workflows.
5 Phase 1: Brainstorming VLM Use Cases
Our discussions and brainstorming sessions surfaced many challenges, ranging from requesting a patient scan to prioritization, reporting, and assessment. Our team generated many ideas for improvement (some of which are discussed in prior literature [
112]), such as detecting redundant scan orders; detecting poor quality images at the time of scan to reduce rescans; and optimizing image triage and assignment based on patient urgency and provider subspeciality. We provide a broad overview of these challenges and opportunities using a customer journey map of the radiology workflow (see Supplementary Material).
In this section, we detail our insights into VLM-specific use cases, mainly around radiology reporting and report review, as our focus was on probing the potential utility of VLM capabilities to support radiologists and clinicians. Where relevant, we provide direct quotes from our clinical team members that were involved in in-depth discussions (R1F, C1F) and brainstorming sessions (R2F, C2F) – denoted with F (formative study) to distinguish clinical team members from the user feedback study participants.
5.1 Use Cases for Draft Report Generation
In considering how VLM capabilities can support radiology image review and reporting, we discussed whether an AI-generated draft report might provide any value. Interestingly, our radiology team members likened these to reports they receive from their trainees: “I would treat it as a draft report coming from my trainee.” (R2F)R2F touched on the difference between draft and preliminary reports, noting that only senior radiology trainees were allowed to make a report ‘prelim’ – which would be available to the clinical team, and would later get ‘amended’ by senior radiologists for any changes.
This insight led to a detailed discussion on how radiologists currently review, edit, and sign draft or preliminary reports. R1F shared that he looked at the indication (why the request was made) and the image first to form their opinion before looking at the impression, whereas R2F preferred to immediately review the indication and the impression to decide whether she agrees or disagrees. As to how much effort was involved in reviewing and editing these reports, R2F shared: “Junior trainees’ reports will require more work. Depending on how good it is, I might dictate from scratch … Senior trainees, I usually look at [their reports] and sign. I’ll just say ‘I agree’. I’m not going to correct a typo. I might do small edits to say ‘there is also this’ … If I disagree, I will say “My interpretation is this…” I will dictate if it’s a few sentences or type a few words here and there.”
Throughout our discussions, we repeatedly asked: What makes ‘a good AI experience’ in radiology? Elaborating on what makes a radiology report ‘good’, we teased out three aspects: the report is (1) accurate (i.e. findings are correct); (2) complete (i.e., there are no missing findings); and (3) error-free (i.e. report does not have typos). This led us to further probe the value proposition AI might bring into radiology in the form of improved report quality and reduced reporting time. Radiology team members pointed out that they often prioritize speed over quality; they had to work really quickly due to the large number of images waiting to be reported. A team member asked whether AI-generated findings in the form of bullet points would provide any value if radiologists still had to dictate the report by themselves (to reduce the risk of errors). Radiology team members pushed back, noting that the system would not save them time in reporting, thus it would provide little value. They recalled instances where the voice recognition system introduced transcription errors, and stressed that they do not want to spend additional time correcting an AI system’s errors: ‘[recounting an incorrect transcription of ‘abdominal viscera’ as ‘animal viscera’] It was embarrassing. It should be able to correct these, so that I can sign without having to read what I dictated.” (R2F) These discussions hinted at time savings as a key design requirement for clinician acceptance.
Finally, our conversations brought up the questions: Should a draft report be shown to clinicians? R2F reflected that this may lead to tensions in terms of responsibility and radiologist acceptance: “There is an issue of responsibility. Radiologists might think they’re out of the loop” (R2F). Both clinicians and radiologists proposed that AI-generated findings could be used for triage and early flagging of critical findings without presenting too much detail. This became one of the central themes of exploration in our later study.
5.2 Use Cases for Visual Search and Querying
When reviewing visual question-answering capabilities, both clinicians and radiologists brought up that they regularly perform web searches to look for similar images or clinical information relevant to the patient case. These included medical databases and clinical guidelines (e.g., nice.org.uk – The National Institute for Health and Care Excellence guidelines), as well as websites that provide peer-reviewed patient cases (e.g., gpnotebook.com, radiopaedia.org, radiologyassistant.nl, uptodate.com). R2F described two scenarios where searching similar images was helpful. The first case included situations where she would suspect that there is a pattern in the patient image, but cannot be sure what anomaly it might be: “I know there is a pattern but I don’t know what it is.” She would use search queries that described the pattern (e.g., glass opacities CT lung) to find similar images to help with diagnostic assessment. The second case was having diagnostic uncertainty about the suspected pattern: “I think this is crazy paving, but I haven’t seen crazy paving in a while.” She would search for a certain pattern in trusted websites (e.g., “crazy paving chest ct radiopaedia”) to see examples of that particular pattern to help disambiguate possible interpretations.
Both radiologist and clinician team members indicated forming search queries with the abnormality and imaging modality to find similar cases with an overview of pathologies listing common causes: “I’ll look at the differential diagnoses [listed] … [which makes me think] I haven’t considered that, but knowing what I know about the patient, yeah that makes sense.” (R2F) We discussed how radiologists might perform visual searches if they had the ability to query a region in a patient image, for instance, drawing a bounding box and typing ‘is this normal or abnormal’ (image query, text query, or image and text query). R1F shared that text query might be preferable: “I would prefer text, because if I’m selecting a lump, anything might look like a lump.” R2F however preferred the following search query type: “If I could snip a region... so that I don’t have to translate that to a text query.”; suggesting variations in search preferences.
Our discussions also touched on clinician-radiologist interactions, and the types of questions asked. Clinicians shared that they might ask clarifying questions for less visible findings: “You said in the image [there is this]... Where is it? Is this normal?” (C2F) Both radiologists and clinicians noted that image annotation tools were part of the reporting software, yet were rarely used. Clinicians also sought information on next steps: “Do you think we need to act on this? What [additional] imaging should we order? Who should we call about this?” (C2F) Radiologist team members shared that such clarification interactions can be overwhelming: “Sometimes clinicians want to hear from their favorite radiologists that they’ve built a trust relationship over the years, which can be overwhelming for the radiologist.” (R2F) We discussed that visual annotations and image search capabilities might reduce some of the back and forth.
5.3 Use Cases for Longitudinal Imaging
VLM capabilities enable the comparison of a patient’s prior images for longitudinal assessment, a core practice in radiology reporting [
2,
116]. Reflecting on situations where this capability could be useful, R2F spoke of the challenge of tracking the size of nodules over time:
“It might look like the size hasn’t changed much [compared to the most recent image], but actually it’s grown 5 millimeters compared to two years ago.” We envisioned that a system could summarize past images and reports to provide key highlights, such as chronic events, operations, and the trajectory of abnormalities.
7 Phase 3: Eliciting User Feedback
In the third phase, we sought feedback from a broader set of clinicians to understand whether, how and when the VLM-assisted radiology imaging concepts might be useful for clinical practice. This section reports participants’ feedback on each design concept, capturing perceived benefits and suggestions for improvement.
7.1 Draft Report Generation
Expectation of near-perfect AI performance: All radiologists expressed that having an AI-generated draft report would be valuable as long as the model performed really well; with high sensitivity and specificity. Describing how AI reporting errors could add burden, one radiologist explained: “If it misses something, I’ve got to say that. If it’s false positive, I either have to click to remove it from the report entirely, or I have to change something.” (R2) To better understand what would be considered as good enough AI performance for this use case, we asked “Out of 10 reports, how many are you willing to correct?”. Almost all replied “1 out of 10” (R1, R2, R3) or “5 to 10 out of 100” (R5); suggesting the need for near-perfect performance for AI-generated draft reports to provide real utility. Only one radiologist, a trainee, responded “3 out of 10”, noting that the system could make them more confident even if it did not reduce their workload: “It [would be] getting stuff right enough for me to feel comfortable just to edit the 30% of cases where it’s going to be wrong.” (R4); suggesting potentially added benefits for trainee learning.
Accounting for fast-paced practice & high workload: Echoing our initial findings, radiologists noted that their practice is fast-paced and high volume: “It is literally going as fast as humanly possible. Scrolling through things, looking at image, saying whatever I can, go over the spellchecks. Make sure I didn’t say anything really wrong and then sign and get on the next one.... I just need to get my job done fast. I don’t get paid more for quality.” (R2). Consequently, participants mainly spoke of value as time savings, especially when reading multi-slice images such as those captured by CT that take significantly longer to review and report than i.e. X-rays, and images that are outside of their subspecialty (R1, R2, R3, R5): “I might be a seasoned reporter for lung or cardiac, but as every week it happens, we’ll get a neck CT... when you’re not doing it day in day out, it’s extremely difficult. You would love an AI which is at least giving you the salient findings.” (R5) This suggests a draft report may reduce risks of key clinical observations being missed and could assist with image interpretation confidence. Apart from time savings, participants also mentioned potential benefits in reduced cognitive burden. For simpler X-ray images, R2 for example mentioned: “I can do [X-rays] in 10 seconds... [but] there’s the cognitive burden. Having to say the words and go through it all is painful.” R4, who was a trainee, reflected that the main benefit of the system would be reducing reporting time rather than the time spent for image interpretation: “Regardless of what the system says, I’m still going to go through my same search patterns for the findings and interpreting those... the only area where it’s going to be saving time is in creating that draft [prose] report because then I don’t have to worry about the wording and if I’ve missed something”.
Preference for short, standardized reporting: Interestingly, when probed whether short form sentences could be useful, all radiologists shared that they prefer to work with bullet point style findings instead of prose text. Several participants highlighted the literature on
structured reporting, which is proposed as a solution for improving report quality and consistency [
45]:
“The idea of a narrative report happened in 1898 and we’ve not moved on from it. It’s full of hedging, it’s full of weird language that only radiologists use: ‘likely to be’, ‘cannot exclude’. [This is] what we should be moving away from rather than using the technology to reverse engineer the future into what we got.” (R3)
Commenting on how the bullet list findings in the prototype were presented, R1 reflected “My reporting style is much more telegraphic. So I’ll say ‘large right pleural effusion’, that’s exactly how I’d phrase. I wouldn’t say ‘there is’ or ‘is seen’ or all those kinds of phrases. I don’t think [they] are helpful, especially for findings.” Similarly, R3 advocated for structured findings for consistency and objectivity: “Rather than saying ‘suspected mild cardiomegaly’, you say ‘heart is enlarged’ or ‘heart enlarged’, which is a statement. It may be right or wrong, but it’s objective.” All these suggest a preference for concise, accurate and consistent reporting over the historic use of more ambiguous prose text, something that AI reporting could assist in standardizing.
Favoring prioritized findings & confidence indications to assist image interpretation: Additionally, radiologists described the benefits of having findings structured by their clinical relevance and the systems’ confidence in the generated outputs. For example, a systems capability to compare a current study to a patient’s prior image enables ordering report findings by: what is new, what has changed or is unchanged, which gives important context to aid image interpretation and subsequent clinical action. For example, the sudden ‘new’ appearance of a pneumothorax would require urgent clinical attention whilst a reduction in consolidation in the patients chest upon pneumonia diagnosis may suggest that antibiotic treatment is working. Furthermore, all participants (R1, R2, R3, R5) suggested having confidence intervals to communicate AI uncertainty: “Rather than using ‘likely to be’, ‘unlikely to be’, ‘possibly’... ‘Likely prostate cancer 4 out of 5’, [which is] more robust and easier to interpret.” (R3) One radiologist suggested displaying the model confidence and ranking findings on this basis: “[Say for a finding] I don’t totally agree, I don’t disagree. But if it’s confidence is only like 56%, I’m just going to knock that out.” (R2)
Impressions present key interpretative work: While short form, structured reporting was preferred for findings, some radiologists (R1, R3) shared that having unstructured, prose text is more appropriate for the impression section which is the “non-objective, doctor bit” (R3): “The main focus of communication between us and the team taking care of the patient is that impression part of the report. So it’s really important to me to have that correctly crafted.” (R1) R5 reflected that findings could be useful, yet the impression will be more difficult to get right: “We get a lot of [outsourced] reports from teleradiology, which just tell you what the findings are. A clinician will want to know the clinical impression.... Is a report better than no report? I think it is fine if it gets the findings right, even if it doesn’t do all the synthesis clinically.” Given the importance of the impression section and its broader interpretative work that may include additional contextual information, the feedback from our participants suggests that clinicians may want to remain in charge of this task; positioning AI’s role closer to the extraction of relevant findings from an image rather than its overall clinical interpretation.
Broading uses of (prose) draft reports: When asked how an AI-generated draft report should be presented, all radiologists suggested having both bullet points and prose report presented together whereby bullet points serve to assist the review, and prose for clinical communication: “I could just get rid of [a bullet point] and it takes it out of the report, that’s great. Because editing at that level is so much easier than editing on the report.” (R2) A few radiologists noted that a patient-facing report could also be generated based on the list of findings (R1, R3); suggesting additional use cases and user groups.
In response to making an AI-generated draft report available to clinicians, all radiologists thought the AI-generated report could be useful for triage purposes, especially in situations where clinicians could escalate cases – as long as it did not look too final: “The subtlety there is that a draft report sounds too final in the health culture. But a ‘prelim’ or a ‘wet read’, that’s a very rough, not final thing. The clinicians would take that information and use their judgement to call the radiologist or wait for the report.” (R2) Alongside legal, regulatory and other organizational requirements to approve any such AI use, this requires a system design that appropriately communicates and clearly discloses the nature of preliminary AI-generated contents.
7.2 Augmented Report Review
Locating image findings & their prioritization by clinical relevance: Exploring how VLM capabilities could be utilized to augment the experiences of clinicians when reviewing the radiology report, all described finding image annotations helpful, especially for complex images like CTs. Most clinicians shared that they do not receive training to read CTs: “I look at CT scans, but I’m not trained to look at CT scans. I’m trained to look at X-rays.” (C5) Some (C3, C6, C7) noted that they are comfortable reading CTs mainly within their subspeciality: “[In a brain scan] I would 100% be able to localize where things are. But if it was a report of a liver I would struggle.” (C7) They pointed out that for such multi-slice images, current systems require them to manually navigate to the image slice indicated in the report to view abnormalities. Having “clickable” findings, either on the report itself or in an overview section, that would direct them to the image location of relevance, was perceived valuable to save time and make it easier to differentiate what is in the image: “[Looking at a CT scan that had multiple areas of edema infarction] As a clinician, you’re like, well, this must be the bit that’s bleeding, but this must be the inflamed bit. But they look similar to me.” (C1) Clinicians additionally described several abnormalities that can be difficult to interpret: “Lymph nodes are the thing that people often miss on chest X-rays. Small pneumothoraces are difficult to see. The difference between a pneumothorax and a bullae [is] a common problem with the misreading of chest X-rays.” (C6) As such, they ascribed value to AI image annotations in aiding their understanding of the reported findings. Furthermore, similar to radiologists’ feedback, clinicians reflected that an overview section could highlight the most important and actionable findings: “Report overview would work best if you constrain it to show the top 6 salient features. We can get a lot of information overload if there are 25 of them.” (C7)
Building an appropriate mental model of the AI: When discussing more broadly how AI assistance could feature within workflows, one clinician differentiated for example a radiology assistant from a clinical assistant, whereby the former is embedded within the image viewer for radiology-specific tasks, whereas the latter –which is conceived as answering broader clinical questions– would be expected to sit within the EHR system: “If I’ve got a radiologist at my fingertips, I’d restrict to asking it the kind of questions I might be asking the radiologist. Therefore it belongs in [the radiology] screen, whereas some of the other things like, how should I treat this patient? I think that belongs in the main body of EHR rather than in this radiology reporting system.” (C4) This commentary highlights the importance of workflow integration for building an appropriate mental model of the AI’s likely purpose and capabilities.
Cautioning about chat format & too complex queries: In response to the AI assistant embodied as chatbot, several clinicians (C1, C3, C5, C7) commented that they were unlikely to use an assistant in chat form due to time-demands and lack of trust in generated, potentially high-risk responses: “I don’t need a chatbot function where I’m talking and stuff. I haven’t got the time for it.” (C5) Some clinicians raised concerns about responsibility in clinical decision making: “I’m not all of a sudden going to ask ChatGPT ‘What am I going to do with the brain tumor?’ I’m going to ask my friend who’s a specialist of this. There’s a question of responsibility. ” (C1) Similarly, in answers to questions what clinicians would not want to use an AI assistant for (whether in chat or any other form), C7 – an oncologist – emphasized that he would not use it as a prognostic tool: “The radiology assistant shouldn’t be used to make predictions. It’s not a radiomic analysis in that sense.” Similarly, a cardiothoracic physician indicated that she would not ask what’s unknowable: “You wouldn’t ask things that are impossible to know. Things that are too complicated, like [the patient is] on six other drugs, how are they going to interact in combination? I wouldn’t bother asking, I wouldn’t trust the answer cause it’s too individualized.” (C6) Another concern was around the reinforcement of radiology observations that present negative findings. Here, clinicians stressed that they weigh positive findings more than negatives: “[If someone asks] ‘Can you confirm there really isn’t a small pneumothorax on this?’ Then the answer from the assistant should be ‘No, you can’t’.” (C7) In other words, clinicians cautioned the uses of AI for more ambitious, high-risk VLM use cases involving prognosis, more complex patient cases, or a definite negation of abnormalities – given more likely chances of errors and their negative implications on patient care.
Focusing on task- and patient-specific, functional queries: However, clinicians described an array of rather functional, task-specific queries where they could imagine AI to assist by either connecting them to, or extracting information on their behalf. For example, clinicians envisioned the AI assistant to perform image-based quantifications such as calculations of the cardiothoracic ratio (calculated by measuring the maximum diameter of the heart and thoracic cavity); Mirels’ score (indicating the risk of bone fracture); sarcopenia index (muscle-fat ratio to track weight loss in cancer patients); and waist-to-hip ratio in CT scans. All of these are currently calculated manually, often using phone apps: “It would be perceived added value if it could be quickly extracted from [an image] read, as you wouldn’t calculate it unless you needed.” (C7) In keeping with these more functional tasks, participants often envisioned AI assisting interactions in familiar forms, such as tool buttons, alerts or reminders for specific conditions and workflows; thereby describing expectations of the AI being designed as a workflow tool. One clinician expressed: “I almost would want the prompt ‘Have you thought about this?”’ (C5) whilst simultaneously cautioning that such prompts could easily become annoying: “[For guidelines] I want to be able to click [on a finding], guidance, then it searches and brings it up for me. I don’t want pop-up fatigue.” (C5)
Furthermore, clinicians described how such practical, patient-specific AI functionality could be achieved even more effectively if VLM capabilities were combined with patient EHR data:
“You want it to give you, here’s their allergies, here’s their weight, here’s their renal function, here’s their swallow plan. Do they have a cannula in place? And here’s their other medications that could interact with that medication. If it can pull from the system that type of information, excellent, you’re saving me a huge amount of time.” (C5)
Criticizing many of the more generic information that were probed in our concept sketch (e.g., clinical features, differential diagnoses), clinicians emphasized the benefits of including additional EHR data to provide patient-context relevant information: “I don’t need [it to remind me] the 10 common causes of pleural effusion. What will be really helpful is for it to know that actually in this context, hypothyroidism becomes not the 29th thing, but actually upping [that to] your top five you should be considering... because this patient’s got some other clues or signs.” (C3) Similarly, surfacing a patient’s eligibility for clinical trials or surfacing specific hospital or NHS level guidelines were described useful (C1, C2, C5, C6, C7); re-emphasizing the need for AI information provision to be adapted to each patient’s specific context.
7.3 Visual Search and Querying
Aiding interpretation via comparison with relevant patient cases: All clinicians and radiologists shared that they perform web searches to find similar images, though not too frequently (e.g., 1/week). For this concept, being able to visually search radiology images and reports within the context of their hospital and patient population was valued the most: “Often you look at a CT scan on [internet] and you go ‘my CT scans don’t look anything like that’ [because it was a different generation CT scanner]. So it’s very important to visualize the abnormality in the context of the type of imaging you would see in your center.” (C7) Most clinicians and radiologists wanted to query what is normal, or queries with age and sex: “Recently we had a big debate: What does a 16 year old thymus look like normally?” (C6) An intensive care unit (ICU) clinician also described the difficulty of assessing rare conditions where they overlap with other abnormalities, because such cases are too infrequent and unfamiliar:
“Nasogastric (NG) tubes in the wrong place on a chest X-ray on someone in ICU with pneumonia is even less common [than misplaced NG tubes alone]. So people have to simulate abnormalities in their head and compare the X-ray with their simulation. Showing [cases] similar to your patient would be useful.” (C1)
All this suggests potential benefits of VLM use in retrieving or simulating other patient cases that enable comparative image assessments for either rare and complex (e.g., querying ‘NG tube’ + ‘pneumonia’), or normal cases to assist interpretation. For such uses, participants again positioned the AI system as a tool for extracting, searching or filtering information rather than as a conversational interface: “I’d have it as a tool that I can work with, and not conversation.” (R1) Describing how they would use queries to refine image search, one clinician added: “To then be able to type in pneumonia for example, and then the other [search results] go away. ‘Just female patients’ or ‘I’m only interested in people over 75’.” (C7)
AI insights to provide reassurance to ‘human’ interpretation: Reflecting on when in their workflow visual search and query capabilities could be useful, some clinicians suggested their use for follow-up questions about the radiology report: “Radiologist might have looked at it, but just not commented on it. I just want the reassurance, is that normal or not? Is it a nodule? Is it a mass? Is it a piece of consolidation? Same goes with head scans. Does this look like quite a full brain? Does the patient have hydrocephalus or not?” (C5). Yet, other clinicians reflected that even with AI functionality to retrieve i.e., similar images, they might still want to ask a radiologist to be assured: “Would I be reassured if it flashed up a whole load of other people’s chest X-rays and said, this was reported as normal and this was reported as normal, for yours is probably normal. I’m not sure that I would, but maybe.” (C6) Interestingly, none of the participants expected the system to provide an answer, and preferred example patient cases to inform their decisions: “Here’s a bunch of pictures, you decide. And that’s reasonable, right? I’m not asking some kind of segmentation to then take responsibility for the decisions.” (C1). This suggests preferences for AI use to reassure and aid human image interpretation rather than its use as an interpretative agent in itself.
7.4 Patient Imaging History Highlights
Reducing laborious information gathering: All radiologists and clinicians highly valued having a summary of a patient’s prior images highlighting key events and chronic conditions. Searching through a patient’s history was a major part of the clinical workflow. Recognizing the potential for time savings: “Half of my life is kind of spent chasing notes and pre-existing conditions. A sentence or two, just about the radiology, would save me a lot of time.” (C1) Some clinicians (C3, C7) spoke of a time-reward trade-off: “The problem with image interpretation is, how far back do you look when interpreting for change?” (C7) They expressed feelings of guilt as they mostly look through recent reports, but not images, due to lack of time. Radiologists, on the other hand, shared they take a thorough look at past images, yet expressed desires for an automated summary: “That is a pretty standard practice already for radiologists, but certainly being able to more easily get at that imaging history is going to be a help.” (R1)
Facilitating relevant patient information access: Probing what would be useful to highlight, participants mainly described the historical status of the patient, such as the baseline lung architecture before a patient had pneumonia. Examples included past operations (e.g., Do they have a collapsed lung?), key events (e.g., When their pacemaker first appeared or their sternotomy wires first went in?) and changes in abnormalities (e.g., New masses, fluid consolidation, rib fractures, are they old or new?). When asked whether a text summary would be still useful in comparison to more multimodal, VML capabilities (e.g., text summary of key events along with image annotations), most participants commented that linked reports and visual highlights could aid verification: “If you clicked on it [for it to show you annotated images], then you can corroborate.” (C6) However, several participants emphasized that even a text summary would provide an improvement to the current state: “We would willingly ingest that information even if it was a little bit more clunky.” (C7) Finally, a few clinicians (C3, C6) pointed out that unlike radiologists, the interface they use to review prior reports only presents a list view without images. As such, they thought of AI to still be useful if it could point them, at least, to important reports to guide their navigation to the relevant image: “I have to click on each one individually, wait for it to load.... Even if I had a little red flag next to it saying ‘open this one, this has got money in it’.” (C3). This again highlights the prospective utility of AI in surfacing the clinically most relevant insights; and suggests that utility may already be achieved with simpler AI capabilities.