An open-source fine-tuned large language model for radiological impression generation: a multi-reader performance study

Serapio, Adrian; Chaudhari, Gunvant; Savage, Cody; Lee, Yoo Jin; Vella, Maya; Sridhar, Shravan; Schroeder, Jamie Lee; Liu, Jonathan; Yala, Adam; Sohn, Jae Ho

doi:10.1186/s12880-024-01435-w

Research
Open access
Published: 27 September 2024

An open-source fine-tuned large language model for radiological impression generation: a multi-reader performance study

Adrian Serapio¹,
Gunvant Chaudhari³,
Cody Savage²,
Yoo Jin Lee¹,
Maya Vella¹,
Shravan Sridhar¹,
Jamie Lee Schroeder⁴,
Jonathan Liu¹,
Adam Yala⁵ &
…
Jae Ho Sohn¹

BMC Medical Imaging volume 24, Article number: 254 (2024) Cite this article

1072 Accesses
Metrics details

Abstract

Background

The impression section integrates key findings of a radiology report but can be subjective and variable. We sought to fine-tune and evaluate an open-source Large Language Model (LLM) in automatically generating impressions from the remainder of a radiology report across different imaging modalities and hospitals.

Methods

In this institutional review board-approved retrospective study, we collated a dataset of CT, US, and MRI radiology reports from the University of California San Francisco Medical Center (UCSFMC) (n = 372,716) and the Zuckerberg San Francisco General (ZSFG) Hospital and Trauma Center (n = 60,049), both under a single institution. The Recall-Oriented Understudy for Gisting Evaluation (ROUGE) score, an automatic natural language evaluation metric that measures word overlap, was used for automatic natural language evaluation. A reader study with five cardiothoracic radiologists was performed to more strictly evaluate the model’s performance on a specific modality (CT chest exams) with a radiologist subspecialist baseline. We stratified the results of the reader performance study based on the diagnosis category and the original impression length to gauge case complexity.

Results

The LLM achieved ROUGE-L scores of 46.51, 44.2, and 50.96 on UCSFMC and upon external validation, ROUGE-L scores of 40.74, 37.89, and 24.61 on ZSFG across the CT, US, and MRI modalities respectively, implying a substantial degree of overlap between the model-generated impressions and impressions written by the subspecialist attending radiologists, but with a degree of degradation upon external validation. In our reader study, the model-generated impressions achieved overall mean scores of 3.56/4, 3.92/4, 3.37/4, 18.29 s,12.32 words, and 84 while the original impression written by a subspecialist radiologist achieved overall mean scores of 3.75/4, 3.87/4, 3.54/4, 12.2 s, 5.74 words, and 89 for clinical accuracy, grammatical accuracy, stylistic quality, edit time, edit distance, and ROUGE-L score respectively. The LLM achieved the highest clinical accuracy ratings for acute/emergent findings and on shorter impressions.

Conclusions

An open-source fine-tuned LLM can generate impressions to a satisfactory level of clinical accuracy, grammatical accuracy, and stylistic quality. Our reader performance study demonstrates the potential of large language models in drafting radiology report impressions that can aid in streamlining radiologists’ workflows.

Peer Review reports

Introduction

Radiology reports synthesize a radiologist’s interpretations which are essential in communicating the current condition of a patient [1]. Radiology reports typically consist of an exam type, clinical history, comparison, technique, radiation dose, findings, and impression section [2]. The impression section is of utmost importance, as it summarizes the key findings of the radiology report and carries the most weight in influencing the clinical decision-making of the consulting physician [3, 4]. As it stands, the process of generating the impression section is not always standardized and can be subjective [5]. Automatically generating impressions can help to ensure that essential findings are not omitted while also keeping the impressions succinct.

Since the Large Language Models (LLMs) ChatGPT and GPT-4 were released in November 2022 and March 2023 respectively, multiple studies have shown how these LLMs could be applied to a variety of radiological tasks such as structured reporting, question answering on a radiology board-style examination, and response to common lung cancer questions [6,7,8]. Closely related to our work, GPT-4 was shown to generate impressions for radiology reports [9].

Given that ChatGPT and GPT-4 are close-sourced models only available via web APIs, we believe that it is the crucial next step to clinically validate the performance of fine-tuned open-source large language models, enhancing access and replicability that will greatly aid future development in this area. Especially for private clinical datasets, open-source models provide the advantage of eliminating the need to upload sensitive patient data to a cloud service and instead be trained and deployed locally [10].

In this study, our objective was to evaluate the performance of a fine-tuned open-source LLM in generating impressions to summarize radiology reports over multiple imaging modalities and hospitals which would test the model’s capacity to generalize across different settings. We aimed to evaluate the fine-tuned model’s performance through a clinical reader performance study on a specific modality with subspecialty radiologists.

Methods

Datasets and Corpora

The radiology reports in this study were retrospectively collected with the University of California San Francisco’s Institutional Review Board approval and informed consent waiver, following the Helsinki Declaration of 1975, as revised in 2013. All methods were performed in accordance with the relevant guidelines and regulations. We gathered CT, US, and MRI reports from two hospitals under one institutional affiliation. The University of California San Francisco Medical Center (UCSFMC) is an academic tertiary referral center, while the Zuckerberg San Francisco General Hospital (ZSFG) and Trauma Center is a level-1 trauma center and county safety net hospital. A total of 372,716 radiology reports between January 1, 2021 and October 22, 2022 were consecutively and comprehensively sourced from UCSFMC, while a total of 60,049 radiology reports between January 1, 2022 and December 29, 2022 were consecutively and comprehensively sourced from ZSFG. In terms of reporting style, both UCSFMC and ZSFG follow structured reporting. Moreover, both hospitals utilize a system where reports are initially prepared by residents and then reviewed and finalized by attending radiologists, who provide revisions before signing off. As such, all reports reflect the work and approval of the attending radiologist. Table 1 summarizes the demographics of the datasets sourced from UCSFMC and ZSFG.

Table 1 Characteristics of the UCSFMC training, validation, and test sets and the ZSFG independent test dataset

Full size table

We excluded all outside hospital imported cases as they did not have associated radiology reports in the system, reports with findings stored in clinical notes, reports that did not separate the findings and impression section, and reports that shared the same accession numbers. From UCSFMC, a total of 19,436 reports were excluded, leaving 353,280 reports that were used in our study. 102172, 12772, and 12772 patients were assigned for training, validation, and testing respectively. This resulted in training, validation, and test datasets composed of 282525, 35631, and 35124 reports respectively. From ZSFG, a total of 126 reports were excluded which resulted in an independent test set of 59923 reports from 27530 patients (Fig. 1).

Model development

We fine-tuned the open-source Text-to-Text Transformer (T5) large language model to generate impressions [11]. The T5 is an instruction-tuned model that has been initially pre-trained on the colossal, cleaned version of Common Crawl’s web crawl corpus (C4) dataset, composed of websites scraped from the internet [12]. The remainder of each radiology report excluding the impression serves as the input text and the impression section of each radiology report serves as the output text, where both sequences are tokenized and then subsequently fed into the model (Fig. 2). PyTorch (version 2.1.0) and the HuggingFace transformers library (version 4.35.0) were used to implement these methods [13, 14]. We used the AdamW optimizer with a learning rate of 0.0003, a batch size of 4, and accumulated grad batches of 32 for an effective batch size of 128 [15]. All code is available at https://github.com/bdrad/radiological-report-impression-generation.

Automated lexical evaluation metrics

The Recall-Oriented Understudy for Gisting Evaluation (ROUGE) score, the standard performance metric for automated text summarization, was calculated to evaluate the models’ performance in impression generation [16]. ROUGE-1 and ROUGE-2 measure the overlap of a unigram and bigrams, respectively, between the original impression and generated impression. On the other hand, the ROUGE-L is based on the longest common subsequence, measuring sentence-level semantic similarity. A higher ROUGE score indicates a higher-quality summary with a maximum ROUGE score of 100. We calculated the ROUGE-1, ROUGE-2, and ROUGE-L scores over the UCSFMC test dataset and the ZSFG independent test dataset.

Clinical reader performance study

We conducted a reader performance study with five board-certified cardiothoracic radiologists who have eight, seven, six, eight, and six years of experience (inclusive of residency and fellowship training). The study involved 60 CT chest reports from 60 unique patients that were sampled from the UCSFMC test dataset. The sample size was determined by the time and resources required to have attending radiologists manually evaluate and edit impressions. Furthermore, we confirmed a similar size in Sun et al. who have previously conducted a reader study for automatic impression generation based on 50 reports and limited the evaluation to the modality of Chest X-rays [9]. We focused our reader study on evaluating Chest CTs to impose a more stringent and granular analysis of the errors of generated impressions when compared to a subspecialist cardiothoracic radiologist baseline.

Forty of the reports were randomly selected to show the generated impression, while 20 show the original impression written by the attending thoracic radiologist. This reader performance study structure involving both model-generated and radiologist's final impressions was chosen for better evaluation of the LLM, including any of its potential errors or unexpected behaviors. We note that the CT scan images were not provided to the radiologists.

Each radiologist was asked to rate the impression in terms of clinical accuracy, grammatical accuracy, and stylistic quality. They may optionally edit the impression. Edit time and edit distance (number of words changed) were recorded to quantitatively measure workflow efficiency. We also calculate the ROUGE scores of the original or generated impression with respect to the radiologist edits. We note, however, that this score cannot be directly compared to the previously calculated ROUGE scores, as the previous one was subject against a separately written original impression, while in this case, measuring against an edited impression by a reader.

We also stratified the complexity of the reports in the reader study according to diagnostic categories and the length of the original impression. To determine each study’s diagnosis category, a thoracic radiologist with eight years of experience who did not participate in the reader performance study examined the clinical history and original impression of each report. The radiologist defined it into the following categories: Cancer staging, Acute/emergent findings, Interstitial lung disease, Nodules, Lung Transplant, and Aneurysm. For model evaluation, the Interstitial lung disease, Nodules, Lung Transplant, and Aneurysm were consolidated into a single “Other” category. In terms of impression length, each of the original impressions was classified into three categories: Short, Medium, and Long. The reports were sorted by original impression length with short, medium, and long corresponding to the bottom 20, middle 20, and top 20 reports in terms of original impression word length.

Statistical analysis

A Mann–Whitney U test was used to calculate the P values comparing the ratings for the model-generated impressions and the original impressions written by an attending radiologist in terms of clinical accuracy, grammatical accuracy, stylistic quality, edit time, and edit distance [17]. 95% CIs were generated for the ROUGE scores and reader performance study metrics using bootstrapping with resampling. A multi-rater intraclass correlation was computed to measure inter-rater variability for the ordinal clinical metrics of clinical accuracy, grammatical accuracy, and stylistic quality as applicable [18]. All statistical analysis was conducted in Python 3.10.9 using the Numpy (version 1.26.4) Scipy (version 1.11.1), and Pingouin (version 0.5.4) packages [19,20,21].

Results

Dataset characteristics

For UCSFMC, we excluded 15803 reports that were non-reportable due to being outside-hospital studies, 715 reports with findings stored in clinical notes, 2912 reports that did not separate the findings and impression section, and 6 reports that share the same accession numbers. For ZSFG, we excluded 124 reports that did not separate the findings and impression section and 2 reports that share the same accession numbers (Fig. 1).

After dataset exclusion, we tabulate the age, sex, imaging modality, status (Emergency/Inpatient/Outpatient), stat (Is Stat/Non-stat), and body part imaged for the UCSFMC training, validation, test datasets and ZSFG independent test dataset (Table 1). In addition to the demographics of the 60 CT chest reports used in the reader performance study, Table 2 documents the stratifications by diagnosis category and original impression length to gauge case complexity.

Table 2 Characteristics of CT chest cases used in the reader study evaluation dataset assigned for model-generated and radiologist-written impression evaluation

Full size table

Automated lexical evaluation metrics

Table 3 depicts the automated lexical metrics achieved by the large language model on both the UCSFMC and ZSFG test datasets. The ROUGE-1, ROUGE-2, and ROUGE-L scores quantify the overall adherence of large language models in generating impressions to the level of the finalized impressions written by attending radiologists. The large language model achieved a ROUGE-1 score of 53.22 (95% CI: 52.88, 53.62), ROUGE-2 score of 51.26 (95% CI: 50.87, 51.65), and ROUGE-L score of 46.51 (95% CI: 46.13, 46.89) on the CT modality for the UCSFMC test dataset. The model achieved a slightly lower ROUGE-1 score of 46.57 (95% CI: 46.37, 46.79), ROUGE-2 score of 31.87 (95% CI: 31.65, 32.09), and ROUGE-L score of 40.74 (95% CI: 40.52, 40.93) on the CT modality for the ZSFG independent test dataset. We observe a degree of degradation in model quality when externally validated for the CT modality.

Table 3 Summary statistics for the automated lexical ROUGE scores results of the large language model on the UCSFMC test dataset and ZSFG independent test set over multiple imaging modalities

Full size table

The large language model achieved a ROUGE-1 score of 51.26 (95% CI: 50.87, 51.65), ROUGE-2 score of 35.36 (95% CI: 34.91, 35.79), and ROUGE-L score of 44.2 (95% CI: 43.78, 44.65) on the MRI modality for the UCSFMC test dataset. The model achieved a slightly lower ROUGE-1 score of 45.04 (95% CI: 44.59, 45.5), ROUGE-2 score of 29.47 (95% CI: 29, 29.95), and ROUGE-L score of 37.89 (95% CI: 37.43, 38.31) on the MRI modality for the ZSFG independent test dataset. Similarly, we observe a degree of degradation in model quality when externally validated for the MRI modality.

The large language model achieved a ROUGE-1 score of 56.41 (95% CI: 55.89, 56.9), ROUGE-2 score of 41.15 (95% CI: 40.54, 41.76), and ROUGE-L score of 50.96 (95% CI: 50.46, 51.48) on the US modality for the UCSFMC test dataset. The model achieved a lower ROUGE-1 of 32 (95% CI: 31.75, 32.24), ROUGE-2 score of 13.87 (95% CI: 13.65, 14.08), and ROUGE-L score of 24.61 (95% CI: 24.38, 24.85) on the US modality for the ZSFG independent test dataset. Similarly, we observe a greater degree of degradation in model quality when externally validated for the US modality.

Clinical reader performance study

The model achieved an overall mean clinical accuracy of 3.56 (3.46, 3.67) out of 4, grammatical accuracy of 3.92 (3.89, 3.96) out of 4, and stylistic quality of 3.37 (3.26, 3.47) out of 4, edit time of 18.29 (14.85, 21.98) seconds, and edit distance of 12.32 (9.88, 14.97) words. The radiologist baseline, which was the original cardiothoracic radiologist’s impression, achieved an overall mean clinical accuracy of 3.75 (3.61, 3.88) out of 4, grammatical accuracy of 3.87 (3.79, 3.94) out of 4, and stylistic quality of 3.54 (3.42, 3.65) out of 4, edit time of 12.2 (8.48, 16.48) seconds, and edit distance of 5.74 (4.06, 7.72) words (Table 4). Moreover, with respect to the edited impressions, the model-written impressions achieved a mean ROUGE-1, ROUGE-2, and ROUGE-L scores of 85 (82.89, 88.22), 81 (77.04, 84.41), and 84 (80.72, 87.13) respectively. On the other hand, the original impressions written by an attending radiologist achieved mean scores of 89 (85.96, 92.69), 85 (76.90, 89.30), and 89 (84.76, 92.31) respectively (Table 5).

Table 4 Statistics of the results of the reader performance study along with stratifications based on the diagnosis category and original impression length

Full size table

Table 5 ROUGE score summary statistics from the reader performance study measuring the overlap between the impression being evaluated and the revised impression written by the attending radiologist reader

Full size table

Table 4 also depicts mean scores of the model-generated and radiologist-written impressions stratified by diagnosis category and original impression length. For reports that contained acute/emergent findings, the LLM achieved the highest clinical accuracy rating of 3.64 (3.45, 3.8) out of 4, whereas the radiologist baseline achieved a clinical accuracy of 3.71 (3.46, 3.91) out of 4. The model slightly underperforms in the category “Other” (Interstitial Lung Disease, Nodules, and Lung Transplant) achieving a clinical accuracy rating of 3.4 (3.16, 3.62) out of 4, while the radiologist baseline achieves a clinical accuracy of 3.86 (3.66, 4) out of 4. In terms of impression length, the LLM performs the best in clinical accuracy on shorter impressions achieving a clinical accuracy rating of 3.66 (3.47, 3.81) out of 4 in this category, and slightly underperforms in longer impressions achieving a clinical accuracy rating of 3.45 (3.23, 3.63) out of 4 and 3.58 (3.38, 3.75) in the Medium and Long categories.

Multi-rater interclass correlation scores were calculated to measure the inter-rater reliability of the group of radiologists who participated in the reader performance study. Given the limited variance of the grammatical accuracy metric (σ² = 0.098) as opposed to the clinical accuracy (σ² = 0.58) and stylistic quality (σ² = 0.47), we chose to report intra-class correlations for clinical accuracy and stylistic quality given the limited ability of the intraclass correlation score to quantify agreement over limited variance [18]. The level of agreement among the readers was moderate for both metrics with ICC scores of 0.67 and 0.57 for clinical accuracy and stylistic quality respectively.

Error analysis

Figure 3 illustrates the model-generated impression that received the lowest average clinical accuracy along with the remainder of the report and edits from the panel of thoracic radiologist readers. We note the subjectivity in assigning a specific interstitial pneumonia pattern and the interplay between the stylistic preference of the attending radiologist including the addition and omission of certain findings.

Figure 4 illustrates the model-generated impression that received the lowest average stylistic quality. We note how the model tends to be verbose and include specific aspects of the findings section such as the size of the lymph node or note the particular series and slice that a finding is located, of which radiologists tend not to include the impression section. We also note the interplay between stylistic quality and clinical accuracy wherein the model failed to note if the findings are non-specific, or concerning for metastasis.

Figure 5 enumerates the modifications for every impression that received a rating of 1 out of 4 in terms of clinical accuracy from both model-generated impressions and radiologist-written impressions. This comprehensive breakdown illustrates a variety of clinical errors both from model-generated and radiologist-written impressions across different diagnosis categories.

Figure 6 illustrates sample cases that compare the ROUGE score across different pairs of impressions. We note that ROUGE scores by definition measure adherence to the reference impression. We observe how ROUGE scores occasionally reflect stylistic quality better than clinical accuracy and note how it is integral to not rely on them and conduct reader performance studies to more reliably measure model performance.

Discussion

We have evaluated a fine-tuned open-source large language model’s ability to generate impressions from the remainder of a radiology report over multiple imaging modalities and hospitals. On the UCSFMC test dataset, the LLM achieved ROUGE-1, ROUGE-2, and ROUGE-L scores of 53.22, 51.26, and 46.51 on CT reports, 51.26, 35.36, and 44.2 on MRI reports, and 56.41, 41.15, and 50.96, on US reports. We also tested the LLM’s performance on the ZSFG independent test set and it achieved scores of 46.57, 31.87, and 40.74 on CT reports, 45.04, 29.47, and 37.89 on MRI reports, and 32, 13.87, and 24.61, on US reports. For the reader performance study, the model-generated impressions achieved overall mean scores of 3.56/4, 3.92/4, and 3.37/4, 18.29 s, and 12.32 words for clinical accuracy, grammatical accuracy, stylistic quality, edit time, and edit distance respectively, while the original subspecialist radiologist impression baseline achieved overall mean scores of 3.75/4, 3.87/4, and 3.54/4, 12.2 s, 5.74 words respectively. Additionally, with respect to the readers’ edited impressions, the model-generated impressions achieved ROUGE-1, ROUGE-2, and ROUGE-L scores of 85 (82.89, 88.22), 81 (77.04, 84.41), and 84 (80.72, 87.13) respectively. On the other hand, the original impressions written by an attending radiologist achieved mean scores of 89 (85.96, 92.69), 85 (76.90, 89.30), and 89 (84.76, 92.31) respectively. The LLM achieved the highest clinical accuracy ratings for acute/emergent findings and on shorter impressions.

The ROUGE score results on the two hospital test datasets demonstrate a substantial overlap between the model-generated impressions and the original impression written by an attending radiologist. These scores may be impacted by the variability in writing impressions between radiologists, but act as a general gauge to assess potential model degradation in external validation. We sought to address this limitation in interpreting the ROUGE score by additionally conducting a reader performance study to more clinically assess if the model-written impression, though potentially different from the original radiologist’s impression, is of satisfactory quality. With respect to model edits in the reader study, the model had a substantially higher set of ROUGE scores, also evidenced by a relatively low edit distance to the revised indication written by the readers. This set of ROUGE scores demonstrates the potential to have LLMs preliminarily draft impressions that can be subsequently revised and finalized by radiologists. Overall, we note that the ROUGE scores can only be interpreted in relative terms, as the ROUGE scores for the automated lexical metrics measure the overlap of independently written impressions, while the reader study ROUGE scores are focused on the deviation from radiologists’ revisions on an already-written impression.

Our findings demonstrate the need to develop evaluation frameworks where automated lexical metrics are complemented by a reader performance study for a more comprehensive analysis of the generated impressions. Our reader performance study leads to a more granular and comprehensive analysis of the strengths and flaws of the large language model in generating impressions with a thoracic radiologist baseline. Aside from quantitative metrics such as clinical accuracy, grammatical accuracy, and stylistic quality, the reader study also examines impression quality with the radiologist’s word-for-word edits and edit time to simulate a workflow integrating large language models in radiology reporting. For instance, our stratified analysis by diagnosis reveals that the LLM performs best in terms of cancer staging and acute/emergent diagnosis categories, but slightly underperforms in terms of the Other category, including cases that included interstitial lung disease diagnosis categories. Particularly, for the impression that received the lowest average rating in terms of clinical accuracy, the radiologist readers noted how an impression generated by the model that mentions a UIP pattern instead of an NSIP pattern may adversely affect clinical care [23]. This finding on the clinical risks of LLMs has also been explored in other investigations that examined the use of LLMs for biomedical applications [24,25,26]. These error cases, despite few, demonstrate the necessity of radiologist supervision at this stage if it were to be integrated for clinical use.

Several studies have previously sought to automatically generate impressions using large language models. For instance, Sun et. al and Ma et. al have examined how to adapt GPT-4 to generate impressions for radiology reports [9, 22]. We build upon this body of work on automatic impression generation for radiology report summarization and focus on evaluating fine-tuned open-source large language models which would greatly enhance study replicability as opposed to closed-source models such as ChatGPT and GPT-4. Furthermore, the open-source nature of our study and full release of the associated code allows for further development in this area in contrast with the closed-source algorithms currently available in industry.

Our results present a framework for fine-tuning and evaluating an open-source large language model for automatic impression generation. Subsequent work in this area can focus on a prospective clinical validation of LLMs in enhancing the clarity and consistency of radiologist-written impressions, significantly improving the communication between physicians and radiologists. One such implementation could involve a hybrid approach of leveraging LLMs to draft radiology report impressions with subsequent revisions from radiologists with the resulting time-savings and reduction of costs from the streamlined workflow can be measured and evaluated.

Our study had several limitations. First of all, our automated lexical methodology of calculating the adherence of large language model output using the ROUGE score is not directly interpretable and can only be used in relative terms to gauge model performance (e.g. relative to other imaging modalities or hospital dataset). Second, our reader performance study only included sixty cases, due to the prohibitive cost and intractability of a large-scale reader study involving the manual editing and evaluation by subspecialist cardiothoracic radiologists. Our reader study was primarily intended to identify key areas where large language models can provide value in terms of generating impressions, but a more comprehensive analysis with a larger sample size and disease category stratification is deferred to future work. Third, only two hospitals that use the English language were included in the study which would imply that additional evaluation must be needed to establish the utility of the model to a broader clinical audience. Fourth, another methodical limitation is that given the scope of the study, we were unable to measure time savings in terms of absolute gain. To measure an unbiased estimate of the time taken for an attending radiologist to write an impression with and without this model, the large language model needs to be directly integrated into the clinical workflow via the dictation software requiring additional regulatory approval which we delegate to future work.

Conclusions

In conclusion, we have evaluated a fine-tuned open-source large language model’s capacity to generate impressions for radiology reports across multiple imaging modalities and hospitals. Our reader performance study demonstrates that LLMs have the potential to greatly improve the workflow efficiency of radiologists by drafting preliminary versions of impressions and contribute to the quality of radiology reports.

Availability of data and materials

The data for this project came from UCSF radiology reports. The data can be shared with researchers using data use agreement and approval of the data committee at UCSF.

Abbreviations

LLM:: Large Language Model
ROUGE:: Recall-Oriented Understudy for Gisting Evaluation
T5:: Text-to-Text Transformer

References

Hartung MP, Bickle IC, Gaillard F, Kanne JP. How to create a great radiology report. RadioGraphics. 2020;40(6):1658–70. https://doi.org/10.1148/rg.2020200020. Radiological Society of North America.
Article PubMed Google Scholar
Hall FM. Language of the Radiology Report. Am J Roentgenol. 2000;175(5):1239–42. https://doi.org/10.2214/ajr.175.5.1751239. American Roentgen Ray Society.
Article CAS Google Scholar
Good practice for radiological reporting. Guidelines from the European Society of Radiology (ESR). Insights Imaging. 2011;2(2):93–6. https://doi.org/10.1007/s13244-011-0066-7.
Article Google Scholar
Gershanik EF, Lacson R, Khorasani R. Critical finding capture in the impression section of radiology reports. AMIA Annu Symp Proc. 2011;2011:465–9.
PubMed PubMed Central Google Scholar
Brady AP. Error and discrepancy in radiology: inevitable or avoidable? Insights Imaging. 2016;8(1):171–82. https://doi.org/10.1007/s13244-016-0534-1.
Article PubMed PubMed Central Google Scholar
Adams LC, Truhn D, Busch F, et al. Leveraging GPT-4 for post hoc transformation of free-text radiology reports into structured reporting: a multilingual feasibility study. Radiology. 2023;307(4):e230725. https://doi.org/10.1148/radiol.230725 Radiological Society of North America.
Article PubMed Google Scholar
Bhayana R, Krishna S, Bleakney RR. Performance of ChatGPT on a radiology board-style examination: insights into current strengths and limitations. Radiology. 2023;307(5):e230582. https://doi.org/10.1148/radiol.230582. Radiological Society of North America.
Article PubMed Google Scholar
Rahsepar AA, Tavakoli N, Kim GHJ, Hassani C, Abtin F, Bedayat A. How AI responds to common lung cancer questions: ChatGPT versus Google Bard. Radiology. 2023;307(5):e230922. https://doi.org/10.1148/radiol.230922. Radiological Society of North America.
Article PubMed Google Scholar
Sun Z, Ong H, Kennedy P, et al. Evaluating GPT4 on impressions generation in radiology reports. Radiology. 2023;307(5):e231259. https://doi.org/10.1148/radiol.231259. Radiological Society of North America.
Article PubMed Google Scholar
Mukherjee P, Hou B, Lanfredi RB, Summers RM. Feasibility of using the privacy-preserving large language model vicuna for labeling radiology reports. Radiology. 2023;309(1):e231147. https://doi.org/10.1148/radiol.231147. Radiological Society of North America.
Article PubMed Google Scholar
Chung HW, Hou L, Longpre S, et al. Scaling Instruction-finetuned language models. arXiv; 2022. https://doi.org/10.48550/arXiv.2210.11416.
Raffel C, Shazeer N, Roberts A, et al. Exploring the limits of transfer learning with a unified text-to-text transformer. J Mach Learn Res. 2020;21(1):140:5485-140:5551.
Google Scholar
Paszke A, Gross S, Massa F, et al. PyTorch: an imperative style, high-performance deep learning library. arXiv; 2019. https://doi.org/10.48550/arXiv.1912.01703.
Wolf T, Debut L, Sanh V, et al. HuggingFace’s transformers: state-of-the-art natural language processing. arXiv; 2020. https://doi.org/10.48550/arXiv.1910.03771.
Loshchilov I, Hutter F. Decoupled weight decay regularization. arXiv; 2019. https://doi.org/10.48550/arXiv.1711.05101.
Lin C-Y. ROUGE: A package for automatic evaluation of summaries. text summ branches out. Barcelona, Spain: Association for Computational Linguistics; 2004. p. 74–81. https://aclanthology.org/W04-1013. Accessed 15 Apr 2023.
Mann HB, Whitney DR. On a test of whether one of two random variables is stochastically larger than the other. Ann Math Stat. 1947;18(1):50–60. https://doi.org/10.1214/aoms/1177730491. Institute of Mathematical Statistics.
Article Google Scholar
Bartko JJ. The intraclass correlation coefficient as a measure of reliability. Psychol Rep. 1966;19(1):3–11. https://doi.org/10.2466/pr0.1966.19.1.3.
Article CAS PubMed Google Scholar
Virtanen P, Gommers R, Oliphant TE, et al. SciPy 1.0: fundamental algorithms for scientific computing in Python. Nat Methods. 2020;17(3):261–72. https://doi.org/10.1038/s41592-019-0686-2. Nature Publishing Group.
Article CAS PubMed PubMed Central Google Scholar
Vallat R. Pingouin: statistics in Python. J Open Source Softw. 2018;3(31):1026. https://doi.org/10.21105/joss.01026.
Article Google Scholar
Harris CR, Millman KJ, van der Walt SJ, et al. Array programming with NumPy. Nature. 2020;585(7825):357–62. https://doi.org/10.1038/s41586-020-2649-2. Nature Publishing Group.
Article CAS PubMed PubMed Central Google Scholar
Ma C, Wu Z, Wang J, et al. ImpressionGPT: an iterative optimizing framework for radiology report summarization with ChatGPT. arXiv; 2023. https://doi.org/10.48550/arXiv.2304.08448.
du Bois R, King TE. Challenges in pulmonary fibrosis · 5: The NSIP/UIP debate. Thorax. 2007;62(11):1008–12. https://doi.org/10.1136/thx.2004.031039.
Article PubMed PubMed Central Google Scholar
Wornow M, Xu Y, Thapa R, et al. The shaky foundations of large language models and foundation models for electronic health records. NPJ Digit Med. 2023;6(1):1–10. https://doi.org/10.1038/s41746-023-00879-8. Nature Publishing Group.
Article Google Scholar
Li H, Moon JT, Purkayastha S, Celi LA, Trivedi H, Gichoya JW. Ethics of large language models in medicine and medical research. Lancet Digit Health. 2023;5(6):e333–5. https://doi.org/10.1016/S2589-7500(23)00083-3. Elsevier.
Article CAS PubMed Google Scholar
Shen Y, Heacock L, Elias J, et al. ChatGPT and other large language models are double-edged swords. Radiology. 2023. https://doi.org/10.1148/radiol.230163. Radiological Society of North America.

Download references

Acknowledgements

Not applicable.

Funding

Not applicable.

Author information

Authors and Affiliations

Department of Radiology and Biomedical Imaging, University of California, San Francisco, San Francisco, CA, USA
Adrian Serapio, Yoo Jin Lee, Maya Vella, Shravan Sridhar, Jonathan Liu & Jae Ho Sohn
Department of Radiology, University of Maryland Medical Center, Baltimore, MD, USA
Cody Savage
Department of Radiology, University of Washington, Seattle, WA, USA
Gunvant Chaudhari
MedStar Georgetown University Hospital, Washington, DC, USA
Jamie Lee Schroeder
Computational Precision Health, University of California, Berkeley and University of California, San Francisco, Berkeley, USA
Adam Yala

Authors

Adrian Serapio
View author publications
You can also search for this author in PubMed Google Scholar
Gunvant Chaudhari
View author publications
You can also search for this author in PubMed Google Scholar
Cody Savage
View author publications
You can also search for this author in PubMed Google Scholar
Yoo Jin Lee
View author publications
You can also search for this author in PubMed Google Scholar
Maya Vella
View author publications
You can also search for this author in PubMed Google Scholar
Shravan Sridhar
View author publications
You can also search for this author in PubMed Google Scholar
Jamie Lee Schroeder
View author publications
You can also search for this author in PubMed Google Scholar
Jonathan Liu
View author publications
You can also search for this author in PubMed Google Scholar
Adam Yala
View author publications
You can also search for this author in PubMed Google Scholar
Jae Ho Sohn
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

AS and JHS conceived the study, analyzed results, and interpreted findings. AY and JHS advised and supervised study design and evaluation. AS and JHS drafted the manuscript. AS, GC, CS, YJL, MV, SS, JS, JL, AY, JHS contributed substantially to manuscript revision. All authors have carefully read and take responsibility for the integrity of its findings. All authors read and approved the final manuscript.

Corresponding author

Correspondence to Jae Ho Sohn.

Ethics declarations

Ethics approval and consent to participate

The radiology reports in this study were collected retrospectively following the University of California San Francisco’s Institutional Review Board approval (reference #: 303383) and informed consent waiver, following the Helsinki Declaration of 1975 as revised in 2013. All methods were performed in accordance with the relevant guidelines and regulations.

Consent for publication

Not applicable.

Competing interests

The authors declare no competing interests.

Additional information

Publisher’s Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.

Reprints and permissions

About this article

Cite this article

Serapio, A., Chaudhari, G., Savage, C. et al. An open-source fine-tuned large language model for radiological impression generation: a multi-reader performance study. BMC Med Imaging 24, 254 (2024). https://doi.org/10.1186/s12880-024-01435-w

Download citation

Received: 28 June 2024
Accepted: 18 September 2024
Published: 27 September 2024
DOI: https://doi.org/10.1186/s12880-024-01435-w

An open-source fine-tuned large language model for radiological impression generation: a multi-reader performance study

Abstract

Background

Methods

Results

Conclusions

Introduction

Methods

Datasets and Corpora

Model development

Automated lexical evaluation metrics

Clinical reader performance study

Statistical analysis

Results

Dataset characteristics

Automated lexical evaluation metrics

Clinical reader performance study

Error analysis

Discussion

Conclusions

Availability of data and materials

Abbreviations

References

Acknowledgements

Funding

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Ethics approval and consent to participate

Consent for publication

Competing interests

Additional information

Publisher’s Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

BMC Medical Imaging

Contact us