Published online Oct 17, 2024.
https://doi.org/10.3348/kjr.2024.0788
Insufficient Transparency in Stochasticity Reporting in Large Language Model Studies for Medical Applications in Leading Medical Journals
Large language models (LLMs) can potentially reshape healthcare [1, 2, 3]. Numerous studies continue to report on the performance of LLMs in medical applications. Unlike conventional artificial intelligence models, such as convolutional neural networks, which produce consistent outputs for given inputs through deterministic operations, LLMs can generate varying responses even when prompted repeatedly with the exact same query. This phenomenon, known as ‘stochasticity,’ results from random elements in the operation of LLMs [4, 5].
Stochasticity-related variability in LLM outputs presents critical challenges for both medical practice and scientific research. Maintaining consistent information is vital in medical practice; therefore, understanding the extent and nature of this stochasticity-related variability is crucial for assessing LLMs for medical applications. Failure to adequately address or report stochasticity can hinder the replicability of research findings, as highlighted in a recent editorial [6]. Moreover, the lack of transparency in reporting stochasticity raises concerns that this characteristic of LLMs could be exploited to selectively present favorable results. Despite these limitations, published studies have often overlooked the issue of stochasticity. To address this gap, we conducted a systematic analysis of the published literature, focusing on the reporting practices of stochasticity in research studies evaluating the performance of LLMs in medical applications.
A systematic literature search was conducted to identify research articles evaluating the performance of LLMs in medical applications, as illustrated in Figure 1. This search was performed using PubMed, covering articles published between November 30, 2022 (the release date of ChatGPT by OpenAI) to June 25, 2024. The search query employed was: “(large language model) OR (chatgpt) OR (gpt-3.5) OR (gpt-4) OR (bard) OR (gemini) OR (claude) OR (chatbot).” To manage the large number of results, we focused on those perceived as high-quality publications by selecting studies from journals ranked in the top deciles according to the 2023 Journal Impact Factor. These journals were indexed in the Science Citation Index Expanded and were among the top 10% in each of the 59 subject categories within the Clinical Medicine group as defined by the Journal Citation Reports. The number of querying attempts for each query, methods for handling multiple results, and reliability analysis across repeated queries were extracted from the eligible articles. The proportion of articles that clearly reported stochasticity-related issues was determined. Additionally, a subgroup analysis was conducted by excluding studies that used a temperature setting of zero, as this setting makes the model essentially deterministic and thus minimizes the need to address stochasticity [4]. An experienced medical librarian initially identified the article candidates. All subsequent steps were carried out independently by two reviewers with expertise in systematic reviews of medical literature. Any disagreements were resolved through consensus.
Fig. 1
PRISMA flow diagram for systematic literature analysis. JCR = Journal Citation Reports
A total of 159 studies were analyzed (Fig. 1; see Supplementary Table 1 for the full list), of which 147 remained after excluding studies with a temperature of setting of zero. The reporting of stochasticity-related issues is summarized in Table 1. Only 15.1% of the studies (24/159) clearly reported these stochasticity-related issues, while 84.3% (134/159) failed to disclose the number of query attempts. Additionally, only 12.7% of the studies (20/158, excluding one study that explicitly reported using a single attempt) included a reliability analysis of the results from repeated querying attempts. These results were consistent across the 147 studies.
Table 1
Reporting of stochasticity-related issues in the published papers
The literature analysis was limited by the use of PubMed alone. Nevertheless, the findings revealed an unequivocal substantial deficiency in the reporting of stochasticity-related issues in studies on the performance of LLMs in medical applications. As our analysis focused on studies published in leading medical journals, the reporting quality of studies from lower-tier journals might even be more deficient. A compelling need exists to enhance transparency and thoroughness in reporting stochasticity, particularly through the reporting guidelines [6].
Supplement
The Supplement is available with this article at https://doi.org/10.3348/kjr.2024.0788.
Click here to view.(135K, pdf)
Conflicts of Interest:Chong Hyun Suh, an Assistant to the Editor of the Korean Journal of Radiology, was not involved in the editorial evaluation or decision to publish this article. The remaining author has declared no conflicts of interest.
Author Contributions:
Conceptualization: Chong Hyun Suh.
Funding acquisition: Woo Hyun Shim.
Investigation: all authors.
Methodology: all authors.
Supervision: Chong Hyun Suh.
Writing—original draft: Chong Hyun Suh.
Writing—review & editing: Jeho Yi, Woo Hyun Shim, Hwon Heo.
Funding Statement:This research was supported by a grant of the Korea Health Technology R&D Project through the Korea Health Industry Development Institute (KHIDI), funded by the Ministry of Health & Welfare, Republic of Korea (HR20C0026) and a grant from the Asan Institute for Life Sciences, Asan Medical Center, Seoul, Republic of Korea (2024IP0060-1). The funders had no specific roles in this study.
References
-
Meskó B, Topol EJ. The imperative for regulatory oversight of large language models (or generative AI) in healthcare. NPJ Digit Med 2023;6:120
-
-
Kaddour J, Harris J, Mozes M, Bradley H, Raileanu R, McHardy R. Challenges and applications of large language models. arXiv [Preprint]. 2023 [accessed on August 10, 2024].Available at: https://doi.org/10.48550/arXiv.2307.10169.
-
Publication Types
MeSH Terms
Figures
Tables
ORCID IDs
Funding Information
-
Korea Health Industry Development Institute
HR20C0026
-
Asan Institute for Life Sciences, Asan Medical Center
2024IP0060-1