Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
Skip to content
Licensed Unlicensed Requires Authentication Published by De Gruyter March 12, 2024

Can ChatGPT-4 evaluate whether a differential diagnosis list contains the correct diagnosis as accurately as a physician?

  • Kazuya Mizuta , Takanobu Hirosawa ORCID logo EMAIL logo , Yukinori Harada and Taro Shimizu
From the journal Diagnosis

Abstract

Objectives

The potential of artificial intelligence (AI) chatbots, particularly the fourth-generation chat generative pretrained transformer (ChatGPT-4), in assisting with medical diagnosis is an emerging research area. While there has been significant emphasis on creating lists of differential diagnoses, it is not yet clear how well AI chatbots can evaluate whether the final diagnosis is included in these lists. This short communication aimed to assess the accuracy of ChatGPT-4 in evaluating lists of differential diagnosis compared to medical professionals’ assessments.

Methods

We used ChatGPT-4 to evaluate whether the final diagnosis was included in the top 10 differential diagnosis lists created by physicians, ChatGPT-3, and ChatGPT-4, using clinical vignettes. Eighty-two clinical vignettes were used, comprising 52 complex case reports published by the authors from the department and 30 mock cases of common diseases created by physicians from the same department. We compared the agreement between ChatGPT-4 and the physicians on whether the final diagnosis was included in the top 10 differential diagnosis lists using the kappa coefficient.

Results

Three sets of differential diagnoses were evaluated for each of the 82 cases, resulting in a total of 246 lists. The agreement rate between ChatGPT-4 and physicians was 236 out of 246 (95.9 %), with a kappa coefficient of 0.86, indicating very good agreement.

Conclusions

ChatGPT-4 demonstrated very good agreement with physicians in evaluating whether the final diagnosis should be included in the differential diagnosis lists.


Corresponding author: Takanobu Hirosawa, MD, PhD, Department of Diagnostic and Generalist Medicine, Dokkyo Medical University, 880 Kitakobayashi, Mibu-cho, Simotsuga-gun, Tochigi, 321-0293, Japan, Phone: +81 282 86 1111, Fax: +81 282 86 4775, E-mail:

Acknowledgments

This study was made possible using the resources from the Department of Diagnostic and Generalist Medicine, Dokkyo Medical University.

  1. Research ethics: Not applicable.

  2. Informed consent: Not applicable.

  3. Author contributions: The authors have accepted responsibility for the entire content of this manuscript and approved its submission.

  4. Competing interests: The authors state no conflict of interest.

  5. Research funding: None declared.

  6. Data availability: Not applicable.

References

1. Yang, D, Fineberg, HV, Cosby, K. Diagnostic excellence. JAMA 2021;326:1905–6. https://doi.org/10.1001/jama.2021.19493.Search in Google Scholar PubMed

2. Singh, H, Connor, DM, Dhaliwal, G. Five strategies for clinicians to advance diagnostic excellence. BMJ 2022;376:e068044. https://doi.org/10.1136/bmj-2021-068044.Search in Google Scholar PubMed

3. Meyer, AND, Singh, H. The path to diagnostic excellence includes feedback to calibrate how clinicians think. JAMA 2019;321:737–8. https://doi.org/10.1001/jama.2019.0113.Search in Google Scholar PubMed

4. Hirosawa, T, Kawamura, R, Harada, Y, Mizuta, K, Tokumasu, K, Kaji, Y, et al.. ChatGPT-Generated differential diagnosis lists for complex case-derived clinical vignettes: diagnostic accuracy evaluation. JMIR Med Inform 2023;11:e48808. https://doi.org/10.2196/48808.Search in Google Scholar PubMed PubMed Central

5. Berg, HT, van Bakel, B, van de Wouw, L, Jie, KE, Schipper, A, Jansen, H, et al.. ChatGPT and generating a differential diagnosis early in an emergency department presentation. Ann Emerg Med 2024;83:83–6. https://doi.org/10.1016/j.annemergmed.2023.08.003.Search in Google Scholar PubMed

6. Hirosawa, T, Harada, Y, Yokose, M, Sakamoto, T, Kawamura, R, Shimizu, T. Diagnostic accuracy of differential-diagnosis lists generated by generative pretrained transformer 3 chatbot for clinical vignettes with common chief complaints: a pilot study. Int J Environ Res Publ Health 2023;20:3378. https://doi.org/10.3390/ijerph20043378.Search in Google Scholar PubMed PubMed Central

7. Fleiss, JL, Levin, B, Paik, MC. Statistical methods for rates and proportions. New York: John Wiley & sons; 2003.10.1002/0471445428Search in Google Scholar


Supplementary Material

This article contains supplementary material (https://doi.org/10.1515/dx-2024-0027).


Received: 2024-02-09
Accepted: 2024-02-22
Published Online: 2024-03-12

© 2024 Walter de Gruyter GmbH, Berlin/Boston

Downloaded on 28.9.2024 from https://www.degruyter.com/document/doi/10.1515/dx-2024-0027/html
Scroll to top button