Abstract
Chatbot-based multimodal AI holds promise for collecting medical histories and diagnosing ophthalmic diseases using textual and imaging data. This study developed and evaluated the ChatGPT-powered Intelligent Ophthalmic Multimodal Interactive Diagnostic System (IOMIDS) to enable patient self-diagnosis and self-triage. IOMIDS included a text model and three multimodal models (textâ+âslit-lamp, textâ+âsmartphone, textâ+âslit-lampâ+âsmartphone). The performance was evaluated through a two-stage cross-sectional study across three medical centers involving 10 subspecialties and 50 diseases. Using 15640 data entries, IOMIDS actively collected and analyzed medical history alongside slit-lamp and/or smartphone images. The textâ+âsmartphone model showed the highest diagnostic accuracy (internal: 79.6%, external: 81.1%), while other multimodal models underperformed or matched the text model (internal: 69.6%, external: 72.5%). Moreover, triage accuracy was consistent across models. Multimodal approaches enhanced response quality and reduced misinformation. This proof-of-concept study highlights the potential of chatbot-based multimodal AI for self-diagnosis and self-triage. (The clinical trial was registered on June 26, 2023, on ClinicalTrials.gov under the registration number NCT05930444.).
Similar content being viewed by others
In the medical field, health data is inherently multimodal, encompassing both physical measurements and natural-language narratives1. Ophthalmology, a discipline that heavily relies on multimodal information, requires detailed patient histories and visual examinations2. Consequently, multimodal machine learning is becoming increasingly important for medical diagnostics in ophthalmology. Previous studies on ophthalmic diagnostic models have underscored the substantial potential of image recognition AI in automating tasks that demand clinical expertize3. Recently, chatbot-based multimodal generative AI has emerged as a promising avenue towards advancing precision health, integrating health data from both imaging and textual perspectives. However, commonly utilized public models like GPT-4V and Googleâs VLM, while demonstrating some diagnostic capabilities, are currently deemed inadequate for clinical decision-making in ophthalmology4,5. In addition, these current models have not yet achieved the capability to actively collect patient histories through natural language human-computer interactions or to accurately interpret images acquired with non-specialist ophthalmic equipment such as smartphones. Overcoming these challenges in multimodal AI is crucial and could pave the way for self-diagnosis of ophthalmic conditions in home settings, yielding substantial socioeconomic benefits6. Therefore, there is both a need and potential for further development of multimodal AI models to diagnose and triage ophthalmic diseases.
In the domain of gathering medical histories through human-computer interactions, the introduction of prompt engineeringâa streamlined, data- and parameter-efficient technique for aligning large language models with the intricate demands of medical history inquiriesârepresents a significant advancement7. Expanding on this innovation, we propose an interactive ophthalmic consultation system that utilizes AI chatbotsâ robust text analysis capabilities to autonomously generate inquiries about a patientâs medical history based on their chief complaints. While previous research has not directly utilized chatbots for collecting ophthalmic medical histories, evidence indicates that current AI chatbots can provide precise and detailed responses to ophthalmic queries, such as retina-related multiple choice questions, myopia-related open-ended questions, and questions about urgency triage8,9,10. Furthermore, these chatbots excel in delivering high-quality natural language responses to medical inquiries, often exhibiting greater empathy than human doctors11. Therefore, we propose that the system should be capable of formulating comprehensive diagnoses and tailored recommendations based on the patientâs responses. A significant drawback of previous studies on linguistic models is their reliance on simulated patient histories or perspectives created by researchers, which lack validation in real-world, large-scale clinical settings. This limitation raises uncertainties about the practical applications and autonomous deployment capabilities of these models. Hence, it is imperative to develop and validate an embodied conversational agent in authentic clinical settings, where patients themselves contribute data.
Regarding ophthalmic imaging, previous research has primarily focused on slit-lamp photographs, yielding promising results. For instance, a novel end-to-end fully convolutional network has been developed to diagnose infectious keratitis using corneal photographs12. Recently, exploration into smartphone videos has shown their effectiveness in diagnosing pediatric eye diseases, potentially aiding caregivers in identifying visual impairments in children13. Algorithms using smartphone-acquired photographs have also proven valuable in measuring anterior segment depth, which is particularly useful for screening primary angle closure glaucoma14. Together, these studies underscore the utility of both slit-lamp photographs and smartphone-acquired images in addressing diverse challenges in ophthalmic diagnostics. However, a significant limitation of previous AI imaging and multimodal studies is their narrow focus on single diseases. These pipelines often operate independently within their domains, lacking integration across different fields to enhance overall functionality15. For example, applying a model designed for herpes zoster ophthalmicus to triage a patient with cataracts may yield irrelevant outputs16. Moreover, the reliance on single disease types increases the risk of poor model generalizability due to spectrum biasâa disparity in disease prevalence between the modelâs development population and its intended application population17. This limitation is particularly problematic for home-based self-diagnosis and self-triage. To address these challenges, a recent developed and evaluated a novel machine learning system optimized for ophthalmic triage, using data from 9825 patients across 95 conditions18. Thus, multimodal AI capable of diagnosing and triaging multiple ophthalmic diseases is essential for improving care across diverse populations.
Based on the aforementioned context, our study aims to develop an Intelligent Ophthalmic Multimodal Interactive Diagnostic System (IOMIDS), an embodied conversational agent integrated with the AI chatbot ChatGPT. This system is designed for multimodal diagnosis and triage using eye images captured by slit-lamp or smartphone, alongside medical history. Clinical evaluations will be conducted across three centers, focusing on a comprehensive investigation of 50 prevalent ophthalmic conditions. The primary objective is to assess diagnostic effectiveness, with a secondary focus on triage performance across 10 ophthalmic subspecialties. Our research aims to explore the application of AI in complex clinical settings, incorporating data contributions not only from researchers but also directly from patients, thereby simulating real-world scenarios to ensure the practicality of AI technology in ophthalmic care.
Results
Overview of the study and datasets
We conducted this study at three centers, collecting 15640 data entries from 9825 subjects (4554 male, 5271 female) to develop and evaluate the IOMIDS system (Fig. 1a). Among these, 6551 entries belong to the model development dataset, 912 entries belong to the silent evaluation dataset, and 8177 entries belong to the clinical trial dataset (Supplementary Fig. 1). In detail, we first collected a doctor-patient communication dialog dataset of 450 entries to train the text model through prompt engineering. Next, to assess the diagnostic and triage efficiency of the text model, we collected Dataset A (Table 1), consisting of simulated patient data derived from outpatient records. We then gathered two image datasets (Table 1, Dataset B and Dataset C) for training and validating image diagnostic models, which contain only images and the corresponding image-based diagnostic data. Dataset D, Dataset E and Dataset F (Table 1) were then collected to evaluate image diagnostic model performance and develop a text-image multimodal model. These datasets include both patient medical histories and anterior segment images. Following in silico development of the IOMIDS program, we collected a silent evaluation dataset to compare the diagnostic and triage efficacy among different models (Table 1, Dataset G). The early clinical evaluation consists of internal evaluation (Shanghai center) and external evaluation (Nanjing and Suqian), with 3519 entries from 2292 patients in Shanghai, 2791 entries from 1748 patients in Nanjing, and 1867 entries from 1192 patients in Suqian. Comparison among these centers reveals significant differences in subspecialties, disease classifications, gender, age, and laterality (Supplementary Table 1), suggesting that these factors may influence model performance and should be considered in further analyses.
a Intelligent Ophthalmic Multimodal Interactive Diagnostic System (IOMIDS) is an embodied conversational agent integrated with ChatGPT designed for multimodal diagnosis using eye images and medical history. It comprises a text model and an image model. The text model employs classifiers for chief complaints, along with question and analysis prompts developed from real doctor-patient dialogs. The image model utilizes eye photos taken with a slit-lamp and/or smartphone for image-based diagnosis. These modules combine through diagnostic prompts to create a multimodal model. Patients with eye discomfort can interact with IOMIDS using natural language. This interaction enables IOMIDS to gather patient medical history, guide them in capturing eye lesion photos with a smartphone or uploading slit-lamp images, and ultimately provide disease diagnosis and ophthalmic subspecialty triage information. b Both the text model and the multimodal models follow a similar workflow for text-based modules. After a patient inputs their chief complaint, it is classified by the chief complaint classifier using keywords, triggering relevant question and analysis prompts. The question prompt guides ChatGPT to ask specific questions to gather the patientâs medical history. The analysis prompt considers the patientâs gender, age, chief complaint, and medical history to generate a preliminary diagnosis. If no image information is provided, IOMIDS provides the preliminary diagnosis along with subspecialty triage and prevention, treatment, and care guidance as the final response. If image information is available, the diagnosis prompt integrates image analysis with the preliminary diagnosis to provide a final diagnosis and corresponding guidance. c The textâ+âimage multimodal model is divided into textâ+âslit-lamp, textâ+âsmartphone, and textâ+âslit-lampâ+âsmartphone models based on image acquisition methods. For smartphone-captured images, YOLOv7 segments the image to isolate the affected eye, removing other facial information, followed by analysis using a ResNet50-trained diagnostic model. Slit-lamp captured images skip segmentation and are directly analyzed by another ResNet50-trained model. Both diagnostic outputs undergo threshold processing to exclude non-relevant diagnoses. The image information is then integrated with the preliminary diagnosis derived from textual information via the diagnosis prompt to form the multimodal model.
Development of the IOMIDS system
To develop the text model, we categorized doctor-patient dialogs according to chief complaint themes (Supplementary Table 2). Three researchers independently reviewed the dataset and each selected a set of 90 dialogs for training. Based on these dialogs, we used prompt engineering (Fig. 1b) to develop an embodied conversational agent with ChatGPT. After comparison, the most effective set of 90 dialogs (Supplementary Data 1) was identified, finalizing the text model for further research. These included 11 dialogs on âdry eyeâ, 10 on âitchy eyeâ, 10 on âred eyeâ, 7 on âeye swellingâ, 10 on âeye painâ, 8 on âeye dischargeâ, 5 on âeye massesâ, 13 on âblurry visionâ, 6 on âdouble visionâ, 6 on âeye injuries or foreign bodiesâ, and 4 on âproptosisâ. This text model can reliably generate questions related to the chief complaint and provide a final response based on the patientâs answers, which includes diagnostic, triage, and other preventive, therapeutic, and care guidance.
After developing the text model, we evaluated its performance using Dataset A (Table 1). The results demonstrated varying diagnostic accuracy across diseases (Fig. 2a). Specifically, the model performed least effectively for primary anterior segment diseases (cataract, keratitis, and pterygium), achieving only 48.7% accuracy (Supplementary Fig. 2a). To identify conditions that did not meet development goals, we analyzed the top 1â3 diseases in each subspecialty. The results showed that the following did not achieve the targets of sensitivityââ¥â90% and specificityââ¥â95% (Fig. 2a): keratitis, pterygium, cataract, glaucoma, and thyroid eye disease. Clinical experience suggests slit-lamp and smartphone captured images are valuable for diagnosing cataracts, keratitis, and pterygium. Therefore, development efforts of image-based diagnostic model would focus on these three conditions.
a Heatmaps of diagnostic (top) and triage (bottom) performance metrics after in silico evaluation of the text model (Dataset A). Metrics are column-normalized from -2 (blue) to 2 (red). Disease types are categorized into six major classifications. The leftmost lollipop chart displays the prevalence of each diagnosis and triage. b Radar charts of disease-specific diagnosis (red) and triage (green) accuracy in Dataset A. Rainbow ring represents six disease classifications. Asterisks indicate significant differences between diagnosis and triage accuracy based on Fisherâs exact test. c Bar charts of overall accuracy and disease-specific accuracy for diagnosis (red) and triage (green) after silent evaluation across different models (Dataset G). The line graph below denotes the model used: text model, textâ+âslit-lamp model, textâ+âsmartphone model, and textâ+âslit-lampâ+âsmartphone model. d Sankey diagram of Dataset G illustrating the flow of diagnoses across different models for each case. Each line represents a case. PPV, positive predictive value; NPV, negative predictive value; * Pâ<â0.05, ** Pâ<â0.01, *** Pâ<â0.001, **** Pâ<â0.0001.
Beyond diagnosis, the chatbot effectively provided triage information. Statistical analysis revealed high overall triage accuracy (88.3%), significantly outperforming diagnostic accuracy (84.0%; Fig. 2b; Fisherâs exact test, Pâ=â0.0337). All subspecialties achieved a negative predictive valueââ¥â95%, and all, except optometry (79.7%) and retina (77.6%), achieved a positive predictive valueââ¥â85% (Dataset A in Supplementary Data 2). Thus, eight out of ten subspecialties met the predefined developmental targets. Future multimodal model development will focus on enhancing diagnostic capabilities while utilizing the text modelâs triage prompts without additional refinement.
To develop a multimodal model combining text and images, we first created two image-based diagnostic models based on Dataset B and Dataset C (Table 1), with 80% of the images used for training and 20% for validation. The slit-lamp model achieved disease-specific accuracies of 79.2% for cataract, 87.6% for keratitis, and 98.4% for pterygium (Supplementary Fig. 2b). The smartphone model achieved disease-specific accuracies of 96.2% for cataract, 98.4% for keratitis, and 91.9% for pterygium (Supplementary Fig. 2c). After developing the image diagnostic models, we collected Dataset D, Dataset E and Dataset F (Table 1), which included both imaging results and patient history. Clinical diagnosis requires integrating medical history and eye imaging features, so clinical and image diagnoses may not always align (Supplementary Fig. 3a). To address this, we used image information only to rule out diagnoses. Using image diagnosis as the gold standard, we plotted the receiver operating characteristic (ROC) curves for cataract, keratitis, and pterygium in Dataset D (Supplementary Fig. 3b) and Dataset E (Supplementary Fig. 3c). The threshold >0.363 provided high specificity for all three conditions (cataract 83.5%, keratitis 99.2%, pterygium 96.6%) in Dataset D and was used to develop the textâ+âslit-lamp multimodal model. Similarly, in Dataset E, the threshold >0.315 provided high specificity for all three conditions (cataract 96.8%, keratitis 98.5%, pterygium 95.0%) and was used to develop the textâ+âsmartphone multimodal model. In the textâ+âslit-lampâ+âsmartphone multimodal model, we tested two methods to combine the results from slit-lamp and smartphone images. The first method used the union of the diagnoses excluded by each model, while the second used the intersection. Testing on Dataset F showed that the first method achieved significantly higher accuracy (52.2%, Supplementary Fig. 3d) than the second method (31.9%, Supplementary Fig. 3e; Fisherâs exact test, Pâ<â0.0001). Therefore, we applied the first method in all subsequent evaluations for the textâ+âslit-lampâ+âsmartphone model.
Using clinical diagnosis as the gold standard, the diagnostic accuracy of all multimodal models significantly improved compared to the text model; specifically, the textâ+âslit-lamp model increased from 32.0% to 65.5% (Fisherâs exact text, Pâ<â0.0001), the textâ+âsmartphone model increased from 41.6% to 64.2% (Fisherâs exact test, Pâ<â0.0001), and the textâ+âslit-lampâ+âsmartphone model increased from 37.4% to 52.2% (Fisherâs exact test, Pâ=â0.012). Therefore, we successfully developed four models for the IOMIDS system: the unimodal text model, the textâ+âslit-lamp multimodal model, the textâ+âsmartphone multimodal model, and the textâ+âslit-lampâ+âsmartphone multimodal model.
Silent evaluation of diagnostic and triage performance
During the silent evaluation phase, Dataset G was collected to validate the diagnostic and triage performance of the IOMIDS system. Although the diagnostic performance for cataract, keratitis, and pterygium (Dataset G in Supplementary Data 3) did not meet the established clinical goal, significant improvements in diagnostic accuracy were observed for all multimodal models compared to the text model (Fig. 2c). The Sankey diagram revealed that in the text model, 70.8% of cataract cases and 78.3% of pterygium cases were misclassified as âothersâ (Fig. 2d). In the âothersâ category, the textâ+âslit-lamp multimodal model correctly identified 88.2% of cataract cases and 63.0% of pterygium cases. The textâ+âsmartphone multimodal model performed even better, correctly diagnosing 93.3% of cataract cases and 80.0% of pterygium cases. Meanwhile, the textâ+âslit-lampâ+âsmartphone multimodal model accurately identified 90.5% of cataract cases and 68.2% of pterygium cases in the same category.
Regarding triage accuracy, the overall performance improved with the multimodal models. However, the accuracy for cataract triage notably decreased, dropping from 91.7% to 62.5% in the textâ+âslit-lamp model (Fig. 2c, Fisherâs exact test, Pâ=â0.0012), to 58.3% in the textâ+âsmartphone model (Fig. 2c, Fisherâs exact test, Pâ=â0.0003), and further to 53.4% in the textâ+âslit-lampâ+âsmartphone model (Fig. 2c, Fisherâs exact test, Pâ=â0.0001). Moreover, neither the text model nor the three multimodal models met the established clinical goal in any subspecialty (Supplementary Data 2).
We also investigated whether medical histories in the outpatient electronic system alone were sufficient for the text model to achieve accurate diagnostic and triage results. We randomly sampled 104 patients from Dataset G and re-entered their medical dialogs into the text model (Supplementary Fig.â1). For information not recorded in the outpatient history, responses were given as âno information availableâ. The results showed a significant decrease in diagnostic accuracy, dropping from 63.5% â 20.2% (Fisherâs exact test, Pâ<â0.0001), while triage accuracy remained relatively unchanged, only slightly decreasing from 72.1% â 70.2% (Fisherâs exact test, Pâ=â0.8785). This study suggests that while triage accuracy of the text model is not dependent on dialog completeness, diagnostic accuracy is affected by the completeness of the answers provided. Therefore, thorough responses to AI chatbot queries are crucial in clinical applications.
Evaluation in real clinical settings with trained researchers
The clinical trial involved two parts: researcher-collected data and patient-entered data (Table 2). There was a significant difference in the words input and the duration of input between researchers and patients. For researcher-collected data, the average was 38.5â±â8.2 words and 58.2â±â13.5âs, while for patient-entered data, the average was 55.5â±â10.3 words (t-test, Pâ=â0.002) and 128.8â±â27.1âs (t-test, Pâ<â0.0001). We first assessed the diagnostic performance during the researcher-collected data phase. For the text model across six datasets (Dataset 1â3, 6â8 in Supplementary Data 3), the following number of diseases met the clinical goal for diagnosing ophthalmic diseases: 16 out of 46 diseases (46.4% of all cases) in Dataset 1, 16 out of 32 diseases (16.4% of all cases) in Dataset 2, 18 out of 28 diseases (61.4% of all cases) in Dataset 3, 19 out of 48 diseases (43.9% of all cases) in Dataset 6, 14 out of 28 diseases (35.3% of all cases) in Dataset 7, and 11 out of 42 diseases (33.3% of all cases) in Dataset 8. Thus, less than half of the cases during the researcher-collected data phase met the clinical goal for diagnosis.
Next, we investigated the subspecialty triage accuracy of the text model across various datasets (Dataset 1â3, 6â8 in Supplementary Data 2). Our findings revealed that during internal validation, the cornea subspecialty achieved the clinical goal for triaging ophthalmic diseases. In external validation, the general outpatient clinic, cornea subspecialty, optometry subspecialty, and glaucoma subspecialty also met these clinical criteria. We further compared the diagnostic and triage outcomes of the text model across six datasets. Data analysis demonstrated that triage accuracy exceeded diagnostic accuracy in most datasets (Supplementary Fig. 4aâc, eâg). Specifically, triage accuracy was 88.7% compared to diagnostic accuracy of 69.3% in Dataset 1 (Fig. 3a; Fisherâs exact test, Pâ<â0.0001), 84.1% compared to 62.4% in Dataset 2 (Fisherâs exact test, Pâ<â0.0001), 82.5% compared to 75.4% in Dataset 3 (Fisherâs exact test, Pâ=â0.3508), 85.7% compared to 68.6% in Dataset 6 (Fig. 3a; Fisherâs exact test, Pâ<â0.0001), 80.5% compared to 66.5% in Dataset 7 (Fisherâs exact test, Pâ<â0.0001), and 84.5% compared to 65.1% in Dataset 8 (Fisherâs exact test, Pâ<â0.0001). This suggests that while the text model may not meet clinical diagnostic needs, it could potentially fulfill clinical triage requirements.
a Radar charts of disease-specific diagnosis (red) and triage (green) accuracy after clinical evaluation of the text model in internal (left, Dataset 1) and external (right, Dataset 6) centers. Asterisks indicate significant differences between diagnosis and triage accuracy based on Fisherâs exact test. b Circular stacked bar charts of disease-specific diagnostic accuracy across different models from internal (left, Dataset 2â4) and external (right, Dataset 7â9) evaluations. Solid bars represent the text model, while hollow bars represent multimodal models. Asterisks indicate significant differences in diagnostic accuracy between two models based on Fisherâs exact test. c Bar charts of overall accuracy (upper) and accuracy of primary anterior segment diseases (lower) for diagnosis (red) and triage (green) across different models in Dataset 2â5 and Dataset 7â10. The line graphs below denote study centers (internal, external), models used (text, textâ+âslit-lamp, textâ+âsmartphone, textâ+âslit-lampâ+âsmartphone), and data provider (researchers, patients). * Pâ<â0.05, ** Pâ<â0.01, *** Pâ<â0.001, **** Pâ<â0.0001.
We then investigated the diagnostic performance of multimodal models in Dataset 2, 3, 7, and 8 (Supplementary Data 3). Both the textâ+âslit-lamp model and textâ+âsmartphone model demonstrated higher overall diagnostic accuracy compared to the text model in internal and external validations, with statistically significant improvements noted for the textâ+âsmartphone model in Dataset 8 (Fig. 3c). The clinical goal for diagnosing ophthalmic diseases was achieved by 11 out of 32 diseases (13.8% of all cases) in Dataset 2, 21 out of 28 diseases (70.2% of all cases) in Dataset 3, 11 out of 28 diseases (28.5% of all cases) in Dataset 7, and 15 out of 42 diseases (50.6% of all cases) in Dataset 8. The textâ+âsmartphone model outperformed the text model by meeting the clinical goal for diagnosis in more cases and disease types. For some other diseases that did not meet the clinical goal for diagnosis, significant improvements in diagnostic accuracy were also found within the multimodal models (Fig. 3b). Therefore, the multimodal model exhibited better diagnostic performance compared to the text model.
Regarding to triage, some datasets of the multimodal models showed a minor decrease in accuracy compared to the text model, however, these differences were not statistically significant (Fig. 3c). Unlike the silent evaluation phase, in clinical applications, neither of the two multimodal models demonstrated a notable decline in triage accuracy across different diseases, including cataract (Supplementary Fig. 5). In summary, data collected by researchers indicated that multimodal models outperformed the text model in diagnostic accuracy but were slightly less efficient in triage.
Evaluation in real clinical settings with untrained patients
During the patient-entered data phase, considering the convenience of smartphones, we focused on the text model, the textâ+âsmartphone model, and the textâ+âslit-lampâ+âsmartphone model. First, we compared the triage accuracy. Consistent with the researcher-collected data phase, the overall triage accuracy of the multimodal models were slightly lower than that of the text model, but this difference was not statistically significant (Fig. 3c). For subspecialties, in both internal and external validation, the text and multimodal models for the general outpatient clinic and glaucoma met the clinical goals for triaging ophthalmic diseases. Additionally, internal validation showed that the multimodal models met these standards for cornea, optometry, and retina subspecialties. In external validation, the text model met the standards for cornea and retina, while the multimodal models met the standards for cataract and retina. These results suggest that both the text model and multimodal models meet the triage requirements when patients input their own data.
Next, we compared the diagnostic accuracy of the text model and the multimodal models. Results revealed that in both internal and external validations, all diseases met the specificity criterion ofââ¥â95%. In Dataset 4, the text model met the clinical criterion of sensitivityââ¥â75% in 15 out of 42 diseases (40.5% of cases), while the textâ+âsmartphone multimodal model met this criterion in 24 out of 42 diseases (78.6% of cases). In Dataset 5, the text model achieved the sensitivity threshold ofââ¥â75% in 14 out of 40 diseases (48.3% of cases), while the textâ+âslit-lampâ+âsmartphone multimodal model met this criterion in only 10 out of 40 diseases (35.0% of cases). In Dataset 9, the text model achieved the clinical criterion in 24 out of 43 diseases (57.1%), while the textâ+âsmartphone model met the criterion in 28 out of 43 diseases (81.9%). In Dataset 10, the text model achieved the criterion in 25 out of 42 diseases (62.5%), whereas the textâ+âslit-lampâ+âsmartphone model met the criterion in 22 out of 42 diseases (50.8%). This suggests that the textâ+âsmartphone model outperforms the text model, while the textâ+âslit-lampâ+âsmartphone model does not. Further statistical analysis confirmed the superiority of textâ+âsmartphone model when comparing its diagnostic accuracy with the text model in both Dataset 4 and Dataset 9 (Fig. 3c). We also conducted an analysis of diagnostic accuracy for individual diseases, identifying significant improvements for certain diseases (Fig. 3b). These findings collectively show that during the patient-entered data phase, the textâ+âsmartphone model not only meets triage requirements but also delivers better diagnostic performance than both the text model and the textâ+âslit-lampâ+âsmartphone model.
We further compared the diagnostic and triage accuracy of the text model in Dataset 4 and Dataset 9. Consistent with previous findings, both internal validation (triage: 80.4%, diagnosis: 69.6%; Fisherâs exact test, Pâ<â0.0001) and external validation (triage: 84.7%, diagnosis: 72.5%; Fisherâs exact test, Pâ<â0.0001) demonstrated significantly higher triage accuracy compared to diagnostic accuracy for the text model (Supplementary Fig. 4d, h). Examining individual diseases, cataract exhibited notably higher triage accuracy than diagnostic accuracy in internal validation (Dataset 4: triage 76.8%, diagnosis 51.2%; Fisherâs exact test, Pâ=â0.0011) and external validation (Dataset 9: triage 87.3%, diagnosis 58.2%; Fisherâs exact test, Pâ=â0.0011). Interestingly, in Dataset 4, the diagnostic accuracy for myopia (94.0%) was significantly higher (Fisherâs exact test, Pâ=â0.0354) than the triage accuracy (80.6%), indicating that the triage accuracy of the text model may not be influenced by diagnostic accuracy. Subsequent regression analysis is necessary to investigate the factors determining triage accuracy.
Due to varying proportions of the disease classifications across the three centers (Supplementary Table 1), we further explored changes in diagnostic and triage accuracy within each classification. Results revealed that, regardless of whether data was researcher-collected or patient-reported, diagnostic accuracy for primary anterior segment diseases (cataract, keratitis, pterygium) was significantly higher in the multimodal model compared to the text model in both internal and external validation (Fig. 3c). Further analysis of cataract, keratitis, and pterygium across Datasets 2, 3, 4, 7, 8, and 9 (Fig. 3b) also showed that, similar to the silent evaluation phase, multimodal model diagnostic accuracy for cataract significantly improved compared to the text model in most datasets. Pterygium and keratitis exhibited some improvement but showed no significant change across most datasets due to sample size limitations. For the other five major disease categories, multimodal model diagnostic accuracy did not consistently improve and even significantly declined in some categories (Supplementary Fig. 6). These findings indicate that the six major disease categories may play crucial roles in influencing the diagnostic performance of the models, underscoring the need for further detailed investigation.
Comparison of diagnostic performance in different models
To further compare the diagnostic accuracy of different models across various datasets, we conducted comparisons within six major disease categories. The results revealed significant differences in diagnostic accuracy among the models across these categories (Fig. 4a). For example, when comparing the textâ+âsmartphone model (Datasets 4, 9) to the text model (Datasets 1, 6), both internal and external validations showed higher diagnostic accuracy for the former in primary anterior segment diseases, other anterior segment diseases, and intraorbital diseases and emergency categories compared to the latter (Fig. 4a, b). Interestingly, contrary to previous findings within datasets, comparisons across datasets demonstrated a notable decrease in diagnostic accuracy for the textâ+âslit-lamp model (Dataset 1 vs 2, Dataset 6 vs 7) and the textâ+âslit-lampâ+âsmartphone model (Dataset 4 vs 5, Dataset 9 vs 10) in the categories of other anterior segment diseases and vision disorders in both internal and external validations (Fig. 4a). This suggests that, in addition to the model used and the disease categories, other potential factors may influence the modelâs diagnostic accuracy.
a Bar charts of diagnostic accuracy calculated for each disease classification across different models from internal (upper, Dataset 1â5) and external (lower, Dataset 6â10) evaluations. The bar colors represent disease classifications. The line graphs below denote study centers, models used, and data providers. b Heatmaps of diagnostic performance metrics after internal (left) and external (right) evaluations of different models. For each heatmap, metrics in the text model and textâ+âsmartphone model are normalized together by column, ranging from -2 (blue) to 2 (red). Disease types are classified into six categories and displayed by different colors. c Multivariate logistic regression analysis of diagnostic accuracy for all cases (left) and subgroup analysis for follow-up cases (right) during clinical evaluation. The first category in each factor is used as a reference, and OR values and 95% CIs for other categories are calculated against these references. OR, odds ratio; CI, confidence interval; *Pâ<â0.05, **Pâ<â0.01, ***Pâ<â0.001, ****Pâ<â0.0001.
We then conducted univariate and multivariate regression analyses to explore factors influencing diagnostic accuracy. Univariate analysis revealed that seven factors (age, laterality, number of visits, disease classification, model, data provider, and words input) significantly influence diagnostic accuracy (Supplementary Table 3). In multivariate analysis, six factors (age, laterality, number of visits, disease classification, model, and words input) remained significant, while the data provider was no longer a critical factor (Fig. 4c). Subgroup analysis of follow-up cases showed that only the model type significantly influenced diagnostic accuracy (Fig. 4c). For first-visit patients, three factors (age, disease classification, and model) were still influential. Further analysis across different age groups within each disease classification revealed that the multimodal models generally outperformed or performed comparably to the text model in most disease categories (Table 3). However, all multimodal models, including the textâ+âslit-lamp model (OR: 0.21 [0.04â0.97]), the textâ+âsmartphone model (OR: 0.17 [0.09â0.32]), and the textâ+âslit-lampâ+âsmartphone model (OR: 0.16 [0.03â0.38]), showed limitations in diagnosing visual disorders in patients over 45âyears old compared to the text model (Table 3). Additionally, both the textâ+âslit-lamp model (OR: 0.34 [0.20â0.59]) and the textâ+âslit-lampâ+âsmartphone model (OR: 0.67 [0.43â0.89]) were also less effective for diagnosing other anterior segment diseases in this age group. In conclusion, for follow-up cases, both textâ+âslit-lamp and textâ+âsmartphone models are suitable, with a preference for the textâ+âsmartphone model. For first-visit patients, the textâ+âsmartphone model is recommended, but its diagnostic efficacy for visual disorders in patients over 45âyears old (such as presbyopia) may be inferior to that of the text model.
We also performed a regression analysis on triage accuracy. In the univariate logistic regression, the center and data provider significantly influenced triage accuracy. Multivariate regression analysis showed that only the data provider significantly impacted triage accuracy, with patient-entered data significantly improving accuracy (OR: 1.40 [1.25â1.56]). Interestingly, neither model type nor diagnostic accuracy affects triage outcomes. Considering the previous data analysis results from the patient-entered data phase, both the text model and the textâ+âsmartphone model are recommended as self-service triage tools for patients in clinical applications. Collectively, among the four models developed in our IOMIDS system, the textâ+âsmartphone model is more suitable for patient self-diagnosis and self-triage compared to the other models.
Model interpretability
In subgroup analysis, we identified limitations in the diagnostic accuracy for all multimodal models for patients over 45âyears old. The misdiagnosed cases in this age group were further analyzed to interpret the limitations. Both the textâ+âslit-lamp model (Datasets 2, 7) and the textâ+âslit-lampâ+âsmartphone model (Datasets 5, 10) frequently misdiagnosed other anterior segment and visual disorders as cataracts or keratitis. For instance, with the textâ+âslit-lampâ+âsmartphone model, glaucoma (18 cases, 69.2%) and conjunctivitis (22 cases, 38.6%) were often misdiagnosed as keratitis, while presbyopia (6 cases, 54.5%) and visual fatigue (11 cases, 28.9%) were commonly misdiagnosed as cataracts. In contrast, both the text model (Datasets 1â10) and the textâ+âsmartphone model (Datasets 3, 4, 8, 9) had relatively low misdiagnosis rates for cataracts (text: 23 cases, 3.5%; textâ+âsmartphone: 91 cases, 33.7%) and keratitis (text: 16 cases, 2.4%; textâ+âsmartphone: 25 cases, 9.3%). These results suggest that in our IOMIDS system, the inclusion of slit-lamp images, whether in the textâ+âslit-lamp model or the text + slit-lamp + smartphone model, may actually hinder diagnostic accuracy due to the high false positive rate for cataracts and keratitis.
We then examined whether these misdiagnoses could be justified through image analysis. First, we reviewed the misdiagnosed cataract cases. In the textâ+âslit-lamp model, 30 images (91.0%) were consistent with a cataract diagnosis. However, clinically, they were mainly diagnosed with glaucoma (6 cases, 20.0%) and dry eye syndrome (5 cases, 16.7%). Similarly, in the textâ+âsmartphone model, photographs of 80 cases (88.0%) were consistent with a cataract diagnosis. Clinically, these cases were primarily diagnosed with refractive errors (20 cases), retinal diseases (15 cases), and dry eye syndrome (8 cases). We then analyzed the class activation maps of the two multimodal models. Both models showed regions of interest for cataracts near the lens (Supplementary Fig. 7), in accordance with clinical diagnostic principles. Thus, these multimodal models can provide some value for cataract diagnosis based on images but may lead to discrepancies with the final clinical diagnosis.
Next, we analyzed cases misdiagnosed as keratitis by the textâ+âslit-lamp model. The results showed that only one out of 25 cases had an anterior segment photograph consistent with keratitis, indicating a high false-positive rate for keratitis with the textâ+âslit-lamp model. We then conducted a detailed analysis of the class activation maps generated by this model during clinical application. The areas of interest for keratitis were centered around the conjunctiva rather than the corneal lesions (Supplementary Fig. 7a). Thus, the model appears to interpret conjunctival congestion as indicative of keratitis, contributing to the occurrence of false-positive results. In contrast, the textâ+âsmartphone model displayed areas of interest for keratitis near the corneal lesions (Supplementary Fig. 7b), which aligns with clinical diagnostic principles. Taken together, future research should focus on refining the textâ+âslit-lamp model for keratitis diagnosis and prioritize optimizing the balance between text-based and image-based information to enhance diagnostic accuracy across both multimodal models.
Inter-model variability and inter-expert variability
We further evaluated the diagnostic accuracy of GPT4.0 and the domestic large language model (LLM) Qwen using Datasets 4, 5, 9, and 10. Additionally, we invited three trainees and three junior doctors to independently diagnose these diseases. Since the textâ+âsmartphone model performed the best in the IOMIDS system, we compared its diagnostic accuracy with that of the other two LLMs and ophthalmologists with varying levels of experience (Fig. 5a-b). The textâ+âsmartphone model (80.0%) outperformed GPT4.0 (71.7%, ϲ test, Pâ=â0.033) and showed similar accuracy to the mean performance of trainees (80.6%). Among the three LLMs, Qwen performed the poorest, comparable to the level of a junior doctor. However, all three LLMs fell short of expert-level performance, suggesting there is still potential for improvement.
a Comparison of diagnostic accuracy of IOMIDS (textâ+âsmartphone model), GPT4.0, Qwen, expert ophthalmologists, ophthalmology trainees, and unspecialized junior doctors. The dotted lines represent the mean performance of ophthalmologists at different experience levels. b Heatmap of Kappa statistics quantifying agreement between diagnoses provided by AI models and ophthalmologists. c Kernel density plots of user satisfaction rated by researchers (red) and patients (blue) during clinical evaluation. d Example of an interactive chat with IOMIDS (left) and quality evaluation of the chatbot response (right). On the left, the central box displays the patient interaction process with IOMIDS: entering chief complaint, answering system questions step-by-step, uploading a standard smartphone-captured eye photo, and receiving diagnosis and triage information. The chatbot response includes explanations of the condition and guidance for further medical consultation. The surrounding boxes show a researcherâs evaluation of six aspects of the chatbot response. The radar charts on the right illustrate the quality evaluation across six aspects for chatbot responses generated by the text model (red) and the textâ+âimage model (blue). The axes for each aspect correspond to different coordinate ranges due to varying rating scales. Asterisks indicate significant differences between two models based on two-sided t-test. ** Pâ<â0.01, *** Pâ<â0.001, **** Pâ<â0.0001.
We then analyzed the agreement between the answers provided by the LLMs and ophthalmologists (Fig. 5b). Agreement among expert ophthalmologists, who served as the gold standard in our study, was generally strong (κ: 0.85â0.95). Agreement among trainee doctors was moderate (κ: 0.69â0.83), as was the agreement among junior doctors (κ: 0.69â0.73). However, the agreement among the three LLMs was weaker (κ: 0.48â0.63). Notably, the textâ+âsmartphone model in IOMIDS showed better agreement with experts (κ: 0.72â0.80) compared to the other two LLMs (GPT4.0: 0.55â0.78; Qwen: 0.52â0.75). These results suggest that the textâ+âsmartphone model in IOMIDS demonstrates the best alignment with experts among the three LLMs.
Evaluation of user satisfaction and response quality
The IOMIDS responses not only contained diagnostic and triage results but also provided guidance on prevention, treatment, care, and follow-up (Fig. 5c). We first analyzed both researcher and patient satisfaction with these responses. Satisfaction was evaluated by researchers during the model development phase and the clinical trial phase; satisfaction was evaluated by patients during the clinical trial phase, regardless of the data collection method. Researchers rated satisfaction score significantly higher (4.63â±â0.92) than patients (3.99â±â1.46; t-test, Pâ<â0.0001; Fig. 5c). Patient ratings did not differ between researcher-collected (3.98â±â1.45) and self-entered data (4.02â±â1.49; t-test, Pâ=â0.3996). Researchers frequently rated chatbot responses as very satisfied (82.5%), whereas patient ratings varied, with 20.2% finding responses not satisfied (11.7%) or slightly satisfied (8.5%), and 61.9% rating them very satisfied. Further demographic analysis between these patient groups revealed that the former (45.7â±â23.8âyears) were significantly older than the latter (37.8â±â24.4âyears; t-test, Pâ<â0.0001), indicating greater acceptance and positive evaluation of AI chatbots among younger individuals.
Next, we evaluated the response quality between multimodal models and the text model (Fig. 5d). The multimodal models exhibited significantly higher overall information quality (4.06â±â0.12 vs. 3.82â±â0.14; t-test, Pâ=â0.0031) and better understandability (78.2%â±â1.3% vs. 71.1%â±â0.7%; t-test, Pâ<â0.0001) than the text model. Additionally, the multimodal models showed significantly lower misinformation scores (1.02â±â0.05 vs. 1.23â±â0.11; t-test, Pâ=â0.0003) compared to the text model. Notably, the empathy score statistically decreased in multimodal models compared to the text model (3.51â±â0.63 vs. 4.01â±â0.56; t-text, Pâ<â0.0001), indicating lower empathy in chatbot responses from multimodal models. There were no significant differences in terms of grade level (readability), with both the text model and multimodal models being suitable for users at a grade 3 literacy level. These findings suggest that multimodal models generate high-quality chatbot responses with good readability. Future studies may focus on enhancing the empathy of these multimodal models to better suit clinical applications.
Discussion
The Intelligent Ophthalmic Multimodal Interactive Diagnostic System (IOMIDS) is designed to diagnose ophthalmic diseases using multimodal information and provides comprehensive medical advice, including subspecialty triage, prevention, treatment, follow-up, and care. During development, we created four models: a text-based unimodal model, which is an embodied conversational agent integrated with ChatGPT, and three multimodal models that combine medical history information from interactive conversations with eye images for a more thorough analysis. In clinical evaluations, the multimodal models significantly improved diagnostic performance over the text model for anterior segment diseases such as cataract, keratitis, and pterygium in patients aged 45 and older. Previous studies also demonstrated the strength of multimodal models over unimodal models, showing that multimodal model outperformed image-only model in identifying pulmonary diseases and predicting adverse clinical outcomes in COVID-19 patients19. Thus, multimodal models are more suitable for analyzing medical information than unimodal models.
Notably, the textâ+âsmartphone model in the IOMIDS system demonstrated the highest diagnostic accuracy, outperforming current multimodal LLMs like GPT4.0 and Qwen. However, while this model approaches trainee-level performance, it still falls short of matching the accuracy of expert ophthalmologists. GPT4.0 itself achieved accuracy only slightly higher than junior doctors. Previous studies have similarly indicated that while LLMs show promise in supporting ophthalmic diagnosis and education, they lack the nuanced precision of trained specialists, particularly in complex cases20. For instance, Shemer et al. tested ChatGPTâs diagnostic accuracy in a clinical setting and found it lower than that of ophthalmology residents and attending physicians21. Nonetheless, it completed diagnostic tasks significantly faster than human evaluators, highlighting its potential as an efficient adjunct tool. Future research should focus on refining intelligent diagnostic models for challenging and complex cases, with iterative improvements aimed at enhancing diagnostic accuracy and clinical relevance.
Interestingly, the textâ+âsmartphone model outperformed the textâ+âslit-lamp model in diagnosing cataract, keratitis, and pterygium in patients under 45âyears old. Even though previous studies have shown significant progress in detecting these conditions using smartphone photographs22,23,24, there is insufficient evidence to support the finding that the textâ+âslit-lamp model is less efficient than the textâ+âsmartphone model. To address this issue, we first thoroughly reviewed the class activation maps of both models. We found that the slit-lamp model often focused on the conjunctival hyperemia region rather than the corneal lesion area in keratitis cases, leading to more false-positive diagnoses. This mismatch between model-identified areas of interest and clinical lesions suggests a flaw in our slit-lamp image analysis12. Additionally, we analyzed the imaging characteristics of the training datasets (Dataset B and Dataset C) for image-based diagnostic models. Dataset B exhibited a large proportion of the conjunctival region in images, particularly in keratitis cases, which often displayed extensive conjunctival redness. Conversely, Dataset C, comprising smartphone images, showed a smaller proportion of the conjunctival region and facilitated reducing bias towards conjunctival hyperemia in keratitis cases. Overall, refining the anterior segment image dataset may enhance the diagnostic accuracy of the textâ+âslit-lamp model.
Notably, the textâ+âsmartphone model has demonstrated advantages in diagnosing orbital diseases, even though it was not specifically trained for these conditions. These findings highlight the need and potential for further enhancement of the textâ+âsmartphone model in diagnosing both orbital and eyelid diseases. Additionally, there was no significant difference in diagnostic capabilities for retinal diseases between the textâ+âslit-lamp model, the textâ+âsmartphone model, or the textâ+âslit-lampâ+âsmartphone model compared to the text-only model, which aligns with our expectations. This suggests that appearance images may not significantly contribute to diagnosing retinal diseases, consistent with clinical practice. Several studies have successfully developed deep learning models for accurately detecting retinal diseases using fundus photos25, optical coherence tomography (OCT) images26, and other eye-related images. Therefore, IOMIDS could benefit from functionalities to upload retinal examination images and enable comprehensive diagnosis, thereby improving the efficiency of diagnosing retinal diseases. Furthermore, we found that relying solely on medical histories from the outpatient electronic system was insufficient for the text model to achieve accurate diagnoses. This suggests that IOMIDS may gather clinically relevant information that doctors often overlook or fail to record in electronic systems. Thus, future system upgrades could involve aiding doctors in conducting preliminary interviews and compiling initial medical histories to reduce their workload.
Regarding subspecialty triage, consistent with prior research, the text model demonstrates markedly superior accuracy in triage compared to diagnosis27. Additionally, we observed an intriguing phenomenon: triage accuracy is influenced not by diagnostic accuracy but by the data collector. Specifically, patientsâ self-input data resulted in significantly improved triage accuracy compared to data input by researchers. Upon careful analysis of the differences, we found that patient-entered data tends to be more conversational, whereas researcher-entered data tends to use medical terminology and concise expressions. A prior randomized controlled trial (RCT) investigated how different social roles of chatbots influence the chatbot-user relationship, and results suggested that adjusting chatbot roles can enhance usersâ intentions toward the chatbot28. However, no RCT study is available to investigate how usersâ language styles influence chatbot results. Based on our study, we propose that if IOMIDS is implemented in home or community settings without researcher involvement, everyday conversational language in self-reports does not necessarily impair its performance. Therefore, IOMIDS may serve as a decision support system for self-triage to enhance healthcare efficiency and provide cost-effectiveness benefits.
Several areas for improvement were identified for our study. Firstly, due to sample size limitations during the model development phase, we were unable to develop a combined image model for slit-lamp and smartphone images. Instead, we integrated the results of the slit-lamp and smartphone models using logical operations, which led to suboptimal performance of the textâ+âslit-lampâ+âsmartphone model. In fact, previous studies involving multiple image modalities have achieved better results29. Therefore, it will be necessary to develop a dedicated multimodal model for slit-lamp and smartphone images in future work. Moreover, the multimodal model showed lower empathy compared to the text model, possibly due to its more objective diagnosis prompts contrasting with conversational styles. Future upgrades will adjust the multimodal modelâs analysis prompts to enhance empathy in chatbot responses. Thirdly, older users reported lower satisfaction with IOMIDS, highlighting the need for improved human-computer interaction for this demographic. In addition, leveraging ChatGPTâs robust language capabilities and medical knowledge, we used prompt engineering to optimize for parameter efficiency, cost-effectiveness, and speed in clinical experiments. However, due to limitations in OpenAIâs medical capabilities, particularly its inability to pass the Chinese medical licensing exam30, we aim to develop our own large language model based on real Chinese clinical dialogs. This model is expected to enhance diagnostic accuracy and adapt to evolving medical consensus. During our study, we used GPT3.5 instead of GPT4.0 due to token usage constraints. Since GPT4.0 has shown superior responses to ophthalmology-related queries in recent studies31, integrating GPT4.0 into IOMIDS may enhance its clinical performance. It is worth mentioning that the results of our study may not be applicable to other language environments. Previous studies have shown that GPT responds differently to prompts in various languages, with English appearing to yield better results32. There were also biases in linguistic evaluation, as expert assessments can vary based on language habits, semantic comprehension, and cultural values. Moreover, our study represents an early clinical evaluation, and comparative prospective evaluations are necessary before implementing IOMIDS in clinical practice.
Material and methods
Ethics approval
The study was approved by the Institutional Review Board of Fudan Eye & ENT Hospital, the Institutional Review Board of the Affiliated Eye Hospital of Nanjing Medical University, and the Institutional Review Board of Suqian First Hospital. The study was registered on ClinicalTrials.gov (NCT05930444) on June 26, 2023. It was conducted in accordance with the Declaration of Helsinki, with all participants providing written informed consent before their participation.
Study design
This study aims to develop an Intelligent Ophthalmic Multimodal Interactive Diagnostic System (IOMIDS) for diagnosing and triaging ophthalmic diseases (Supplementary Fig. 1). The IOMIDS includes four models: an unimodal text model, which is an embodied conversational agent with ChatGPT; a textâ+âslit-lamp multimodal model, which incorporates both text and eye images captured by a slit-lamp equipment; a textâ+âsmartphone multimodal model, utilizing text along with eye images captured by a smartphone for diagnosis and triage; and a textâ+âslit-lampâ+âsmartphone multimodal model, which combines both image modalities with text to achieve a final diagnosis. Clinical validation of the three modelsâ performance is conducted through a two-stage cross-sectional study, an initial silent evaluation stage followed by an early clinical evaluation stage, as detailed in the protocol article33. Triage covers 10 ophthalmic subspecialties: general outpatient clinic, optometry, strabismus, cornea, cataract, glaucoma, retina, neuro-ophthalmology, orbit, and emergency. Diagnosis involves 50 common ophthalmic diseases (Supplementary Data 4), categorized by lesion location into anterior segment diseases, fundus and optic nerve diseases, intraorbital diseases and emergencies, eyelid diseases, and visual disorders. Notably, because of the image diagnostic training conducted for cataract, keratitis, and pterygium during the development of multimodal models, anterior segment diseases are further classified into two categories: primary anterior segment diseases (including cataract, keratitis, and pterygium) and other anterior segment diseases.
Collecting and formatting doctor-patient communication dialogs
Doctor-patient communication dialogs were collected from the outpatient clinics of Fudan Eye & ENT Hospital, covering the predetermined 50 disease types across 10 subspecialties. After collection, each dialog underwent curation and formatting. Curation requires removing filler words and irrelevant redundant content (e.g., payment methods). Formatting refers to structuring each dialog into four standardized parts: (1) chief complaint; (2) series of questions from the doctor; (3) patientâs responses to each question; (4) doctorâs diagnosis, triage judgment, and information on prevention, treatment, care, and follow-up. Subsequently, researchers (with at least 3âyears of clinical experience as attending physicians) carefully reviewed the dialogs for each disease, selecting those where the doctorâs questions were focused on the chief complaint, demonstrated medical reasoning, and contributed to diagnosis and triage. After this careful review, 90 out of 450 dialogs were selected for prompt engineering to train the text model for IOMIDS. Three researchers independently evaluated the dialogs, resulting in three sets of 90 dialogs. To assess the performance of the models trained with these different sets of dialogs, we created two virtual cases for each of the five most prevalent diseases across 10 subspecialties at our research institution, totaling 100 cases. The diagnostic accuracy for each set of prompts was 73%, 68%, and 52%, respectively. Ultimately, the first set of 90 dialogs (Supplementary Data 1) was chosen as the final set of prompts.
Developing a dynamic prompt system for IOMIDS
To build the IOMIDS system, we designed a dynamic prompt system that enhances ChatGPTâs role in patient consultations by integrating both textual and image data. This system supports diagnosis and triage based on either single-modal (text) or multi-modal (text and image) information. The overall process is illustrated in Fig. 1b, with a detailed explanation provided below:
The system is grounded in a medical inquiry prompt corpus, developed by organizing 90 real-world clinical dialogs into a structured format. Each interview consists of four segments: âPatientâs Chief Complaint,â âInquiry Questions,â âPatientâs Responses,â and âDiagnosis and Consultation Recommendations.â These clinical interviews are transformed into structured inquiry units, known as âpromptsâ within the system. When a patient inputs their primary complaint, the systemâs Chief Complaint Classifier identifies relevant keywords and matches them with corresponding prompts from the corpus. These selected prompts, along with the patientâs initial complaint, form a question prompt that guides ChatGPT in asking about the relevant medical history related to the chief complaint.
After gathering the responses to these inquiries, an analysis prompt is generated. This prompt directs ChatGPT to perform a preliminary diagnosis and triage based on the conversation history. The analysis prompt includes the question prompts from the previous stage, along with all questions and answers exchanged during the consultation. If no appearance-related or anterior segment images are provided by the patient, the system uses only this analysis prompt to generate diagnosis and triage recommendations, which are then communicated back to the patient as the final output of the text model.
For cases that involve multi-modal information, we developed an additional diagnosis prompt. This prompt expands on the previous analysis prompt by incorporating key patient informationâsuch as gender, age, and preliminary diagnosis/triage decisionsâalongside diagnostic data obtained from slit-lamp or smartphone images. By combining image data with textual information, ChatGPT is able to provide more comprehensive medical recommendations, including diagnosis, triage, and additional advice based on both modalities.
It is important to note that in a single consultation, the question prompt, analysis prompt, and diagnosis prompt are not independent; rather, they are interconnected and progressive. The question prompt is part of the analysis prompt, and the analysis prompt is integrated into the diagnosis prompt.
Collecting and ground-truth labeling of images
Image diagnostic data is crucial for diagnosis prompts, and to obtain this data, we need to develop an image-based diagnostic model. Considering there are two common methods for capturing eye photos in clinical settings, both slit-lamp captured and smartphone captured eye images were collected for the development of the image-based diagnostic model. These images encompass diseases identified as requiring image diagnosis (specifically cataract, keratitis, pterygium) through in silico evaluation of the text model (as detailed below). For patients with different diagnoses in each eye (e.g., keratitis in one eye and dry eye in the other), each eye was included as an independent data entry. Additionally, slit-lamp and smartphone captured images for diseases with top 1â5 prevalence rates in each subspecialty were collected and categorized as âothersâ for training, validation, and testing of the image diagnostic model.
For slit-lamp images, the following criteria apply: 1. Images must be taken using the slit-lampâs diffuse light with no overexposure; 2. Both inner and outer canthi must be visible; 3. Image size must be at least 1MB, with a resolution of no less than 72 pixels/inch. For smartphone images, the following conditions must be met: 1. The eye of interest must be naturally open; 2. Images must be captured under indoor lighting, with no overexposure; 3. The shooting distance must be within 1 meter, with focus on the eye region; 4. Image size must be at least 1MB, with a resolution of no less than 72 pixels/inch. Images not meeting these requirements will be excluded from the study. Four specialists independently labeled images into four categories (cataract, keratitis, pterygium, and others) based solely on the image. Consensus was reached when three or more specialists agreed on the same diagnosis; images where agreement could not be reached by at least two specialists were excluded.
Developing image-based ophthalmic classification models for multimodal models
We developed two distinct deep learning algorithms using ResNet-50 to process images captured by slit-lamp and smartphone cameras. The first algorithm was designed to detect cataracts, keratitis, and pterygium in an anterior segment image dataset (Dataset B) obtained under a slit lamp. The second algorithm targeted the detection of these conditions in a dataset (Dataset C) consisting of single-eye regions extracted from smartphone images. To address the challenge of non-eye facial regions in smartphone images, we collected an additional 200 images to train and validate an eye-target detection model using YOLOv7. This model was specifically trained to detect eye regions by annotating the eye areas within these images. Of the total images, 80% were randomly assigned to the training set, and the model was trained for 300 epochs using the default learning rate and preprocessing settings specified in the YOLOv7 website (https://github.com/WongKinYiu/yolov7). The remaining 20% of the images were used as a validation set, achieving a Precision of 1.0, Recall of 0.98, and mAP@.5 of 0.991. These images were not reused in any other experiments.
For the disease classification network, we created a four-class dataset consisting of cataracts, keratitis, pterygium, and âOtherâ categories. This dataset includes both anterior segment images and smartphone-captured eye images across all categories. The âOtherâ class includes normal eye images and images of various other eye conditions. Using a ResNet-50 model pretrained on ImageNet, we fine-tuned it on this four-class dataset for 200 epochs to optimize classification accuracy across both modalities.
During training, each image was resized to 224âÃâ224 pixels and underwent data augmentation techniques to enhance generalization. These included a 0.2 probability of random horizontal flipping, random rotations between -5 to 5 degrees, and a 0.2 probability of automatic contrast adjustments. White balance adjustments were also applied to standardize the images. For validation and testing, images were resized to 224âÃâ224 pixels, underwent white balance adjustments, and were then input into the model for disease prediction. To improve model robustness and minimize overfitting, we employed five-fold cross-validation. The dataset was divided into five parts (each 20%), with four parts used for training and one part for validation in each fold. The final model was selected based on the highest validation accuracy, without specific constraints on sensitivity and specificity for individual models.
Generating image diagnostic data for multimodal models
Before being input into the diagnosis prompt, image diagnostic data underwent preprocessing. Preliminary experiments revealed that when the data indicated a single diagnosis, the multimodal model might overlook patient demographics and medical history, leading to a direct image-based diagnosis. To address this, we adjusted the image diagnostic results by excluding specific diagnoses.
Specifically, we modified the classification model by removing the final Softmax layer and using the scores from the fully connected (fc) layer as outputs for each category. These scores were then rescaled to fall within the range of [-1, 1], providing continuous diagnostic output for each image category. The rescaled scores served as the image modelâs diagnostic output. We also collected additional datasets, Dataset D (slit-lamp captured images) and Dataset E (smartphone captured images), to ensure the independence of training, validation, and testing sets. The model was then run on Datasets D and E, generating scores for all four categories across all images, with the image diagnosis serving as the gold standard. Receiver operating characteristic (ROC) curves were plotted for cataracts, keratitis, and pterygium to determine optimal thresholds for maximizing specificity across these diseases. These thresholds were then established as the output standard. For example, if only cataracts exceeded the threshold among the three diseases, the output label from the image diagnostic module would be âcataractâ.
To further develop a textâ+âslit-lampâ+âsmartphone model, we collected Dataset F, which includes both slit-lamp and smartphone images for each individual. We tested two methods to combine the results from the slit-lamp and smartphone images. The first method used the union of diagnoses excluded by each model, while the second method used the intersection. For example, if the slit-lamp image excluded cataracts and the smartphone image excluded both cataracts and keratitis, the first method would exclude cataracts and keratitis, while the second method would exclude only cataracts. These âexcluded diagnosesâ were then combined with the userâs analysis prompt, key patient information, and preliminary diagnosis to construct the diagnosis prompt, as shown in Fig. 1c. This diagnosis prompt was then sent to ChatGPT, allowing it to generate the final multimodal diagnostic result by integrating both image-based and contextual data.
In silico evaluation of text model
After developing chief complaint classifiers, question prompt templates, and analysis prompt templates, we integrated these functionalities into a text model and conducted an in silico evaluation using virtual cases (Dataset A). These cases consist of simulated patient data derived from outpatient records. To ensure the cohortâs characteristics are representative of real-world clinical settings, we determined the total number of cases per subspecialty based on outpatient volumes over the past 3âyears. We randomly selected cases across subspecialties to cover the predefined set of 50 disease types (Supplementary Data 4). These 50 disease types were chosen based on their prevalence rates in each subspecialty and their diagnostic feasibility without sole dependence on physical examinations. Our goal was to gather about 100 cases for general outpatient clinics and the cornea subspecialty, and ~50 cases for other subspecialties.
During the evaluation process, researchers conducted data entry for each case as a new session, ensuring that chatbot responses were generated solely based on that specific case, without any prior input from other cases. Our primary objective was to achieve a sensitivity ofââ¥â90% and a specificity ofââ¥â95% for diagnosing common disease types that ranked in the top 1â3 based on outpatient volumes over the past three years within each subspecialty. The disease types include dry eye syndrome, allergic conjunctivitis, conjunctivitis, myopia, refractive error, visual fatigue, strabismus, keratitis, pterygium, cataract, glaucoma, vitreous opacity, retinal detachment, ptosis, thyroid eye disease, eyelid mass, chemical ocular trauma, and other eye injuries. For disease types that had failed to meet these performance thresholds and had the potential to be diagnosed through imaging, we would develop an image-based diagnostic system. Additionally, regarding triage outcomes, our secondary goal was to achieve a positive predictive value of ⥠90% and a negative predictive value ofââ¥â95%. Since predictive values are significantly influenced by prevalence, diseases with prevalence below the 5th percentile threshold are excluded from the secondary outcome analysis.
Silent evaluation and early clinical evaluation of IOMIDS
After developing the text model and two multimodal models, all three were integrated into IOMIDS and installed on two iPhone 13 Pro devices to conduct a two-stage cross-sectional study, comprising silent evaluation and early clinical evaluation. During the silent evaluation, researchers collected patient gender, age, chief complaint, medical history inquiries, slit-lamp captured images, and smartphone captured images without disrupting clinical activities (Dataset G). If researchers encountered a specific chatbot query for which the answer could not be found in electronic medical histories, patients would be followed up with telephone interviews on the same day as their clinical visit. Based on sample size calculations33, we aimed to collect 25 cases each for cataract, keratitis, and pterygium, along with another 25 cases randomly selected for other diseases. Following data collection, we conducted data analysis. If the data did not meet predefined standards, further sample expansion was considered. These standards were set as follows: the primary outcome aimed to achieve a sensitivity ofââ¥â85% and a specificity ofââ¥â95% for diagnosing cataract, keratitis, and pterygium; the secondary outcome aimed to achieve a positive predictive value ofââ¥â85% and a negative predictive value ofââ¥â95% for subspecialty triage after excluding diseases with a prevalence below the 5th percentile threshold. To further investigate whether the completeness of medical histories would influence the diagnostic and triage performance of the text model, we randomly selected about half of the cases from Dataset G and re-entered the doctor-patient dialogs. For the chatbot queries that should be answered via telephone interviews, the researcher uniformly marked them as âno information availableâ. The changes in diagnostic and triage accuracy were subsequently analyzed.
The early clinical evaluation was conducted at Fudan Eye & ENT Hospital for internal validation and at the Affiliated Eye Hospital of Nanjing Medical University and Suqian First Hospital for external validation. Both validations included two settings: data collection by researchers and self-completion by patients. Data collection by researchers was conducted at the internal center from July 21 to August 20, 2023, and at the external centers from August 21 to October 31, 2023. Self-completion by patients took place at the internal center from November 10, 2023, to January 10, 2024, and at the external centers from January 20 to March 10, 2024. During the patient-entered data stage, researchers guided users (patients, parents, and caregivers) through the entire process, which included selecting an appropriate testing environment, accurately entering demographic information and chief complaints, providing detailed responses to chatbot queries, and obtaining high-quality eye photos. It is important to note that when collecting smartphone images using the IOMIDS system, we provide guidance throughout the image capture process. Notifications are issued for problems such as excessive distance, overexposure, improper focus, misalignment, or if the eye is not open (Supplementary Fig. 8). In both the researcher-collected data phase and the patient-entered data phase, the primary goal was to achieve a sensitivity ofââ¥â75% and a specificity of â¥95% for diagnosing ophthalmic diseases, excluding those with a prevalence below the 5th percentile threshold. The secondary goal was to achieve a positive predictive value ofââ¥â80% and a negative predictive value of â¥95% for subspecialty triage.
Comparison of inter-model and model-expert agreement
Five expert ophthalmologists with professor titles, three ophthalmology trainees (residents), and two junior doctors without specialized training participated in the study. The expert ophthalmologists were responsible for reviewing all cases, and when at least three experts reached a consensus, their diagnostic results were considered the gold standard. The trainees and junior doctors were involved solely in the clinical evaluation phase for Datasets 4-5 and 9-10, independently providing diagnostic results for each case. Additionally, GPT-4.0 and the domestic large language model Qwen, both with image-text diagnostic capabilities, were included to generate diagnostic results for the same cases. The diagnostic accuracy and consistency of these large language models were then compared with those of ophthalmologists at different experience levels.
Rating for user satisfaction and response quality
User satisfaction with the entire human-computer interaction experience was evaluated by both patients and researchers using a 1â5 scale (not satisfied, slightly satisfied, moderately satisfied, satisfied, and very satisfied) during early clinical evaluation study. Neither the patients nor the researchers were aware of the correctness of the output when assessing satisfaction. Furthermore, 50 chatbot final responses were randomly selected from all datasets generated during both silent evaluation and early clinical evaluation. Three independent researchers, who were blinded to the model types and reference standards, assessed the quality of these chatbot final responses across six aspects. Overall information quality was assessed using DISCERN (rated from 1â=âlow to 5â=âhigh). Understandability and actionability were evaluated with the Patient Education Materials Assessment Tool for Printable Materials (PEMAT-P), scored from 0â100%. Misinformation was rated on a five-point Likert scale (from 1â=ânone to 5â=âhigh). Empathy was also rated using a 5-point scale (not empathetic, slightly empathetic, moderately empathetic, empathetic, and very empathetic). Readability was analyzed using the Chinese Readability Index Explorer (CRIE; http://www.chinesereadability.net/CRIE/?LANG=CHT), which assigns scores corresponding to grade levels: 1â6 points for elementary school (grades 1â6), 7 points for middle school, and 8 points for high school.
Statistical analysis
All data analyses were conducted using Stata/BE (version 17.0). Continuous and ordinal variables were expressed as meanâ±âstandard deviation. Categorical variables were presented as frequency (percentage). For baseline characteristics comparison, a two-sided t-test was used for continuous and ordinal variables, and the Chi-square test or Fisherâs exact test was employed for categorical variables, as appropriate. Diagnosis and triage performance metrics, including sensitivity, specificity, accuracy, positive predictive values, negative predictive values, Youden index, and prevalence, were calculated for each dataset using the one-vs.-rest strategy. Diseases with a prevalence below the 5th percentile threshold were excluded from subspecialty parameter calculations. Reference standards and the correctness of the IOMIDS diagnosis and triage were established according to the protocol article33.
Specifically, overall diagnostic accuracy was defined as the proportion of correctly diagnosed cases out of the total cases in each dataset, while overall triage accuracy was the proportion of correctly triaged cases out of the total cases. Similarly, disease-specific diagnostic and triage accuracies were calculated as the proportion of correctly diagnosed or triaged cases per disease. Diseases were categorized into six classifications based on lesion location: primary anterior segment diseases (cataract, keratitis, pterygium), other anterior segment diseases, fundus and optic nerve diseases, intraorbital diseases and emergencies, eyelid diseases, and visual disorders. Diagnostic and triage accuracies were calculated for each category. Fisherâs exact test was used to compare accuracies of different models within each category.
Univariate logistic regression was performed to identify potential factors influencing diagnostic and triage accuracy. Factors with Pâ<â0.05 were further analyzed using multivariate logistic regression, with odds ratios (OR) and 95% confidence intervals (CI) calculated. Subgroup analyses were conducted for significant factors (e.g., disease classification) identified in the multivariate analysis. Notably, age was dichotomized using the mean age of all patients during the early clinical evaluation stage for inclusion in the logistic regression. Additionally, ROC curves for cataract, keratitis, and pterygium were generated in Dataset D and Dataset E using image-based diagnosis as the gold standard. The area under the ROC curve (AUC) was calculated for each curve. Agreement between answers provided by doctors and LLMs was quantified through calculation of Kappa statistics, interpreted in accordance with McHughâs recommendations20. During the evaluation of user satisfaction and response quality, a two-sided t-test was used to compare different datasets across various metrics. Quality scores from three independent evaluators were averaged before statistical analysis. In this study, P values of <0.05 were considered statistically significant.
Data availability
The data that support the findings of this study are divided into two groups: published data and restricted data. The authors declare that the published data supporting the main results of this study can be obtained within the paper and its Supplementary Information. For research purposes, a representative image deidentified using masks on patientâs face in this study is available. In the case of noncommercial use, researchers can contact the corresponding authors for access to the raw data. Due to portrait rights and patient privacy restrictions, restricted data, including raw videos, are not provided to the public.
Code availability
The code for the prompt engineering aspect of our work is embedded within a Java backend development system, making it tightly integrated with our proprietary infrastructure. Due to this integration, we are unable to publicly release this specific portion of the code. However, the image processing component of our system utilizes open-source models, specifically ResNet-50 and YOLOv7, which are readily available on GitHub and other repositories. Readers interested in the image processing methodologies we employed can easily access and utilize these models from the following sources: ResNet-50: https://github.com/huggingface/pytorch-image-models/blob/main/timm/models/resnet.py. YOLOv7: https://github.com/WongKinYiu/yolov7. We encourage readers to explore these repositories for implementation details and further insights into the image processing techniques used in our study.
References
Poon H. Multimodal generative AI for precision health. NEJM AI https://ai.nejm.org/doi/full/10.1056/AI-S2300233 (2023).
Tan, T. F. et al. Artificial intelligence and digital health in global eye health: opportunities and challenges. Lancet Glob. Health 11, 1432â1443 (2023).
Wagner, S. K. et al. Development and international validation of custom- engineered and code-free deep-learning models for detection of plus disease in retinopathy of prematurity: a retrospective study. Lancet Digit. Health 5, E340âE349 (2023).
Xu, P. S., Chen, X. L., Zhao, Z. W. & Shi, D. L. Unveiling the clinical incapabilities: a benchmarking study of GPT-4V(ision) for ophthalmic multimodal image analysis. Br. J. Ophthalmol. 108, 1384â1389 (2024).
Antaki, F., Chopra, R. & Keane, P. A. Vision-language models for feature detection of macular diseases on optical coherence tomography. JAMA Ophthalmol. 142, 573â576 (2024).
Wu, X. H. et al. Cost-effectiveness and cost-utility of a digital technology-driven hierarchical healthcare screening pattern in China. Nat. Commun. 15, 3650 (2024).
Singhal, K. et al. Large language models encode clinical knowledge. Nature 620, 172â180 (2023).
Mihalache, A. et al. Accuracy of an artificial intelligence chatbotâs interpretation of clinical ophthalmic images. JAMA Ophthalmol. 142, 321â326 (2024).
Lim, Z. W. et al. Benchmarking large language modelsâ performances for myopia care: a comparative analysis of ChatGPT-3.5, ChatGPT-4.0, and Google Bard. Ebiomedicine 95, 104770 (2023).
Lyons R. J., Arepalli S. R., Fromal O., Choi J. & D. Jain N. Artificial intelligence chatbot performance in triage of ophthalmic conditions. Can. J. Ophthalmol. 59, e301âe308 (2023).
Ayers, J. W. et al. Comparing physician and artificial intelligence chatbot responses to patient questions posted to a public social media forum. JAMA Int. Med. 183, 589â596 (2023).
Li, J. H. et al. Class-aware attention network for infectious keratitis diagnosis using corneal photographs. Comp. Biol. Med. 151, 106301 (2022).
Chen, W. B. et al. Early detection of visual impairment in young children using a smartphone-based deep learning system. Nat. Med. 29, 493â503 (2023).
Qian, C. X. et al. Smartphone-acquired anterior segment images for deep learning prediction of anterior chamber depth: a proof-of-concept study. Front. Med. 9, 912214 (2022).
Khan, S. M. et al. A global review of publicly available datasets for ophthalmological imaging: barriers to access, usability, and generalisability. Lancet Digit. Health 3, E51âE66 (2021).
Nath, S., Marie, A., Ellershaw, S., Korot, E. & Keane, P. A. New meaning for NLP: the trials and tribulations of natural language processing with GPT-3 in ophthalmology. Br. J. Ophthalmol. 106, 889â892 (2022).
Dow, E. R. et al. The collaborative community on ophthalmic imaging roadmap for artificial intelligence in age-related Macular degeneration. Ophthalmology 129, E43âE59 (2022).
Brandao-de-Resende, C. et al. A machine learning system to optimise triage in an adult ophthalmic emergency department: a model development and validation study. Eclinicalmedicine 66, 102331 (2023).
Zhou, H. Y. et al. A transformer-based representation-learning model with unified processing of multimodal input for clinical diagnostics. Nat. Biomed. Eng. 7, 743â755 (2023).
Thirunavukarasu, A. J. et al. Large language models approach expert-level clinical knowledge and reasoning in ophthalmology: A head-to-head cross-sectional study. PLOS Digit. Health 3, e0000341 (2024).
Shemer, A. et al. Diagnostic capabilities of ChatGPT in ophthalmology. Graefes Arch. Clin. Exp. Ophthalmol. 262, 2345â2352 (2024).
Wang, L. et al. Feasibility assessment of infectious keratitis depicted on slit-lamp and smartphone photographs using deep learning. Int. J. Med. Inform. 155, 104583 (2021).
Askarian, B., Ho, P. & Chong, J. W. Detecting cataract using smartphones. IEEE J. Transl. Eng. Health Med. 9, 3800110 (2021).
Liu, Y. et al. Accurate detection and grading of pterygium through smartphone by a fusion training model. Br. J. Ophthalmol. 108, 336â342 (2024).
Dong, L. et al. Artificial intelligence for screening of multiple retinal and optic nerve diseases. JAMA Netw. Open 5, e229960 (2022).
Kang, E. Y.-C. et al. A multimodal imagingâbased deep learning model for detecting treatment-requiring retinal vascular diseases: model development and validation study. JMIR Med. Inform. 9, e28868 (2021).
Zandi, R. et al. Exploring diagnostic precision and triage proficiency: a comparative study of GPT-4 and bard in addressing common ophthalmic complaints. Bioeng. Basel 11, 120 (2024).
Nissen, M. et al. The effects of health care chatbot personas with different social roles on the client-chatbot bond and usage intentions: development of a design codebook and web-based study. J. Med. Int. Res. 24, e32630 (2022).
Xiong, J. et al. Multimodal machine learning using visual fields and peripapillary circular OCT scans in detection of glaucomatous optic neuropathy. Ophthalmology 129, 171â180 (2022).
Zong, H. et al. Performance of ChatGPT on Chinese national medical licensing examinations: a five-year examination evaluation study for physicians, pharmacists and nurses. Bmc Med. Edu. 24, 143 (2024).
Pushpanathan K. et al. Popular large language model chatbotsâ accuracy, comprehensiveness, and self-awareness in answering ocular symptom queries. Iscience 26, 108163 (2023).
Liu, X. C. et al. Uncovering language disparity of ChatGPT on retinal vascular disease classification: cross-sectional study. J. Med. Int. Res. 26, e51926 (2024).
Peng, Z. et al. Development and evaluation of multimodal AI for diagnosis and triage of ophthalmic diseases using ChatGPT and anterior segment images: protocol for a two-stage cross-sectional study. Front. Artif. Intell. 6, 1323924 (2023).
Acknowledgements
We sincerely appreciated the support from Yuhang Jiang (Jiangsu Health Vocational College) for assisting with data collection, Xiang Zeng (Fudan University) for assisting with image formatting, and Shi Yin and Xingxing Wu for assisting with auditing the clinical data. The author(s) declare financial support was received for the research, authorship, and/or publication of this article. This work was supported by Shanghai Hospital Development Center (Grant Number SHDC2023CRD013), National Key R&D Program of China (Grant Number 2018YFA0701700), National Natural Science Foundation of China (Grant Numbers 82371101, U20A20170, 62271337, 62371328, 62371326), Zhejiang Key Research and Development Project (Grant Number 2021C03032), Natural Science Foundation of Jiangsu Province of China (Grant Numbers BK20211308, BK20231310).
Author information
Authors and Affiliations
Contributions
R.M., Q.C., Z.P., J.Q., X.C., and C.Z. conceived and designed the study. J.G., K.X., J.C., R.Z., H.C., X.Z., and J.H. guided the research design. R.M. and Z.P. collected data in the in silico development stage. J.L. and M.Y. collected data during the silent evaluation and early clinical evaluation stages. J.L., M.Y., J.W., W.X., X.L., and H.C. organized image data. R.M., J.Y., Z.P., L.T., P.J., X.L., L.G., H.C., and X.W. audited the clinical data. Q.C., J.L., W.Z., D.X., B.N., J.W., F.S., and X.C. implemented the multimodal system. R.M., Q.C., J.Y., and Z.P. drafted the initial manuscript. L.T., J.L., M.Y., W.S., and J.W. coordinated end-user testing. R.M., Q.C., and Y.Z. conducted statistical analysis. Y.Z., J.Z., B.Q., and Q.J. contributed to data collection and entry. F.S., J.Q., X.C., and C.Z. were responsible for the decision to submit the manuscript. All authors reviewed and revised the manuscript, approved the final version, and had access to all the data in the study.
Corresponding authors
Ethics declarations
Competing interests
The authors declare no competing interests.
Additional information
Publisherâs note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License, which permits any non-commercial use, sharing, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if you modified the licensed material. You do not have permission under this licence to share adapted material derived from this article or parts of it. The images or other third party material in this article are included in the articleâs Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the articleâs Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by-nc-nd/4.0/.
About this article
Cite this article
Ma, R., Cheng, Q., Yao, J. et al. Multimodal machine learning enables AI chatbot to diagnose ophthalmic diseases and provide high-quality medical responses. npj Digit. Med. 8, 64 (2025). https://doi.org/10.1038/s41746-025-01461-0
Received:
Accepted:
Published:
DOI: https://doi.org/10.1038/s41746-025-01461-0