In the medical field, health data is inherently multimodal, encompassing both physical measurements and natural-language narratives1. Ophthalmology, a discipline that heavily relies on multimodal information, requires detailed patient histories and visual examinations2. Consequently, multimodal machine learning is becoming increasingly important for medical diagnostics in ophthalmology. Previous studies on ophthalmic diagnostic models have underscored the substantial potential of image recognition AI in automating tasks that demand clinical expertize3. Recently, chatbot-based multimodal generative AI has emerged as a promising avenue towards advancing precision health, integrating health data from both imaging and textual perspectives. However, commonly utilized public models like GPT-4V and Google’s VLM, while demonstrating some diagnostic capabilities, are currently deemed inadequate for clinical decision-making in ophthalmology4,5. In addition, these current models have not yet achieved the capability to actively collect patient histories through natural language human-computer interactions or to accurately interpret images acquired with non-specialist ophthalmic equipment such as smartphones. Overcoming these challenges in multimodal AI is crucial and could pave the way for self-diagnosis of ophthalmic conditions in home settings, yielding substantial socioeconomic benefits6. Therefore, there is both a need and potential for further development of multimodal AI models to diagnose and triage ophthalmic diseases.

In the domain of gathering medical histories through human-computer interactions, the introduction of prompt engineering—a streamlined, data- and parameter-efficient technique for aligning large language models with the intricate demands of medical history inquiries—represents a significant advancement7. Expanding on this innovation, we propose an interactive ophthalmic consultation system that utilizes AI chatbots’ robust text analysis capabilities to autonomously generate inquiries about a patient’s medical history based on their chief complaints. While previous research has not directly utilized chatbots for collecting ophthalmic medical histories, evidence indicates that current AI chatbots can provide precise and detailed responses to ophthalmic queries, such as retina-related multiple choice questions, myopia-related open-ended questions, and questions about urgency triage8,9,10. Furthermore, these chatbots excel in delivering high-quality natural language responses to medical inquiries, often exhibiting greater empathy than human doctors11. Therefore, we propose that the system should be capable of formulating comprehensive diagnoses and tailored recommendations based on the patient’s responses. A significant drawback of previous studies on linguistic models is their reliance on simulated patient histories or perspectives created by researchers, which lack validation in real-world, large-scale clinical settings. This limitation raises uncertainties about the practical applications and autonomous deployment capabilities of these models. Hence, it is imperative to develop and validate an embodied conversational agent in authentic clinical settings, where patients themselves contribute data.

Regarding ophthalmic imaging, previous research has primarily focused on slit-lamp photographs, yielding promising results. For instance, a novel end-to-end fully convolutional network has been developed to diagnose infectious keratitis using corneal photographs12. Recently, exploration into smartphone videos has shown their effectiveness in diagnosing pediatric eye diseases, potentially aiding caregivers in identifying visual impairments in children13. Algorithms using smartphone-acquired photographs have also proven valuable in measuring anterior segment depth, which is particularly useful for screening primary angle closure glaucoma14. Together, these studies underscore the utility of both slit-lamp photographs and smartphone-acquired images in addressing diverse challenges in ophthalmic diagnostics. However, a significant limitation of previous AI imaging and multimodal studies is their narrow focus on single diseases. These pipelines often operate independently within their domains, lacking integration across different fields to enhance overall functionality15. For example, applying a model designed for herpes zoster ophthalmicus to triage a patient with cataracts may yield irrelevant outputs16. Moreover, the reliance on single disease types increases the risk of poor model generalizability due to spectrum bias—a disparity in disease prevalence between the model’s development population and its intended application population17. This limitation is particularly problematic for home-based self-diagnosis and self-triage. To address these challenges, a recent developed and evaluated a novel machine learning system optimized for ophthalmic triage, using data from 9825 patients across 95 conditions18. Thus, multimodal AI capable of diagnosing and triaging multiple ophthalmic diseases is essential for improving care across diverse populations.

Based on the aforementioned context, our study aims to develop an Intelligent Ophthalmic Multimodal Interactive Diagnostic System (IOMIDS), an embodied conversational agent integrated with the AI chatbot ChatGPT. This system is designed for multimodal diagnosis and triage using eye images captured by slit-lamp or smartphone, alongside medical history. Clinical evaluations will be conducted across three centers, focusing on a comprehensive investigation of 50 prevalent ophthalmic conditions. The primary objective is to assess diagnostic effectiveness, with a secondary focus on triage performance across 10 ophthalmic subspecialties. Our research aims to explore the application of AI in complex clinical settings, incorporating data contributions not only from researchers but also directly from patients, thereby simulating real-world scenarios to ensure the practicality of AI technology in ophthalmic care.

Results

Overview of the study and datasets

We conducted this study at three centers, collecting 15640 data entries from 9825 subjects (4554 male, 5271 female) to develop and evaluate the IOMIDS system (Fig. 1a). Among these, 6551 entries belong to the model development dataset, 912 entries belong to the silent evaluation dataset, and 8177 entries belong to the clinical trial dataset (Supplementary Fig. 1). In detail, we first collected a doctor-patient communication dialog dataset of 450 entries to train the text model through prompt engineering. Next, to assess the diagnostic and triage efficiency of the text model, we collected Dataset A (Table 1), consisting of simulated patient data derived from outpatient records. We then gathered two image datasets (Table 1, Dataset B and Dataset C) for training and validating image diagnostic models, which contain only images and the corresponding image-based diagnostic data. Dataset D, Dataset E and Dataset F (Table 1) were then collected to evaluate image diagnostic model performance and develop a text-image multimodal model. These datasets include both patient medical histories and anterior segment images. Following in silico development of the IOMIDS program, we collected a silent evaluation dataset to compare the diagnostic and triage efficacy among different models (Table 1, Dataset G). The early clinical evaluation consists of internal evaluation (Shanghai center) and external evaluation (Nanjing and Suqian), with 3519 entries from 2292 patients in Shanghai, 2791 entries from 1748 patients in Nanjing, and 1867 entries from 1192 patients in Suqian. Comparison among these centers reveals significant differences in subspecialties, disease classifications, gender, age, and laterality (Supplementary Table 1), suggesting that these factors may influence model performance and should be considered in further analyses.

Fig. 1: Overview of the workflow and functionality of IOMIDS.
figure 1

a Intelligent Ophthalmic Multimodal Interactive Diagnostic System (IOMIDS) is an embodied conversational agent integrated with ChatGPT designed for multimodal diagnosis using eye images and medical history. It comprises a text model and an image model. The text model employs classifiers for chief complaints, along with question and analysis prompts developed from real doctor-patient dialogs. The image model utilizes eye photos taken with a slit-lamp and/or smartphone for image-based diagnosis. These modules combine through diagnostic prompts to create a multimodal model. Patients with eye discomfort can interact with IOMIDS using natural language. This interaction enables IOMIDS to gather patient medical history, guide them in capturing eye lesion photos with a smartphone or uploading slit-lamp images, and ultimately provide disease diagnosis and ophthalmic subspecialty triage information. b Both the text model and the multimodal models follow a similar workflow for text-based modules. After a patient inputs their chief complaint, it is classified by the chief complaint classifier using keywords, triggering relevant question and analysis prompts. The question prompt guides ChatGPT to ask specific questions to gather the patient’s medical history. The analysis prompt considers the patient’s gender, age, chief complaint, and medical history to generate a preliminary diagnosis. If no image information is provided, IOMIDS provides the preliminary diagnosis along with subspecialty triage and prevention, treatment, and care guidance as the final response. If image information is available, the diagnosis prompt integrates image analysis with the preliminary diagnosis to provide a final diagnosis and corresponding guidance. c The text + image multimodal model is divided into text + slit-lamp, text + smartphone, and text + slit-lamp + smartphone models based on image acquisition methods. For smartphone-captured images, YOLOv7 segments the image to isolate the affected eye, removing other facial information, followed by analysis using a ResNet50-trained diagnostic model. Slit-lamp captured images skip segmentation and are directly analyzed by another ResNet50-trained model. Both diagnostic outputs undergo threshold processing to exclude non-relevant diagnoses. The image information is then integrated with the preliminary diagnosis derived from textual information via the diagnosis prompt to form the multimodal model.

Table 1 Summary of the development and silent evaluation datasets used in this study

Development of the IOMIDS system

To develop the text model, we categorized doctor-patient dialogs according to chief complaint themes (Supplementary Table 2). Three researchers independently reviewed the dataset and each selected a set of 90 dialogs for training. Based on these dialogs, we used prompt engineering (Fig. 1b) to develop an embodied conversational agent with ChatGPT. After comparison, the most effective set of 90 dialogs (Supplementary Data 1) was identified, finalizing the text model for further research. These included 11 dialogs on “dry eye”, 10 on “itchy eye”, 10 on “red eye”, 7 on “eye swelling”, 10 on “eye pain”, 8 on “eye discharge”, 5 on “eye masses”, 13 on “blurry vision”, 6 on “double vision”, 6 on “eye injuries or foreign bodies”, and 4 on “proptosis”. This text model can reliably generate questions related to the chief complaint and provide a final response based on the patient’s answers, which includes diagnostic, triage, and other preventive, therapeutic, and care guidance.

After developing the text model, we evaluated its performance using Dataset A (Table 1). The results demonstrated varying diagnostic accuracy across diseases (Fig. 2a). Specifically, the model performed least effectively for primary anterior segment diseases (cataract, keratitis, and pterygium), achieving only 48.7% accuracy (Supplementary Fig. 2a). To identify conditions that did not meet development goals, we analyzed the top 1–3 diseases in each subspecialty. The results showed that the following did not achieve the targets of sensitivity ≥ 90% and specificity ≥ 95% (Fig. 2a): keratitis, pterygium, cataract, glaucoma, and thyroid eye disease. Clinical experience suggests slit-lamp and smartphone captured images are valuable for diagnosing cataracts, keratitis, and pterygium. Therefore, development efforts of image-based diagnostic model would focus on these three conditions.

Fig. 2: In silico development and silent evaluation of IOMIDS.
figure 2

a Heatmaps of diagnostic (top) and triage (bottom) performance metrics after in silico evaluation of the text model (Dataset A). Metrics are column-normalized from -2 (blue) to 2 (red). Disease types are categorized into six major classifications. The leftmost lollipop chart displays the prevalence of each diagnosis and triage. b Radar charts of disease-specific diagnosis (red) and triage (green) accuracy in Dataset A. Rainbow ring represents six disease classifications. Asterisks indicate significant differences between diagnosis and triage accuracy based on Fisher’s exact test. c Bar charts of overall accuracy and disease-specific accuracy for diagnosis (red) and triage (green) after silent evaluation across different models (Dataset G). The line graph below denotes the model used: text model, text + slit-lamp model, text + smartphone model, and text + slit-lamp + smartphone model. d Sankey diagram of Dataset G illustrating the flow of diagnoses across different models for each case. Each line represents a case. PPV, positive predictive value; NPV, negative predictive value; * P < 0.05, ** P < 0.01, *** P < 0.001, **** P < 0.0001.

Beyond diagnosis, the chatbot effectively provided triage information. Statistical analysis revealed high overall triage accuracy (88.3%), significantly outperforming diagnostic accuracy (84.0%; Fig. 2b; Fisher’s exact test, P = 0.0337). All subspecialties achieved a negative predictive value ≥ 95%, and all, except optometry (79.7%) and retina (77.6%), achieved a positive predictive value ≥ 85% (Dataset A in Supplementary Data 2). Thus, eight out of ten subspecialties met the predefined developmental targets. Future multimodal model development will focus on enhancing diagnostic capabilities while utilizing the text model’s triage prompts without additional refinement.

To develop a multimodal model combining text and images, we first created two image-based diagnostic models based on Dataset B and Dataset C (Table 1), with 80% of the images used for training and 20% for validation. The slit-lamp model achieved disease-specific accuracies of 79.2% for cataract, 87.6% for keratitis, and 98.4% for pterygium (Supplementary Fig. 2b). The smartphone model achieved disease-specific accuracies of 96.2% for cataract, 98.4% for keratitis, and 91.9% for pterygium (Supplementary Fig. 2c). After developing the image diagnostic models, we collected Dataset D, Dataset E and Dataset F (Table 1), which included both imaging results and patient history. Clinical diagnosis requires integrating medical history and eye imaging features, so clinical and image diagnoses may not always align (Supplementary Fig. 3a). To address this, we used image information only to rule out diagnoses. Using image diagnosis as the gold standard, we plotted the receiver operating characteristic (ROC) curves for cataract, keratitis, and pterygium in Dataset D (Supplementary Fig. 3b) and Dataset E (Supplementary Fig. 3c). The threshold >0.363 provided high specificity for all three conditions (cataract 83.5%, keratitis 99.2%, pterygium 96.6%) in Dataset D and was used to develop the text + slit-lamp multimodal model. Similarly, in Dataset E, the threshold >0.315 provided high specificity for all three conditions (cataract 96.8%, keratitis 98.5%, pterygium 95.0%) and was used to develop the text + smartphone multimodal model. In the text + slit-lamp + smartphone multimodal model, we tested two methods to combine the results from slit-lamp and smartphone images. The first method used the union of the diagnoses excluded by each model, while the second used the intersection. Testing on Dataset F showed that the first method achieved significantly higher accuracy (52.2%, Supplementary Fig. 3d) than the second method (31.9%, Supplementary Fig. 3e; Fisher’s exact test, P < 0.0001). Therefore, we applied the first method in all subsequent evaluations for the text + slit-lamp + smartphone model.

Using clinical diagnosis as the gold standard, the diagnostic accuracy of all multimodal models significantly improved compared to the text model; specifically, the text + slit-lamp model increased from 32.0% to 65.5% (Fisher’s exact text, P < 0.0001), the text + smartphone model increased from 41.6% to 64.2% (Fisher’s exact test, P < 0.0001), and the text + slit-lamp + smartphone model increased from 37.4% to 52.2% (Fisher’s exact test, P = 0.012). Therefore, we successfully developed four models for the IOMIDS system: the unimodal text model, the text + slit-lamp multimodal model, the text + smartphone multimodal model, and the text + slit-lamp + smartphone multimodal model.

Silent evaluation of diagnostic and triage performance

During the silent evaluation phase, Dataset G was collected to validate the diagnostic and triage performance of the IOMIDS system. Although the diagnostic performance for cataract, keratitis, and pterygium (Dataset G in Supplementary Data 3) did not meet the established clinical goal, significant improvements in diagnostic accuracy were observed for all multimodal models compared to the text model (Fig. 2c). The Sankey diagram revealed that in the text model, 70.8% of cataract cases and 78.3% of pterygium cases were misclassified as “others” (Fig. 2d). In the “others” category, the text + slit-lamp multimodal model correctly identified 88.2% of cataract cases and 63.0% of pterygium cases. The text + smartphone multimodal model performed even better, correctly diagnosing 93.3% of cataract cases and 80.0% of pterygium cases. Meanwhile, the text + slit-lamp + smartphone multimodal model accurately identified 90.5% of cataract cases and 68.2% of pterygium cases in the same category.

Regarding triage accuracy, the overall performance improved with the multimodal models. However, the accuracy for cataract triage notably decreased, dropping from 91.7% to 62.5% in the text + slit-lamp model (Fig. 2c, Fisher’s exact test, P = 0.0012), to 58.3% in the text + smartphone model (Fig. 2c, Fisher’s exact test, P = 0.0003), and further to 53.4% in the text + slit-lamp + smartphone model (Fig. 2c, Fisher’s exact test, P = 0.0001). Moreover, neither the text model nor the three multimodal models met the established clinical goal in any subspecialty (Supplementary Data 2).

We also investigated whether medical histories in the outpatient electronic system alone were sufficient for the text model to achieve accurate diagnostic and triage results. We randomly sampled 104 patients from Dataset G and re-entered their medical dialogs into the text model (Supplementary Fig. 1). For information not recorded in the outpatient history, responses were given as “no information available”. The results showed a significant decrease in diagnostic accuracy, dropping from 63.5% – 20.2% (Fisher’s exact test, P < 0.0001), while triage accuracy remained relatively unchanged, only slightly decreasing from 72.1% – 70.2% (Fisher’s exact test, P = 0.8785). This study suggests that while triage accuracy of the text model is not dependent on dialog completeness, diagnostic accuracy is affected by the completeness of the answers provided. Therefore, thorough responses to AI chatbot queries are crucial in clinical applications.

Evaluation in real clinical settings with trained researchers

The clinical trial involved two parts: researcher-collected data and patient-entered data (Table 2). There was a significant difference in the words input and the duration of input between researchers and patients. For researcher-collected data, the average was 38.5 ± 8.2 words and 58.2 ± 13.5 s, while for patient-entered data, the average was 55.5 ± 10.3 words (t-test, P = 0.002) and 128.8 ± 27.1 s (t-test, P < 0.0001). We first assessed the diagnostic performance during the researcher-collected data phase. For the text model across six datasets (Dataset 1–3, 6–8 in Supplementary Data 3), the following number of diseases met the clinical goal for diagnosing ophthalmic diseases: 16 out of 46 diseases (46.4% of all cases) in Dataset 1, 16 out of 32 diseases (16.4% of all cases) in Dataset 2, 18 out of 28 diseases (61.4% of all cases) in Dataset 3, 19 out of 48 diseases (43.9% of all cases) in Dataset 6, 14 out of 28 diseases (35.3% of all cases) in Dataset 7, and 11 out of 42 diseases (33.3% of all cases) in Dataset 8. Thus, less than half of the cases during the researcher-collected data phase met the clinical goal for diagnosis.

Table 2 Summary of the clinical datasets used in this study

Next, we investigated the subspecialty triage accuracy of the text model across various datasets (Dataset 1–3, 6–8 in Supplementary Data 2). Our findings revealed that during internal validation, the cornea subspecialty achieved the clinical goal for triaging ophthalmic diseases. In external validation, the general outpatient clinic, cornea subspecialty, optometry subspecialty, and glaucoma subspecialty also met these clinical criteria. We further compared the diagnostic and triage outcomes of the text model across six datasets. Data analysis demonstrated that triage accuracy exceeded diagnostic accuracy in most datasets (Supplementary Fig. 4a–c, e–g). Specifically, triage accuracy was 88.7% compared to diagnostic accuracy of 69.3% in Dataset 1 (Fig. 3a; Fisher’s exact test, P < 0.0001), 84.1% compared to 62.4% in Dataset 2 (Fisher’s exact test, P < 0.0001), 82.5% compared to 75.4% in Dataset 3 (Fisher’s exact test, P = 0.3508), 85.7% compared to 68.6% in Dataset 6 (Fig. 3a; Fisher’s exact test, P < 0.0001), 80.5% compared to 66.5% in Dataset 7 (Fisher’s exact test, P < 0.0001), and 84.5% compared to 65.1% in Dataset 8 (Fisher’s exact test, P < 0.0001). This suggests that while the text model may not meet clinical diagnostic needs, it could potentially fulfill clinical triage requirements.

Fig. 3: Internal and external evaluation of IOMIDS performance on diagnosis and triage.
figure 3

a Radar charts of disease-specific diagnosis (red) and triage (green) accuracy after clinical evaluation of the text model in internal (left, Dataset 1) and external (right, Dataset 6) centers. Asterisks indicate significant differences between diagnosis and triage accuracy based on Fisher’s exact test. b Circular stacked bar charts of disease-specific diagnostic accuracy across different models from internal (left, Dataset 2–4) and external (right, Dataset 7–9) evaluations. Solid bars represent the text model, while hollow bars represent multimodal models. Asterisks indicate significant differences in diagnostic accuracy between two models based on Fisher’s exact test. c Bar charts of overall accuracy (upper) and accuracy of primary anterior segment diseases (lower) for diagnosis (red) and triage (green) across different models in Dataset 2–5 and Dataset 7–10. The line graphs below denote study centers (internal, external), models used (text, text + slit-lamp, text + smartphone, text + slit-lamp + smartphone), and data provider (researchers, patients). * P < 0.05, ** P < 0.01, *** P < 0.001, **** P < 0.0001.

We then investigated the diagnostic performance of multimodal models in Dataset 2, 3, 7, and 8 (Supplementary Data 3). Both the text + slit-lamp model and text + smartphone model demonstrated higher overall diagnostic accuracy compared to the text model in internal and external validations, with statistically significant improvements noted for the text + smartphone model in Dataset 8 (Fig. 3c). The clinical goal for diagnosing ophthalmic diseases was achieved by 11 out of 32 diseases (13.8% of all cases) in Dataset 2, 21 out of 28 diseases (70.2% of all cases) in Dataset 3, 11 out of 28 diseases (28.5% of all cases) in Dataset 7, and 15 out of 42 diseases (50.6% of all cases) in Dataset 8. The text + smartphone model outperformed the text model by meeting the clinical goal for diagnosis in more cases and disease types. For some other diseases that did not meet the clinical goal for diagnosis, significant improvements in diagnostic accuracy were also found within the multimodal models (Fig. 3b). Therefore, the multimodal model exhibited better diagnostic performance compared to the text model.

Regarding to triage, some datasets of the multimodal models showed a minor decrease in accuracy compared to the text model, however, these differences were not statistically significant (Fig. 3c). Unlike the silent evaluation phase, in clinical applications, neither of the two multimodal models demonstrated a notable decline in triage accuracy across different diseases, including cataract (Supplementary Fig. 5). In summary, data collected by researchers indicated that multimodal models outperformed the text model in diagnostic accuracy but were slightly less efficient in triage.

Evaluation in real clinical settings with untrained patients

During the patient-entered data phase, considering the convenience of smartphones, we focused on the text model, the text + smartphone model, and the text + slit-lamp + smartphone model. First, we compared the triage accuracy. Consistent with the researcher-collected data phase, the overall triage accuracy of the multimodal models were slightly lower than that of the text model, but this difference was not statistically significant (Fig. 3c). For subspecialties, in both internal and external validation, the text and multimodal models for the general outpatient clinic and glaucoma met the clinical goals for triaging ophthalmic diseases. Additionally, internal validation showed that the multimodal models met these standards for cornea, optometry, and retina subspecialties. In external validation, the text model met the standards for cornea and retina, while the multimodal models met the standards for cataract and retina. These results suggest that both the text model and multimodal models meet the triage requirements when patients input their own data.

Next, we compared the diagnostic accuracy of the text model and the multimodal models. Results revealed that in both internal and external validations, all diseases met the specificity criterion of ≥ 95%. In Dataset 4, the text model met the clinical criterion of sensitivity ≥ 75% in 15 out of 42 diseases (40.5% of cases), while the text + smartphone multimodal model met this criterion in 24 out of 42 diseases (78.6% of cases). In Dataset 5, the text model achieved the sensitivity threshold of ≥ 75% in 14 out of 40 diseases (48.3% of cases), while the text + slit-lamp + smartphone multimodal model met this criterion in only 10 out of 40 diseases (35.0% of cases). In Dataset 9, the text model achieved the clinical criterion in 24 out of 43 diseases (57.1%), while the text + smartphone model met the criterion in 28 out of 43 diseases (81.9%). In Dataset 10, the text model achieved the criterion in 25 out of 42 diseases (62.5%), whereas the text + slit-lamp + smartphone model met the criterion in 22 out of 42 diseases (50.8%). This suggests that the text + smartphone model outperforms the text model, while the text + slit-lamp + smartphone model does not. Further statistical analysis confirmed the superiority of text + smartphone model when comparing its diagnostic accuracy with the text model in both Dataset 4 and Dataset 9 (Fig. 3c). We also conducted an analysis of diagnostic accuracy for individual diseases, identifying significant improvements for certain diseases (Fig. 3b). These findings collectively show that during the patient-entered data phase, the text + smartphone model not only meets triage requirements but also delivers better diagnostic performance than both the text model and the text + slit-lamp + smartphone model.

We further compared the diagnostic and triage accuracy of the text model in Dataset 4 and Dataset 9. Consistent with previous findings, both internal validation (triage: 80.4%, diagnosis: 69.6%; Fisher’s exact test, P < 0.0001) and external validation (triage: 84.7%, diagnosis: 72.5%; Fisher’s exact test, P < 0.0001) demonstrated significantly higher triage accuracy compared to diagnostic accuracy for the text model (Supplementary Fig. 4d, h). Examining individual diseases, cataract exhibited notably higher triage accuracy than diagnostic accuracy in internal validation (Dataset 4: triage 76.8%, diagnosis 51.2%; Fisher’s exact test, P = 0.0011) and external validation (Dataset 9: triage 87.3%, diagnosis 58.2%; Fisher’s exact test, P = 0.0011). Interestingly, in Dataset 4, the diagnostic accuracy for myopia (94.0%) was significantly higher (Fisher’s exact test, P = 0.0354) than the triage accuracy (80.6%), indicating that the triage accuracy of the text model may not be influenced by diagnostic accuracy. Subsequent regression analysis is necessary to investigate the factors determining triage accuracy.

Due to varying proportions of the disease classifications across the three centers (Supplementary Table 1), we further explored changes in diagnostic and triage accuracy within each classification. Results revealed that, regardless of whether data was researcher-collected or patient-reported, diagnostic accuracy for primary anterior segment diseases (cataract, keratitis, pterygium) was significantly higher in the multimodal model compared to the text model in both internal and external validation (Fig. 3c). Further analysis of cataract, keratitis, and pterygium across Datasets 2, 3, 4, 7, 8, and 9 (Fig. 3b) also showed that, similar to the silent evaluation phase, multimodal model diagnostic accuracy for cataract significantly improved compared to the text model in most datasets. Pterygium and keratitis exhibited some improvement but showed no significant change across most datasets due to sample size limitations. For the other five major disease categories, multimodal model diagnostic accuracy did not consistently improve and even significantly declined in some categories (Supplementary Fig. 6). These findings indicate that the six major disease categories may play crucial roles in influencing the diagnostic performance of the models, underscoring the need for further detailed investigation.

Comparison of diagnostic performance in different models

To further compare the diagnostic accuracy of different models across various datasets, we conducted comparisons within six major disease categories. The results revealed significant differences in diagnostic accuracy among the models across these categories (Fig. 4a). For example, when comparing the text + smartphone model (Datasets 4, 9) to the text model (Datasets 1, 6), both internal and external validations showed higher diagnostic accuracy for the former in primary anterior segment diseases, other anterior segment diseases, and intraorbital diseases and emergency categories compared to the latter (Fig. 4a, b). Interestingly, contrary to previous findings within datasets, comparisons across datasets demonstrated a notable decrease in diagnostic accuracy for the text + slit-lamp model (Dataset 1 vs 2, Dataset 6 vs 7) and the text + slit-lamp + smartphone model (Dataset 4 vs 5, Dataset 9 vs 10) in the categories of other anterior segment diseases and vision disorders in both internal and external validations (Fig. 4a). This suggests that, in addition to the model used and the disease categories, other potential factors may influence the model’s diagnostic accuracy.

Fig. 4: Comparison of diagnostic performance across different models.
figure 4

a Bar charts of diagnostic accuracy calculated for each disease classification across different models from internal (upper, Dataset 1–5) and external (lower, Dataset 6–10) evaluations. The bar colors represent disease classifications. The line graphs below denote study centers, models used, and data providers. b Heatmaps of diagnostic performance metrics after internal (left) and external (right) evaluations of different models. For each heatmap, metrics in the text model and text + smartphone model are normalized together by column, ranging from -2 (blue) to 2 (red). Disease types are classified into six categories and displayed by different colors. c Multivariate logistic regression analysis of diagnostic accuracy for all cases (left) and subgroup analysis for follow-up cases (right) during clinical evaluation. The first category in each factor is used as a reference, and OR values and 95% CIs for other categories are calculated against these references. OR, odds ratio; CI, confidence interval; *P < 0.05, **P < 0.01, ***P < 0.001, ****P < 0.0001.

We then conducted univariate and multivariate regression analyses to explore factors influencing diagnostic accuracy. Univariate analysis revealed that seven factors (age, laterality, number of visits, disease classification, model, data provider, and words input) significantly influence diagnostic accuracy (Supplementary Table 3). In multivariate analysis, six factors (age, laterality, number of visits, disease classification, model, and words input) remained significant, while the data provider was no longer a critical factor (Fig. 4c). Subgroup analysis of follow-up cases showed that only the model type significantly influenced diagnostic accuracy (Fig. 4c). For first-visit patients, three factors (age, disease classification, and model) were still influential. Further analysis across different age groups within each disease classification revealed that the multimodal models generally outperformed or performed comparably to the text model in most disease categories (Table 3). However, all multimodal models, including the text + slit-lamp model (OR: 0.21 [0.04–0.97]), the text + smartphone model (OR: 0.17 [0.09–0.32]), and the text + slit-lamp + smartphone model (OR: 0.16 [0.03–0.38]), showed limitations in diagnosing visual disorders in patients over 45 years old compared to the text model (Table 3). Additionally, both the text + slit-lamp model (OR: 0.34 [0.20–0.59]) and the text + slit-lamp + smartphone model (OR: 0.67 [0.43–0.89]) were also less effective for diagnosing other anterior segment diseases in this age group. In conclusion, for follow-up cases, both text + slit-lamp and text + smartphone models are suitable, with a preference for the text + smartphone model. For first-visit patients, the text + smartphone model is recommended, but its diagnostic efficacy for visual disorders in patients over 45 years old (such as presbyopia) may be inferior to that of the text model.

Table 3 Subgroup analysis of diagnostic accuracy with multivariate logistic regression for newly diagnosed patients

We also performed a regression analysis on triage accuracy. In the univariate logistic regression, the center and data provider significantly influenced triage accuracy. Multivariate regression analysis showed that only the data provider significantly impacted triage accuracy, with patient-entered data significantly improving accuracy (OR: 1.40 [1.25–1.56]). Interestingly, neither model type nor diagnostic accuracy affects triage outcomes. Considering the previous data analysis results from the patient-entered data phase, both the text model and the text + smartphone model are recommended as self-service triage tools for patients in clinical applications. Collectively, among the four models developed in our IOMIDS system, the text + smartphone model is more suitable for patient self-diagnosis and self-triage compared to the other models.

Model interpretability

In subgroup analysis, we identified limitations in the diagnostic accuracy for all multimodal models for patients over 45 years old. The misdiagnosed cases in this age group were further analyzed to interpret the limitations. Both the text + slit-lamp model (Datasets 2, 7) and the text + slit-lamp + smartphone model (Datasets 5, 10) frequently misdiagnosed other anterior segment and visual disorders as cataracts or keratitis. For instance, with the text + slit-lamp + smartphone model, glaucoma (18 cases, 69.2%) and conjunctivitis (22 cases, 38.6%) were often misdiagnosed as keratitis, while presbyopia (6 cases, 54.5%) and visual fatigue (11 cases, 28.9%) were commonly misdiagnosed as cataracts. In contrast, both the text model (Datasets 1–10) and the text + smartphone model (Datasets 3, 4, 8, 9) had relatively low misdiagnosis rates for cataracts (text: 23 cases, 3.5%; text + smartphone: 91 cases, 33.7%) and keratitis (text: 16 cases, 2.4%; text + smartphone: 25 cases, 9.3%). These results suggest that in our IOMIDS system, the inclusion of slit-lamp images, whether in the text + slit-lamp model or the text + slit-lamp + smartphone model, may actually hinder diagnostic accuracy due to the high false positive rate for cataracts and keratitis.

We then examined whether these misdiagnoses could be justified through image analysis. First, we reviewed the misdiagnosed cataract cases. In the text + slit-lamp model, 30 images (91.0%) were consistent with a cataract diagnosis. However, clinically, they were mainly diagnosed with glaucoma (6 cases, 20.0%) and dry eye syndrome (5 cases, 16.7%). Similarly, in the text + smartphone model, photographs of 80 cases (88.0%) were consistent with a cataract diagnosis. Clinically, these cases were primarily diagnosed with refractive errors (20 cases), retinal diseases (15 cases), and dry eye syndrome (8 cases). We then analyzed the class activation maps of the two multimodal models. Both models showed regions of interest for cataracts near the lens (Supplementary Fig. 7), in accordance with clinical diagnostic principles. Thus, these multimodal models can provide some value for cataract diagnosis based on images but may lead to discrepancies with the final clinical diagnosis.

Next, we analyzed cases misdiagnosed as keratitis by the text + slit-lamp model. The results showed that only one out of 25 cases had an anterior segment photograph consistent with keratitis, indicating a high false-positive rate for keratitis with the text + slit-lamp model. We then conducted a detailed analysis of the class activation maps generated by this model during clinical application. The areas of interest for keratitis were centered around the conjunctiva rather than the corneal lesions (Supplementary Fig. 7a). Thus, the model appears to interpret conjunctival congestion as indicative of keratitis, contributing to the occurrence of false-positive results. In contrast, the text + smartphone model displayed areas of interest for keratitis near the corneal lesions (Supplementary Fig. 7b), which aligns with clinical diagnostic principles. Taken together, future research should focus on refining the text + slit-lamp model for keratitis diagnosis and prioritize optimizing the balance between text-based and image-based information to enhance diagnostic accuracy across both multimodal models.

Inter-model variability and inter-expert variability

We further evaluated the diagnostic accuracy of GPT4.0 and the domestic large language model (LLM) Qwen using Datasets 4, 5, 9, and 10. Additionally, we invited three trainees and three junior doctors to independently diagnose these diseases. Since the text + smartphone model performed the best in the IOMIDS system, we compared its diagnostic accuracy with that of the other two LLMs and ophthalmologists with varying levels of experience (Fig. 5a-b). The text + smartphone model (80.0%) outperformed GPT4.0 (71.7%, χ² test, P = 0.033) and showed similar accuracy to the mean performance of trainees (80.6%). Among the three LLMs, Qwen performed the poorest, comparable to the level of a junior doctor. However, all three LLMs fell short of expert-level performance, suggesting there is still potential for improvement.

Fig. 5: Assessment of model-expert agreement and the quality of chatbot responses.
figure 5

a Comparison of diagnostic accuracy of IOMIDS (text + smartphone model), GPT4.0, Qwen, expert ophthalmologists, ophthalmology trainees, and unspecialized junior doctors. The dotted lines represent the mean performance of ophthalmologists at different experience levels. b Heatmap of Kappa statistics quantifying agreement between diagnoses provided by AI models and ophthalmologists. c Kernel density plots of user satisfaction rated by researchers (red) and patients (blue) during clinical evaluation. d Example of an interactive chat with IOMIDS (left) and quality evaluation of the chatbot response (right). On the left, the central box displays the patient interaction process with IOMIDS: entering chief complaint, answering system questions step-by-step, uploading a standard smartphone-captured eye photo, and receiving diagnosis and triage information. The chatbot response includes explanations of the condition and guidance for further medical consultation. The surrounding boxes show a researcher’s evaluation of six aspects of the chatbot response. The radar charts on the right illustrate the quality evaluation across six aspects for chatbot responses generated by the text model (red) and the text + image model (blue). The axes for each aspect correspond to different coordinate ranges due to varying rating scales. Asterisks indicate significant differences between two models based on two-sided t-test. ** P < 0.01, *** P < 0.001, **** P < 0.0001.

We then analyzed the agreement between the answers provided by the LLMs and ophthalmologists (Fig. 5b). Agreement among expert ophthalmologists, who served as the gold standard in our study, was generally strong (κ: 0.85–0.95). Agreement among trainee doctors was moderate (κ: 0.69–0.83), as was the agreement among junior doctors (κ: 0.69–0.73). However, the agreement among the three LLMs was weaker (κ: 0.48–0.63). Notably, the text + smartphone model in IOMIDS showed better agreement with experts (κ: 0.72–0.80) compared to the other two LLMs (GPT4.0: 0.55–0.78; Qwen: 0.52–0.75). These results suggest that the text + smartphone model in IOMIDS demonstrates the best alignment with experts among the three LLMs.

Evaluation of user satisfaction and response quality

The IOMIDS responses not only contained diagnostic and triage results but also provided guidance on prevention, treatment, care, and follow-up (Fig. 5c). We first analyzed both researcher and patient satisfaction with these responses. Satisfaction was evaluated by researchers during the model development phase and the clinical trial phase; satisfaction was evaluated by patients during the clinical trial phase, regardless of the data collection method. Researchers rated satisfaction score significantly higher (4.63 ± 0.92) than patients (3.99 ± 1.46; t-test, P < 0.0001; Fig. 5c). Patient ratings did not differ between researcher-collected (3.98 ± 1.45) and self-entered data (4.02 ± 1.49; t-test, P = 0.3996). Researchers frequently rated chatbot responses as very satisfied (82.5%), whereas patient ratings varied, with 20.2% finding responses not satisfied (11.7%) or slightly satisfied (8.5%), and 61.9% rating them very satisfied. Further demographic analysis between these patient groups revealed that the former (45.7 ± 23.8 years) were significantly older than the latter (37.8 ± 24.4 years; t-test, P < 0.0001), indicating greater acceptance and positive evaluation of AI chatbots among younger individuals.

Next, we evaluated the response quality between multimodal models and the text model (Fig. 5d). The multimodal models exhibited significantly higher overall information quality (4.06 ± 0.12 vs. 3.82 ± 0.14; t-test, P = 0.0031) and better understandability (78.2% ± 1.3% vs. 71.1% ± 0.7%; t-test, P < 0.0001) than the text model. Additionally, the multimodal models showed significantly lower misinformation scores (1.02 ± 0.05 vs. 1.23 ± 0.11; t-test, P = 0.0003) compared to the text model. Notably, the empathy score statistically decreased in multimodal models compared to the text model (3.51 ± 0.63 vs. 4.01 ± 0.56; t-text, P < 0.0001), indicating lower empathy in chatbot responses from multimodal models. There were no significant differences in terms of grade level (readability), with both the text model and multimodal models being suitable for users at a grade 3 literacy level. These findings suggest that multimodal models generate high-quality chatbot responses with good readability. Future studies may focus on enhancing the empathy of these multimodal models to better suit clinical applications.

Discussion

The Intelligent Ophthalmic Multimodal Interactive Diagnostic System (IOMIDS) is designed to diagnose ophthalmic diseases using multimodal information and provides comprehensive medical advice, including subspecialty triage, prevention, treatment, follow-up, and care. During development, we created four models: a text-based unimodal model, which is an embodied conversational agent integrated with ChatGPT, and three multimodal models that combine medical history information from interactive conversations with eye images for a more thorough analysis. In clinical evaluations, the multimodal models significantly improved diagnostic performance over the text model for anterior segment diseases such as cataract, keratitis, and pterygium in patients aged 45 and older. Previous studies also demonstrated the strength of multimodal models over unimodal models, showing that multimodal model outperformed image-only model in identifying pulmonary diseases and predicting adverse clinical outcomes in COVID-19 patients19. Thus, multimodal models are more suitable for analyzing medical information than unimodal models.

Notably, the text + smartphone model in the IOMIDS system demonstrated the highest diagnostic accuracy, outperforming current multimodal LLMs like GPT4.0 and Qwen. However, while this model approaches trainee-level performance, it still falls short of matching the accuracy of expert ophthalmologists. GPT4.0 itself achieved accuracy only slightly higher than junior doctors. Previous studies have similarly indicated that while LLMs show promise in supporting ophthalmic diagnosis and education, they lack the nuanced precision of trained specialists, particularly in complex cases20. For instance, Shemer et al. tested ChatGPT’s diagnostic accuracy in a clinical setting and found it lower than that of ophthalmology residents and attending physicians21. Nonetheless, it completed diagnostic tasks significantly faster than human evaluators, highlighting its potential as an efficient adjunct tool. Future research should focus on refining intelligent diagnostic models for challenging and complex cases, with iterative improvements aimed at enhancing diagnostic accuracy and clinical relevance.

Interestingly, the text + smartphone model outperformed the text + slit-lamp model in diagnosing cataract, keratitis, and pterygium in patients under 45 years old. Even though previous studies have shown significant progress in detecting these conditions using smartphone photographs22,23,24, there is insufficient evidence to support the finding that the text + slit-lamp model is less efficient than the text + smartphone model. To address this issue, we first thoroughly reviewed the class activation maps of both models. We found that the slit-lamp model often focused on the conjunctival hyperemia region rather than the corneal lesion area in keratitis cases, leading to more false-positive diagnoses. This mismatch between model-identified areas of interest and clinical lesions suggests a flaw in our slit-lamp image analysis12. Additionally, we analyzed the imaging characteristics of the training datasets (Dataset B and Dataset C) for image-based diagnostic models. Dataset B exhibited a large proportion of the conjunctival region in images, particularly in keratitis cases, which often displayed extensive conjunctival redness. Conversely, Dataset C, comprising smartphone images, showed a smaller proportion of the conjunctival region and facilitated reducing bias towards conjunctival hyperemia in keratitis cases. Overall, refining the anterior segment image dataset may enhance the diagnostic accuracy of the text + slit-lamp model.

Notably, the text + smartphone model has demonstrated advantages in diagnosing orbital diseases, even though it was not specifically trained for these conditions. These findings highlight the need and potential for further enhancement of the text + smartphone model in diagnosing both orbital and eyelid diseases. Additionally, there was no significant difference in diagnostic capabilities for retinal diseases between the text + slit-lamp model, the text + smartphone model, or the text + slit-lamp + smartphone model compared to the text-only model, which aligns with our expectations. This suggests that appearance images may not significantly contribute to diagnosing retinal diseases, consistent with clinical practice. Several studies have successfully developed deep learning models for accurately detecting retinal diseases using fundus photos25, optical coherence tomography (OCT) images26, and other eye-related images. Therefore, IOMIDS could benefit from functionalities to upload retinal examination images and enable comprehensive diagnosis, thereby improving the efficiency of diagnosing retinal diseases. Furthermore, we found that relying solely on medical histories from the outpatient electronic system was insufficient for the text model to achieve accurate diagnoses. This suggests that IOMIDS may gather clinically relevant information that doctors often overlook or fail to record in electronic systems. Thus, future system upgrades could involve aiding doctors in conducting preliminary interviews and compiling initial medical histories to reduce their workload.

Regarding subspecialty triage, consistent with prior research, the text model demonstrates markedly superior accuracy in triage compared to diagnosis27. Additionally, we observed an intriguing phenomenon: triage accuracy is influenced not by diagnostic accuracy but by the data collector. Specifically, patients’ self-input data resulted in significantly improved triage accuracy compared to data input by researchers. Upon careful analysis of the differences, we found that patient-entered data tends to be more conversational, whereas researcher-entered data tends to use medical terminology and concise expressions. A prior randomized controlled trial (RCT) investigated how different social roles of chatbots influence the chatbot-user relationship, and results suggested that adjusting chatbot roles can enhance users’ intentions toward the chatbot28. However, no RCT study is available to investigate how users’ language styles influence chatbot results. Based on our study, we propose that if IOMIDS is implemented in home or community settings without researcher involvement, everyday conversational language in self-reports does not necessarily impair its performance. Therefore, IOMIDS may serve as a decision support system for self-triage to enhance healthcare efficiency and provide cost-effectiveness benefits.

Several areas for improvement were identified for our study. Firstly, due to sample size limitations during the model development phase, we were unable to develop a combined image model for slit-lamp and smartphone images. Instead, we integrated the results of the slit-lamp and smartphone models using logical operations, which led to suboptimal performance of the text + slit-lamp + smartphone model. In fact, previous studies involving multiple image modalities have achieved better results29. Therefore, it will be necessary to develop a dedicated multimodal model for slit-lamp and smartphone images in future work. Moreover, the multimodal model showed lower empathy compared to the text model, possibly due to its more objective diagnosis prompts contrasting with conversational styles. Future upgrades will adjust the multimodal model’s analysis prompts to enhance empathy in chatbot responses. Thirdly, older users reported lower satisfaction with IOMIDS, highlighting the need for improved human-computer interaction for this demographic. In addition, leveraging ChatGPT’s robust language capabilities and medical knowledge, we used prompt engineering to optimize for parameter efficiency, cost-effectiveness, and speed in clinical experiments. However, due to limitations in OpenAI’s medical capabilities, particularly its inability to pass the Chinese medical licensing exam30, we aim to develop our own large language model based on real Chinese clinical dialogs. This model is expected to enhance diagnostic accuracy and adapt to evolving medical consensus. During our study, we used GPT3.5 instead of GPT4.0 due to token usage constraints. Since GPT4.0 has shown superior responses to ophthalmology-related queries in recent studies31, integrating GPT4.0 into IOMIDS may enhance its clinical performance. It is worth mentioning that the results of our study may not be applicable to other language environments. Previous studies have shown that GPT responds differently to prompts in various languages, with English appearing to yield better results32. There were also biases in linguistic evaluation, as expert assessments can vary based on language habits, semantic comprehension, and cultural values. Moreover, our study represents an early clinical evaluation, and comparative prospective evaluations are necessary before implementing IOMIDS in clinical practice.

Material and methods

Ethics approval

The study was approved by the Institutional Review Board of Fudan Eye & ENT Hospital, the Institutional Review Board of the Affiliated Eye Hospital of Nanjing Medical University, and the Institutional Review Board of Suqian First Hospital. The study was registered on ClinicalTrials.gov (NCT05930444) on June 26, 2023. It was conducted in accordance with the Declaration of Helsinki, with all participants providing written informed consent before their participation.

Study design

This study aims to develop an Intelligent Ophthalmic Multimodal Interactive Diagnostic System (IOMIDS) for diagnosing and triaging ophthalmic diseases (Supplementary Fig. 1). The IOMIDS includes four models: an unimodal text model, which is an embodied conversational agent with ChatGPT; a text + slit-lamp multimodal model, which incorporates both text and eye images captured by a slit-lamp equipment; a text + smartphone multimodal model, utilizing text along with eye images captured by a smartphone for diagnosis and triage; and a text + slit-lamp + smartphone multimodal model, which combines both image modalities with text to achieve a final diagnosis. Clinical validation of the three models’ performance is conducted through a two-stage cross-sectional study, an initial silent evaluation stage followed by an early clinical evaluation stage, as detailed in the protocol article33. Triage covers 10 ophthalmic subspecialties: general outpatient clinic, optometry, strabismus, cornea, cataract, glaucoma, retina, neuro-ophthalmology, orbit, and emergency. Diagnosis involves 50 common ophthalmic diseases (Supplementary Data 4), categorized by lesion location into anterior segment diseases, fundus and optic nerve diseases, intraorbital diseases and emergencies, eyelid diseases, and visual disorders. Notably, because of the image diagnostic training conducted for cataract, keratitis, and pterygium during the development of multimodal models, anterior segment diseases are further classified into two categories: primary anterior segment diseases (including cataract, keratitis, and pterygium) and other anterior segment diseases.

Collecting and formatting doctor-patient communication dialogs

Doctor-patient communication dialogs were collected from the outpatient clinics of Fudan Eye & ENT Hospital, covering the predetermined 50 disease types across 10 subspecialties. After collection, each dialog underwent curation and formatting. Curation requires removing filler words and irrelevant redundant content (e.g., payment methods). Formatting refers to structuring each dialog into four standardized parts: (1) chief complaint; (2) series of questions from the doctor; (3) patient’s responses to each question; (4) doctor’s diagnosis, triage judgment, and information on prevention, treatment, care, and follow-up. Subsequently, researchers (with at least 3 years of clinical experience as attending physicians) carefully reviewed the dialogs for each disease, selecting those where the doctor’s questions were focused on the chief complaint, demonstrated medical reasoning, and contributed to diagnosis and triage. After this careful review, 90 out of 450 dialogs were selected for prompt engineering to train the text model for IOMIDS. Three researchers independently evaluated the dialogs, resulting in three sets of 90 dialogs. To assess the performance of the models trained with these different sets of dialogs, we created two virtual cases for each of the five most prevalent diseases across 10 subspecialties at our research institution, totaling 100 cases. The diagnostic accuracy for each set of prompts was 73%, 68%, and 52%, respectively. Ultimately, the first set of 90 dialogs (Supplementary Data 1) was chosen as the final set of prompts.

Developing a dynamic prompt system for IOMIDS

To build the IOMIDS system, we designed a dynamic prompt system that enhances ChatGPT’s role in patient consultations by integrating both textual and image data. This system supports diagnosis and triage based on either single-modal (text) or multi-modal (text and image) information. The overall process is illustrated in Fig. 1b, with a detailed explanation provided below:

The system is grounded in a medical inquiry prompt corpus, developed by organizing 90 real-world clinical dialogs into a structured format. Each interview consists of four segments: “Patient’s Chief Complaint,” “Inquiry Questions,” “Patient’s Responses,” and “Diagnosis and Consultation Recommendations.” These clinical interviews are transformed into structured inquiry units, known as “prompts” within the system. When a patient inputs their primary complaint, the system’s Chief Complaint Classifier identifies relevant keywords and matches them with corresponding prompts from the corpus. These selected prompts, along with the patient’s initial complaint, form a question prompt that guides ChatGPT in asking about the relevant medical history related to the chief complaint.

After gathering the responses to these inquiries, an analysis prompt is generated. This prompt directs ChatGPT to perform a preliminary diagnosis and triage based on the conversation history. The analysis prompt includes the question prompts from the previous stage, along with all questions and answers exchanged during the consultation. If no appearance-related or anterior segment images are provided by the patient, the system uses only this analysis prompt to generate diagnosis and triage recommendations, which are then communicated back to the patient as the final output of the text model.

For cases that involve multi-modal information, we developed an additional diagnosis prompt. This prompt expands on the previous analysis prompt by incorporating key patient information—such as gender, age, and preliminary diagnosis/triage decisions—alongside diagnostic data obtained from slit-lamp or smartphone images. By combining image data with textual information, ChatGPT is able to provide more comprehensive medical recommendations, including diagnosis, triage, and additional advice based on both modalities.

It is important to note that in a single consultation, the question prompt, analysis prompt, and diagnosis prompt are not independent; rather, they are interconnected and progressive. The question prompt is part of the analysis prompt, and the analysis prompt is integrated into the diagnosis prompt.

Collecting and ground-truth labeling of images

Image diagnostic data is crucial for diagnosis prompts, and to obtain this data, we need to develop an image-based diagnostic model. Considering there are two common methods for capturing eye photos in clinical settings, both slit-lamp captured and smartphone captured eye images were collected for the development of the image-based diagnostic model. These images encompass diseases identified as requiring image diagnosis (specifically cataract, keratitis, pterygium) through in silico evaluation of the text model (as detailed below). For patients with different diagnoses in each eye (e.g., keratitis in one eye and dry eye in the other), each eye was included as an independent data entry. Additionally, slit-lamp and smartphone captured images for diseases with top 1–5 prevalence rates in each subspecialty were collected and categorized as “others” for training, validation, and testing of the image diagnostic model.

For slit-lamp images, the following criteria apply: 1. Images must be taken using the slit-lamp’s diffuse light with no overexposure; 2. Both inner and outer canthi must be visible; 3. Image size must be at least 1MB, with a resolution of no less than 72 pixels/inch. For smartphone images, the following conditions must be met: 1. The eye of interest must be naturally open; 2. Images must be captured under indoor lighting, with no overexposure; 3. The shooting distance must be within 1 meter, with focus on the eye region; 4. Image size must be at least 1MB, with a resolution of no less than 72 pixels/inch. Images not meeting these requirements will be excluded from the study. Four specialists independently labeled images into four categories (cataract, keratitis, pterygium, and others) based solely on the image. Consensus was reached when three or more specialists agreed on the same diagnosis; images where agreement could not be reached by at least two specialists were excluded.

Developing image-based ophthalmic classification models for multimodal models

We developed two distinct deep learning algorithms using ResNet-50 to process images captured by slit-lamp and smartphone cameras. The first algorithm was designed to detect cataracts, keratitis, and pterygium in an anterior segment image dataset (Dataset B) obtained under a slit lamp. The second algorithm targeted the detection of these conditions in a dataset (Dataset C) consisting of single-eye regions extracted from smartphone images. To address the challenge of non-eye facial regions in smartphone images, we collected an additional 200 images to train and validate an eye-target detection model using YOLOv7. This model was specifically trained to detect eye regions by annotating the eye areas within these images. Of the total images, 80% were randomly assigned to the training set, and the model was trained for 300 epochs using the default learning rate and preprocessing settings specified in the YOLOv7 website (https://github.com/WongKinYiu/yolov7). The remaining 20% of the images were used as a validation set, achieving a Precision of 1.0, Recall of 0.98, and mAP@.5 of 0.991. These images were not reused in any other experiments.

For the disease classification network, we created a four-class dataset consisting of cataracts, keratitis, pterygium, and “Other” categories. This dataset includes both anterior segment images and smartphone-captured eye images across all categories. The “Other” class includes normal eye images and images of various other eye conditions. Using a ResNet-50 model pretrained on ImageNet, we fine-tuned it on this four-class dataset for 200 epochs to optimize classification accuracy across both modalities.

During training, each image was resized to 224 × 224 pixels and underwent data augmentation techniques to enhance generalization. These included a 0.2 probability of random horizontal flipping, random rotations between -5 to 5 degrees, and a 0.2 probability of automatic contrast adjustments. White balance adjustments were also applied to standardize the images. For validation and testing, images were resized to 224 × 224 pixels, underwent white balance adjustments, and were then input into the model for disease prediction. To improve model robustness and minimize overfitting, we employed five-fold cross-validation. The dataset was divided into five parts (each 20%), with four parts used for training and one part for validation in each fold. The final model was selected based on the highest validation accuracy, without specific constraints on sensitivity and specificity for individual models.

Generating image diagnostic data for multimodal models

Before being input into the diagnosis prompt, image diagnostic data underwent preprocessing. Preliminary experiments revealed that when the data indicated a single diagnosis, the multimodal model might overlook patient demographics and medical history, leading to a direct image-based diagnosis. To address this, we adjusted the image diagnostic results by excluding specific diagnoses.

Specifically, we modified the classification model by removing the final Softmax layer and using the scores from the fully connected (fc) layer as outputs for each category. These scores were then rescaled to fall within the range of [-1, 1], providing continuous diagnostic output for each image category. The rescaled scores served as the image model’s diagnostic output. We also collected additional datasets, Dataset D (slit-lamp captured images) and Dataset E (smartphone captured images), to ensure the independence of training, validation, and testing sets. The model was then run on Datasets D and E, generating scores for all four categories across all images, with the image diagnosis serving as the gold standard. Receiver operating characteristic (ROC) curves were plotted for cataracts, keratitis, and pterygium to determine optimal thresholds for maximizing specificity across these diseases. These thresholds were then established as the output standard. For example, if only cataracts exceeded the threshold among the three diseases, the output label from the image diagnostic module would be “cataract”.

To further develop a text + slit-lamp + smartphone model, we collected Dataset F, which includes both slit-lamp and smartphone images for each individual. We tested two methods to combine the results from the slit-lamp and smartphone images. The first method used the union of diagnoses excluded by each model, while the second method used the intersection. For example, if the slit-lamp image excluded cataracts and the smartphone image excluded both cataracts and keratitis, the first method would exclude cataracts and keratitis, while the second method would exclude only cataracts. These “excluded diagnoses” were then combined with the user’s analysis prompt, key patient information, and preliminary diagnosis to construct the diagnosis prompt, as shown in Fig. 1c. This diagnosis prompt was then sent to ChatGPT, allowing it to generate the final multimodal diagnostic result by integrating both image-based and contextual data.

In silico evaluation of text model

After developing chief complaint classifiers, question prompt templates, and analysis prompt templates, we integrated these functionalities into a text model and conducted an in silico evaluation using virtual cases (Dataset A). These cases consist of simulated patient data derived from outpatient records. To ensure the cohort’s characteristics are representative of real-world clinical settings, we determined the total number of cases per subspecialty based on outpatient volumes over the past 3 years. We randomly selected cases across subspecialties to cover the predefined set of 50 disease types (Supplementary Data 4). These 50 disease types were chosen based on their prevalence rates in each subspecialty and their diagnostic feasibility without sole dependence on physical examinations. Our goal was to gather about 100 cases for general outpatient clinics and the cornea subspecialty, and ~50 cases for other subspecialties.

During the evaluation process, researchers conducted data entry for each case as a new session, ensuring that chatbot responses were generated solely based on that specific case, without any prior input from other cases. Our primary objective was to achieve a sensitivity of ≥ 90% and a specificity of ≥ 95% for diagnosing common disease types that ranked in the top 1–3 based on outpatient volumes over the past three years within each subspecialty. The disease types include dry eye syndrome, allergic conjunctivitis, conjunctivitis, myopia, refractive error, visual fatigue, strabismus, keratitis, pterygium, cataract, glaucoma, vitreous opacity, retinal detachment, ptosis, thyroid eye disease, eyelid mass, chemical ocular trauma, and other eye injuries. For disease types that had failed to meet these performance thresholds and had the potential to be diagnosed through imaging, we would develop an image-based diagnostic system. Additionally, regarding triage outcomes, our secondary goal was to achieve a positive predictive value of ≥ 90% and a negative predictive value of ≥ 95%. Since predictive values are significantly influenced by prevalence, diseases with prevalence below the 5th percentile threshold are excluded from the secondary outcome analysis.

Silent evaluation and early clinical evaluation of IOMIDS

After developing the text model and two multimodal models, all three were integrated into IOMIDS and installed on two iPhone 13 Pro devices to conduct a two-stage cross-sectional study, comprising silent evaluation and early clinical evaluation. During the silent evaluation, researchers collected patient gender, age, chief complaint, medical history inquiries, slit-lamp captured images, and smartphone captured images without disrupting clinical activities (Dataset G). If researchers encountered a specific chatbot query for which the answer could not be found in electronic medical histories, patients would be followed up with telephone interviews on the same day as their clinical visit. Based on sample size calculations33, we aimed to collect 25 cases each for cataract, keratitis, and pterygium, along with another 25 cases randomly selected for other diseases. Following data collection, we conducted data analysis. If the data did not meet predefined standards, further sample expansion was considered. These standards were set as follows: the primary outcome aimed to achieve a sensitivity of ≥ 85% and a specificity of ≥ 95% for diagnosing cataract, keratitis, and pterygium; the secondary outcome aimed to achieve a positive predictive value of ≥ 85% and a negative predictive value of ≥ 95% for subspecialty triage after excluding diseases with a prevalence below the 5th percentile threshold. To further investigate whether the completeness of medical histories would influence the diagnostic and triage performance of the text model, we randomly selected about half of the cases from Dataset G and re-entered the doctor-patient dialogs. For the chatbot queries that should be answered via telephone interviews, the researcher uniformly marked them as “no information available”. The changes in diagnostic and triage accuracy were subsequently analyzed.

The early clinical evaluation was conducted at Fudan Eye & ENT Hospital for internal validation and at the Affiliated Eye Hospital of Nanjing Medical University and Suqian First Hospital for external validation. Both validations included two settings: data collection by researchers and self-completion by patients. Data collection by researchers was conducted at the internal center from July 21 to August 20, 2023, and at the external centers from August 21 to October 31, 2023. Self-completion by patients took place at the internal center from November 10, 2023, to January 10, 2024, and at the external centers from January 20 to March 10, 2024. During the patient-entered data stage, researchers guided users (patients, parents, and caregivers) through the entire process, which included selecting an appropriate testing environment, accurately entering demographic information and chief complaints, providing detailed responses to chatbot queries, and obtaining high-quality eye photos. It is important to note that when collecting smartphone images using the IOMIDS system, we provide guidance throughout the image capture process. Notifications are issued for problems such as excessive distance, overexposure, improper focus, misalignment, or if the eye is not open (Supplementary Fig. 8). In both the researcher-collected data phase and the patient-entered data phase, the primary goal was to achieve a sensitivity of ≥ 75% and a specificity of ≥95% for diagnosing ophthalmic diseases, excluding those with a prevalence below the 5th percentile threshold. The secondary goal was to achieve a positive predictive value of ≥ 80% and a negative predictive value of ≥95% for subspecialty triage.

Comparison of inter-model and model-expert agreement

Five expert ophthalmologists with professor titles, three ophthalmology trainees (residents), and two junior doctors without specialized training participated in the study. The expert ophthalmologists were responsible for reviewing all cases, and when at least three experts reached a consensus, their diagnostic results were considered the gold standard. The trainees and junior doctors were involved solely in the clinical evaluation phase for Datasets 4-5 and 9-10, independently providing diagnostic results for each case. Additionally, GPT-4.0 and the domestic large language model Qwen, both with image-text diagnostic capabilities, were included to generate diagnostic results for the same cases. The diagnostic accuracy and consistency of these large language models were then compared with those of ophthalmologists at different experience levels.

Rating for user satisfaction and response quality

User satisfaction with the entire human-computer interaction experience was evaluated by both patients and researchers using a 1–5 scale (not satisfied, slightly satisfied, moderately satisfied, satisfied, and very satisfied) during early clinical evaluation study. Neither the patients nor the researchers were aware of the correctness of the output when assessing satisfaction. Furthermore, 50 chatbot final responses were randomly selected from all datasets generated during both silent evaluation and early clinical evaluation. Three independent researchers, who were blinded to the model types and reference standards, assessed the quality of these chatbot final responses across six aspects. Overall information quality was assessed using DISCERN (rated from 1 = low to 5 = high). Understandability and actionability were evaluated with the Patient Education Materials Assessment Tool for Printable Materials (PEMAT-P), scored from 0–100%. Misinformation was rated on a five-point Likert scale (from 1 = none to 5 = high). Empathy was also rated using a 5-point scale (not empathetic, slightly empathetic, moderately empathetic, empathetic, and very empathetic). Readability was analyzed using the Chinese Readability Index Explorer (CRIE; http://www.chinesereadability.net/CRIE/?LANG=CHT), which assigns scores corresponding to grade levels: 1–6 points for elementary school (grades 1–6), 7 points for middle school, and 8 points for high school.

Statistical analysis

All data analyses were conducted using Stata/BE (version 17.0). Continuous and ordinal variables were expressed as mean ± standard deviation. Categorical variables were presented as frequency (percentage). For baseline characteristics comparison, a two-sided t-test was used for continuous and ordinal variables, and the Chi-square test or Fisher’s exact test was employed for categorical variables, as appropriate. Diagnosis and triage performance metrics, including sensitivity, specificity, accuracy, positive predictive values, negative predictive values, Youden index, and prevalence, were calculated for each dataset using the one-vs.-rest strategy. Diseases with a prevalence below the 5th percentile threshold were excluded from subspecialty parameter calculations. Reference standards and the correctness of the IOMIDS diagnosis and triage were established according to the protocol article33.

Specifically, overall diagnostic accuracy was defined as the proportion of correctly diagnosed cases out of the total cases in each dataset, while overall triage accuracy was the proportion of correctly triaged cases out of the total cases. Similarly, disease-specific diagnostic and triage accuracies were calculated as the proportion of correctly diagnosed or triaged cases per disease. Diseases were categorized into six classifications based on lesion location: primary anterior segment diseases (cataract, keratitis, pterygium), other anterior segment diseases, fundus and optic nerve diseases, intraorbital diseases and emergencies, eyelid diseases, and visual disorders. Diagnostic and triage accuracies were calculated for each category. Fisher’s exact test was used to compare accuracies of different models within each category.

Univariate logistic regression was performed to identify potential factors influencing diagnostic and triage accuracy. Factors with P < 0.05 were further analyzed using multivariate logistic regression, with odds ratios (OR) and 95% confidence intervals (CI) calculated. Subgroup analyses were conducted for significant factors (e.g., disease classification) identified in the multivariate analysis. Notably, age was dichotomized using the mean age of all patients during the early clinical evaluation stage for inclusion in the logistic regression. Additionally, ROC curves for cataract, keratitis, and pterygium were generated in Dataset D and Dataset E using image-based diagnosis as the gold standard. The area under the ROC curve (AUC) was calculated for each curve. Agreement between answers provided by doctors and LLMs was quantified through calculation of Kappa statistics, interpreted in accordance with McHugh’s recommendations20. During the evaluation of user satisfaction and response quality, a two-sided t-test was used to compare different datasets across various metrics. Quality scores from three independent evaluators were averaged before statistical analysis. In this study, P values of <0.05 were considered statistically significant.