Abstract
Artificial intelligence (AI) has become a crucial element of modern technology, especially in the healthcare sector, which is apparent given the continuous development of large language models (LLMs), which are utilized in various domains, including medical beings. However, when it comes to using these LLMs for the medical domain, there’s a need for an evaluation platform to determine their suitability and drive future development efforts. Towards that end, this study aims to address this concern by developing a comprehensive Multi-Criteria Decision Making (MCDM) approach that is specifically designed to evaluate medical LLMs. The success of AI, particularly LLMs, in the healthcare domain, depends on their efficacy, safety, and ethical compliance. Therefore, it is essential to have a robust evaluation framework for their integration into medical contexts. This study proposes using the Fuzzy-Weighted Zero-InConsistency (FWZIC) method extended to p, q-quasirung orthopair fuzzy set (p, q-QROFS) for weighing evaluation criteria. This extension enables the handling of uncertainties inherent in medical decision-making processes. The approach accommodates the imprecise and multifaceted nature of real-world medical data and criteria by incorporating fuzzy logic principles. The MultiAtributive Ideal-Real Comparative Analysis (MAIRCA) method is employed for the assessment of medical LLMs utilized in the case study of this research. The results of this research revealed that “Medical Relation Extraction” criteria with its sub-levels had more importance with (0.504) than “Clinical Concept Extraction” with (0.495). For the LLMs evaluated, out of 6 alternatives, (\(A4\)) “GatorTron S 10B” had the 1st rank as compared to (\(A1\)) “GatorTron 90B” had the 6th rank. The implications of this study extend beyond academic discourse, directly impacting healthcare practices and patient outcomes. The proposed framework can help healthcare professionals make more informed decisions regarding the adoption and utilization of LLMs in medical settings.
Similar content being viewed by others
Data Availability
No datasets were generated or analysed during the current study.
References
C. Xingxin, Z. Xin, and W. Gangming, "Research on online fault detection tool of substation equipment based on artificial intelligence," Journal of King Saud University-Science, vol. 34, no. 6, p. 102149, 2022.
M. Pournader, H. Ghaderi, A. Hassanzadegan, and B. Fahimnia, "Artificial intelligence applications in supply chain management," International Journal of Production Economics, vol. 241, p. 108250, 2021.
A. Zirar, S. I. Ali, and N. Islam, "Worker and workplace Artificial Intelligence (AI) coexistence: Emerging themes and research agenda," Technovation, vol. 124, p. 102747, 2023.
A. R. Malik, Y. Pratiwi, K. Andajani, I. W. Numertayasa, S. Suharti, and A. Darwis, "Exploring Artificial Intelligence in Academic Essay: Higher Education Student's Perspective," International Journal of Educational Research Open, vol. 5, p. 100296, 2023.
G. Kaur, P. Tomar, and M. Tanque, Artificial intelligence to solve pervasive internet of things issues. Academic Press, 2020.
S. Tuli et al., "AI augmented Edge and Fog computing: Trends and challenges," Journal of Network and Computer Applications, p. 103648, 2023.
K. Panesar and M. B. P. C. de Alba, "Natural language processing-driven framework for the early detection of language and cognitive decline," Language and Health, 2023.
O. Nov, N. Singh, and D. M. Mann, "Putting ChatGPT's medical advice to the (Turing) test," medRxiv, p. 2023.01. 23.23284735, 2023.
T. Kojima, S. S. Gu, M. Reid, Y. Matsuo, and Y. Iwasawa, "Large language models are zero-shot reasoners," Advances in neural information processing systems, vol. 35, pp. 22199-22213, 2022.
C. Zhang, J. Chen, J. Li, Y. Peng, and Z. Mao, "Large language models for human-robot interaction: A review," Biomimetic Intelligence and Robotics, p. 100131, 2023.
A. H. Huang, H. Wang, and Y. Yang, "FinBERT: A large language model for extracting information from financial text," Contemporary Accounting Research, vol. 40, no. 2, pp. 806-841, 2023.
R. Taylor et al., "Galactica: A large language model for science," arXiv preprint arXiv:2211.09085, 2022.
X. Yang et al., "A large language model for electronic health records," NPJ Digital Medicine, vol. 5, no. 1, p. 194, 2022.
H. Jung et al., "Enhancing Clinical Efficiency through LLM: Discharge Note Generation for Cardiac Patients," arXiv preprint arXiv:2404.05144, 2024.
J. Barile et al., "Diagnostic accuracy of a large language model in pediatric case studies," JAMA pediatrics, 2024.
B. Kasper and A. Brownfield, "Evaluation of a newly established layered learning model in an ambulatory care practice setting," Currents in Pharmacy Teaching and Learning, vol. 10, no. 7, pp. 925-932, 2018.
U. P. Liyanage and N. D. Ranaweera, "Ethical considerations and potential risks in the deployment of large Language Models in diverse societal contexts," Journal of Computational Social Dynamics, vol. 8, no. 11, pp. 15-25, 2023.
J. Yuan, R. Tang, X. Jiang, and X. Hu, "Llm for patient-trial matching: Privacy-aware data augmentation towards better performance and generalizability," in American Medical Informatics Association (AMIA) Annual Symposium, 2023.
A. J. Thirunavukarasu, D. S. J. Ting, K. Elangovan, L. Gutierrez, T. F. Tan, and D. S. W. Ting, "Large language models in medicine," Nature medicine, vol. 29, no. 8, pp. 1930-1940, 2023.
C. Peng et al., "A Study of Generative Large Language Model for Medical Research and Healthcare," arXiv preprint arXiv:2305.13523, 2023.
L. Gao et al., "The pile: An 800gb dataset of diverse text for language modeling," arXiv preprint arXiv:2101.00027, 2020.
T. L. Saaty, "The analytic hierarchy process: planning, priority setting, resource allocation," ed: McGraw-Hill, New York London, 1980.
R. L. Keeney and H. Raiffa, Decisions with multiple objectives: preferences and value trade-offs. Cambridge university press, 1993.
V. Belton and T. Stewart, Multiple criteria decision analysis: an integrated approach. Springer Science & Business Media, 2002.
T. L. J. E. j. o. o. r. Saaty, "How to make a decision: the analytic hierarchy process," vol. 48, no. 1, pp. 9–26, 1990.
G.-H. Tzeng and J.-J. Huang, Multiple attribute decision making: methods and applications. CRC press, 2011.
E. Triantaphyllou and E. Triantaphyllou, Multi-criteria decision making methods. Springer, 2000.
B. Roy, Multicriteria methodology for decision aiding. Springer Science & Business Media, 2013.
K. T. Atanassov and S. Stoeva, "Intuitionistic fuzzy sets," Fuzzy sets and Systems, vol. 20, no. 1, pp. 87-96, 1986.
M. R. Seikh and U. Mandal, "Multiple attribute group decision making based on quasirung orthopair fuzzy sets: Application to electric vehicle charging station site selection problem," Engineering Applications of Artificial Intelligence, vol. 115, p. 105299, 2022.
R. Mohammed et al., "Determining importance of many-objective optimisation competitive algorithms evaluation criteria based on a novel fuzzy-weighted zero-inconsistency method," International Journal of Information Technology & Decision Making, vol. 21, no. 01, pp. 195-241, 2022.
D. S. Pamucar, S. P. Tarle, and T. Parezanovic, "New hybrid multi-criteria decision-making DEMATEL-MAIRCA model: sustainable selection of a location for the development of multimodal logistics centre," Economic Research-Ekonomska Istraživanja, vol. 31, no. 1, pp. 1641–1665, 2018/01/01 2018, https://doi.org/10.1080/1331677X.2018.1506706.
A. Alamoodi et al., "Based on neutrosophic fuzzy environment: a new development of FWZIC and FDOSM for benchmarking smart e-tourism applications," Complex & Intelligent Systems, vol. 8, no. 4, pp. 3479-3503, 2022.
A. Alamoodi et al., "New extension of fuzzy-weighted zero-inconsistency and fuzzy decision by opinion score method based on cubic pythagorean fuzzy environment: a benchmarking case study of sign language recognition systems," International Journal of Fuzzy Systems, vol. 24, no. 4, pp. 1909-1926, 2022.
E. Krishnan et al., "Interval type 2 trapezoidal‐fuzzy weighted with zero inconsistency combined with VIKOR for evaluating smart e‐tourism applications," International Journal of Intelligent Systems, vol. 36, no. 9, pp. 4723-4774, 2021.
K. Chatterjee, D. Pamucar, and E. K. Zavadskas, "Evaluating the performance of suppliers based on using the R'AMATEL-MAIRCA method for green supply chain implementation in electronics industry," Journal of cleaner production, vol. 184, pp. 101-129, 2018.
K. Huang, J. Altosaar, and R. Ranganath, "Clinicalbert: Modeling clinical notes and predicting hospital readmission," arXiv preprint arXiv:1904.05342, 2019.
L. Floridi and M. Chiriatti, "GPT-3: Its nature, scope, limits, and consequences," Minds and Machines, vol. 30, pp. 681-694, 2020.
J. Lee et al., "BioBERT: a pre-trained biomedical language representation model for biomedical text mining," Bioinformatics, vol. 36, no. 4, pp. 1234-1240, 2020.
X. Yang, J. Bian, R. Fang, R. I. Bjarnadottir, W. R. Hogan, and Y. Wu, "Identifying relations of medications with adverse drug events using recurrent convolutional neural networks and gradient boosting," Journal of the American Medical Informatics Association, vol. 27, no. 1, pp. 65-72, 2020.
Acknowledgements
This work was supported by Tenaga Nasional Berhad (TNB) and UNITEN through the BOLD Refresh Postdoctoral Fellowships under the project code of J510050002-IC-6 BOLDREFRESH2025-Centre of Excellence.
Author information
Authors and Affiliations
Contributions
A.H. Alamoodi, Omar Zughoul, and Dianese David: Writing- Original draft preparation, Salem Garfan, and O.S. Albahri: Conceptualization, Methodology, Dragan Pamucar: Conceptualization, A.S. Albahri: Project Administration, Salman Yussof: Manuscript Revision, Iman Mohamad Sharaf: Writing—Review & Editing.
Corresponding author
Ethics declarations
Ethical Approval
No ethical approval is required for this study.
Ethics approval and consent to participate
All authors in the manuscript consent to participating.
Competing Interests
The authors declare no competing interests.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Supplementary Information
Below is the link to the electronic supplementary material.
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Alamoodi, A.H., Zughoul, O., David, D. et al. A Novel Evaluation Framework for Medical LLMs: Combining Fuzzy Logic and MCDM for Medical Relation and Clinical Concept Extraction. J Med Syst 48, 81 (2024). https://doi.org/10.1007/s10916-024-02090-y
Received:
Accepted:
Published:
DOI: https://doi.org/10.1007/s10916-024-02090-y