Abstract
The increasing adoption of large language models (LLMs) in healthcare presents both opportunities and challenges. While LLM-powered applications are being utilized for various medical tasks, concerns persist regarding their accuracy and reliability, particularly when not specifically trained on medical data. Using open-source models without proper fine-tuning for medical applications can lead to inaccurate or potentially harmful advice, underscoring the need for domain-specific adaptation. Therefore, this study addresses these issues by developing PharmaLLM, a fine-tuned version of the open-source Llama 2 model, designed to provide accurate medicine prescription information. PharmaLLM incorporates a multi-modal input/output mechanism, supporting both text and speech, to enhance accessibility. The fine-tuning process utilized LoRA (Low-Rank Adaptation) with a rank of 16 for parameter-efficient fine-tuning. The learning rate was maintained at 2e-4 for stable adjustments, and a batch size of 12 was chosen to balance computational efficiency and learning effectiveness. The system demonstrated strong performance metrics, achieving 87% accuracy, 92.16% F1 score, 94% sensitivity, 66% specificity, and 90% precision. A usability study involving 33 participants was conducted to evaluate the system using the Chatbot Usability Questionnaire, focusing on error handling, response generation, navigation, and personality. Results from the questionnaire indicated that participants found the system easy to navigate and the responses useful and relevant. PharmaLLM aims to facilitate improved patient-physician interactions, particularly in areas with limited healthcare resources and low literacy rates. This research contributes to the advancement of medical informatics by offering a reliable, accessible web-based tool that benefits both patients and healthcare providers.
Similar content being viewed by others
Avoid common mistakes on your manuscript.
1 Introduction
Human nature prioritizes ease and comfort in daily tasks, with health being an important consideration on a regular basis. Chatbots allow communication between computers and humans by stimulating conversations. Lack of experts, high costs, and restricted access to resources are some of the issues facing the healthcare sector [1]. With chatbots offering quick, easy, and affordable services, artificial intelligence (AI) has completely transformed several industries, including healthcare. Among the most exciting advances in AI are large language models (LLMs), which can comprehend and produce text that is similar to that of a human [1].
Chatbot are computer programs that use natural language processing and machine learning (ML) to have textual or spoken conversations that resemble those of a human [2]. These systems are helpful for communicating with users, clients, and patients because they can handle a wide range of subjects, vocabulary, and skill levels [2]. Because they can process and interpret natural language inputs with high accuracy and context awareness, medical chatbots that use LLMs are a significant advancement in digital health. This makes them more effective and versatile. Conventional chatbots are unable to have dynamic and complex conversations because they frequently rely on prewritten scripts and small databases [3].
The emergence of LLMs, such as Llama and Generative Pretrained Transformer‒3 (GPT-3), has enabled medical chatbots to comprehend natural language with high accuracy and context, transforming digital health capabilities. With the help of natural language processing (NLP), LLMs are trained on massive datasets encompassing billions of parameters through deep learning. Around 2017, Google unveiled its Transformer machine translation architecture, which went on to achieve leading-edge performance in numerous natural language processing tasks [4]. Subsequently, numerous transformers-based LLMs were created and released. A few of these LLMs are Llama from Meta AI, Pathways Language Model (PaLM), GPT‒3, Bidirectional Encoder Representations from Transformers (BERT), and GPT‒4 etc.
Advances in LLMs and NLP have transformed chatbots, dramatically improving their language comprehension skills [5]. This paper presents the training and evaluation of an LLM-powered medical chatbot. The goal is to shed light on the transformative potential of LLMs in medical chatbots and to pave the way for their broader adoption in improving healthcare delivery and patient outcomes. Our system can process text-to-text, text-to-speech, and speech-to-text.
The rest of the paper is organized in the following sections: Sect. 2 covers the literature work on medical chatbots, Sect. 3 describes the materials and methods for training the chatbot, Sect. 4 presents the evaluation results, Sect. 5 discusses our work, and finally, Sect. 6 concludes our work.
2 Literature Review
AI chatbots are conversational agents that communicate with users using text, audio, and visual media to replicate human connections [6]. Chatbots, one of several AI applications, are virtual conversational agents that enable users to communicate with AI-powered computer systems through aural or textual means. Recently, chatbots are being used in a variety of industries, including business, retail, and healthcare [7]. Table 1 provides a literature review of some of the existing medical chatbots for medical queries. Here is a brief description of proposed work.
[8] revealed Pharmabot, a pediatric generic pharmaceutical consultant chatbot developed to prescribe, recommend, and offer information on generic drugs for children. The system was a stand-alone computer and was implemented in Visual C# and MS Access, along with NLP and ML. The system could be used by parents and researchers to help patients choose the right generic medicine. Future studies aim to expand the chatbot's capabilities and convert it into a web-based application.
Following these initial steps, another medical chatbot was introduced by M. R. Dharwadkar et al. [9]. Their chatbot used ML, NLP and Google API for voice-to-text and text-to-voice conversion. They used the Support Vector Machine (SVM) algorithm and disease symptoms to predict diseases, and related answers are displayed on an Android app for analysis. The system was trained on multiple datasets, including Cleveland, Hungarian, Switzerland, and Long Beach VA. But their system mainly focused on heart diseases hence could handle limited diseases. However, they intended to use voice and face recognition tools, mimicking a counselor, and interact with patients at deeper levels in the future.
In the following year, Gajendra Prasad K. C., and colleagues [10] proposed a chatbot using Apriori Algorithm for the general response and sequence-to-sequence model for medical assistant along with NLP and ML to predict diseases based on symptoms for communication and self-diagnosis, promoting early disease detection and better treatment. They used the Google API for voice-to-text and text-to-voice conversions. However, challenges like a lack of accurate medical datasets, a time-consuming seq2seq model, and offline APIs hindered their progress. They aimed to improve disease prediction accuracy by increasing features and using smart wearables.
Now that chatbots have evolved from menu- or button-based to contextual-based, machine learning and AI techniques were used to store and process training models. The paper by Prathamesh Kandpal and colleagues [11] discussed the workings of a model, applications, and relevant work in the healthcare industry, integrating natural language processing with deep learning. They combined the concepts of TensorFlow, TFLearn, NLTK ((Natural Language Tool Kit)) & NumPy with the field of healthcare assistance. Their paper discussed various packages, code workflow, data input, training, and output, along with the industry applications of chatbots, common challenges, and limitations. They pointed out limitations faced by chatbots in terms of complex interfaces, time-consuming scenarios, high installation costs, weak decision-making skills, and weak memory storage and processing, yet they aimed to integrate more languages with more diverse menus, images, and videos.
Progressing further, G. K. Vamsi and the fellows proposed a new method for creating a conversational agent (chatbot) for health care using deep neural learning [12]. They trained their DNN model on the open source Kaggle-based dataset. Their work was a text-to-text and English-based system. The activation function for input and hidden layers was ReLu while for the output, activation function was SoftMax with GUI in Tkinter (Python). They tried different optimizer algorithms, including SGD, AdaGrad, AdaDelta, RmsProp, and Adam, with SDG giving 100% accuracy. However, limitations included accuracy, a lack of empathy, and privacy concerns. Making room to explore other enhanced methods and use advanced neural networks like RNN, deep CNN, LSTM, and deep auto-encoder in the future.
In a subsequent development, Nikita Vijay Shinde et al. presented a chatbot for healthcare systems using artificial intelligence [13]. Chatbots are computer programs that use natural language to interact with users. Their project aimed to save users time by reducing healthcare consultation time. It used N-gram, TF-IDF, and cosine similarity to extract keywords from user queries, improving security and effectiveness through user protection and an improved query interface for user convenience. SVM was used for classification and Porter algorithm to discard unwanted words and knowledge based was used to store the question answers. One of the limitations was that it was trained only for primary diseases. Future work intended to expand the system to integrate more features.
With further advancements, a multilingual healthcare chatbot application was proposed by Sagar Badlani et al. [14], using Multi-Layer Perceptron (MLP) and ML capable of performing disease diagnosis based on user symptoms and responding to queries using tokenization, stemming, TF-IDF and cosine similarity techniques. The system supported English, Hindi, and Gujarati languages and speech-to-text and voice communication. The speech-to-text and text-to-speech conversion was done using libraries like Googletrans, gTTS, SpeechRecognition, and Playsound. The system used and compared classification algorithms like Random Forest Classifier, K Nearest Neighbors, SVM, Decision Tress, and MNB but Random Forest Classifier was used as the core classifier, with the highest 98.43% accuracy, 0.9774% precision and 0.9781% F1-Score. Limitation was lack of data and future work intended to expand the system to include more languages, deep learning algorithms, and natural language generation.
Progressing further, I-Ching Hsu and his team developed a Machine Learning-based Chatbot Framework (MLCF) using Spark cluster technology to create intelligent chatbot [15]. This chatbot framework was used to create a Medical and Health Information Platform (MHIP) on the LINE messaging apps, supporting Taiwanese Chinese while using NLP. The MLCF uses medical articles with symptoms and etiology as training data, enabling a multi-classification model for machine learning. The MHIP could diagnose and recommend treatments based on user interaction. The prediction accuracy of random forest was best and decision trees was most suited for better performance than logistic, and the YANG mode was found to be more effective than standalone. But the system was only for the Taiwanese Chinese. However, they wanted to improve the Rasa NUL's ability to automatically generate JSON training data and integrate emotion detection into chatbots in future research.
Advancing the research, Sanjay Chakraborty and fellows discussed the potential of chatbots in the medical sector to combat infectious diseases [16]. They proposed an AI chatbot interaction and prediction model using a deep-feedforward multilayer perceptron. The chatbot was designed using TensorFlow for natural language processing and deep neural network architecture, it used TensorFlow to build the network for accurate predictions, while being trained on the COVID-19-dataset-json file. The model achieved a minimum loss of 0.1232 and the highest accuracy of 94.32% using LSTM out of Long-Short-Term-Memory (LSTM), Recurrent-Neural Network (RNN) and Decision Trees. Their study highlighted the functionalities and applications of medical chatbots, including their ease of use and potential for preventing COVID-19. They aimed to improve the system for continuous development of the chatbots.
Taking this a step further, a pipeline-based architecture for a chatbot application was developed by Girish Rajani et al. [17]. It consisted of three modules that worked together to generate accurate predictions. The architecture was designed for healthcare service-based bots, which can be used for natural language conversations or medical diagnosis by implementing Naïve Bayes for classification. The Seq2Seq model was used for general, and decision trees or recommendation systems were used for medical. The system was trained on the Cornell Movie-Dialogs Corpus for general and the Disease Symptom Prediction dataset from Kaggle for medical. Each module was evaluated individually based on metrics like F1 score and accuracy. The first module, the Naive Bayes Classifier, has an accuracy of 96.67% and an F1 score of 97%. The generic chatbot module is trained using the Seq2Seq model with 25% accuracy. The third module, Decision Tree Classification, had an accuracy of 97.54% for disease prediction.
Subsequently, the healthcare chatbot system developed by Palak Dohare and his team [18] aimed to diagnose illnesses and provide basic information before contacting a doctor. The system could perform health checkups and suggest doctors based on user symptoms. It allowed users to converse with a user-friendly bot through text or voice. The system saved session histories, allowing users to review interactions and prior medical history. The medical chatbot offered medical help for common diseases and was user-friendly for anyone who could type in English. But the paper lacked discussion on the accuracy, training, performance, or efficiency of the system. They mentioned making their system able to recommend nearby specialists and fix appointments as their future goal.
As the research evolved, D. Rajeswara Rao et al. described a medical chatbot trained on a personalized health assistant corpus to provide quick answers to frequently asked medical questions [19]. The solution was based on the BERT model a transformer-based model, which incorporated Masked Language Model (MLM) and Next Sentence Prediction (NSP). The chatbot answered questions about symptoms, safety measures, medication dosages, and other topics while facilitating text chats between users. Also, treatment recommendations were made based on how severe the user's ailment is. The system was tested on all previous models with different datasets but 96.5% accuracy, 93% precision, 93% F1-score, and 94% recall were attained by the BERT language model, which is superior to the other models. There were only 30 major illnesses that could be consulted via the web application, with video consultations soon to be added.
Continuing this trajectory, the methodology proposed by Arun Babu et al. focused on developing a BERT-based medical chatbot to improve healthcare communication and accessibility [20]. The chatbot addressed traditional challenges like inaccurate responses to jargon and the inability to offer personalized feedback. It achieved high accuracy and precision, with a 98% accuracy and precision score. It also had exceptional disease prediction capabilities, with a 97% AUC–ROC score. The chatbot also ensured comprehensive coverage with a 96% recall, minimizing the risk of overlooking potential medical diagnoses. The F1 score of 98% showed its proficiency in delivering accurate and personalized healthcare information. However, challenges included computational demands, potential biases in training data, and the need for continuous learning to adapt to evolving healthcare scenarios.
The literature analysis from Table 1 pointed out that the development of health chatbots confronts several challenges. A few of these are that the best use of LLMs has not been made in the development of these types of health chatbots, and neither of the datasets has been fine-tuned for medical queries. Moreover, input–output processing has only been able to process text to text or speech to text. The proposed architecture aims to close the aforementioned gaps and incorporates large language model concepts by presenting a health chatbot with more than two processing methods. GPT-4 is used for data preprocessing, Low-Rank Adaptation (LoRA) for efficient fine tuning, and Tiny Llama as the model are employed for the proposed methodology to handle the medical queries.
Our contributions include presenting a methodology using LLM to improve multilingual text-to-text, text-to-speech (TTS), and speech-to-text (STT) processing for a medical chatbot able to respond to user health-related queries. Following are the contributions:
-
The suggested architecture develops chatbots using Tiny Llama in an effort to increase their effectiveness and interactive capabilities
-
The architecture presented uses LoRA on the Kaggle dataset "EDA | 11,000 Medicines" to fine-tune the models for better performance. EDA is short form for Exploratory Data Analysis.
-
Employs text-to-speech synthesis to produce natural and contextually appropriate speech outputs and speech-to-text processing to improve expression accuracy, ensuring reliable communication and interaction for medical and health-related queries.
3 Methodology
The developed system is an LLM-based medical framework that allows users to communicate through a user-friendly interface. Being a multilingual and multi-input system, the framework used ML and NLP techniques such as NLTK, TTS, and STT to handle multilingual text, speech synthesis, and speech recognition. The system was trained on the EDA (11,000) medicines dataset from Kaggle, which contains information such as medicine name, composition, uses, side effects, image ULRs, manufacturers, and user satisfaction reviews, categorizing them into excellent, average, and poor [21]. The proposed methodology employed Tiny Llama [22], a Llama2 scaled-down version, as its LLM model, resulting in ~ 1.1B training parameters. The model was fine-tuned by using parameters such as the learning rate, making sure that model modifications are made gradually and steadily, batch size to balance learning efficacy and computing burden, epochs that consistently give the models a consistent amount of time to adapt, trainer, and PEFT (Parameter-Efficient Fine-Tuning). The presented system was developed to answer users' medical and health-related queries. The architecture of the devised methodology and the framework are depicted in Fig. 1 and discussed below.
In the developed framework, the input can be both text and speech, the system used LoRA for parameter-efficient fine-tuning. For to understand and explore the LoRA, the architecture illustrated in Fig. 2, and mathematical equations are given as follows:
To train a model, use a random Gaussian initialization for A and zero for B, ensuring no bias, where A and B are the parameters. Scale the LoRA term by α/r to reduce the need to retune hyperparameters when varying r. However, practical implementation is not necessary, as PEFT has already implemented these steps.
As Fig. 1 depicts the output from the output is represented in the form of the application that is used by the user. For text processing, the system uses NLTK tools and GoogleTrans, for speech synthesis, TTS techniques, and for speech recognition, STT techniques. NLTK is a leading platform for building Python programs to work with human language data. It includes a suite of text-processing libraries for tokenization, parsing, classification, and much more. TTS is a technology that converts written text into spoken words like virtual assistants and on the other hand, SST converts spoken language into words such as voice typing. The orchestration architecture employs pipelines for managing processing steps and RESTful APIs for component interactions (text-to-text, TTS, STT).
The detailed discussion on the dataset fine-tuning and the model Tiny Llama and the UI has been discussed further in the following sub-sections.
3.1 Dataset
A Kaggle medicine dataset EDA | 11,000 Medicines [21] consisting of 11,000 medicines data, from which only 7515 preprocessed medicine data was used for fine-tuning and evaluation of the proposed model. For each medicine, there is detailed information about the active ingredients used in the formulation, the various medical conditions and ailments for which these medicines are recommended, the information about the potential side effects associated with each, image URLs for to view the product, the insights to manufacturers of the medicines and reviews of the user satisfaction in the categories of Excellent, Average and Poor. The dataset was originally intended for EDA to help understand different medicine's characteristics and reviews. The snapshot of the described dataset is given below in Fig. 3.
3.2 Pre-processing
GPT-4 was employed in the proposed framework for pre-processing. To improve the suggested model, the dataset, which was originally in CSV format, had to be converted to text format. The Kaggle EDA dataset comprised around 11,000 medicines. Details about each medicine included its name, composition, usage, potential side effects, manufacturers, picture URLs that allowed users to identify and recognize medications, and user satisfaction ratings that classified reviews as good, average, or poor. After removing all drugs, based on the poor review of less than 70% during the pre-processing step and the columns of image URL and manufacturer’s name were removed as they were not relevant for the cause of the project, only 7517 medicines met the selection criteria. Drugs with poor reviews, specifically those with ratings below 70% were removed, to ensure the reliability and safety of PharmaLLM’s recommendations. This threshold was set to exclude drugs with consistently negative feedback or adverse effects, thereby enhancing the quality of the recommendations. Following processing, the CSV file's 7515-medicine data was converted to sentence format with the aid of GPT-4. An example of generated sentence format can been seen through Fig. 4. Following conversion, the data for 7515 medicines was ready for fine-tuning using the suggested architecture.
3.3 Selection of Large Language Models (LLMs) and Architecture
There are plenty of open-source large language models available today [23]. After evaluating different models from their documentation and performance benchmarks, it was decided to use a variant of Llama an open-source model that outperforms GPT-3 [24] and demonstrated effectiveness, superior performance, adaptability, distinct operational efficiencies, and suitability for limited resources [25]. Llama is a series of large language models developed and introduced by Meta AI. These models provide competitive performance as compared to LLMs based on billions of parameters. These parameters range from 7 to 65B. For example, Llama-13B outperforms GPT-3 on most benchmarks, despite being 10 × smaller [24].
Tiny Llama [22], a variant of Llama which is an open-source model, is integrated as a language model into this proposed system. It is a smaller and more efficient version of Llama series, that is most suited for applications with limited resources, having a design that provides powerful natural language capabilities with reduced computational and memory requirements. Tiny Llama efficiency and with low resource requirement, it was an important consideration in terms of our resource-constrained devices. The details of this training device is given in Sect. 4. The architecture of the proposed model also follows the transformer architecture [4], the same as Llama 2 [24], pre-normalization, SwiGLU activation function [26], Rotary Positional Embeddings (RoPE) [27], and Grouped Query Attention (GQA) [28] to improve the model training performance. The model provides a basis for comprehending the influence of existing LLMs on different facets and real-world applications by enabling a thorough examination of their strengths and weaknesses.
The proposed framework following the transformer architecture, same as of Llama 2, uses the Root Mean Square Norm (RMSNorm) layer [29] pre-normalization to normalize the intermediate layers in transformer along with the output layer. The proposed solution uses the scale down architecture of the Llama2 namely tiny Llama, that uses 22 layers, 6 heads, an embedding dimension of 2048, an intermediate dimension of 5632, leading to ~ 1.1B training parameters. Furthermore, the system uses the optimization techniques namely fused layer norm, fused SwiGLU, the FlashAttension for the speed up, achieving a throughout of 24 K tokens per second on an A100-40GU.
In the proposed methodology, the model was fine-tuned with the primary goal of augmenting the capability of these models to yield relevant and accurate responses to patient queries. The model was rigorously configured with fine-tuning parameters tailored to enhance its ability to generate precise and dependable responses. The configuration included a learning rate maintained at 2e−4, ensuring gradual and stable model adjustments, batch size fixed at 12 to balance computational demand and learning effectiveness, epochs consistently set to 3 providing the models with a uniform duration for adaptation, and the PEFT set to be True.
The algorithm demonstrates PEFT employing given hyperparameters (learning rate, batch size, and number of epochs) for supervised fine-tuning (SFT) of a pre-trained "Tiny Llama" model. The approach entails loading the model and optimizer, executing PEFT if enabled, and iterating over a shuffled training dataset for a preset number of epochs. Before optimizing and resetting gradients, the model conducts forward passes, calculates loss, backpropagates to update gradients, and uses supervised fine-tuning in each iteration. At the end of each session, a checkpoint is stored, the validation loss is supplied, and the model's performance is evaluated against a validation dataset. Lastly, the trained model is stored.
3.4 User Interface
The developed system used ReactFootnote 1 for the front-end development and backend was developed with the help of flask.Footnote 2 The system is designed to be accessible from the web as well through an application. The system provides a user-friendly interface (Fig. 5.) for chat through which users can input either text or speech. The user query is fed to the chatbot which answers the query accordingly. The snapshots of some questions and responses of the bot are added below.
4 Results
This section has covered the results and evaluation of our model. Our model ran smoothly on a system equipped with a 64-bit operating system, 32 GB of RAM, a 10th-generation Intel i9-core processor, and 8 GB of VRAM. The graph in Fig. 5 shows the progress of training epochs over time through a graph. The epoch number increases as the training progresses. Figure 6 depicts a graph showing the learning rate used during the training process. The learning rate may change according to a predefined schedule or adaptive learning. Figure 6 represents the graph that plots the training loss over time. The loss measures the model performance during the training.
For user usability satisfaction, the Chatbot Usability Questionnaire (CUQ) (Fig. 7) was used to test the suggested model on a healthcare chatbot [30]. More specifically, analysis of the proposed solution was conducted through qualitative and quantitative methods. Based on the Chatbot UX principles offered by the ALMA chatbot tools, the Chatbot Usability Questionnaire evaluates the chatbot's personality, navigation, responses, onboarding error handling, and other aspects. It is intended to compete with SUS, has 16 statements, and is primarily intended for chatbots. Based on user input, the evaluation is done on a scale of 1 to 5, or from "strongly disagree" to "strongly agree." The following statements are part of CUQ:
Thirty-three students, including five medical evaluators from the Services Institute of Medical Sciences (SIMS) participated in our survey, while the rest were students from various majors at UET Lahore. All the participants were given clear disclaimers, emphasizing that the chatbot should not be used as a substitute for professional medical consultation. Both our model and the students were given the same user query to respond to. Through a user-friendly UI that we supplied, the users could enter their questions, and the chatbot could answer them through the same UI. The model answers are concealed from the medical students throughout the survey. The accuracy of both responses was evaluated and compared to real data. Below are screenshots of the user's question and the chatbot's responses (Fig. 8).
Figure 8 illustrate PharmaLLM’s extensive replies as well as suggestions for various medications and their negative effects. The user posed many health-related questions, and the generated PharmaLLM provided good responses. The findings of the CUQ of the 33 participants evaluating PharmaLLM from 1 to 5 are also included in the form of graphs to lend credibility to the proposed health chatbot. To minimize inadvertent replies from participants, the CUQ is constructed so that one of the survey's statements is positive and the next is negative. For instance, statement 13 of CUQ analyzes if the chatbot coped well with any errors or mistakes (positive statement), whereas statement 14 is aimed at filling in the feedback from participants that the chatbot appeared unable to manage any problems (negative version of statement 13). A contrary statement, such as this, is useful for more effectively evaluating the chatbot's effectiveness. The subsequent graphs depict the chatbot's personality, navigation, answers, and onboarding error handling. The graphs are added for both positive and negative statements to validate the feedback of one statement by the participants with the other.
The graph in Fig. 8 shows that the developed PharmaLLM efficiently handled mistakes and inaccuracies in participant replies, with scores primarily between 4 and 5. Figure 9b depicts participant feedback to the assertion that the chatbot appeared unable to manage problems, with scores ranging mostly between 1 and 2. Figure 10 show similar graphics based on participants' responses, indicating the PharmaLLM's user-friendly interface.
Furthermore, the participants' graphs in Fig. 11a, b demonstrated that PharmaLLM was simple to use, and they had no difficulty or misunderstanding when utilizing the presented chatbot. Additionally, the participants find the PharmaLLM responses useful, appropriate, informative, and relevant. The responses of the participants are shown in the form of the graphs presented in Fig. 11c, d.
The qualitative data was also collected from participants through additional comments section in the survey. The participants' qualitative feedback on the medicine prescriber system was also generally positive, with many highlighting its strengths and offering thoughtful suggestions for improvement. One user mentioned that the system's response relevancy was excellent, noting how it provided suitable medicine options based on the symptoms they entered. This shows that the system was able to understand the individual needs. Another participant praised how intuitive the system was, describing it as easy to navigate. They appreciated the clean and straightforward user interface, which made inputting information an easy task. However, there were also some constructive comments. For example, one participant suggested adding interactive elements like progress indicators to make the experience more engaging.
The PharmaLLM's performance metrics were overall accuracy, precision, recall, and F1 score. Accuracy can be calculated using (3). These metrics are widely used in the field of machine learning, particularly for tasks involving classification and natural language processing. Precision and recall are crucial in the medical domain, where false positives (precision) and false negatives (recall) can have significant consequences, mathematically presented in (4) and (5). The F1 score offers a balanced measure, considering both precision and recall, to give a single metric reflecting the model's overall effectiveness, represented using (6). PharmaLLM's accuracy rate was 87.69%; precision in identifying accurate medicine information was 90.38%, indicating high reliability; recall was 94.0%, indicating effectiveness in retrieving pertinent information from the dataset; and the F1 score, which assesses the balance between precision and recall, yielded a final score of 92.16%, highlighting balanced performance in both areas. The different measures of PharmaLLM are presented in Fig. 12.
Confusion matrix (Fig. 13.) depicts the proportion of forecasts that are true positives (TP), true negatives (TN), false positives (FP), and false negatives (FN). Future enhancements to the PharmaLLM's performance and reliability may be led by studying this confusion matrix to identify areas where it excels and where it could improve.
The TP represents correctly identified and supplied medication; the TN represents correctly identified irrelevant information as unrelated to the query; the FP represents incorrectly identified irrelevant information as relevant; and the FN represents incorrectly identified irrelevant information as relevant. PharmaLLM achieved TP rates of 72.30%, TN of 15.38%, FP of 4.61%, and FN of 7.69%.
An extensive error analysis was carried out in order to pinpoint the shortcomings and areas that may be enhanced. The prevalent mistakes in the model's results are negative inquiries. By adding additional negative examples to the fine-tuning dataset, this might be significantly decreased. A drug called X, for instance, may produce statements like "X is not suitable for the treatment of headache, epilepsy, and {some other diseases}" if it was created with cancer treatment in mind. These changes would, however, greatly expand the training dataset and raise the computational overhead needed for fine-tuning. The outcomes demonstrate that while the inclusion of LLMs did enhance the medical chatbots' accuracy and precision, further advancements are still possible. Lack of resources is one of our work's limitations, but our research did show that these issues could be resolved. Our ultimate objective is to enhance the chatbot's functionality and precision, enabling it to serve as a reliable medical chatbot.
5 Discussions
LLM evaluation has resulted in a number of trustworthy and open-source frameworks, including Llama [24, 31], and BERT [32]. Emphasizing the importance of computational power and datasets in offering high-quality domain-specific LLM, it is obvious that, particularly in the health care business, the availability of this high-quality free medical training dataset is limited.
This paper describes an open-source Tiny Llama model based on LLM designed to meet patients' information needs for medical aid. According to the survey, a number of stakeholders, including physicians and patients, expressed satisfaction with its usefulness. Our Llama-based medical chatbot was popular with both evaluators and users since it was able to answer inquiries and deliver credible responses, enhancing confidence. It also increased communication and reduced the frequency of doctor or hospital visits, which saved patients time. Notably, our extensive linguistic support made it easy to use, even for people with weak reading and technological abilities. professionals praised the approach for improving communication between professionals and patients.
The model was fine-tuned specifically for medical purposes on the medical dataset from the Kaggle. The novelty of the proposed model included the evaluation in various ways such as in terms of qualitative and quantitative analysis. CUQ, unlike System Usability Survey (SUS) was used to evaluate the usability of our medical chatbot along with the human evaluators. Moreover, our model is multilingual and multi-processor of input and output. Furthermore, the evaluation matrix of the model included the precision rate, the F1 score and the performance rate that aligned well with the current literature work in the form of OncoGPT [33] yet different from Conversational Health Agents [34]. OncoGPT is trained on oncology-related conversations using the LLaMA-7B model. The performance metrics for OncoGPT are as follows: precision of 0.68, recall of 0.60, and an F1 score of 0.63. In contrast, conversational health agents are designed to generate personalized responses to users' healthcare queries. This framework allows developers to incorporate external resources, such as data sources, knowledge bases, and analytical models, into their language model-based solutions. The system utilizes the OpenAI API for query generation. However, it is important to note that this system is not specifically trained on healthcare-related conversations and requires an OpenAI API key for access, which may limit its accessibility to a broader audience.
The system proposed in this study is trained on medical data and is designed to be resource-efficient by utilizing Tiny LLaMA. This approach has resulted in impressive performance metrics: sensitivity of 0.94, precision of 0.90, and an F1 score of 0.92. These results underscore the system’s effectiveness in handling medical queries while maintaining efficient resource usage, making it more accessible and practical for deployment in resource-constrained environments.
Our initial findings show that the enhanced LLMs answered patient questions in a way that was both understandable and useful. However, a closer look at these LLM responses revealed several incorrect information with varying clinical implications for patient safety. The same reasoning limits the practicality of deploying these LLMs as patient-facing medical chatbots in healthcare settings, even after fine-tuning with domain-specific knowledge, unless more research is done to evaluate and improve their performance. Given everything stated above, our model did assist in a variety of ways, but it was still limited by difficulties such as a lack of more relevant datasets and hardware resources. Despite having a short dataset and limited resources, we want to solve these issues in the future and increase the reliability and performance of our model. Additionally, we are trying to improve the model to minimize the risk of incorrect advice, with plans to integrate expert-reviewed content in future versions.
Although PharmaLLM has not yet been deployed in real settings, it is crucial to address the ethical considerations that will guide its or other medical LLM’s future use, given the sensitivity of healthcare data and advice. Data privacy is of utmost importance; our design ensures that user interactions will not be stored or misused, protecting personal information. Additionally, we recognize the potential risks of incorrect advice, which could significantly impact users' health. While our current user study demonstrated a high positive score for response relevance, we are committed to ongoing improvements. Future work can involve a thorough model evaluation process, regular updates informed by expert feedback, and adherence to best practices in medical advice to ensure the system operates responsibly and effectively. These measures will help mitigate risks and enhance the ethical use of PharmaLLM in real-world settings.PharmaLLM can also be integrated into existing healthcare systems after taking several strategic considerations to ensure seamless functionality and compliance with healthcare standards. PharmaLLM can be connected to electronic health records (EHR) systems, hospital management software, or telemedicine platforms through standardized APIs, facilitating the exchange of information while maintaining data privacy and security. To enhance compatibility with various EHR systems, Fast Healthcare Interoperability Resources (FHIR) standards can be utilized, allowing PharmaLLM to function effectively across diverse healthcare environments. By addressing these factors, PharmaLLM can be effectively deployed to enhance healthcare delivery in real-world settings.
6 Conclusion
Our primary contribution is to combine LLMs with the multi-input and output processing, with an emphasis on medical queries. We fine-tuned our Tiny Llama using the medicine-related Kaggle dataset, which included 11,000 medications' names, dosages, side effects, compositions, and more. This means that more data sources might be added in the future to enhance field-specific datasets. It is worth mentioning that when the current dataset was used, our model fared better at answering patients' medical inquiries.
More characteristics are believed to be necessary to improve the current model, such as the application of Reinforcement Learning from Human Feedback (RLHF). Furthermore, training the model with more advanced frameworks, such as Llama2, might improve overall quality and ensure model consistency. The model achieved the accuracy, F1 score, recall and precision rate of 87.69, 92.16, 94.0 and 90.38 percent respectively.
Overall, our approach has considerable potential for a wide range of real-world applications, including expert advice and instructional services. Its principal application might be as a smart advising service on health platforms, answering queries swiftly and efficiently while saving medical personnel time and effort. This allows patients and physicians to interact more readily, which is especially useful in locations with limited healthcare resources and low literacy rates. Our technique benefits both patients and healthcare providers, advancing medical research in the process. Future studies could also involve connecting the system via API to open-source medical news websites, such as NIH and similar platforms for continuous chatbot knowledge update. Additionally, we plan to conduct broader testing in diverse clinical settings to further validate the chatbot’s effectiveness and adaptability across various healthcare environments.
Code and Data Availability
Code and data is available here.
References
Al Nazi Z, Peng W, Large Language models in healthcare and medical domain: a review, Informatics 2024, 11(3): 57; 2024, https://doi.org/10.3390/INFORMATICS11030057.
Adamopoulou E, Moussiades L. Chatbots: History, technology, and applications. Mach Learn Appl. 2020;2: 100006. https://doi.org/10.1016/J.MLWA.2020.100006.
Prof. Jadhav P, Samnani A, Alachiya A, Shah V, Selvam A Intelligent Chatbot, International Journal of Advanced Research in Science, Communication and Technology, pp. 679–683, May 2022, https://doi.org/10.48175/IJARSCT-3996.
Vaswani A et al. Attention Is All You Need, Adv Neural Inf Process Syst, vol. 2017-December, pp. 5999–6009, Jun. 2017, Accessed: Oct. 11, 2024. [Online]. https://arxiv.org/abs/1706.03762v7
Borsci S, Schmettow M, Malizia A, Chamberlain A, Frank Van Der Velde, A confirmatory factorial analysis of the Chatbot Usability Scale: a multilanguage validation, Pers Ubiquitous Comput 1: 3, https://doi.org/10.1007/s00779-022-01690-0.
Oh YJ, Zhang J, Fang ML, Fukuoka Y. A systematic review of artificial intelligence chatbots for promoting physical activity, healthy diet, and weight loss. Int J Behav Nutr Phys Act. 2021;18(1):1–25. https://doi.org/10.1186/S12966-021-01224-6/TABLES/4.
Chang IC, Shih YS, Kuo KM. Why would you use medical chatbots? interview and survey. Int J Med Inform. 2022;165: 104827. https://doi.org/10.1016/J.IJMEDINF.2022.104827.
Eleonor Comendador BV, Michael Francisco BB, Medenilla JS, Mae Nacion ST, Bryle Serac TE Pharmabot: a pediatric generic medicine consultant chatbot, https://doi.org/10.12720/joace.3.2.137-140.
Dharwadkar R, Deshpande NA. A Medical ChatBot. Int J Comput Trends Technol. 2018;60(1):41–5. https://doi.org/10.14445/22312803/IJCTT-V60P106.
Prasad GK Satvik Ranjan C Assistant Professor UG Student and T. Ankit Vivek Kumar, A Personalized Medical Assistant Chatbot: MediBot, IJSTE-Int J Sci Technol Eng 5, 2019, Accessed: Jun. 27, 2024. [Online]. www.ijste.org
Kandpal P, Jasnani K, Raut R, Bhorge S Contextual chatbot for healthcare purposes (using deep learning), Proceedings of the World Conference on Smart Trends in Systems, Security and Sustainability, WS4 2020, pp. 625–634, 2020, 10.1109/WORLDS450073.2020.9210351.
Vamsi GK, Rasool A, Hajela G Chatbot: a deep neural network based human to machine conversation model, 2020 11th International Conference on Computing, Communication and Networking Technologies, ICCCNT 2020, Jul. 2020, https://doi.org/10.1109/ICCCNT49239.2020.9225395.
Shinde NV, Akhade A, Bagad P, Bhavsar H, Wagh SK, Kamble A, Healthcare Chatbot System using Artificial Intelligence, Proceedings of the 5th International Conference on Trends in Electronics and Informatics, ICOEI 2021, pp. 1174–1181, Jun. 2021, https://doi.org/10.1109/ICOEI51242.2021.9452902.
Badlani S, Aditya T, Dave M, Chaudhari S, Multilingual healthcare chatbot using machine learning, 2021 2nd International Conference for Emerging Technology, INCET 2021, May 2021, https://doi.org/10.1109/INCET51464.2021.9456304.
Hsu IC, De Yu J. A medical Chatbot using machine learning and natural language understanding. Multimed Tools Appl. 2022;81(17):23777–99. https://doi.org/10.1007/S11042-022-12820-4/TABLES/4.
Chakraborty S, et al. An ai-based medical chatbot model for infectious disease prediction. IEEE Access. 2022;10:128469–83. https://doi.org/10.1109/ACCESS.2022.3227208.
Rajani G, Ruparel K Deep learning based chatbot architecture for medical diagnosis and treatment recommendation, Proceedings of 3rd International Conference on Advanced Computing Technologies and Applications, ICACTA 2023, 2023, https://doi.org/10.1109/ICACTA58201.2023.10392702.
Dohare P, Johri S, Priya S, Singh S, Upadhyay S, Good Fellow : A Healthcare Chatbot System, 2023 14th International Conference on Computing Communication and Networking Technologies, ICCCNT 2023, 2023, https://doi.org/10.1109/ICCCNT56998.2023.10308009.
Rao DR, Thottempudi K, Surla BK, Satapathy A, Design and evaluation of a medical chatbot built on BERT language model for remote health assistance, 2024 3rd International Conference for Innovation in Technology, INOCON 2024, 2024, https://doi.org/10.1109/INOCON60754.2024.10511493.
Babu A, Boddu SB. BERT-based medical Chatbot: enhancing healthcare communication through natural language understanding. Exploratory Res Clin Soc Pharmacy. 2024;13: 100419. https://doi.org/10.1016/J.RCSOP.2024.100419.
EDA | 11,000 Medicines. Accessed: Jun. 27, 2024. [Online]. https://www.kaggle.com/code/sasakitetsuya/eda-11-000-medicines
Zhang P, Zeng G, Wang T, Lu W TinyLlama: an open-source small language model, Jan. 2024, Accessed: Oct. 11, 2024. [Online]. https://arxiv.org/abs/2401.02385v2
Souza JD, A. Intelligence, A review of transformer models, https://doi.org/10.48366/r640001.
Touvron H et al. LLaMA: Open and Efficient Foundation Language Models, 2023, Accessed: Oct. 11, 2024. [Online]. https://arxiv.org/abs/2302.13971v1
Fang Tan T et al. Fine-tuning Large Language Model (LLM) Artificial Intelligence Chatbots in Ophthalmology and LLM-based evaluation using GPT-4, Feb. 2024, Accessed: Jun. 27, 2024. [Online]. https://arxiv.org/abs/2402.10083v1
N. S. Google, GLU Variants Improve Transformer, Feb. 2020, Accessed: Jun. 27, 2024. [Online]. https://arxiv.org/abs/2002.05202v1
Su J, Ahmed M, Lu Y, Pan S, Bo W, Liu Y. RoFormer: enhanced transformer with rotary position embedding. Neurocomputing. 2024;568: 127063. https://doi.org/10.1016/J.NEUCOM.2023.127063.
Ainslie J, Lee-Thorp J, de Jong M, Zemlyanskiy Y, Lebrón F, Sanghai S GQA: Training Generalized Multi-Query Transformer Models from Multi-Head Checkpoints, EMNLP 2023 - 2023 Conference on Empirical Methods in Natural Language Processing, Proceedings, pp. 4895–4901, May 2023, https://doi.org/10.18653/v1/2023.emnlp-main.298.
Zhang B, Sennrich R Root mean square layer normalization, Adv Neural Inf Process Syst, 32, Oct. 2019, Accessed: Jun. 27, 2024. [Online]. https://arxiv.org/abs/1910.07467v1
Holmes S, Moorhead A, Bond R, Zheng H, Coates V, McTear M Usability testing of a healthcare chatbot: Can we use conventional methods to assess conversational user interfaces?, ECCE 2019 - Proceedings of the 31st European Conference on Cognitive Ergonomics: “‘Design for Cognition,’” pp. 207–214, Sep. 2019, https://doi.org/10.1145/3335082.3335094.
Zheng L et al. Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena, Jun. 2023, Accessed: Jun. 27, 2024. [Online]. https://arxiv.org/abs/2306.05685v4
Devlin J, Chang MW, Lee K, Toutanova K BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding, NAACL HLT 2019 - 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies - Proceedings of the Conference, vol. 1, pp. 4171–4186, Oct. 2018, Accessed: Jun. 27, 2024. [Online]. https://arxiv.org/abs/1810.04805v2
Jia F et al. OncoGPT: A Medical Conversational Model Tailored with Oncology Domain Expertise on a Large Language Model Meta-AI (LLaMA), Feb. 2024, Accessed: Jun. 27, 2024. [Online]. https://arxiv.org/abs/2402.16810v1
Abbasian M, Azimi I, Rahmani AM, Jain R, Conversational Health Agents: A Personalized LLM-Powered Agent Framework, Oct. 2023, Accessed: Jun. 27, 2024. [Online]. https://arxiv.org/abs/2310.02374v4
Funding
This research received no external funding.
Author information
Authors and Affiliations
Contributions
AA; methodology, software engineering, and programming, ZN; Conceptualization and analysis, A.Azam and ZN; writing—original draft preparation, MUGK; writing—review editing, and supervision, All authors have read and agreed to the published version of the manuscript.
Corresponding author
Ethics declarations
Conflict of Interest
Not applicable.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.
About this article
Cite this article
Azam, A., Naz, Z. & Khan, M.U.G. PharmaLLM: A Medicine Prescriber Chatbot Exploiting Open-Source Large Language Models. Hum-Cent Intell Syst 4, 527–544 (2024). https://doi.org/10.1007/s44230-024-00085-z
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s44230-024-00085-z