LLMs are not Zero-Shot Reasoners for Biomedical Information Extraction

Aishik Nagar¹, Viktor Schlegel^*2,3, Thanh-Tung Nguyen,
Hao Li³, Yuping Wu³, Kuluhan Binici⁴, Stefan Winkler⁴

Abstract

Large Language Models (LLMs) are increasingly adopted for applications in healthcare, reaching the performance of domain experts on tasks such as question answering and document summarisation. Despite their success on these tasks, it is unclear how well LLMs perform on tasks that are traditionally pursued in the biomedical domain, such as structured information extration. To breach this gap, in this paper, we systematically benchmark LLM performance in Medical Classification and Named Entity Recognition (NER) tasks. We aim to disentangle the contribution of different factors to the performance, particularly the impact of LLMs’ task knowledge and reasoning capabilities, their (parametric) domain knowledge, and addition of external knowledge. To this end we evaluate various open LLMs—including BioMistral and Llama-2 models—on a diverse set of biomedical datasets, using standard prompting, Chain-of-Thought (CoT) and Self-Consistency based reasoning as well as Retrieval-Augmented Generation (RAG) with PubMed and Wikipedia corpora. Counter-intuitively, our results reveal that standard prompting consistently outperforms more complex techniques across both tasks, laying bare the limitations in the current application of CoT, self-consistency and RAG in the biomedical domain. Our findings suggest that advanced prompting methods developed for knowledge- or reasoning-intensive tasks, such as CoT or RAG, are not easily portable to biomedical tasks where precise structured outputs are required. This highlights the need for more effective integration of external knowledge and reasoning mechanisms in LLMs to enhance their performance in real-world biomedical applications.

Introduction

The success of Large Language Models (LLMs) promises to reshape the landscape of AI healthcare applications, especially for scenarios relying on Question Answering (Budler, Gosak, and Stiglic 2023; Subramanian et al. 2024), summarisation (Schlegel et al. 2023) and extracting insights from unstructured patient-generated health data (Li et al. 2023). While considerable progress has been made in leveraging LLMs for tasks requiring free-text outputs, much of the focus has been on optimizing the parametric knowledge—the information stored in the model’s weights and learned during training. Recent works explore methods such as fine-tuning on task-specific data and in-context learning (ICL) and reporting significant improvements in model performance.

However, these approaches primarily enhance the models’ internal knowledge representation. As such, they rely on readily available data for the structured tasks at hand, be it in form of training sets for task-specific fine-tuning (Abburi et al. 2023), or for selecting good-quality representative few-shot examples for ICL (Zhang et al. 2024; Gutierrez et al. 2022). In the biomedical domain, such resources for structured prediction tasks are typically not available, as requirements might arise ad-hoc—for example when researchers need to process a set of medical records to find patients satisfying inclusion criteria for a clinical trial (Jullien et al. 2023) (e.g., whether they’re a smoker). But even for well established tasks, such as medication name extraction, for which resources exist (Wei et al. 2020), these resources often prove to be insufficient in a practical context, due to the domain shift between public resources and internal hospital data (Hadi et al. 2023). Therefore, solely training-set reliant improvements parametric knowledge of LLMs as driver of performance for structured prediction tasks is often infeasible and approaches need to be able to perform well in zero-shot scenarios. Despite this, the literature currently lacks a systematic investigation of other crucial aspects of knowledge utilization.

In order to address this research gap, we first postulate that the performance of LLMs in medical reasoning and information extraction tasks in “true” zero-shot setting¹¹1by “true zero-shot” we refer to the scenario where no examples are available to solve the task and no information beyond the labels and their semantically meaningful names is made available to the model (Lampert, Nickisch, and Harmeling 2014). hinges on three distinct categories of knowledge:

•

Parametric Knowledge: The inherent knowledge embedded within the model’s parameters.
•

Task Knowledge: The model’s ability to reason about the specific task, including understanding relevant labels and the context of the task.
•

External Knowledge: Additional information and context retrieved to supplement the model’s understanding and decision-making process.

Research in evaluating these aspects specifically in the medical domain (Nori et al. 2023; Subramanian et al. 2024) is being conducted vividly, but these works have mostly focused on knowledge-intensive prerequisite tasks, such as Multiple-Choice Question Answering. While useful to evaluate the medical knowledge of LLMs, they do not address the question of the medical capabilities of LLMs to succeed on tasks that are more reflective of real applications, such as medical text classification or information extraction. As such, it is necessary to evaluate, whether advancements derived from methods that enhance performance, such as (zero-shot) Chain-of-Thought (CoT) reasoning (Wei et al. 2022; Wang and Zhou 2024), self-consistency (Wang et al. 2022) and Retrieval-augmented Generation (RAG) (Li et al. 2024) carry over to such structured prediction tasks.

Moreover, these studies often employ large, commercial models like ChatGPT (Biswas 2023) or GPT-4 (OpenAI 2023) which present significant challenges in real-world applications due to their computational cost and privacy concerns associated with sending sensitive data to third-party APIs. Furthermore, there is a growing concern regarding the reliability of LLMs in medical applications, as even the most powerful models are prone to generating hallucinations, compromising the truthfulness of the outputs. Although constrained generations have shown promise in mitigating these issues, their application in medical information extraction tasks has been limited.

Thus, there are three problems that currently inhibit our understanding of the capabilities of LLMs on structured prediction tasks in the medical domain, and, as a consequence, their improvement: (i) Existing approaches to structured prediction tasks in the medical domain typically enhance parametric knowledge and rely on the availability of training sets, which might not be realistic; (ii) “True zero-shot” studies and methods to improve performance in such settings are mostly carried out surrogate tasks such as Question Answering and whether they can be adapted to structured prediction tasks on is unknown; (iii) Advancements are typically reported on large-scale, proprietary LLMs which might be unusable due to privacy concerns and inaccessibility to logits for constrained decoding.

In this paper, we aim to address these gaps by systematically benchmarking the performance of LLMs in medical classification and NER tasks as a representative selection of structured prediction tasks. We focus on assessing the impact of task knowledge and external knowledge while maintaining the parametric knowledge at a reasonable yet static level. Our approach involves exploring a range of techniques, including CoT reasoning, Retrieval-Augmented Generation (RAG), and constrained generation, which have not been extensively applied in these settings. By providing a comprehensive evaluation of these methods, we seek to offer new insights into the practical deployment of LLMs in the medical domain, highlighting both the challenges and potential solutions.

To summarise, this paper makes the following novel contributions: First, to our knowledge, we present the first comprehensive benchmark for LLMs in medical classification and Named Entity Recognition (NER) tasks, providing a systematic evaluation of their information extraction performance in these critical structured prediction tasks within the medical domain. Second, we investigate the impact of various knowledge enhancement techniques, including Chain of Thought (CoT) reasoning, Self-Consistency, Retrieval-Augmented Generation (RAG), and constrained generation, which have not been extensively explored in medical information extraction settings. Notably, we demonstrate that parametric knowledge capacity, i.e., model size, is a primary and often sole driver of performance in zero-shot settings, offering insights into the limitations and potential of current LLM architectures.

Related Work

We briefly survey the existing benchmarking literature in the medical domain, outlining the lack of studies focusing on structured prediction tasks. Furthermore, we cover recent prompting techniques that were proposed to elicit reasoning in LLMs, and augment their domain knowledge, either by better tapping into their parametric knowledge or by explicitly providing them with relevant external context. Notably, we omit approaches that rely on existence of training sets, such as few-shot prompting (Wang et al. 2023) or model fine-tuning, as one of the key challenges in the medical domain is the lack of annotated task data, due to privacy concerns over sharing medical records. Instead, as outlines in the introduction, we focus on “true” zero-shot capabilities of LLMs.

Existing LLMs Benchmarks: With the rising popularity of LLMs, many works evaluated their performance in the biomedical and clinical domains. These works typically focus on evaluating domain-knowledge by means of Question Answering (Singhal et al. 2023; Harris 2023; Subramanian et al. 2024), or focus directly on possible application scenarios, such as summarisation (Li et al. 2023; Yim et al. 2023) or clinical coding (Kaur, Ginige, and Obst 2023). Many works combine these two directions in an effort to provide more comprehensive benchmarks (Srivastava et al. 2024; Xiong et al. 2024; Feng et al. 2024; Chen et al. 2020; Manes et al. 2024). However, many of these works overlook the wealth of existing literature and plethora of available resources for traditional structured prediction tasks in the biomedical domain, such as document classification, entity recognition and linking and event and relation extraction (e.g., Pyysalo et al. (2007; 2012) to name a few). Fries et al. (2022) have provided a comprehensive and unified collection of these resources, however their work prioritises reportage of the resource collection over benchmarking results. Their preliminary evaluations suggest that their evaluated pre-LLM era models barely surpass the random guess baseline in the zero-shot setting. We build upon their work by providing a detailed analysis to what extent approaches to enhance reasoning and knowledge in LLMs help to challenge this status quo.

Reasoning- and Knowledge-enhancing approaches: Current work attempts to improve the performance of LLMs from different knowledge utilization perspectives. One of the obvious methods is full parameter domain-specific pre-training (Xie et al. 2024). For example, Chen et al. (2023) propose the largest medical foundation model, trained on both biomedical and clinical data, up to 70B. Bolton et al. (2024), on the other hand, believe larger LLMs are computationally expensive to run, proposing a 2.7B LLM specific for biomedical NLP tasks. When fine-tuned, the relatively small model compete with larger LLMs. In our study, we compare domain-generalist models with those adapted to the medical domain. Since full parameter tuning is costly, many works focus on domain knowledge adaptation by pre-training (Shi et al. 2024; Song et al. 2024) or instruction tuning (Willard and Louf 2023) with adapters. Training-free approaches encompass chain-of-thought (CoT) (Wei et al. 2022; Jeong et al. 2024), self-consistency(Wang et al. 2022), Concerned with lack of grounding resulting in hallucination, recent work introduce RAG methods (Li et al. 2024; Wang et al. 2024b; Yu et al. 2023; Munnangi et al. 2024; Wang et al. 2024a; Soong et al. 2023). However, most of these efforts have focused on performance in a particular knowledge paradigm and have lacked a systematic assessment of how performance on structured prediction, which we address in our study.

Methodology

Our methodology is designed to answer the following two research questions: “How well do LLMs perform on structured prediction tasks?” and “To what extent can approaches that enhance task and external knowledge improve their performance?” To answer the first research question, we benchmark a representative sample of LLMs on a large collection of biomedical text classification and NER datasets. More specifically, we choose the task of Medical Text Classification and NER as representative structured predictions tasks. We focus on the “true” zero shot setting, since, as discussed before, this allows us to establish the level of models’ original parametric knowledge, which is desirable as it more closely reflects real-world application scenarios, because annotated training data for such tasks in the biomedical domain is usually not available due to the ad-hoc nature of task requirements and privacy constraints of medical records. Thus improving parametric knowledge is often infeasible in practice. To answer the second question, we compare their zero-shot performance to various methods that aim to enhance task knowledge and external knowledge, while keeping the parametric knowledge static.

Datasets

Since we evaluate different prompting techniques, we restrict the choice of tasks to those where the number of possible labels is small enough to fit in the evaluated LLMs’ context window. We restrict the number of labels to ten and the mean length of the input documents to at most 2048 tokens. This leaves us with 14 different classification datasets from the BigBio collection²²2for the GAD dataset, we only select 1 fold out of the 10 available, as the folds feature the same task for different data, unlike other datasets. We also skipped the Chinese subset of meddialog as we had difficulties loading the dataset. For the NER task, we sample 12 datasets from the pool of those that satisfy the criteria. The resulting dataset sample features four non-English datasets and six non-public classification datasets, which allows us to investigate whether LLMs perform better on minority languages or on data that is less likely to be found in public pre-training corpora. We run the evaluation on the official test-set split where available, otherwise we consider the full dataset. For datasets with more than 500 instances, we sample 500 random but fixed instances to speed up the experiments. Overall, our selection spans English and non-english source data, publicly available and private datasets, and various domains such as scientific papers, medical notes and social media. The overview of the datasets follows below, with full details to be found in the technical appendix.

Classification.

The datasets used for classification tasks include both single-label and multi-label datasets, covering a wide range of biomedical and clinical domains. For single-label classification, the GAD dataset focuses on identifying associations between genes and diseases (Bravo et al. 2015), while the GEO dataset is concerned with classifying microarray, transcriptomics, and single-cell experiments from the Gene Expression Omnibus (GEO) database (Elucidata 2022). The MedDialog dataset aims to classify dialogue snippets as either being said by a doctor or a patient (Chen et al. 2020). Furthermore, the CZIDrsm dataset has several subsets, including one for classifying research articles based on aspects of disease research (CZIBase), and others for identifying whether a paper describes substantive research into Quality of Life (CZIQoL) or is a natural history study (CZINatHist).

In multi-label classification, the LitCovid dataset is used for the classification of COVID-19-related articles (Chen et al. 2021). The CAS and ESSAI datasets are utilized for identify negation and uncertainty clinical cases from French-speaking countries (Grabar, Claveau, and Dalloux 2018). The NTCIR13 datasets include subsets for disease classification of tweets in Japanese (*-Ja), English (*-En), and Chinese (*-Zh) (Iso et al. 2017). Additionally, the PsyTAR dataset is used for sentence classification of various drug-related effects, such as Adverse Drug Reactions (ADR) and Withdrawal Symptoms (WDs) (Zolnoori et al. 2019), while the SciCite dataset is used for citation intent classification based on the context within computer science and biomedical domains (Cohan et al. 2019).

NER.

The datasets for Named Entity Recognition (NER) tasks are similarly divided into entity recognition (single entity type) and classification (multiple entity types). In the single-type category, the GeneTag dataset is used for gene/protein NER, with two annotation versions: the original GeneTag-G and the corrected GeneTag-C (Tanabe et al. 2005). Additionally, the GENIA-PPI dataset focuses on protein-protein interactions or gene regulatory relations within the GENIA corpus, capturing primarily static relations (Pyysalo et al. 2009; Hoehndorf et al. 2010; Ohta et al. 2010).

The multiple-type NER datasets encompass various complex biomedical tasks. The AnEm dataset targets anatomical entity recognition (Ohta et al. 2012), while the BioInfer dataset focuses on recognizing proteins, genes, and RNA entities (Pyysalo et al. 2007). The Genia-EE dataset is used for the GENIA Event corpus (Kim et al. 2009), and the BioNLP11-REL dataset is employed for extracting part-of relations between genes/proteins and associated entities (Pyysalo, Ohta, and Tsujii 2011). Furthermore, the BioNLP-13-CG dataset is used for Cancer Genetics (CG) information extraction, focusing on recognizing events represented as structured n-ary associations of given physical entities (Pyysalo, Ohta, and Ananiadou 2013). The BioNLP-13-GRO dataset aims to populate the Gene Regulation Ontology with events and relations (Kim et al. 2013), and the BioNLP-13-PC dataset is used for the automatic extraction of biomolecular reactions from text (Ohta et al. 2013). Lastly, the PICO dataset deals with recognizing (P)articipants, (I)nterventions, and (O)utcomes (Nye et al. 2018), and the MLEE dataset is used for event extraction related to angiogenesis (Pyysalo et al. 2012).

Models

For our experiments, we employed two instruction-tuned variants of the Llama-2 model—7B and 70B—both (Touvron et al. 2023), alongside the BioMistral-7B model (Labrak et al. 2024) which was further pre-trained on the biomedical domain. Since we make use of constrained generation to generate model outputs and guide the models decoding process, we retrict the evaluation to open source models since this process is not possible for proprietary models such as GPT-4.

Techniques

Standard prompting was used as a baseline for both the Classification as well as the NER tasks. Chain-of-thought reasoning (Wei et al. 2022) has been shown to improve performance, particularly in QA and logical reasoning tasks. Thus, we also ran experiments with chain-of-thought reasoning to measure its impact on model performance. For the NER task, we adapted a more guided, two-stage approach (Shen et al. 2021) to implement a novel chain-of-thought reasoning approach. Here, The first stage involves inducing a generic entity name from a datasets’ known entity labels—e.g., “Bodypart” for the NER labels describing different bodyparts—and then labelling the input document with that generic entity type. In the second stage all entities labelled in this way are further disambiguated with their respective fine-grained dataset NER labels. Retrieval Augmented Generation (Lewis et al. 2020) has been established as an effective technique to improve model performance by introducing relevant non-parameteric knowledge to models and thus grounding the generated outputs to factual information. Xiong et al. (2024) conducted a systematic study of RAG on medical QA, and we incorporate their findings into our study. We used PubMed abstracts (Sanyal, Bhowmick, and Das 2021) and Wikipedia articles as knowledge corpora, because Xiong et al.’s (2024) experiments found that using PubMed improved performance over non RAG techniques, while using Wikipedia reduced performance in medical QA tasks. Our goal was to evaluate whether the same holds true for structured prediction tasks as well. For the RAG module, we made use of FAISS (Douze et al. 2024; Johnson, Douze, and Jégou 2019), which allows retrieval of most similar documents based on semantic similarity, where we used the all-MiniLM-L6-v2 sentence transformers (Reimers and Gurevych 2019) model for embedding input documents and corpora. For each experiment, the number of retrieved documents was computed based on the maximum possible documents which could be used without exceeding the token limit of the model.
Self-consistency, proposed by Wang et al. (2022), improves chain-of-thought reasoning of LLMs by sampling reasoning paths for a given problem, followed by a majority vote for the final answer. We also conduct a set of experiments employing self-consistency to investigate whether such improvements can be observed on structured prediction tasks in the medical domain as well. For classification tasks, self consistency was employed to generate multiple reasoning chains for the given problem, followed by answer extraction from each reasoning chain and majority voting to select the final answer. For NER tasks, since we follow the two-stage approach, self-consistency was employed in both stages. Multiple general entity labels were generated in the first stage, and entities were extracted for each such label. In the second stage, self consistency was again used for the entity selection phase as well as the entity label determination step. Majority voting was utilised in final label or class selection in each case (Xie et al. 2023).
Constrained decoding in LLMs (Willard and Louf 2023) was used to ensure structured information extraction and text generation. This allowed us to evaluate the LLMs for the task at hand without the added variability due to the aleatoric uncertainties brought about by the probabilistic language generation fundamental to the architectures of the models. More specifically, for classification tasks, we ensured the presense of at least one label in the generated outputs. For NER we restricted the generation of spans occurring in text in the first step, and in the second step, for each of the spans we restricted the generation to any of the possible labels. This is also one of the reasons why we opted against evaluating API-based closed-source LLMs³³3The other reason being their intransparancy with regard to training data, which violates our “true” zero-shot setting., as in our initial experiments the hallucinations in generated outputs created problems with reliably parsing the structured outputs.

We refer to chain of thought as CoT, Self-consistency as SC, RAG as RAG-{P|W} for PubMed and Wikipedia corpora, respectively, and to standard prompting as Vanilla.

Evaluation Results

Overview of results

Figure 1: Best-performing Standard Prompting method for BioMistral 7B, Llama-70B and Llama-7B for all classification tasks.

	Technique	CLS	NER
	Technique	F1	F1-S	F1-L
BioMistral-7B	Vanilla	36.5	3.3	2.2
	CoT	31.3	1.5	1.3
	SC-CoT	20.5	0.8	0.4
	CoT-RAG-P	14.7	1.6	1.2
	CoT-RAG-W	15.5	1.3	1.0
	SC-CoT-RAG-P	19.2	0.5	0.4
	SC-CoT-RAG-W	21.6	0.4	0.3
Llama-2-70B	Vanilla	40.3	8.6	5.8
	CoT	35.9	10.3	7.3
	SC-CoT	28.0	9.1	5.4
	CoT-RAG-P	16.5	9.9	7.1
	CoT-RAG-W	15.7	10.6	7.2
	SC-CoT-RAG-P	27.2	9.0	5.4
	SC-CoT-RAG-W	26.6	9.1	5.3
Llama-2-7B	Vanilla	34.9	6.5	5.2
	CoT	30.6	4.9	2.5
	SC-CoT	24.6	5.1	3.0
	CoT-RAG-P	14.3	4.6	2.3
	CoT-RAG-W	14.5	4.2	1.7
	SC-CoT-RAG-P	25.5	5.7	2.9
	SC-CoT-RAG-W	11.1	5.6	3.2

Table 1: Performance of each model and technique combination across Classification and NER datasets. For classification, we report Micro-F1 and for NER we report both Span-Identification Micro-F1 performance as well as full Micro-F1 performance, including recognizing correct types.

Reasoning and knowledge enhancing techniques seem to not improve performance. Figure 1 and Figure 2 compare the results of the best performing techniques for each model for classification and NER, respectively. As seen in Table 1, perhaps counter-intuitively, Standard Prompting consistently achieves the highest average F1 scores across all models for classification task, with BioMistral-7B obtaining 36.48%, Llama-2-70B-Chat-AWQ achieving 40.34%, and Llama-2-7b-chat-hf scoring 34.92%. This result indicates that for structured prediction tasks, more complex reasoning techniques such as Chain of Thought (CoT) Prompting or Retrieval-Augmented Generation (RAG), do not outperform simpler approaches like Standard Prompting. For NER tasks, the results present a more nuanced picture compared to the classification tasks. While Standard Prompting remains effective, there is a noticeable shift in performance across different models and datasets. Notably, the scores are significantly lower than typical F1 scores in biomedical NER benchmarks. For instance, the NCBI disease corpus (Doğan, Leaman, and Lu 2014; Krallinger et al. 2015) and CHEMDNER dataset usually yield higher performances with specialized models or extensive pre-training. State-of-the-art models on these benchmarks can achieve Span F1 scores up to 0.90 for the NCBI disease corpus (Kocaman and Talby 2021; Zhou et al. 2023). However, similar to our findings, in true zero-shot setting, NER scores have been reported to be markedly low, even for the general domain (Shen et al. 2021) and when supplying label descriptions (Picco et al. 2024).

A possible reason for poor performance might be that these approaches have been tailored towards—and shown to work well on—knowledge- and reasoning-intensive tasks, such as Question Answering (Nori et al. 2023) or Mathematical Reasoning (Wang and Zhou 2024; Wang et al. 2022; Li et al. 2024). Meanwhile more narrowly defined tasks like information extraction or classification require the understanding of specific task semantics over generic reasoning capabilities. They seem to not require broad knowledge, as it could be found in biomedical paper abstracts or Wikipedia articles, but rather require application of domain knowledge in a specific and highly contextualized tasks, contained within the input document and task description. Models need to be able to handle highly specialized vocabulary, including jargon, acronyms, and synonyms that can vary widely between subfields (Kim et al. 2007; Zheng, Yu et al. 2018; Jiang and Xu 2024). There is a fundamental requirement for context dependent disambiguation of ambiguity and polysemy as well as nuances and variablity in syntax and expressions of biomedical concepts. This is often developed through specialized pre-training or domain-specific enhancements, which the LLMs have not been able to capture. These challenges necessitate models that not only have robust general NER capabilities but also an intricate understanding of biomedical context which can very for different subtasks within the domain.

Figure 2: Best-performing Standard Prompting method for BioMistral 7B, Llama-70B and Llama-7B for all NER tasks.

Scale drives improvements. In line with previous observations, we find that the 70B model also shows a considerable improvement (5.4% for classification, 2.2% for NER Span F1) over the 7B model. The most significant difference in performance between the Llama 7B and 70B Models is observed when using Self-Consistency with Chain of Thought and RAG (Wikipedia), where the 70B model outperforms the 7B model by 15.45% on classification and on NER tasks. This suggests that the larger model is significantly better at leveraging external knowledge when combined with self-consistency and chain of thought prompting. The larger model’s increased capacity might be particularly advantageous in handling these complexities, resulting in a more significant performance gap compared to simpler techniques. Methods like Chain-of-Thought Prompting and Self-Consistency with Chain-of-Thought and RAG involve complex reasoning and knowledge integration processes(Wei et al. 2022). This is further demonstrated by the fact that Llama 70B improves performance by 10.91% when using Self Consistency is added to Wikipedia based RAG, indicating that self consistency helps model combat the drop in performance when adding potentially irrelevant external information for the larger model. Unlike in classification tasks, where Standard Prompting was universally superior, NER performance does not degrade as much when using advanced prompting techniques, particularly when using larger models like Llama-2-70B, likely due to the general lack of epistemic confidence in the answers in the first place.

Figure 3: Breakdown of the Micro-F1 performance of each technique and the random guess baseline for all classification datasets, compared against the random guess baseline.

Figure 4: Breakdown of each technique and the random guess baseline on all NER datasets as measured by the Micro-F1 scores. A prediction is counted as correct when both the span and its assigned label are found in the ground truth

Refer to caption — Figure 5: Comparison of performance of each model in single label vs multi label datasets. Random baseline for single class classification is 0.415 and multi class classification is 0.215.

Detailed Comparison of Prompting Techniques

The use of CoT and Self Consistency are not helpful if there is a lack of parametric knowledge about the task. For BioMistral-7B, using Self-Consistency CoT prompting leads to the biggest reduction of about 16% for classification tasks. One possible reason is the domain-specific pre-training equips the model to better follow the instructions directly without needing additional reasoning structures, which seem detrimental. Similar to the RAG case, self-consistency seems to not consistently improve performance for NER. While Self Consistency aims to improve the reliability of Chain of Thought prompting by generating multiple reasoning paths and selecting the most consistent one, it might introduce additional complexity leading to errors or inconsistencies. This is especially true, if the model’s answers have low confidence scores due to insufficient parametric knowledge which prevents them to reliably solve these problems and would explain the observed performance drop. For NER tasks, the combination of Chain of Thought (CoT) and Self-Consistency prompting with RAG (Wikipedia) shows the most substantial performance difference between the 70B and 7B models. This suggests that larger models are more adept at leveraging external knowledge and complex reasoning strategies for entity recognition tasks if there is lack of parametric knowledge.

RAG does not help information extraction. The quality and relevance of the retrieved information can significantly impact performance, as seen from the fact that there is an average drop of 16.91% when using RAG with PubMed Corpora and 16.47% when using RAG with Wikipedia corpora as compared to the best performing technique for classification. While incorporating external knowledge through RAG can be generally beneficial for QA based tasks (Xiong et al. 2024) where incorporating relevant facts to the given question can append relevant knowledge into the model, it is not as straightforward in classification and information extraction tasks. This has to especially be considered in the given task setting, where the model could be confused by the presence of irrelevant knowledge information which adds an additional layer of complexity in extracting the relevant information for answering the relevant questions.

SC helps models filter out irrelevant noise in case of RAG, but does not help CoT While Self Consistency aims to improve the reliability of Chain of Thought prompting by generating multiple reasoning paths and selecting the most consistent one, is fundamentally dependent on the models epistemic certainty (Yadkori et al. 2024; Liu et al. 2024). This hinders performance if the model’s answers have low confidence scores due to insufficient parametric knowledge which prevents them to reliably solve these problems and would explain the observed performance drop. For BioMistral-7B, using Self-Consistency CoT prompting leads to the biggest reduction of about 16% for classification tasks. One possible reason is the domain-specific pre-training equips the model to better follow the instructions directly without needing additional reasoning structures, which seem detrimental. Similar to the RAG case, self-consistency seems to not consistently improve performance for NER. The combination of Chain of Thought (CoT) and Self-Consistency prompting with RAG (Wikipedia) shows the most substantial performance difference between the 70B and 7B models. This suggests that larger models are more adept at leveraging external knowledge and complex reasoning strategies for entity recognition tasks to augment the lack of epistemic uncertainty.

Detailed Per-dataset analysis

Models Perform Significantly better on public datasets. Models perform significantly better on public datasets (average accuracy of 30%) compared to private datasets (average accuracy of 12%). This might hint at possible data leakage during pre-training or instruction-tuning, as publicly available datasets are more likely to be included in a web-crawl or a dedicated instruction tuning dataset. This might suggest that model performance on ‘unseen’ (yet publicly available) tasks could be a result of unintentional data leakage rather than a by product of reasoning or generalisation.

Multilingual Performance is not Scale Dependent. As shown in Figure 1, smaller models can match or even outperform larger models on Chinese and Japanese datasets but not on English datasets. This may be due to the heavy reliance on large English corpora during training, with limited exposure to medical contexts in other languages. This forces models to generalize compressed language representations to specialized domains, where overfitting on sparse languages may hinder larger models’ performance.

LLMs struggle on tasks high complexity tasks As seen in Figure 5, LLMs seem to struggle to outperform random baselines for both single and multi class classification tasks. However, Figure 3 paints a more nuanced picture: guessing baseine remains unbeaten only on two of 14 datasets, which drags down the average performance significantly.

Figures 3 and 4 show that Llama2 70B demonstrates good performance in low-complexity tasks such as disease and symptom classification (CZIBase, NTCIR13-En) and medium-complexity tasks like Gene Expression classification (Geo). However, the model is challenged by higher-complexity problems, such as the BioNLP13-CG and GENIA-EE datasets.Specifically, in datasets that demand nuanced understanding and interpretation, such as the extraction of participants and outcomes from abstracts and gene ontology population (PICO, BioNLP13-GRO) the performance is low. When incorporating RAG (Retrieval-Augmented Generation) techniques, there are fluctuations in performance across datasets. While results improve on some datasets, RAG does not universally benefit the model’s ability to accurately extract and classify biomedical information.

Conclusion

We provide a comprehensive benchmark and analysis of LLMs in Medical Classification and Named Entity Recognition tasks, revealing several key insights that have significant implications for the field. We carry out a critical investigation of broad claims regarding LLM capabilities by replicating them in various contexts, domains and datasets. We find that models suffer from fundamental drawbacks in generalizability, which hinder their performance in structured information extraction tasks on domain specific problems. This leads to Standard prompting outperforming more advanced methods across both the tasks. Our findings underscore the paramount importance of parametric knowledge capacity in zero-shot settings, regardless of advanced techniques used to augment external knowledge or model reasoning.

Acknowledgements

This research is supported by the National Research Foundation, Prime Minister’s Office, Singapore under its Campus for Research Excellence and Technological Enterprise (CREATE) programme. The authors thank Abhinav Ramesh Kashyap, Andy T. Liu and Vijay Prakash Dwivedi for their comments and useful feedback during the work. The authors further acknowledge and are thankful for the use of Imperial College Research Computing Service (DOI: http://doi.org/10.14469/hpc/2232), and the Computational Shared Facility at The University of Manchester.

References

Abburi et al. (2023) Abburi, H.; Suesserman, M.; Pudota, N.; Veeramani, B.; Bowen, E.; and Bhattacharya, S. 2023. Generative AI Text Classification using Ensemble LLM Approaches. In IberLEF@SEPLN, volume 3496 of CEUR Workshop Proceedings. CEUR-WS.org.
Biswas (2023) Biswas, S. S. 2023. Role of chat gpt in public health. Annals of biomedical engineering, 51(5): 868–869.
Bolton et al. (2024) Bolton, E.; Venigalla, A.; Yasunaga, M.; Hall, D.; Xiong, B.; Lee, T.; Daneshjou, R.; Frankle, J.; Liang, P.; Carbin, M.; et al. 2024. Biomedlm: A 2.7 b parameter language model trained on biomedical text. arXiv preprint arXiv:2403.18421.
Bravo et al. (2015) Bravo, À.; Piñero, J.; Queralt-Rosinach, N.; Rautschka, M.; and Furlong, L. I. 2015. Extraction of relations between genes and diseases from text and large-scale data analysis: implications for translational research. BMC Bioinformatics, 16(1).
Budler, Gosak, and Stiglic (2023) Budler, L. C.; Gosak, L.; and Stiglic, G. 2023. Review of artificial intelligence-based question-answering systems in healthcare. WIREs Data. Mining. Knowl. Discov., 13(2).
Chen et al. (2021) Chen, Q.; Allot, A.; Leaman, R.; Doğan, R. I.; and Lu, Z. 2021. Overview of the BioCreative VII LitCovid Track: multi-label topic classification for COVID-19 literature annotation. In Proceedings of the seventh BioCreative challenge evaluation workshop.
Chen et al. (2020) Chen, S.; Ju, Z.; Dong, X.; Fang, H.; Wang, S.; Yang, Y.; Zeng, J.; Zhang, R.; Zhang, R.; Zhou, M.; Zhu, P.; and Xie, P. 2020. MedDialog: A Large-scale Medical Dialogue Dataset. CoRR, abs/2004.03329.
Chen et al. (2023) Chen, Z.; Cano, A. H.; Romanou, A.; Bonnet, A.; Matoba, K.; Salvi, F.; Pagliardini, M.; Fan, S.; Köpf, A.; Mohtashami, A.; et al. 2023. Meditron-70b: Scaling medical pretraining for large language models. arXiv preprint arXiv:2311.16079.
Cohan et al. (2019) Cohan, A.; Ammar, W.; van Zuylen, M.; and Cady, F. 2019. Structural Scaffolds for Citation Intent Classification in Scientific Publications. In Conference of the North American Chapter of the Association for Computational Linguistics.
Doğan, Leaman, and Lu (2014) Doğan, R. I.; Leaman, R.; and Lu, Z. 2014. NCBI disease corpus: a resource for disease name recognition and concept normalization. Journal of biomedical informatics, 47: 1–10.
Douze et al. (2024) Douze, M.; Guzhva, A.; Deng, C.; Johnson, J.; Szilvasy, G.; Mazaré, P.-E.; Lomeli, M.; Hosseini, L.; and Jégou, H. 2024. The Faiss library. arXiv preprint.
Elucidata (2022) Elucidata, I. 2022. GEOKhoj v1. https://github.com/ElucidataInc/GEOKhoj-datasets/tree/main/geokhoj_v1.
Feng et al. (2024) Feng, H.; Ronzano, F.; LaFleur, J.; Garber, M.; de Oliveira, R.; Rough, K.; Roth, K.; Nanavati, J.; Zine El Abidine, K.; and Mack, C. 2024. Evaluation of large language model performance on the Biomedical Language Understanding and Reasoning Benchmark. medRxiv, 2024–05.
Fries et al. (2022) Fries, J.; Weber, L.; Seelam, N.; Altay, G.; Datta, D.; Garda, S.; Kang, S.; Su, R.; Kusa, W.; Cahyawijaya, S.; et al. 2022. Bigbio: A framework for data-centric biomedical natural language processing. Advances in Neural Information Processing Systems, 35: 25792–25806.
Grabar, Claveau, and Dalloux (2018) Grabar, N.; Claveau, V.; and Dalloux, C. 2018. CAS: French Corpus with Clinical Cases. In Proceedings of the Ninth International Workshop on Health Text Mining and Information Analysis, 122–128. Brussels, Belgium: Association for Computational Linguistics.
Gutierrez et al. (2022) Gutierrez, B. J.; McNeal, N.; Washington, C.; Chen, Y.; Li, L.; Sun, H.; and Su, Y. 2022. Thinking about GPT-3 In-Context Learning for Biomedical IE? Think Again. In EMNLP (Findings), 4497–4512. Association for Computational Linguistics.
Hadi et al. (2023) Hadi, M. U.; Qureshi, R.; Shah, A.; Irfan, M.; Zafar, A.; Shaikh, M. B.; Akhtar, N.; Wu, J.; Mirjalili, S.; et al. 2023. A survey on large language models: Applications, challenges, limitations, and practical usage. Authorea Preprints.
Harris (2023) Harris, E. 2023. Large language models answer medical questions accurately, but can’t match clinicians’ knowledge. JAMA.
Hoehndorf et al. (2010) Hoehndorf, R.; cyrille Ngonga Ngomo, A.; Pyysalo, S.; Ohta, T.; Oellrich, A.; and Rebholz-schuhmann, D. 2010. Applying ontology design patterns to the implementation of relations in GENIA. In Proceedings of the Fourth International Symposium for Semantic Mining in Biomedicine.
Iso et al. (2017) Iso, H.; Ruiz, C.; Murayama, T.; Taguchi, K.; Takeuchi, R.; Yamamoto, H.; Wakamiya, S.; and Aramaki, E. 2017. NTCIR13 MedWeb Task: multi-label classification of tweets using an ensemble of neural networks. In NTCIR.
Jeong et al. (2024) Jeong, M.; Hwang, H.; Yoon, C.; Lee, T.; and Kang, J. 2024. OLAPH: Improving Factuality in Biomedical Long-form Question Answering. arXiv preprint arXiv:2405.12701.
Jiang and Xu (2024) Jiang, C.; and Xu, W. 2024. MedReadMe: A Systematic Study for Fine-grained Sentence Readability in Medical Domain. arXiv preprint arXiv:2405.02144.
Johnson, Douze, and Jégou (2019) Johnson, J.; Douze, M.; and Jégou, H. 2019. Billion-scale similarity search with GPUs. IEEE Transactions on Big Data, 7(3): 535–547.
Jullien et al. (2023) Jullien, M.; Valentino, M.; Frost, H.; O’Regan, P.; Landers, D.; and Freitas, A. 2023. NLI4CT: Multi-Evidence Natural Language Inference for Clinical Trial Reports. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, 16745–16764. Stroudsburg, PA, USA: Association for Computational Linguistics.
Kaur, Ginige, and Obst (2023) Kaur, R.; Ginige, J. A.; and Obst, O. 2023. AI-based ICD coding and classification approaches using discharge summaries: A systematic literature review. Expert Syst. Appl., 213(Part): 118997.
Kim et al. (2007) Kim, H.; Goryachev, S.; Rosemblat, G.; Browne, A.; Keselman, A.; and Zeng-Treitler, Q. 2007. Beyond surface characteristics: a new health text-specific readability measurement. In AMIA Annual Symposium Proceedings, volume 2007, 418. American Medical Informatics Association.
Kim et al. (2009) Kim, J.-D.; Ohta, T.; Pyysalo, S.; Kano, Y.; and Tsujii, J. 2009. Overview of BioNLP’09 Shared Task on Event Extraction. In Proceedings of the BioNLP 2009 Workshop Companion Volume for Shared Task, 1–9. Boulder, Colorado: Association for Computational Linguistics.
Kim et al. (2013) Kim, J.-j.; Han, X.; Lee, V.; and Rebholz-Schuhmann, D. 2013. GRO Task: Populating the Gene Regulation Ontology with events and relations. In Proceedings of the BioNLP Shared Task 2013 Workshop, 50–57. Sofia, Bulgaria: Association for Computational Linguistics.
Kocaman and Talby (2021) Kocaman, V.; and Talby, D. 2021. Biomedical named entity recognition at scale. In Pattern Recognition. ICPR International Workshops and Challenges: Virtual Event, January 10–15, 2021, Proceedings, Part I, 635–646. Springer.
Krallinger et al. (2015) Krallinger, M.; Leitner, F.; Rabal, O.; Vazquez, M.; Oyarzabal, J.; and Valencia, A. 2015. CHEMDNER: The drugs and chemical names extraction challenge. Journal of cheminformatics, 7: 1–11.
Labrak et al. (2024) Labrak, Y.; Bazoge, A.; Morin, E.; Gourraud, P.-A.; Rouvier, M.; and Dufour, R. 2024. Biomistral: A collection of open-source pretrained large language models for medical domains. arXiv preprint arXiv:2402.10373.
Lampert, Nickisch, and Harmeling (2014) Lampert, C. H.; Nickisch, H.; and Harmeling, S. 2014. Attribute-Based Classification for Zero-Shot Visual Object Categorization. IEEE Trans. Pattern Anal. Mach. Intell., 36(3): 453–465.
Lewis et al. (2020) Lewis, P.; Perez, E.; Piktus, A.; Petroni, F.; Karpukhin, V.; Goyal, N.; Küttler, H.; Lewis, M.; Yih, W.-t.; Rocktäschel, T.; et al. 2020. Retrieval-augmented generation for knowledge-intensive nlp tasks. Advances in Neural Information Processing Systems, 33: 9459–9474.
Li et al. (2023) Li, H.; Wu, Y.; Schlegel, V.; Batista-Navarro, R.; Nguyen, T.; Kashyap, A. R.; Zeng, X.; Beck, D.; Winkler, S.; and Nenadic, G. 2023. Team: PULSAR at ProbSum 2023: PULSAR: Pre-training with Extracted Healthcare Terms for Summarising Patients’ Problems and Data Augmentation with Black-box Large Language Models. In BioNLP@ACL, 503–509. Association for Computational Linguistics.
Li et al. (2024) Li, M.; Zhou, H.; Yang, H.; and Zhang, R. 2024. RT: a Retrieving and Chain-of-Thought framework for few-shot medical named entity recognition. Journal of the American Medical Informatics Association, ocae095.
Liu et al. (2024) Liu, L.; Pan, Y.; Li, X.; and Chen, G. 2024. Uncertainty Estimation and Quantification for LLMs: A Simple Supervised Approach. arXiv preprint arXiv:2404.15993.
Manes et al. (2024) Manes, I.; Ronn, N.; Cohen, D.; Ber, R. I.; Horowitz-Kugler, Z.; and Stanovsky, G. 2024. K-QA: A Real-World Medical Q&A Benchmark. CoRR, abs/2401.14493.
Munnangi et al. (2024) Munnangi, M.; Feldman, S.; Wallace, B. C.; Amir, S.; Hope, T.; and Naik, A. 2024. On-the-fly Definition Augmentation of LLMs for Biomedical NER. arXiv preprint arXiv:2404.00152.
Nori et al. (2023) Nori, H.; Lee, Y. T.; Zhang, S.; Carignan, D.; Edgar, R.; Fusi, N.; King, N.; Larson, J.; Li, Y.; Liu, W.; Luo, R.; McKinney, S. M.; Ness, R. O.; Poon, H.; Qin, T.; Usuyama, N.; White, C.; and Horvitz, E. 2023. Can Generalist Foundation Models Outcompete Special-Purpose Tuning? Case Study in Medicine. CoRR, abs/2311.16452.
Nye et al. (2018) Nye, B.; Li, J. J.; Patel, R.; Yang, Y.; Marshall, I.; Nenkova, A.; and Wallace, B. 2018. A Corpus with Multi-Level Annotations of Patients, Interventions and Outcomes to Support Language Processing for Medical Literature. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 197–207. Melbourne, Australia: Association for Computational Linguistics.
Ohta et al. (2010) Ohta, T.; Pyysalo, S.; Kim, J.-D.; and Tsujii, J. 2010. A reevaluation of biomedical named entity - term relations. Journal of bioinformatics and computational biology, 8: 917–28.
Ohta et al. (2013) Ohta, T.; Pyysalo, S.; Rak, R.; Rowley, A.; Chun, H.-W.; Jung, S.-J.; Choi, S.-P.; Ananiadou, S.; and Tsujii, J. 2013. Overview of the Pathway Curation (PC) task of BioNLP Shared Task 2013. In Proceedings of the BioNLP Shared Task 2013 Workshop, 67–75. Sofia, Bulgaria: Association for Computational Linguistics.
Ohta et al. (2012) Ohta, T.; Pyysalo, S.; Tsujii, J.; and Ananiadou, S. 2012. Open-domain Anatomical Entity Mention Detection. In Proceedings of the Workshop on Detecting Structure in Scholarly Discourse, volume W12-43. Association for Computational Linguistics.
OpenAI (2023) OpenAI. 2023. GPT-4 Technical Report. CoRR, abs/2303.08774.
Picco et al. (2024) Picco, G.; Fuchs, L.; Galindo, M. M.; Purpura, A.; López, V.; and Lam, H. T. 2024. Description Boosting for Zero-Shot Entity and Relation Classification. CoRR, abs/2406.02245.
Pyysalo et al. (2007) Pyysalo, S.; Ginter, F.; Heimonen, J.; Bj"orne, J.; Boberg, J.; J"arvinen, J.; and Salakoski, T. 2007. BioInfer: a corpus for information extraction in the biomedical domain. BMC bioinformatics, 8(1): 1–24.
Pyysalo, Ohta, and Ananiadou (2013) Pyysalo, S.; Ohta, T.; and Ananiadou, S. 2013. Overview of the Cancer Genetics (CG) task of BioNLP Shared Task 2013. In Proceedings of the BioNLP Shared Task 2013 Workshop, 58–66. Sofia, Bulgaria: Association for Computational Linguistics.
Pyysalo et al. (2009) Pyysalo, S.; Ohta, T.; Kim, J.-D.; and Tsujii, J. 2009. Static Relations: a Piece in the Biomedical Information Extraction Puzzle. In Proceedings of the BioNLP 2009 Workshop, 1–9. Boulder, Colorado: Association for Computational Linguistics.
Pyysalo et al. (2012) Pyysalo, S.; Ohta, T.; Miwa, M.; Cho, H.-C.; Tsujii, J.; and Ananiadou, S. 2012. Event extraction across multiple levels of biological organization. Bioinformatics, 28(18): i575–i581.
Pyysalo, Ohta, and Tsujii (2011) Pyysalo, S.; Ohta, T.; and Tsujii, J. 2011. Overview of the Entity Relations (REL) Supporting Task of BioNLP Shared Task 2011. In Proceedings of the BioNLP Shared Task 2011 Workshop, BioNLP Shared Task ’11, 83–88. USA: Association for Computational Linguistics. ISBN 9781937284091.
Reimers and Gurevych (2019) Reimers, N.; and Gurevych, I. 2019. Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks. In EMNLP/IJCNLP (1), 3980–3990. Association for Computational Linguistics.
Sanyal, Bhowmick, and Das (2021) Sanyal, D. K.; Bhowmick, P. K.; and Das, P. P. 2021. A review of author name disambiguation techniques for the PubMed bibliographic database. J. Inf. Sci., 47(2).
Schlegel et al. (2023) Schlegel, V.; Li, H.; Wu, Y.; Subramanian, A.; Nguyen, T.; Kashyap, A. R.; Beck, D.; Zeng, X.; Batista-Navarro, R. T.; Winkler, S.; and Nenadic, G. 2023. PULSAR at MEDIQA-Sum 2023: Large Language Models Augmented by Synthetic Dialogue Convert Patient Dialogues to Medical Records. In CLEF (Working Notes), volume 3497 of CEUR Workshop Proceedings, 1668–1679. CEUR-WS.org.
Shen et al. (2021) Shen, Y.; Ma, X.; Tan, Z.; Zhang, S.; Wang, W.; and Lu, W. 2021. Locate and label: A two-stage identifier for nested named entity recognition. arXiv preprint arXiv:2105.06804.
Shi et al. (2024) Shi, W.; Xu, R.; Zhuang, Y.; Yu, Y.; Wu, H.; Yang, C.; and Wang, M. D. 2024. MedAdapter: Efficient Test-Time Adaptation of Large Language Models towards Medical Reasoning. arXiv:2405.03000.
Singhal et al. (2023) Singhal, K.; Azizi, S.; Tu, T.; Mahdavi, S. S.; Wei, J.; Chung, H. W.; Scales, N.; Tanwani, A.; Cole-Lewis, H.; Pfohl, S.; et al. 2023. Large language models encode clinical knowledge. Nature, 620(7972): 172–180.
Song et al. (2024) Song, Y.; Zhang, J.; Tian, Z.; Yang, Y.; Huang, M.; and Li, D. 2024. LLM-based privacy data augmentation guided by knowledge distillation with a distribution tutor for medical text classification. arXiv preprint arXiv:2402.16515.
Soong et al. (2023) Soong, D.; Sridhar, S.; Si, H.; Wagner, J.-S.; Sá, A. C. C.; Yu, C. Y.; Karagoz, K.; Guan, M.; Hamadeh, H.; and Higgs, B. W. 2023. Improving accuracy of GPT-3/4 results on biomedical data using a retrieval-augmented language model. arXiv preprint arXiv:2305.17116.
Srivastava et al. (2024) Srivastava, S.; PV, A.; Menon, S.; Sukumar, A.; Philipose, A.; Prince, S.; Thomas, S.; et al. 2024. Functional benchmarks for robust evaluation of reasoning performance, and the reasoning gap. arXiv preprint arXiv:2402.19450.
Subramanian et al. (2024) Subramanian, A.; Schlegel, V.; Ramesh Kashyap, A.; Nguyen, T.-T.; Dwivedi, V. P.; and Winkler, S. 2024. M-QALM: A Benchmark to Assess Clinical Reading Comprehension and Knowledge Recall in Large Language Models via Question Answering. In Ku, L.-W.; Martins, A.; and Srikumar, V., eds., Findings of the Association for Computational Linguistics ACL 2024, 4002–4042. Bangkok, Thailand and virtual meeting: Association for Computational Linguistics.
Tanabe et al. (2005) Tanabe, L.; Xie, N.; Thom, L. H.; Matten, W.; and Wilbur, W. J. 2005. GENETAG: a tagged corpus for gene/protein named entity recognition. BMC Bioinformatics, 6.
Touvron et al. (2023) Touvron, H.; Martin, L.; Stone, K.; Albert, P.; Almahairi, A.; Babaei, Y.; Bashlykov, N.; Batra, S.; Bhargava, P.; Bhosale, S.; et al. 2023. Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288.
Wang et al. (2024a) Wang, J.; Yang, Z.; Yao, Z.; and Yu, H. 2024a. Jmlr: Joint medical llm and retrieval training for enhancing reasoning and professional question answering capability. arXiv preprint arXiv:2402.17887.
Wang et al. (2022) Wang, X.; Wei, J.; Schuurmans, D.; Le, Q.; Chi, E.; Narang, S.; Chowdhery, A.; and Zhou, D. 2022. Self-consistency improves chain of thought reasoning in language models. arXiv preprint arXiv:2203.11171.
Wang and Zhou (2024) Wang, X.; and Zhou, D. 2024. Chain-of-thought reasoning without prompting. arXiv preprint arXiv:2402.10200.
Wang et al. (2023) Wang, Y.; Peng, X.; Shen, T.; Clarke, A.; Schlegel, C.; Martin, P.; and Long, G. 2023. Soft Prompt Transfer for Zero-Shot and Few-Shot Learning in EHR Understanding. In ADMA (3), volume 14178 of Lecture Notes in Computer Science, 18–32. Springer.
Wang et al. (2024b) Wang, Z.; Liu, A.; Lin, H.; Li, J.; Ma, X.; and Liang, Y. 2024b. Rat: Retrieval augmented thoughts elicit context-aware reasoning in long-horizon generation. arXiv preprint arXiv:2403.05313.
Wei et al. (2022) Wei, J.; Wang, X.; Schuurmans, D.; Bosma, M.; Xia, F.; Chi, E.; Le, Q. V.; Zhou, D.; et al. 2022. Chain-of-thought prompting elicits reasoning in large language models. Advances in neural information processing systems, 35: 24824–24837.
Wei et al. (2020) Wei, Q.; Ji, Z.; Li, Z.; Du, J.; Wang, J.; Xu, J.; Xiang, Y.; Tiryaki, F.; Wu, S.; Zhang, Y.; Tao, C.; and Xu, H. 2020. A study of deep learning approaches for medication and adverse drug event extraction from clinical text. J. Am. Medical Informatics Assoc., 27(1): 13–21.
Willard and Louf (2023) Willard, B. T.; and Louf, R. 2023. Efficient guided generation for large language models. arXiv e-prints, arXiv–2307.
Xie et al. (2024) Xie, Q.; Chen, Q.; Chen, A.; Peng, C.; Hu, Y.; Lin, F.; Peng, X.; Huang, J.; Zhang, J.; Keloth, V.; et al. 2024. Me llama: Foundation large language models for medical applications. arXiv preprint arXiv:2402.12749.
Xie et al. (2023) Xie, T.; Li, Q.; Zhang, J.; Zhang, Y.; Liu, Z.; and Wang, H. 2023. Empirical study of zero-shot ner with chatgpt. arXiv preprint arXiv:2310.10035.
Xiong et al. (2024) Xiong, G.; Jin, Q.; Lu, Z.; and Zhang, A. 2024. Benchmarking retrieval-augmented generation for medicine. arXiv preprint arXiv:2402.13178.
Yadkori et al. (2024) Yadkori, Y. A.; Kuzborskij, I.; György, A.; and Szepesvári, C. 2024. To Believe or Not to Believe Your LLM. arXiv preprint arXiv:2406.02543.
Yim et al. (2023) Yim, W.; Fu, Y.; Abacha, A. B.; Snider, N.; Lin, T.; and Yetisgen, M. 2023. ACI-BENCH: a Novel Ambient Clinical Intelligence Dataset for Benchmarking Automatic Visit Note Generation. CoRR, abs/2306.02022.
Yu et al. (2023) Yu, G.; Liu, L.; Jiang, H.; Shi, S.; and Ao, X. 2023. Retrieval-augmented few-shot text classification. In Findings of the Association for Computational Linguistics: EMNLP 2023, 6721–6735.
Zhang et al. (2024) Zhang, M.; Wang, B.; Fei, H.; and Zhang, M. 2024. In-Context Learning for Few-Shot Nested Named Entity Recognition. In ICASSP, 10026–10030. IEEE.
Zheng, Yu et al. (2018) Zheng, J.; Yu, H.; et al. 2018. Assessing the readability of medical documents: a ranking approach. JMIR medical informatics, 6(1): e8611.
Zhou et al. (2023) Zhou, W.; Zhang, S.; Gu, Y.; Chen, M.; and Poon, H. 2023. Universalner: Targeted distillation from large language models for open named entity recognition. arXiv preprint arXiv:2308.03279.
Zolnoori et al. (2019) Zolnoori, M.; Fung, K. W.; Patrick, T. B.; Fontelo, P.; Kharrazi, H.; Faiola, A.; Wu, Y. S. S.; Eldredge, C. E.; Luo, J.; Conway, M.; Zhu, J.; Park, S. K.; Xu, K.; Moayyed, H.; and Goudarzvand, S. 2019. A systematic approach for developing a corpus of patient reported adverse drug events: A case study for SSRI and SNRI medications. Journal of Biomedical Informatics, 90.

Appendix A: Datasets

Table 2 and 3 list the huggingface dataset cards and citations for each classification and ner dataset used in the paper respectively.
For datasets considered private, we assume that models have not been trained on these datasets due to their restricted access, which requires Data Use Agreements (DUAs) and other permissions. Consequently, the likelihood of these datasets being included in common web crawls is low.
We have signed all the relevant Data Use Agreements (DUAs) and strictly adhere to their provisions. We do not redistribute the data and advise those wishing to reproduce experiments involving private datasets to consult the corresponding Hugging Face dataset cards for guidance on obtaining the necessary data.

Appendix A Appendix B: Compute Details

1.

Hardware used (GPU/CPU): We used a mix of different shared computational facilities with nVidia A100-SXM4-80GB, RTX6000 with 24GB and L40S with 48GB. Debian OS was used for all the compute servers.
2.

Memory: The machines used had between 256 GB and 1TB of memory
3.

Software and libraries used: The environment can be reproduced from the textttenvironment.yaml file in the supplementary material
4.

Model details: The models used have been described in detail in the main paper submission under the Models subsection of the Methodology section.
5.

Random seed of 42 was used for all random sampling purposes

Dataset Name	HuggingFace Card	Citation
GAD	bigbio/gad	(Bravo et al. 2015)
GEO	bigbio/geokhoj_v1	(Elucidata 2022)
MedDialog	bigbio/meddialog	(Chen et al. 2020)
CZIBase	bigbio/czi_drsm
CZIQoL	bigbio/czi_drsm
CZINatHist	bigbio/czi_drsm
LitCovid	bigbio/bc7_litcovid	(Chen et al. 2021)
CAS	bigbio/cas	(Grabar, Claveau, and Dalloux 2018)
ESSAI	bigbio/essai	(Grabar, Claveau, and Dalloux 2018)
NTCIR13-Ja	bigbio/ntcir_13 _medweb	(Iso et al. 2017)
NTCIR13-En	bigbio/ntcir_13 _medweb	(Iso et al. 2017)
NTCIR13-Zh	bigbio/ntcir_13 _medweb	(Iso et al. 2017)
PsyTAR	bigbio/psytar	(Zolnoori et al. 2019)
SciCite	bigbio/scicite	(Cohan et al. 2019)

Table 2: Datasets used for classification tasks.

Dataset Name	HuggingFace Card	Citation
GeneTag-G	bigbio/genetag	(Tanabe et al. 2005)
GeneTag-C	bigbio/genetag	(Tanabe et al. 2005)
GENIA-PPI	bigbio/genia _relation_corpus	(Pyysalo et al. 2009; Hoehndorf et al. 2010; Ohta et al. 2010)
AnEm	bigbio/an_em	(Ohta et al. 2012)
BioInfer	bigbio/bioinfer	(Pyysalo et al. 2007)
Genia-EE	bigbio/bionlp _shared_task_2009	(Kim et al. 2009)
BioNLP11-REL	bigbio/bionlp_st _2011_rel	(Pyysalo, Ohta, and Tsujii 2011)
BioNLP-13-CG	bigbio/bionlp_st _2013_cg	(Pyysalo, Ohta, and Ananiadou 2013)
BioNLP-13-GRO	bigbio/bionlp_st _2013_gro	(Kim et al. 2013)
BioNLP-13-PC	bigbio/bionlp_st _2013_pc	(Ohta et al. 2013)
PICO	bigbio/ebm_pico	(Nye et al. 2018)
MLEE	bigbio/mlee	(Pyysalo et al. 2012)

Table 3: Datasets used for NER tasks.