Evaluation of Language Models in the Medical Context Under Resource-Constrained Settings

Andrea Posada, Daniel Rueckert, Felix Meissen, and Philip Müller

Abstract

Since the Transformer architecture emerged, language model development has grown, driven by their promising potential. Releasing these models into production requires properly understanding their behavior, particularly in sensitive domains like medicine. Despite this need, the medical literature still lacks practical assessment of pre-trained language models, which are especially valuable in settings where only consumer-grade computational resources are available. To address this gap, we have conducted a comprehensive survey of language models in the medical field and evaluated a subset of these for medical text classification and conditional text generation. The subset includes $53$ models with $110$ million to $13$ billion parameters, spanning the Transformer-based model families and knowledge domains. Different approaches are employed for text classification, including zero-shot learning, enabling tuning without the need to train the model. These approaches are helpful in our target settings, where many users of language models find themselves. The results reveal remarkable performance across the tasks and datasets evaluated, underscoring the potential of certain models to contain medical knowledge, even without domain specialization. This study thus advocates for further exploration of model applications in medical contexts, particularly in computational resource-constrained settings, to benefit a wide range of users. The code is available on https://github.com/anpoc/Language-models-in-medicine.

1 Introduction

Natural language processing (NLP) holds great promise in the medical field. The medical community has recently shown substantial interest in leveraging state-of-the-art language models to address various medical challenges [1, 2]. In particular, generative large language models (LLMs) have showcased emergent abilities beyond their original training objectives, such as text summarization and question answering [3]. These newfound abilities have enabled LLMs to perform tasks of significant clinical importance, including passing medical examinations, summarizing clinical and radiological reports, as well as medical dialogues, extracting drug names from medical notes, responding to patient inquiries, and writing medical histories and physical assessments [2, 4].

The versatility of language models can be attributed to a convergence of factors [2, 4, 5]. The first factor is their ability to learn valuable patterns within large amounts of unlabeled data via self-supervision. The second factor revolves around the Transformer architecture [6] and its suitability for efficient parallel processing on modern computing hardware. Lastly, the third factor encompasses the crucial process of fine-tuning language models to align their responses with human expectations through instruction tuning.

Integration of language models in medical settings is becoming a reality as partnerships between developers and healthcare systems continue to grow [7]. The potential benefits are significant, as they can derive broadly applicable representations from extensive medical corpora at scale and encapsulate clinical knowledge [8]. Nevertheless, it is essential to recognize that our understanding of the behavior of both small pre-trained and large language models still needs to be completed [4]. Deploying these models also carries risks, such as the generation of inaccurate results, a phenomenon known as hallucinations, and the potential amplification of existing biases [1, 4]. Language models’ implementation in sensitive fields, such as healthcare, should therefore be approached with the utmost care [5].

Computing and energy resources required by language models for their development and operation are another critical and limiting factor, especially in LLMs. The standard computing resources available in hospitals are of the consumer-grade type, where it is currently infeasible to handle models with hundreds of billions of parameters. Such resource-constrained settings, i.e., with consumer-grade computing resources, are presented not only by healthcare agents and institutions but also by research groups.

When large language models do not represent a cost-effective or viable solution, smaller pre-trained language models can be an alternative. LLMs, albeit more massive, have similar architectures and pre-training tasks to smaller pre-trained language models [9]. With the same computing budget, a smaller model trained with more high-quality data can perform better than its larger counterparts due to undertraining [10]. Using curated scientific and biomedical corpora in pre-trained language models has also been effective for discriminative and generative language modeling [11]. Furthermore, these smaller models align with the crucial imperative of environmental sustainability and open up the possibilities for organizations to develop applications that can run directly on commodity hardware and small devices rather than relying on cloud-based services [12]. Language models in resource-constrained settings thereby address practical challenges and have great potential in local computing.

To further understand the performance of language models in clinical scenarios with limited computational resources, we conducted a comprehensive evaluation focusing on the classification and conditional generation of medical texts in open-source models. The datasets employed enable the assessment of general and radiology-specific medical knowledge. In total, $53$ models are tested, ranging from $110$ M to $13$ B parameters, spanning all Transformer-based model families and knowledge domains from general to clinical. For conditional text generation, solely decoder-only models are used. The approaches adopted for text classification, together with prompt engineering, allow for improved model performance without the need for training or fine-tuning. An analysis of the impact of the prompts on performance is also included. To the best of our knowledge, this is the first work to evaluate such a large number of small pre-trained language models for medical tasks.

2 Preliminaries

The evolution of natural language processing can be condensed into four major groups of models: (1) statistical models, (2) neural language models, (3) pre-trained language models, and (4) large language models [9]. Each of these groups represents a paradigm shift in natural language modeling and has contributed significantly to the conception of language models as we know them today.

The first transition, from statistical to neural language models, entailed a shift from word prediction based on minimal local context to probabilistic evaluation of word sequences using neural networks. This transition also introduced the representation of words as low-dimensional continuous embeddings based on their contextual usage (distributional semantics). The second transition, from neural to pre-trained language models, involved turning from developing task-specific models to pre-training and fine-tuning methodologies. The third transition to large language models moved the focus from discriminative AI to generative AI, from model-centric to data-centric approaches, and from fine-tuning to prompt engineering and prompt tuning [9, 13, 14]. These advances have paved the way for more sophisticated language models with broader applications and improved capabilities.

2.1 Pre-trained language models

Refer to caption — (a) Encoder-only models

The emergence of pre-trained language models represented a paradigm shift, driving research toward designing more efficient architectures and refining pre-training strategies. These pre-trained models have been commonly adapted or specialized to downstream tasks via fine-tuning, which involves transferring knowledge by further training a model on new data. There are significant advantages demonstrated by these models in language understanding and model performance in various tasks [13, 9].

ELMo is one of the earliest attempts at pre-trained language models[15]. This model was developed to capture context-aware word representations by pre-training a bidirectional Long Short-Term Memory (biLSTM) network and fine-tuning it for subsequent downstream tasks. Later the Transformer architecture was introduced, revolutionizing the NLP field by offering highly parallelizable structures and self-attention mechanisms. The Transformer [6] follows the autoencoder archetype, from which three families of models arose: (1) BERT-family or encoder-only models, (2) GPT-family or decoder-only models, and (3) text-to-text or encoder-decoder models. In Fig. 1, the graphical representations of these families are shown.

2.1.1 Encoder-only models

Encoder-only models, exemplified by BERT [16], are based on masked language modeling (MLM), where parts of the input are masked to encourage the model to reconstruct the original sequence, leveraging contextual information bidirectionally. These models can be stated as $v_{1:n}\rightarrow\phi(v_{1:n})$ . In particular, their contextual embeddings have been proven highly effective as general-purpose semantic features, significantly boosting performance in discriminative NLP tasks.

2.1.2 Decoder-only models

Decoder-only models focus on autoregressive language modeling, i.e., predicting the next token in a sequence based on previous tokens. These models produce contextual embeddings and distribution over the subsequent tokens $v_{i+1}$ , which can be stated as $v_{1:i}\rightarrow\phi(v_{1:i}),\mathbb{P}(v_{i+1}|v_{1:i})$ . However, the contextual embeddings they generate depend solely on the left context. Most research efforts are currently directed toward decoder-only models due to their exceptional performance in conditional generation tasks and their demonstrated emergent capabilities.

2.1.3 Encoder-decoder models

Text-to-text models, or encoder-decoder models, are trained to learn the correspondence between a pair of texts and can be stated as $v_{1:n}\rightarrow\phi(v_{1:n}),\mathbb{P}(w_{1:m}|\phi(v_{1:n}))$ . These models combine bidirectional contextual embeddings with the capability to generate output sequences, making them versatile in various text-to-text tasks without requiring additional heads for fine-tuning. Moreover, by having a broad spectrum of language tasks that can be translated into text-to-text representation, these models can potentially be used for a wide range of applications.

2.2 Large language models

Scaling of language models has often resulted in improved model capabilities in various tasks [17, 10, 18, 19, 20, 21, 22], including those requiring specialized scientific knowledge and reasoning [23]. Research by Kaplan et al. [24] revealed that there is an empirical power-law relationship between the language model performance, in terms of cross-entropy loss, and the model size, dataset size, and amount of compute used for training. It was further found that architectural details, such as network width or depth, had minimal effects on performance. Scaling laws have further been studied by Hoffmann et al. [10] and Bahri et al. [25].

Following these empirical results, several studies have trained progressively larger language models of up to hundreds of billion parameters, such as GPT-3 [18], PaLM [20], Galactica [26], LLaMA [27, 28], Claude [29], Gemini 1.5 [30], and Mistral [31]. Among all, GPT-3 and ChatGPT can be considered the precursors of the large language models, the name by which these large-scale language models are known [13, 9]. GPT-4, a latter version of GPT-3, stands out for its exceptional performance, often matching or surpassing human performance on a variety of tasks [11, 32, 33], even in specialized domains [34]. Extensive evaluations have been conducted to GPT-4 [23, 35, 36, 37], exploring even the path toward Artificial General Intelligence (AGI) [32].

LLMs can be adapted to different tasks via prompt engineering, which, unlike fine-tuning, does not require retraining the model and updating its weights. These prompting techniques have led to observing unexpected emergent capabilities in LLMs, demonstrating the potential to address a wide range of complex tasks and exhibit apparent reasoning abilities [38, 3, 8, 14, 39, 18, 22, 40, 41, 42]. In the medical field, for example, Chain of Thought (CoT) has been used for explainability [43] and in-context learning to mitigate the need for costly medical annotations [13]. Numerous studies have even highlighted the competence of large language models as implicit knowledge bases [8, 23, 26, 44].

In-context learning techniques, such as zero-shot and few-shot learning, have also proven to be remarkably effective on instruction-tuned models and models to which reinforcement learning techniques have been applied [22, 39, 8, 45]. Zero-shot learning consists of asking a trained model to complete a task without providing explicit examples of this task, whereas in few-shot learning, some examples are provided. Nonetheless, prompting techniques are not exclusive to LLMs but are also applicable to smaller pre-trained language models, especially encoder-decoder and decoder-only models.

Despite their advantages, LLMs also have limitations. Their high computational resource requirements and associated computational challenges represent a major limitation of these models. For example, conducting studies to better understand the behavior of LLMs and assess important criteria, such as faithfulness and biases, can be costly and time-consuming. The detection biases and hallucinations, i.e., generating inaccurate results, is crucial in sensitive domains such as medicine.

Due to the significance of the computing limitation, alternatives such as model quantization [46] have been introduced. Quantization is a technique that reduces the computational and memory costs of model inference by representing its weights and activations with low-precision data types, such as 8-bit integers, instead of the usual 32-bit floating point. In natural language processing, this technique is currently being extensively studied, with [47, 48, 49, 50, 51, 52] being some examples in the literature.

Recommendations on the optimal use of computational resources have also been proposed. Chinchilla’s scaling law [10], one of these recommendations, states that the optimal model size and the number of tokens for training a language model should scale equally for compute-optimal training under a given computational budget. In [10], it is further proved that current large language models are significantly undertrained due to the recent focus on scaling language models while keeping the amount of training data constant. A smaller model trained with more high-quality data can thus achieve better performance than its larger counterparts with the same computing budget.

2.3 Language models in the biomedical/clinical context

Broadly speaking, language models used in specialized domains are (i) trained models solely on target domain data, (ii) pre-trained models on general domain corpus with tuning strategies, and (iii) pre-trained models on specialized domain corpus with(out) tuning strategies. Examples of tuning strategies are fine-tuning with target domain data and prompt engineering. Pre-training on (ii) can also be domain-adaptive continual pre-training (i.e., pre-trained on specialized domain corpus after pre-training on general domain corpus) or mixed-domain pre-training (i.e., pre-trained on a mix of general and specialized domain corpus, simultaneously).

GPT-4 is an example of a general domain language model that has been studied in medical applications. Research has covered from its utility as a medical chatbot [53] and in medical competency exams [33] to its applications in radiology [54, 11, 55, 56, 57, 58, 59], among others [60, 61, 62, 63]. Nevertheless, the models studied in the medical context are mostly domain-specific, either biomedical or clinical. These models include pre-trained models such as BioBERT [64], SciBERT [65], BioMedBERT [66], BioMegatron [67], ScholarBERT [68], BioGPT [69], and ClinicalBERT [70]; as well as large language models as Galactica [26], MedAlpaca [71], PMC-LLaMA [72], Med-PaLM 2 [73], GatorTron [74], GatorTronGPT [75], and ClinicalGPT [76].

Domain-specific models usually contain general domain data within the pre-training data, with exceptions such as BioMedBERT, Galactica, GatorTron, and GatorTronGPT. For large language models, instruction fine-tuning is the most common tuning technique, as in MedAlpaca, Med-PaLM 2, GatorTronGPT, and ClinicalGPT. Reinforcement learning from human feedback (RLHF) and reinforcement learning from AI feedback (RLAIF) have also been adopted, although less frequently, being HuatuoGPT [77] an example of this. Recent research studies indicate as well a multimodal trend that supports various types of healthcare data, including electronic health records (EHR), medical images, and medical sequence signals. Examples of these developments include LLaVAMed [78], MedAGI [79], OphGLM [80], Visual Med-Alpaca [81], MedFlamingo [82], and CheXzero [83].

3 Related Work

Comparative studies investigating language models are crucial to advance our understanding of them, shed light on their functionalities and pinpoint their constraints. Despite previous research, a notable gap persists in the literature due to, among other cause, current pace of development in NLP. This gap is particularly significant in fields that require heightened sensitivity, such as medicine, where a thorough understanding of models is imperative [45]. Existing research in medicine is mainly focused on specific tasks or datasets or models [5, 39, 14, 11, 8]. Moreover, most of the discursive and practical assessments focus on LLMs, as can be seen below. To the best of our knowledge, there is no practical assessment in the clinical context that includes a wide number of pre-trained models, covering all Transformer-based model families, targeting settings where only consumer grade computing resources are available.

The work by He et al. [13] stands out among exiting descriptive studies, comprehensively addressing the capabilities, limitations, development and integration of language models in healthcare. The language models in scope are pre-trained and large language models. The development process is explained in detail, covering aspects such as training data, methodologies, and optimization strategies. Concerns related to the integration of LLMs into healthcare are also investigated, as fairness, accountability, transparency, and ethics.

Zhou et al. [84] also provide a comprehensive overview of the development and deployment of LLMs in medicine, together with the challenges and opportunities these models face. Their study is both discursive and practical, being one of its highlights. The authors detail the principles of existing medical LLMs, comprising basic model structures, number of parameters, and data sources and scales used for model development. A comparison of the performance of different LLMs across various medical tasks, also against state-of-the-art lightweight models, is also included.

Continuing with practical reviews, Soni et al. [85] assessed the cost-effectiveness of pre-training and fine-tuning in BERT, BioBERT, Clinical BERT, and XLNet for medical question answering tasks. Their results indicate that BERT-based models exhibit superior performance when fine-tuned with mixed datasets (i.e., general and clinical domain data), highlighting a gap in well-generalizable medical QA datasets. The results also suggest that initial fine-tuning on general domain datasets, such as SQuAD, before doing it on clinical datasets can enhance performance. Prompting techniques were not included in their evaluations.

In a similar vein, Jahan et al. [45] studied the impact of data size for fine-tuning and that of prompts in zero-shot learning on model performance. Four large language models are evaluated on six benchmark biomedical text processing tasks across $26$ datasets. Zero-shot LLMs outperform state-of-the-art fine-tuned models, such as BioBERT, BioGPT, and BioBART, when fine-tuning data is scarce. As the amount of fine-tuning data increases, so does the performance of these state-of-the-art fine-tuned models, surpassing zero-shot LLMs. The study also highlights LLMs’ sensitivity to prompts, as variations in these led to significant differences in outcomes. No single LLM consistently excelled across all datasets and tasks. The authors advocate the training of biomedical LLMs on domain-specific corpora while recognizing LLMs’ potential for biomedical applications that lack large annotated data.

Lehman et al. [86] further explored whether LLMs trained primarily on general web text are suitable for highly specialized, safety-critical domains such as medicine, or if domain-specific models are a better alternative. A total of $12$ language models, ranging from $220$ million to $175$ billion parameters, are evaluated on three clinical tasks. As part of the experiments, T5 models were trained from scratch using MIMIC-III and MIMIC-IV clinical notes to investigate the efficiency of clinical tokens. Their findings suggest that relatively small, specialized clinical models significantly outperform all in-context learning approaches, even when fine-tuned on limited annotated data. Neither the models’ ability to handle long texts nor decoder-only and instruction-tuned models are accounted for in their work.

Lastly, Li et al. [87] focuses on pre-trained language models for long clinical text. A core limitation of Transformer-based models is their substantial memory consumption, leading to performance degradation in long clinical texts. To overcome this limitation, the authors pre-trained Longformer and BigBird, two long-sequence Transformers, on a large-scale clinical corpus, extending the maximum input length from $512$ to $4\,096$ . These models consistently and significantly outperformed ClinicalBERT and other short sequence Transformers across ten tasks. Long-sequence Transformers enriched with clinical knowledge are thus capable of learning long-term dependencies in long clinical texts according to the results. No generative tasks and solely encoder-only models are considered in their evaluations.

4 Methodology

A series of experiments on medical text classification and conditional text generation are carried out to understand better the behavior of language models under resource-constrained settings, i.e., settings with consumer-grade computing resources. In total, $53$ language models are evaluated, whose size ranges from $110$ million to $13$ billion parameters. The selection of these models spans the general, biomedical, and clinical knowledge domains and includes the three families of Transformer-based models. Moreover, only open-source, smaller than $13$ B parameters models are considered. Details on the selected models are found in Table 1 and Appendix B.

All experiments are performed using a Quadro RTX 8000 GPU and CUDA version 12.2. To guarantee that the selected models align with consumer-grade computing resources, models with more than $8$ billion parameters (i.e., OpenLLaMA 13B, Flan-T5-XXL, T5-V1.1-11B, and T0++) are run with float $16$ precision. By halving the floating-point precision, these $11$ and $13$ billion parameter model versions are still viable in computational resource-constrained settings.

The three families of Transformer-based models are considered for the text classification task via different approaches (described in Section 4.1.2), whereas solely decoder-only models are used for the conditional text generation task. Transcriptions, MIMIC-CXR, and MS-CXR have been chosen as evaluation datasets. Transcriptions covers a broad spectrum of medical specialties, allowing a general assessment of medical knowledge. MIMIC-CXR and its labeled version, MS-CXR, enable testing focused on radiology, one of the most promising fields for AI integration, narrowing the evaluation to specialized medical knowledge.

Table 1: The models used in this study are categorized by their type, domain, and size. Each model is presented with its number of parameters and may have one or more superscripts. Superscripts are 0: model used for contextual embedding similarity, 1: model used for natural language inference (NLI), 2: model used for multiple-choice questions, 3: model used for text generation, †: instruction-tuned model, ‡: cross-encoder model.

Small (S) Medium (M) Large (L) XL XXL ID Model Size ID Model Size ID Model Size ID Model Size ID Model Size Encoder-only General m00 BERT ${}_{\texttt{BASE}}$ ⁰ [16] 110 M m01 BERT ${}_{\texttt{LARGE}}$ ⁰ [16] 340 M - - - - - - - - - m11 NLI-DeBERTa ${}_{\texttt{base}}$ ^‡1 [88] 100 M m12 RoBERTa ${}_{\texttt{LARGE}}$ -MNLI ^‡1 [89] 355 M Biomedical m02 BiomedBERT 110 M m04 BiomedBERT-large 340 M - - - - - - - - - (abstracts + full text) ⁰ [66] (abstracts only) ⁰ [66] m03 BiomedBERT 110 M (abstracts only) ⁰ [66] m05 SciBERT ⁰ [65] 110 M m06 SapBERT ⁰ [90] 110 M m07 BioLORD-STAMB2-v1 ⁰ [91] 110 M m08 BioLORD-STAMB2-v1-STS2 ⁰ [91] 110 M m09 BioLORD-PMB ⁰ [91] 110 M Clinical m10 Bio+Clinical BERT ⁰ [70] 110 M - - - - - - - - - - - - Encoder-
decoder General m14 T5-V1.1-Base ² [92, 93] 220 M m13 BART Large-MNLI ¹ [94] 407 M m15 T5-V1.1-Large ² [92, 93] 770 M m16 T5-V1.1-3B ² [92, 93] 3.0 B m17 T5-V1.1-11B ² [92, 93] 11.0 B m18 Flan-T5-Base ^†2 [17] 220 M m19 Flan-T5-Large ^†2 [17] 770 M m20 Flan-T5-XL ^†2 [17] 3.0 B m21 Flan-T5-XLL ^†2 [17] 11.0 B m22 T0 3B ^†2 [95] 3.0 B m23 T0++ ^†2 [95] 11.0 B Biomedical - - - - - - - - - - - - - - - Clinical m24 ClinicalT5-base ² [96] 220 M - - - m25 ClinicalT5-large ² [96] 700 M - - - - - - Decoder-only General - - - m26 GPT-2 Medium³ [19] 355 M m27 GPT-2 Large ³ [19] 774 M m28 GPT-2 XL ³ [19] 1.5 B m29 Palmyra Base 5B ²³ [97] 5.0 B m41 OpenLLaMA 3B ³ [98] 3.0 B m30 Camel 5B^†2 [99] 5.0 B m42 OpenLLaMA 3Bv2 ³ [98] 3.0 B m31 GPT-J 6B ²³ [100] 6.0 B m32 Instruct GPT-J ^†2 [101] 6.0 B m33 Falcon-7B ²³ [102] 7.0 B m34 Falcon-7B-Instruct ^†2 [102] 7.0 B m35 MPT-7B ²³ [103] 7.0 B m36 MPT-7B-Instruct ^†2 [103] 7.0 B m37 LLaMA-7B ²³ [27] 7.0 B m38 LLaMA 2-7B ²³ [28] 7.0 B m39 Alpaca 7B ^†2 [104] 7.0 B m40 LLaMA 2-CHAT-7B ^†2 [28] 7.0 B m43 OpenLLaMA 7B ³ [98] 7.0 B m44 OpenLLaMA 7Bv2 ³ [98] 7.0 B m45 OpenLLaMA 13B ³ [98] 13.0 B Biomedical
/ Scientific - - - m48 BioGPT ³ [69] 347 M m47 GPT-2-PubMed Large ³ [105] 774 M m50 Galactica 1.3B ³ [26] 1.3 B m51 Galactica 6.7B ³ [26] 6.7 B m46 GPT-2-PubMed Medium ³ [105] 355 M m49 BioGPT-Large ³ [69] 1.5 B Clinical - - - - - - - - - - - - m52 MedAlpaca 7b ^†2 [71] 7.0 B

4.1 Text classification

Text classification is addressed using the Transcriptions and MS-CXR datasets and three different approaches: (i) contextual embedding similarity, (ii) natural language inference (NLI), and (iii) multiple-choice question answering (MCQA). The contextual embedding similarity approach is intended for encoder-only models, the NLI approach for encoder-only and encoder-decoder models pre-trained for NLI, and the MCQA approach for encoder-decoder and decoder-only models..

Model tuning is implemented through zero-shot learning. To analyze the impact of prompting on text classification performance, different prompts are applied during inference. These prompts, grouped into two sets, are defined according to the classification approach. The first set of prompts is used for contextual embedding similarity and NLI. Since neither of these approaches requires a prompt to work, its non-use is also included in the analysis. The second set of prompts is used for MCQA, approach that needs a prompt to work. Prompts from the second set are defined based on those most commonly used in instruction-tuning models for multiple-choice question answering tasks.

Let $x\in X$ be a text sample and $y\in Y$ be a class, not necessarily corresponding to $x$ . A prompt from the first prompt set, $p\in P_{1}$ , is defined as a function of a prompt template and a label. For example, $p_{1}(y)=$ “This is an example of $y$ ”. The set $P_{1}$ is only applied to the classes. Meanwhile, a prompt from the second prompt set, $p\in P_{2}$ , combines a prompt template (consisting of the prompt structure and a question), a text sample, and the classes. For example, $p(x,Y)=$ “You are a doctor and have the following information about a patient from a chest x-ray: $x$ . What is the diagnosis? $Y$ . (”. In this example, the prompt template consists of the question “What is the diagnosis?” and the prompt structure, which is the rest of the text. Prompts are presented in detail in Appendix C.

4.1.1 Datasets

The datasets evaluated in text classification are Transcriptions and MS-CXR. Each of these datasets is introduced below. Their preprocessing and characterization details are given in Appendix A.

Transcriptions is a multi-label collection of electronic health records (EHRs) covering many medical specialties. Preprocessing is applied to the data, removing null entries, organizing the EHR format, and selecting the final set of labels. After all, $2\,074$ samples and $29$ classes are available. Performance is measured by the AUC score since the dataset is multi-label.

Due to the length of some EHRs, certain token vectors exceed the maximum input length allowed by some models. To cope with this limit, the input sequence is processed using a non-overlapping sliding window method [87], as detailed in Section 4.1.2.

MS-CXR is a multi-class dataset composed of X-ray report sections, each accompanied by annotations made by a radiologist [106, 107, 108]. There are $718$ unique samples representing eight well-distributed classes. Preprocessing of this dataset consists of removing samples with missing information and duplicates. Contrary to Transcriptions, no sample exceeds the maximum allowed input length for any of the models. Performance is measured by accuracy, F1-score, precision, and recall in their macro-averaged version to ensure a comprehensive assessment.

4.1.2 Approaches

Text classification is performed through (i) contextual embedding similarity, (ii) natural language inference, and (iii) multiple-choice question answering.

Contextual embedding similarity is grounded in the cosine similarity between the contextual embeddings of the sample text and the classes. The contextual or sentence embedding is determined by three distinct pooling strategies: CLS-token embedding, average token-level embedding pooling, and maximum token-level embedding pooling.

For this approach, encoder-only models are employed, with a total of $11$ models evaluated. These models have a maximum input token size of $512$ tokens. Therefore, the samples’ token vectors that exceed this limit are processed with the non-overlapping sliding window method. The fragments are aggregated according to the pooling strategy, as follows.

•

CLS pooling: The contextual embedding of each fragment from a sample is computed as its CLS-token output embedding. These embeddings are then aggregated using the element-wise average to obtain the contextual embedding representing the sample.
•

Maximum pooling: The contextual embedding of each fragment from a sample is computed by applying element-wise maximum at token level over the output embeddings. These embeddings are then aggregated using again the element-wise maximum to obtain the contextual embedding representing the sample.
•

Average pooling: The contextual embedding of each fragment from a sample is computed by applying the element-wise average at the token level over the output embeddings. These embeddings are then aggregated using the element-wise weighted average to obtain the contextual embedding representing the sample. Average’s weights indicate the number of non-padding tokens in each fragment.

Natural language inference is the task of determining whether a hypothesis is true (entailment), false (contradiction), or indeterminate (neutral) given a premise. When applied for text classification, the premise represents a test sample, and the hypothesis represents the classes. For multi-class datasets, the predicted label is calculated from the entailment logits of each hypothesized class. For multi-label datasets, the entailment and contradiction logits are transformed into binary probabilities, which indicate whether or not a particular hypothesized class is predicted. This could be viewed as having $n$ binary text classifiers, where $n$ is the number of classes.

This approach employs encoder-only (cross-encoder) and encoder-decoder models. These models have a lower maximum input token size than some of the test samples. Therefore, the token vectors of these particular samples are processed with the non-overlapping sliding window method. They are divided into fragments, whose scores are calculated individually, and then these scores are averaged to get the score of the whole sample.

Multiple-choice question answering enables generative models, i.e., encoder-decoder and decoder-only models, to perform text classification. A total of $27$ models are assessed in this approach, including both pre-trained models and their instruction-tuned versions.

As multiple-choice question answering is not intended for a extensive number of choices, the Transcriptions dataset is evaluated using a reduced version with eleven classes instead of the 29 available. These eleven classes consist of the ten most frequent labels plus an “Other” class. The number of samples evaluated is not affected. Additionally, the models’ logit space has been constrained to align with the response options of a multiple-choice scenario and, thereby, allow for automated evaluation. The token identifiers associated with the feasible response options are determined and used to filter the logit space.

4.2 Conditional text generation task

Conditional text generation is assessed with the MIMIC-CXR dataset, using perplexity as the performance evaluation metric. Perplexity (PPL) is a measure of uncertainty on the value of a sample from a discrete probability distribution. Let $X=(x_{0},x_{1},..,x_{T})$ be a tokenized sequence, then

\textnormal{PPL}(X)=\exp\{-\frac{1}{T}\sum_{t=1}^{T}\log p_{\theta}(x_{t}\mid x% _{<t})\}

where $\log p_{\theta}(x_{t}\mid x_{<t})$ is the log-likelihood of the $t$ -th token conditioned on the preceding tokens $x_{<t}$ .

Decoder-only models are employed for evaluation, with a total of $20$ models considered. Among these models is Galactica, whose tokenizer lacks special tokens. As a consequence, two scenarios are analyzed: the inclusion and the non-inclusion of the start-of-sequence (BOS) token. The BOS token is a special token typically used by generative models to indicate the start of a text. In the first scenario, this token is included during tokenization, and perplexity is calculated from the first token in the texts. When the model’s tokenizer does not have the BOS token predefined, such as Falcon-7B, it is then defined as the tokenizer’s first special token. In the second scenario, the BOS token is excluded, and perplexity is calculated from the text’s second token.

4.2.1 Dataset

The dataset evaluated in conditional text generation is MIMIC-CXR, introduced below. Details on its preprocessing and characterization are in Appendix A.

MIMIC-CXR is an X-ray reports dataset [109, 110, 108]. Relevant sections of these reports are extracted using the code provided by Johnson et al. [111, 112]. Subsequently, null and duplicate samples are removed, with the resulting dataset having $57\,711$ samples. None of these samples exceeds the maximum input size allowed for the proposed models.

5 Results and Discussion

The main findings are outlined below. For comparability, AUC scores reported in this section correspond to evaluating the eleven class-reduced version of Transcriptions (see section 4.1.2). In addition, to ensure the robustness of the results, bootstrapping with $1\,000$ iterations is applied to each experiment. Supplementary results are found in Appendix D.

5.1 Text classification analysis

Table 2: Highest-performing models for text classification per approach and metric. The scores presented correspond to the mean and, in parenthesis, its standard deviation of

1\,000

bootstrap iterations. Approaches are encoded as follows: CES stands for contextual embedding similarity, NLI for natural language inference, and MCQA for multiple-choice question answering.

Dataset	Metric	CES				NLI			MCQA
Dataset	Metric	Model	Score	Prompt	Pooling	Model	Score	Prompt	Model	Score
MS-CXR	Accuracy	BioLORD-STAMB2-v1-STS2	$69.68~{}(1.70)$	x	Avg.	RoBERTa ${}_{\texttt{LARGE}}$ -MNLI	$76.49~{}(1.59)$	x	T0++	$\mathbf{81.74~{}(1.45)}$
	F1-score	BioLORD-STAMB2-v1-STS2	$69.24~{}(1.67)$		Avg.	RoBERTa ${}_{\texttt{LARGE}}$ -MNLI	$78.15~{}(1.44)$	x	T0++	$\mathbf{83.86~{}(1.24)}$
	Precision	BioLORD-PMB	$83.34~{}(1.11)$		CLS	RoBERTa ${}_{\texttt{LARGE}}$ -MNLI	$80.72~{}(1.42)$	x	Alpaca 7B	$\mathbf{85.83~{}(0.95)}$
	Recall	BioLORD-STAMB2-v1-STS2	$72.62~{}(1.34)$		Avg.	RoBERTa ${}_{\texttt{LARGE}}$ -MNLI	$82.27~{}(1.33)$	x	T0++	$\mathbf{89.22~{}(0.82)}$
Transcriptions	AUC score	BioLORD-STAMB2-v1-STS2	$89.03~{}(0.31)$	x	Avg.	BART Large-MNLI	$80.75~{}(0.46)$	x	Flan-T5-XXL	$\mathbf{92.37~{}(0.26)}$

The highest F1 and AUC scores are achieved with the largest instruction-tuned T5 models, i.e., Flan-T5 (m19-m21) and T0 (m22-m23), as shown in Table 2 and Fig. 2. Some of these scores are above $80\%$ in the F1-score and $90\%$ in the AUC score. These instruction-tuned T5 models in question stand among all models considered, ranking within the top 10 highest-performing models in both datasets. Nevertheless, the optimal choice of models may vary when considering precision as the target metric, where Alpaca (m39) and LLaMA 2-CHAT-7B (m40) demonstrate high competence.

Conversely, the lowest F1 and AUC scores are paradoxically obtained with the base T5 models (m14-m17), as evidenced in Fig. 2. These models, along with their clinically fine-tuned versions (m24-m25), rank within the top 10 lowest-performing models on both datasets. Moreover, the $100\%$ (1/1) and $75\%$ (6/8) of the models underperforming a random evaluator in Transcriptions and MS-CXR datasets, respectively, belong to base and clinically fine-tuned T5 models.

Delving into each approach, BioLORD models (m07-m09) are consistently the best choice for the contextual embedding similarity approach. MS-CXR dataset is relatively more complex than the Transcriptions dataset for these models, as reflected by their ranking in performance: 11th versus 3rd place, respectively. BART Large-MNLI (m13) represents the best overall model for the NLI approach. For both datasets, BART Large-MNLI is included in the top 10 highest-performing models, while RoBERTa ${}_{\texttt{LARGE}}$ -MNLI (m12) only does so for MS-CXR dataset. For the multiple choice QA approach, instruction-tuned models stand out, which include instruction-tuned LLaMa models (m39-m40) to the aforementioned instruction-tuned T5 models. Notably, LLaMA 2-CHAT-7B (m40) is within the top 10 highest-performing models in both datasets. These highest performers per approach consistently show results indicative of clinical knowledge or clinical notions.

The results of instruction-tuned T5 models support the feasibility of representing discriminative tasks as generative ones by framing them as instructions. These results also underline that generative tasks are not exclusive to decoder-only models, and text-to-text models may be a promising architecture to explore further. For example, versions of T5 tuned to instructions with 3B parameters (m20, m22) provide superior results to decoder-only models, almost three times larger, on both evaluated datasets.

Model size – More parameters alone do not always translate into better results

The experiments yield findings questioning the claim that larger models consistently deliver superior performance. The performance of the models as a function of the logarithm of their size is depicted in Fig. 3. Testing for monotonic relationships via Spearman’s correlation is only reported for the multiple-choice QA approach due to the lack of diversity in sizes or number of samples. There is insufficient evidence to conclude that the Spearman’s correlation between size and performance is statistically significant in either dataset.

The trend of performance improvement with increasing size is almost nonexistent in the contextual embedding similarity approach. As seen in Fig. 2, for instance, models such as SapBERT (m06) and BioLORD models (m07-m09), which excel in this approach, outperform even three times larger models on both datasets. Within the same models, the deltas in performance associated with increasing the number of parameters are inconclusive. BERT ${}_{\texttt{LARGE}}$ (m01) marginally outperforms BERT ${}_{\texttt{BASE}}$ (m00) on the Transcriptions dataset, whereas the opposite is observed in all metrics on the MS-CXR dataset. BiomedBERT-large (abstracts only) (m04) surpasses, albeit marginally, BioMedBERT (abstracts only) (m03) on both evaluated datasets, excluding in precision. Furthermore, performance gains are evidenced when more training data is used, as shown by comparing BioMedBERT (abstracts only) and BiomedBERT (abstracts + full text) (m02).

Similarly, the effect of increasing the size on performance is not sufficiently clear or strong in the multiple-choice question answering approach. While positive Spearman’s correlations are obtained, there is insufficient evidence to deem them statistically significant. None of the p-values are $<0.05$ , so the null hypothesis that the two variables have no ordinal correlation cannot be rejected. Within T5 models, the effect is minimal or inconsistent when considering their non-instruction-tuned versions (m14-m17, m24-m25). Within the instruction-tuned T5 models (m18-m23), a consistent positive effect of size on performance is observed for both FlanT5 and T0 models on Transcriptions and only for the latter on MS-CXR.

On the other hand, the results in NLI align with the expectation that larger models lead to better performance. However, more models are needed to draw a (solid) conclusion. To have a notion about the lower bound in performance between NLI-DeBERTa ${}_{\texttt{base}}$ (m11) and the largest models evaluated, the difference between the highest and the lowest values obtained among the evaluated prompts, respectively, is calculated. Considering all metrics, these differences range between $[36.32,51.59]$ on MS-CXR, having thus that the largest models always lead to performance improvement. Reaching the same conclusion on the Transcriptions dataset is not straightforward, given the results obtained for RoBERTa ${}_{\texttt{LARGE}}$ -MNLI (m12), as depicted in Fig. 2. This model’s performance is closer to the performance of NLI-DeBERTa ${}_{\texttt{base}}$ than to that of BART Large-MNLI (m13).

Altogether, the results do not provide sufficient evidence that only increasing the model size, in number of parameters, leads to an improvement in performance, whether comparing different or the same models. Although model size may be a relevant factor in determining performance, it is hypothesized that training data and objectives are more decisive in small pre-trained language models. This hypothesis aligns with findings in [10] and [12]. Expanding the sample size and diversity could be essential to validate these observations, considering a minimum of 30 or 35 models per approach.

Model domain – More than a specialized domain; model architecture, training data, and training objective

Current medical datasets remain relatively small compared to those of the general domain, covering only a tiny region of the medical knowledge space [84]. Domain specialization of models using only one of these datasets in question may limit their generalization ability [45].

The effectiveness of domain specialization in improving performance is not evident in the contextual embedding similarity approach, as displayed in Fig. 2. The domain-specific models considered in this approach are Bio+Clinical BERT (m10), BiomedBERT models (m02-m04), and SciBERT (m05). Bio+Clinical BERT achieves lower scores than expected, positioning around the middle of the performance ranking for this approach. Similarly, some of the BiomedBERT models are outperformed by BERT ${}_{\texttt{BASE}}$ (m00) and BERT ${}_{\texttt{LARGE}}$ (m01), their general domain counterparts. These findings, present in both datasets, challenge the superiority of domain-specific models over general domain ones in the task being evaluated via contextual embedding similarity.

Although existing, evidence supporting the effectiveness of domain specialization is still limited and unclear in the multiple-choice question answering approach. The models to be compared are T5 models (m14-m15) versus their clinical specialized versions (m24-m25), and Alpaca (m39) versus MedAlpaca (m52). Differences between ClinicalT5 and T5 models are $5.75$ and $-4.11$ in AUC scores and $5.24$ and $0.00$ in F1-scores. Similarly, differences between MedAlpaca and Alpaca are $-1.86$ in AUC scores and $26.97$ in F1 scores. Due to these values, it can not be clearly stated that domain specialization positively impacts performance.

Considering the insights discussed and the remarkable performance of BioLORD (m07-m09) models, SapBERT (m06), Flan-T5 (m18-m21) models, and T0 (m22-m23) models in their respective approaches, the training data, training objectives, and model architectures are possibly critical in determining model generalization. Continual pre-training for named entity recognition or medical entity linkage using contrastive learning on UMLS data is likely one of the factors for the success of SapBERT and BioLORD models. Likewise, employing instruction-tuned text-to-text models represents a compelling approach to achieving high performance in multiple-choice QA. Due to the impossibility of concluding on the NLI approach, expanding the analysis to incorporate domain-specialized NLI models in biomedical and clinical domains could be valuable.

Prompting and instruction-tuning key to model performance

One of the central points of the study is to analyze the influence of prompting on the models and text classification approaches under investigation. Prompting impact is quantified as the difference in performance resulting from the prompt usage, with positive values indicating an improvement, in contextual embedding similarity and NLI. In multiple-choice QA, this impact is calculated as the variation in performance, expressed in standard deviations, when different instructions are used. The resulting distributions are shown in Fig. 4.

Using a prompt does not always confer benefits in contextual embedding similarity, as reflected by Fig. 4. On Transcriptions, the average impact on the AUC score is $-2.25$ points, with values ranging from $-9.32$ to $5.91$ . Using any of the proposed prompts improves performance for $45.45\%$ of the model + pooling strategy combinations. In contrast, none of these prompts led to AUC score improvements for BioLORD-PMB, BiomedBERT models, BERT ${}_{\texttt{BASE}}$ , and SciBERT. On MS-CXR, the impact of the prompt on performance is more positive on average, albeit with more variability. The average impact on the F1-score is $1.30$ points, with values ranging from $-25.43$ to $40.40$ . Similar values are reported on accuracy, precision, and recall. Employing any of the proposed prompts represents benefits for the $69.70\%$ to $84.85\%$ of the model + pooling strategy combinations, depending on the metric. The performance of BioMedBERT (abstracts only) and BiomedBERT-large (abstracts only) is enhanced with any of the prompts, whereas the performance of the BioLORD models and Bio+Clinical BERT is hindered.

More consistent benefits are observed than in contextual embedding similarity when examining the prompt impact in the NLI approach. On Transcriptions, any of the proposed prompts yields performance improvements, profiting larger models the most from its usage. The average impact on the AUC score is $8.03$ for BART Large-MNLI, $2.15$ for NLI-DeBERTa ${}_{\texttt{base}}$ , and $4.87$ for RoBERTa ${}_{\texttt{LARGE}}$ -MNLI. On MS-CXR, using a prompt only sometimes results in gains, particularly for NLI-DeBERTa ${}_{\texttt{base}}$ . For this model, the average impact on the F1-score is $-3.42$ ; while for BART Large-MNLI and RoBERTa ${}_{\texttt{LARGE}}$ -MNLI is $2.18$ and $2.78$ , respectively. Moreover, positive prompt impacts are only observed on precision for NLI-DeBERTa ${}_{\texttt{base}}$ . In both datasets, there are certain prompts with a high positive impact, whereas others do not, mostly independent of the model.

Similarly, prompt importance is also evident in the multiple-choice question answering approach, given its significant observed influence on model performance. The proportion of models performing better than a random evaluator (AUC score $50\%$ ) on Transcriptions increases from $52\%$ to $96\%$ with appropriate prompts. Similarly, the proportion of better than a random evaluator (F1-score $12.5\%$ ) on MS-CXR rises from $25\%$ to $85\%$ . Prompting importance is thus highlighted not only by the high performance achieved but also by the brittleness of the models. The latter is reflected by the variability in Fig. 4, and further supported by Figs. 17 and 18 in Appendix D. Between datasets, the highest sensitivity to the prompt is found when evaluating Transcriptions, such that, with certain prompts, the instruction-tuned models yield similar results to their base counterparts. Overall, no single prompt works universally well for all models.

Regarding instruction-tuning, these models generally outperform their non instruction-tuned counterparts. The instruction-tuned T5 versions, whether T0 or Flan-T5, in any size considered, exhibit superior performance than their base counterparts. Instruction-tuning also improves performance consistently for the LLaMA models, whereas this is not always the case for other generative models: MPT and GPT-J are exceptions on the Transcriptions dataset and Falcon on the MS-CXR dataset. Overall, this tuning technique represents a gain, with an average increase of $21.45$ points in the AUC score and $43.55$ points in the F1-score.

Summarizing, the results endorse the crucial role of the prompt and its wording in the model’s performance, with both positive and negative effects presented. Consequently, we advocate using prompts and advanced prompting techniques to guide the model toward better results. This process should also not be limited to a single prompt due to the observed and well-known phenomenon of prompt brittleness [13]. Regarding instruction tuning, this technique proves to be beneficial for the models. More details on the prompt impact can be found in Figs. 20 and 19 in Appendix D.

5.2 Conditional text generation analysis

LLaMA models (m38-m39) stand out as the ones with the highest predictive capacity among the models evaluated. Particularly, LLaMA 2-7B (m38) is the highest performer, with a mean perplexity of $9.12$ when including the BOS token and $8.21$ when not. LLaMA models are also notable for the low standard deviation of their mean, with approximate values of $0.05$ and $0.13$ depending on the BOS token usage. These standard deviations indicate higher confidence in the estimated value of the mean.

Conversely, BioGPT models (m48-m49) are the models with the most significant difficulty in comprehending the dataset. BioGPT (m48), the lowest performer, presents a mean perplexity of $80.34$ when including the BOS token and $38.70$ when not. The variability on the mean of these models is among the highest observed, with approximate standard deviations of $3.15$ and $0.44$ depending on the BOS token usage. These results are paradoxical considering that BioGPT is domain-specific while LLaMA 2-7B is not.

Similarly to previous findings for text classification, domain specialization does not necessarily imply surpassing general domain models. For the medium-size domain-specific models, it is observed that BioGPT (m48) does not outperform any of the general domain models, while GPT-2-PubMed Medium (m46) does. For the large size models, domain specialization proves beneficial; whereas for the XL and XXL sizes, neither Galactica (m50-m51) nor BioGPT-Large (m49) clearly outperforms general domain models. Consequently, the only specialized models that prove advantageous are GPT-2-PubMed (m46-m47).

On the other hand, increasing the model size contributes to improved performances, regardless of whether or not the BOS token is included. A slight performance improvement is also observed for the second versions (m42, m44) versus the first versions (m41, m43) of OpenLLaMA. This improvement is on average $1.58$ and $1.95$ points on the perplexity for the 3B and 7B parameter versions, respectively. Considering that the difference between these versions of OpenLLaMA is the dataset used for pre-training, the results obtained for conditional text generation do not contradict those for text classification.

Furthermore, the standard deviations of perplexity reveal the presence of exceptionally challenging samples for the models, that is, outliers, which is visually depicted in Fig. 21 in Appendix D. Moderate outliers, above quantile $0.75$ by $1.5$ times the IQR, represent between $7\%$ and $11\%$ of the data, with BioGPT models having the highest percentages. Extreme outliers, above quantile $0.75$ by three times the IQR, make up between $4\%$ and $7\%$ of the data, with most models exhibiting percentages around $4\%$ and $5\%$ .

Groups of generative models – LLaMA and GPT-2

Two procedures are carried out to determine whether the models exhibit similar perplexity behavior and identify potential clusters among them. The first procedure involves calculating the correlations between the models. Spearman’s and Pearson correlations are considered, assessing monotonic and linear relationships, respectively. The second procedure consists of dimensionality reduction via UMAP, followed by hierarchical clustering, represented by dendrograms in Fig. 6. Both procedures reveal the existence of two main groups of models: the GPT-2 and the LLaMA models.

In general, all models are positively correlated, indicating that most samples have a similar relative difficulty for these models. BioGPT models (m48-m49) are the only exception to this. Further looking at the Pearson correlations, clustering patterns are present, where groups such as the LLaMA, the OpenLLaMA, and the GPT-2 models are identified. Although these previous clusters are somewhat expected, some unexpected associations are also evident, such as between Falcon-7B and MPT-7B and between Palmyra Base 5B and GPT-J 6B. Moreover, linear relationships between the LLaMA and OpenLLaMA models weaken, interestingly, when the BOS token is used, indicating more pronounced performance disparities.Possibly, training data plays a role, as it is essentially their main difference [98].

6 Conclusion

This study comprehensively explores small pre-trained language models with varying sizes, architectural families, and domains. These models, being $52$ considered, are tested for two fundamental medical natural language processing tasks: text classification and conditional text generation. The size of the models ranges from $110$ million to $13$ billion parameters, which is relatively small compared to recent language models but suitable for consumer-grade computing resources. Our findings have significant implications, particularly for researchers and organizations operating under computational resource-constrained settings.

For the text classification task, three distinct approaches are explored: context embedding similarity, natural language inference, and multiple-choice question answering. BioLORD and SapBERT models have demonstrated remarkable performance in text classification via contextual embedding similarity. Similarly, the instruction-tuned versions of T5, Flan-T5 and T0, followed by the instruction-tuned versions of LLaMA, have exhibited outstanding results in the multiple-choice question answering approach. Flan-T5 and T0 are remarkably good in both general medical and radiology-specific knowledge assessments. To fully understand NLI models’ potential, further exploration of this approach is needed, particularly in specialized domains.

A common thread running through our findings is the significance of the prompt in improving text classification performance across different datasets and approaches. This significance extends beyond performance gains; they present a viable alternative to the resource-intensive processes of training and fine-tuning language models, which are often associated with substantial financial and environmental costs. Effective prompt engineering is also essential to mitigate prompt brittleness, ensuring more robust and reliable outcomes. As prompt brittleness is evidenced during the study, and given its importance, further exploration in this line of research is recommended.

Medical datasets often remain relatively small and cover only a small region of the medical knowledge space [84], so domain-specific models specialized using these datasets might see their generalization ability hindered. This practice could explain, to some extent, the results obtained. The results also suggest that the architecture, training data, and training objectives are crucial in determining the model’s generalization abilities, possibly outweighing the relevance of model size as a single variable.

For the conditional text generation task, LLaMA models stand out due to their low perplexities with minimal variation. Two groups of models are also identified based on the perplexities obtained in MIMIC-CXR: a group consisting of GPT-2 models and another of LLaMA models. Further research is needed to identify and understand the outliers within these results, as they could hold important insights.

In conclusion, this research highlights the critical role of prompts in language model inference and reaffirms the effectiveness of instruction-tuned generative models in addressing downstream tasks. It also underscores the relevance of model architecture, training data, and training objectives, potentially even more so than model size alone, in its generalization capacity. We advocate for further investigations into topics such as model calibration, i.e., how certain the model is about output, prompt engineering and tuning, and performance concerning issues like hallucinations and biases, among others. Such studies can lead to more effective and ethical applications of language models in healthcare. Extensions to include quantized models and more medical NLP tasks will be considered in further research. Quantification is an interesting and promising approach to making LLMs viable in consumer-grade computing resources.

References

[1] W. F. Wiggins and A. S. Tejani, “On the Opportunities and Risks of Foundation Models for Natural Language Processing in Radiology,” Radiology: Artificial Intelligence, vol. 4, no. 4, p. e220119, Jul. 2022.
[2] N. H. Shah, D. Entwistle, and M. A. Pfeffer, “Creation and Adoption of Large Language Models in Medicine,” JAMA, vol. 330, no. 9, pp. 866–869, Sep. 2023.
[3] J. Wei et al., “Emergent Abilities of Large Language Models,” Transactions on Machine Learning Research, 2022.
[4] J. A. Omiye, H. Gui, S. J. Rezaei, J. Zou, and R. Daneshjou, “Large language models in medicine: The potentials and pitfalls : A narrative review,” Ann. Intern. Med., vol. 177, no. 2, pp. 210–220, Feb. 2024.
[5] V. Liévin, C. E. Hother, A. G. Motzfeldt, and O. Winther, “Can large language models reason about medical questions?” Patterns, vol. 5, no. 3, p. 100943, 2024.
[6] A. Vaswani et al., “Attention is All you Need,” in Advances in Neural Information Processing Systems, I. Guyon et al., Eds., vol. 30. Curran Associates, Inc., 2017.
[7] G. Kuling, B. Curpen, and A. L. Martel, “BI-RADS BERT and Using Section Segmentation to Understand Radiology Reports,” Journal of Imaging, vol. 8, no. 5, p. 131, 2022.
[8] K. Singhal et al., “Large language models encode clinical knowledge,” Nature, vol. 620, no. 7972, pp. 172–180, Aug. 2023.
[9] W. X. Zhao et al., “A Survey of Large Language Models,” 2023, arXiv:2303.18223 [cs.CL].
[10] J. Hoffmann et al., “An empirical analysis of compute-optimal large language model training,” in Advances in Neural Information Processing Systems, S. Koyejo, S. Mohamed, A. Agarwal, D. Belgrave, K. Cho, and A. Oh, Eds., vol. 35. Curran Associates, Inc., 2022, pp. 30 016–30 030.
[11] Q. Liu et al., “Exploring the Boundaries of GPT-4 in Radiology,” in Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics, Dec. 2023, pp. 14 414–14 445.
[12] M. Abdin et al., “Phi-3 Technical Report: A Highly Capable Language Model Locally on Your Phone,” 2024, arXiv:2404.14219 [cs.CL].
[13] K. He et al., “A Survey of Large Language Models for Healthcare: from Data, Technology, and Applications to Accountability and Ethics,” 2024, arXiv:2310.05694 [cs.CL].
[14] L. Tang et al., “Evaluating large language models on medical evidence summarization,” npj Digital Medicine, vol. 6, no. 1, p. 158, Aug. 2023.
[15] M. E. Peters et al., “Deep Contextualized Word Representations,” in Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers), M. A. Walker, H. Ji, and A. Stent, Eds. Association for Computational Linguistics, Jun. 2018, pp. 2227–2237.
[16] J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova, “BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding,” in Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), J. Burstein, C. Doran, and T. Solorio, Eds. Association for Computational Linguistics, Jun. 2019, pp. 4171–4186.
[17] H. W. Chung et al., “Scaling Instruction-Finetuned Language Models,” Journal of Machine Learning Research, vol. 25, no. 70, pp. 1–53, 2024.
[18] T. Brown et al., “Language Models are Few-Shot Learners,” in Advances in Neural Information Processing Systems, H. Larochelle, M. Ranzato, R. Hadsell, M. Balcan, and H. Lin, Eds., vol. 33. Curran Associates, Inc., 2020, pp. 1877–1901.
[19] A. Radford, J. Wu, R. Child, D. Luan, D. Amodei, and I. Sutskever, “Language Models are Unsupervised Multitask Learners,” OpenAI, Tech. Rep., 2019. [Online]. Available: https://api.semanticscholar.org/CorpusID:160025533
[20] A. Chowdhery et al., “PaLM: Scaling Language Modeling with Pathways,” Journal of Machine Learning Research, vol. 24, no. 240, pp. 1–113, 2023.
[21] J. W. Rae et al., “Scaling Language Models: Methods, Analysis & Insights from Training Gopher,” 2022, arXiv:2112.11446 [cs.CL].
[22] J. Wei et al., “Finetuned Language Models are Zero-Shot Learners,” in International Conference on Learning Representations, 2022.
[23] D. Hendrycks et al., “Measuring Massive Multitask Language Understanding,” in International Conference on Learning Representations, 2021.
[24] J. Kaplan et al., “Scaling Laws for Neural Language Models,” 2020, arXiv:2001.08361 [cs.LG].
[25] Y. Bahri, E. Dyer, J. Kaplan, J. Lee, and U. Sharma, “Explaining neural scaling laws,” Proceedings of the National Academy of Sciences, vol. 121, no. 27, p. e2311878121, 2024.
[26] R. Taylor et al., “Galactica: A Large Language Model for Science,” 2022, arXiv:2211.09085 [cs.CL].
[27] H. Touvron et al., “LLaMA: Open and Efficient Foundation Language Models,” 2023, arXiv:2302.13971 [cs.CL].
[28] ——, “Llama 2: Open Foundation and Fine-Tuned Chat Models,” 2023, arXiv:2307.09288 [cs.CL].
[29] Antropic, “Introducing the next generation of Claude,” Mar. 2024. [Online]. Available: https://www.anthropic.com/news/claude-3-family
[30] Gemini Team et al., “Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context,” 2024, arXiv:2403.05530 [cs.CL].
[31] A. Q. Jiang et al., “Mistral 7B,” 2023, arXiv:2310.06825 [cs.CL].
[32] S. Bubeck et al., “Sparks of Artificial General Intelligence: Early experiments with GPT-4,” 2023, arXiv:2303.12712 [cs.CL].
[33] H. Nori, N. King, S. M. McKinney, D. Carignan, and E. Horvitz, “Capabilities of GPT-4 on Medical Challenge Problems,” 2023, arXiv:2303.13375 [cs.CL].
[34] R. Mao, G. Chen, X. Zhang, F. Guerin, and E. Cambria, “GPTEval: A Survey on Assessments of ChatGPT and GPT-4,” in Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024). ELRA and ICCL, May 2024, pp. 7844–7866.
[35] J. López Espejel, E. H. Ettifouri, M. S. Yahaya Alassan, E. M. Chouham, and W. Dahhane, “GPT-3.5, GPT-4, or BARD? Evaluating LLMs reasoning ability in zero-shot setting and performance boosting through prompts,” Natural Language Processing Journal, vol. 5, p. 100032, 2023.
[36] P. Liang et al., “Holistic Evaluation of Language Models,” Transactions on Machine Learning Research, 2023.
[37] H. Liu, R. Ning, Z. Teng, J. Liu, Q. Zhou, and Y. Zhang, “Evaluating the Logical Reasoning Ability of ChatGPT and GPT-4,” 2023, arXiv:2304.03439 [cs.CL].
[38] R. Schaeffer, B. Miranda, and S. Koyejo, “Are Emergent Abilities of Large Language Models a Mirage?” in Advances in Neural Information Processing Systems, A. Oh, T. Naumann, A. Globerson, K. Saenko, M. Hardt, and S. Levine, Eds., vol. 36. Curran Associates, Inc., 2023, pp. 55 565–55 581.
[39] M. Agrawal, S. Hegselmann, H. Lang, Y. Kim, and D. Sontag, “Large language models are few-shot clinical information extractors,” in Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, Y. Goldberg, Z. Kozareva, and Y. Zhang, Eds. Association for Computational Linguistics, Dec. 2022, pp. 1998–2022.
[40] V. Sanh et al., “Multitask Prompted Training Enables Zero-Shot Task Generalization,” in International Conference on Learning Representations, 2022.
[41] A. Lampinen et al., “Can language models learn from explanations in context?” in Findings of the Association for Computational Linguistics: EMNLP 2022, Y. Goldberg, Z. Kozareva, and Y. Zhang, Eds. Association for Computational Linguistics, Dec. 2022, pp. 537–563.
[42] T. Kojima, S. S. Gu, M. Reid, Y. Matsuo, and Y. Iwasawa, “Large Language Models are Zero-Shot Reasoners,” in Advances in Neural Information Processing Systems, S. Koyejo, S. Mohamed, A. Agarwal, D. Belgrave, K. Cho, and A. Oh, Eds., vol. 35. Curran Associates, Inc., 2022, pp. 22 199–22 213.
[43] J. Wei et al., “Chain-of-Thought Prompting Elicits Reasoning in Large Language Models,” in Advances in Neural Information Processing Systems, S. Koyejo, S. Mohamed, A. Agarwal, D. Belgrave, K. Cho, and A. Oh, Eds., vol. 35. Curran Associates, Inc., 2022, pp. 24 824–24 837.
[44] M. Joshi, E. Choi, D. Weld, and L. Zettlemoyer, “TriviaQA: A Large Scale Distantly Supervised Challenge Dataset for Reading Comprehension,” in Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), R. Barzilay and M.-Y. Kan, Eds. Association for Computational Linguistics, Jul. 2017, pp. 1601–1611.
[45] I. Jahan, M. T. R. Laskar, C. Peng, and J. X. Huang, “A comprehensive evaluation of large language models on benchmark biomedical text processing tasks,” Computers in Biology and Medicine, vol. 171, p. 108189, 2024.
[46] B. Jacob et al., “Quantization and training of neural networks for efficient integer-arithmetic-only inference,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2018.
[47] G. Xiao, J. Lin, M. Seznec, H. Wu, J. Demouth, and S. Han, “SmoothQuant: Accurate and efficient post-training quantization for large language models,” in Proceedings of the 40th International Conference on Machine Learning, ser. Proceedings of Machine Learning Research, A. Krause, E. Brunskill, K. Cho, B. Engelhardt, S. Sabato, and J. Scarlett, Eds., vol. 202. PMLR, July 2023, pp. 38 087–38 099.
[48] Y. Tay, M. Dehghani, D. Bahri, and D. Metzler, “Efficient transformers: A survey,” ACM Comput. Surv., vol. 55, no. 6, December 2022.
[49] S. Li et al., “Evaluating quantized large language models,” in Forty-first International Conference on Machine Learning, 2024.
[50] S. Kim et al., “SqueezeLLM: Dense-and-sparse quantization,” in Forty-first International Conference on Machine Learning, 2024.
[51] J. Guo et al., “Compressing large language models by joint sparsification and quantization,” in Forty-first International Conference on Machine Learning, 2024.
[52] R. Jin et al., “A comprehensive evaluation of quantization strategies for large language models,” in Findings of the Association for Computational Linguistics ACL 2024, L.-W. Ku, A. Martins, and V. Srikumar, Eds. Association for Computational Linguistics, August 2024, pp. 12 186–12 215.
[53] P. Lee, S. Bubeck, and J. Petro, “Benefits, Limits, and Risks of GPT-4 as an AI Chatbot for Medicine,” New England Journal of Medicine, vol. 388, no. 13, pp. 1233–1239, 2023.
[54] M. A. Fink, “Goße Sprachmodelle wie ChatGPT und GPT-4 für eine patientenzentrierte Radiologie [Large language models such as ChatGPT and GPT-4 for patient-centered care in radiology],” Radiologie, vol. 63, no. 9, pp. 665–671, Sep. 2023.
[55] Q. Lyu et al., “Translating radiology reports into plain language using ChatGPT and GPT-4 with prompt learning: results, limitations, and potential,” Visual Computing for Industry, Biomedicine, and Art, vol. 6, no. 1, p. 9, May 2023.
[56] L. C. Adams et al., “Leveraging GPT-4 for Post Hoc Transformation of Free-text Radiology Reports into Structured Reporting: A Multilingual Feasibility Study,” Radiology, vol. 307, no. 4, p. e230725, 2023.
[57] R. Bhayana, R. R. Bleakney, and S. Krishna, “GPT-4 in Radiology: Improvements in Advanced Reasoning,” Radiology, vol. 307, no. 5, p. e230987, 2023.
[58] Z. Wu et al., “Exploring the Trade-Offs: Unified Large Language Models vs Local Fine-Tuned Models for Highly-Specific Radiology NLI Task,” 2023, arXiv:2304.09138 [cs.CL].
[59] M. Ranjit, G. Ganapathy, R. Manuel, and T. Ganu, “Retrieval Augmented Chest X-Ray Report Generation using OpenAI GPT models,” in Proceedings of the 8th Machine Learning for Healthcare Conference, ser. Proceedings of Machine Learning Research, K. Deshpande et al., Eds., vol. 219. PMLR, Aug. 2023, pp. 650–666.
[60] B. Meskó and E. J. Topol, “The imperative for regulatory oversight of large language models (or generative AI) in healthcare,” npj Digital Medicine, vol. 6, no. 1, p. 120, Jul. 2023.
[61] D. Gala and A. N. Makaryus, “The Utility of Language Models in Cardiology: A Narrative Review of the Benefits and Concerns of ChatGPT-4,” International Journal of Environmental Research and Public Health, vol. 20, no. 15, 2023.
[62] S. B. Atallah, N. R. Banda, A. Banda, and N. A. Roeck, “How large language models including generative pre-trained transformer (GPT) 3 and 4 will impact medicine and surgery,” Techniques in Coloproctology, vol. 27, no. 8, pp. 609–614, Aug. 2023.
[63] K. Cheng, Q. Guo, Y. He, Y. Lu, S. Gu, and H. Wu, “Exploring the Potential of GPT-4 in Biomedical Engineering: The Dawn of a New Era,” Annals of Biomedical Engineering, vol. 51, no. 8, pp. 1645–1653, Aug. 2023.
[64] J. Lee et al., “BioBERT: a pre-trained biomedical language representation model for biomedical text mining,” Bioinformatics, vol. 36, no. 4, pp. 1234–1240, Sep. 2019.
[65] I. Beltagy, K. Lo, and A. Cohan, “SciBERT: A Pretrained Language Model for Scientific Text,” in Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), K. Inui, J. Jiang, V. Ng, and X. Wan, Eds. Association for Computational Linguistics, Nov. 2019, pp. 3613–3618.
[66] Y. Gu et al., “Domain-Specific Language Model Pretraining for Biomedical Natural Language Processing,” ACM Trans. Comput. Heal., vol. 3, no. 1, pp. 2:1–2:23, Oct. 2022.
[67] H. Shin et al., “BioMegatron: Larger Biomedical Domain Language Model,” in Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), B. Webber, T. Cohn, Y. He, and Y. Liu, Eds. Association for Computational Linguistics, Nov. 2020, pp. 4700–4706.
[68] Z. Hong, A. Ajith, J. G. Pauloski, E. Duede, K. Chard, and I. T. Foster, “The Diminishing Returns of Masked Language Models to Science,” in Findings of the Association for Computational Linguistics: ACL 2023, A. Rogers, J. L. Boyd-Graber, and N. Okazaki, Eds. Association for Computational Linguistics, 2023, pp. 1270–1283.
[69] R. Luo et al., “BioGPT: generative pre-trained transformer for biomedical text generation and mining,” Briefings in Bioinformatics, vol. 23, no. 6, Sep. 2022.
[70] E. Alsentzer et al., “Publicly Available Clinical BERT Embeddings,” in Proceedings of the 2nd Clinical Natural Language Processing Workshop. Association for Computational Linguistics, Jun. 2019, pp. 72–78.
[71] T. Han et al., “MedAlpaca – An Open-Source Collection of Medical Conversational AI Models and Training Data,” 2023, arXiv:2304.08247 [cs.CL].
[72] C. Wu, W. Lin, X. Zhang, Y. Zhang, W. Xie, and Y. Wang, “PMC-LLaMA: toward building open-source language models for medicine,” Journal of the American Medical Informatics Association: JAMIA, vol. 31, no. 9, pp. 1833–1843, Apr. 2024.
[73] K. Singhal et al., “Towards Expert-Level Medical Question Answering with Large Language Models,” 2023, arXiv:2305.09617 [cs.CL].
[74] X. Yang et al., “A large language model for electronic health records,” npj Digital Medicine, vol. 5, no. 1, p. 194, Dec. 2022.
[75] C. Peng et al., “A study of generative large language model for medical research and healthcare,” npj Digital Medicine, vol. 6, no. 1, p. 210, Nov. 2023.
[76] G. Wang, G. Yang, Z. Du, L. Fan, and X. Li, “ClinicalGPT: Large Language Models Finetuned with Diverse Medical Data and Comprehensive Evaluation,” 2023, arXiv:2306.09968 [cs.CL].
[77] H. Zhang et al., “HuatuoGPT, Towards Taming Language Model to Be a Doctor,” in Findings of the Association for Computational Linguistics: EMNLP 2023, H. Bouamor, J. Pino, and K. Bali, Eds. Association for Computational Linguistics, Dec. 2023, pp. 10 859–10 885.
[78] C. Li et al., “LLaVA-Med: Training a Large Language-and-Vision Assistant for Biomedicine in One Day,” in Advances in Neural Information Processing Systems, A. Oh, T. Naumann, A. Globerson, K. Saenko, M. Hardt, and S. Levine, Eds., vol. 36. Curran Associates, Inc., 2023, pp. 28 541–28 564.
[79] J. Zhou, X. Chen, and X. Gao, “Path to Medical AGI: Unify Domain-specific Medical LLMs with the Lowest Cost,” 2023, arXiv:2306.10765 [cs.AI].
[80] W. Gao et al., “OphGLM: Training an Ophthalmology Large Language-and-Vision Assistant based on Instructions and Dialogue,” 2023, arXiv:2306.12174 [cs.CV].
[81] C. Shu, B. Chen, F. Liu, Z. Fu, E. Shareghi, and N. Collier, “Visual Med-Alpaca: A Parameter-Efficient Biomedical LLM with Visual Capabilities,” 2013. [Online]. Available: https://github.com/cambridgeltl/visual-med-alpaca
[82] M. Moor et al., “Med-Flamingo: a Multimodal Medical Few-shot Learner,” in Proceedings of the 3rd Machine Learning for Health Symposium, ser. Proceedings of Machine Learning Research, S. Hegselmann et al., Eds., vol. 225. PMLR, Dec. 2023, pp. 353–367.
[83] E. Tiu, E. Talius, P. Patel, C. P. Langlotz, A. Y. Ng, and P. Rajpurkar, “Expert-level detection of pathologies from unannotated chest X-ray images via self-supervised learning,” Nature Biomedical Engineering, vol. 6, no. 12, pp. 1399–1406, Dec. 2022.
[84] H. Zhou et al., “A Survey of Large Language Models in Medicine: Progress, Application, and Challenge,” 2024, arXiv:2311.05112 [cs.CL].
[85] S. Soni and K. Roberts, “Evaluation of Dataset Selection for Pre-Training and Fine-Tuning Transformer Language Models for Clinical Question Answering,” in Proceedings of the 12th Language Resources and Evaluation Conference, N. Calzolari et al., Eds. European Language Resources Association, May 2020, pp. 5532–5538.
[86] E. Lehman et al., “Do we still need clinical language models?” in Proceedings of the Conference on Health, Inference, and Learning, ser. Proceedings of Machine Learning Research, B. J. Mortazavi, T. Sarker, A. Beam, and J. C. Ho, Eds., vol. 209. PMLR, Aug. 2023, pp. 578–597.
[87] Y. Li, R. M. Wehbe, F. S. Ahmad, H. Wang, and Y. Luo, “A comparative study of pretrained language models for long clinical text,” Journal of the American Medical Informatics Association, vol. 30, no. 2, pp. 340–347, 11 2022.
[88] Sentence Transformers - Cross-Encoders, “cross-encoder/nli-deberta-base,” 2021. [Online]. Available: https://huggingface.co/cross-encoder/nli-deberta-base
[89] Y. Liu et al., “RoBERTa: A Robustly Optimized BERT Pretraining Approach,” 2019, arXiv:1907.11692 [cs.CL].
[90] F. Liu, E. Shareghi, Z. Meng, M. Basaldella, and N. Collier, “Self-Alignment Pretraining for Biomedical Entity Representations,” in Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies NAACL-HLT 2021, K. Toutanova et al., Eds. Association for Computational Linguistics, Jun. 2021, pp. 4228–4238.
[91] F. Remy, K. Demuynck, and T. Demeester, “BioLORD: Learning Ontological Representations from Definitions for Biomedical Concepts and their Textual Descriptions,” in Findings of the Association for Computational Linguistics: EMNLP 2022, Y. Goldberg, Z. Kozareva, and Y. Zhang, Eds. Association for Computational Linguistics, Dec. 2022, pp. 1454–1465.
[92] C. Raffel et al., “Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer,” Journal of Machine Learning Research, vol. 21, no. 140, pp. 1–67, 2020.
[93] Google, “google/t5-v1_1,” 2023. [Online]. Available: https://huggingface.co/google
[94] AI at Meta, “facebook/bart-large-mnli,” 2023. [Online]. Available: https://huggingface.co/facebook/bart-large-mnli
[95] V. Sanh et al., “Multitask Prompted Training Enables Zero-Shot Task Generalization,” in The Tenth International Conference on Learning Representations, ICLR 2022. OpenReview.net, 2022.
[96] Q. Lu, D. Dou, and T. Nguyen, “ClinicalT5: A Generative Language Model for Clinical Text,” in Findings of the Association for Computational Linguistics: EMNLP 2022, Y. Goldberg, Z. Kozareva, and Y. Zhang, Eds. Association for Computational Linguistics, Dec. 2022, pp. 5436–5443.
[97] Writer Engineering team, “Palmyra-base Parameter Autoregressive Language Model,” Jan. 2023. [Online]. Available: https://dev.writer.com
[98] X. Geng and H. Liu, “OpenLLaMA: An Open Reproduction of LLaMA,” May 2023. [Online]. Available: https://github.com/openlm-research/open_llama
[99] Writer Engineering team, “Camel-5B InstructGPT,” Apr. 2023. [Online]. Available: https://dev.writer.com
[100] B. Wang and A. Komatsuzaki, “GPT-J-6B: A 6 Billion Parameter Autoregressive Language Model,” May 2021. [Online]. Available: https://github.com/kingoflolz/mesh-transformer-jax
[101] NLP Cloud, “nlpcloud/instruct-gpt-j-fp16,” 2023. [Online]. Available: https://huggingface.co/nlpcloud/instruct-gpt-j-fp16
[102] E. Almazrouei et al., “The Falcon Series of Open Language Models,” 2023, arXiv: 2311.16867 [cs.CL].
[103] MosaicML NLP Team, “Introducing MPT-7B: A New Standard for Open-Source, Commercially Usable LLMs,” May 2023. [Online]. Available: www.mosaicml.com/blog/mpt-7b
[104] R. Taori et al., “Stanford Alpaca: An Instruction-following LLaMA model,” GitHub, 2023. [Online]. Available: https://github.com/tatsu-lab/stanford_alpaca
[105] Y. Papanikolaou, “healx/gpt-2-pubmed,” 2020. [Online]. Available: https://huggingface.co/healx
[106] B. Boecking et al., “MS-CXR: Making the Most of Text Semantics to Improve Biomedical Vision-Language Processing (version 0.1),” PhysioNet, 2022. [Online]. Available: https://doi.org/10.13026/b90j-vb87
[107] ——, “Making the Most of Text Semantics to Improve Biomedical Vision–Language Processing,” in Computer Vision – ECCV 2022: 17th European Conference. Cham: Springer Nature Switzerland, Oct. 2022, pp. 1–21. [Online]. Available: https://doi.org/10.1007/978-3-031-20059-5_1
[108] A. L. Goldberger et al., “PhysioBank, PhysioToolkit, and PhysioNet: Components of a new research resource for complex physiologic signals,” Circulation [Online], vol. 101, no. 23, pp. e215–e220, Jun. 2000.
[109] A. E. W. Johnson, T. Pollard, R. Mark, S. Berkowitz, and S. Horng, “The MIMIC-CXR Database,” PhysioNet, 2019. [Online]. Available: https://doi.org/10.13026/C2JT1Q
[110] A. E. W. Johnson et al., “MIMIC-CXR, a de-identified publicly available database of chest radiographs with free-text reports,” Sci Data, vol. 6, no. 1, p. 317, 2019. [Online]. Available: https://doi.org/10.1038/s41597-019-0322-0
[111] A. E. W. Johnson, D. J. Stone, L. A. Celi, and T. J. Pollard, “The MIMIC Code Repository: enabling reproducibility in critical care research,” Journal of the American Medical Informatics Association, vol. 25, no. 1, pp. 32–39, 2018.
[112] A. Johnson et al., “MIT-LCP/mimic-code: MIMIC Code v2.2.1,” Zenodo, Jul. 2022. [Online]. Available: https://doi.org/10.5281/zenodo.6818823

Appendix A Data

This section describes the data employed and outlines the corresponding preprocessing procedure.

A.1 Transcriptions

Transcriptions is a multi-label dataset with 40 different labels and $2,358$ data samples. The data were extracted from Kaggle, and additional information about the labels can be found in MTSamples.com.

A.1.1 Preprocessing

The preprocessing procedure involves the removal of samples that lack associated reports, adjusting the formatting of the report, and selecting and renaming labels. Formatting adjustments are necessary because line breaks are encoded as comma patterns. To ascertain the final format, we considered the original data source MTSamples.com and the results generated by ChatGPT as a guide to knowledge of language models.

In terms of labels, less relevant categories were excluded due to their broad level of generality or lack of association with a specific medical specialty. Precisely, the eliminated labels are: “Consult - History and Phy.”, “Discharge Summary”, “Emergency Room Reports”, “General Medicine”, “Hospice - Palliative Care”, “IME-QME-Work Comp etc.”, “Letters”, “Office Notes”, “Pain Management”, “SOAP / Chart / Progress Notes”. Additionally, several labels contained the “/” character, indicating “or”, which we explicitly replaced with the latter. For example, “Allergy / Immunology” was transformed into “Allergy or Immunology”. Subsequently, the labels “Chiropractic” and “Physical Medicine - Rehab” were merged into a unified category called “Physical Medicine and Rehabilitation, or Chiropractic”. Other modifications include transforming “ENT - Otolaryngology” into “Otolaryngology”, “Hematology - Oncology” into “Hematology or Oncology”, “Lab Medicine - Pathology” into “Laboratory Medicine or Clinical Pathology”, “Pediatrics - Neonatal” into “Pediatrics or Neonatal”, and “Speech - Language” into “Speech and Language”.

Upon completion of the preprocessing, the initial count of 40 different labels is reduced to 29, and the number of samples to consider is $2,074$ .

A.1.2 Description

The class distribution is visualized in Fig. 8. Surgery is the most prevalent category (in $52.46\%$ of the samples), followed by Cardiovascular or Pulmonary ( $17.89\%$ ) and Orthopedic ( $17.11\%$ ). On the other hand, Allergy or Immunology ( $0.33\%$ ), preceded by Autopsy ( $0.38\%$ ) and Laboratory Medicine or Clinical Pathology ( $0.38\%$ ), are the least frequent categories. The number of labels per sample ranges from 1 to 4, with an average of 2 labels per sample. Additionally, some labels never co-occur within the same sample.

The results of analyzing label leakage, which refers to whether a label appears explicitly in the text to be classified, are shown in Fig. 8. For most labels, label leakage is minimal, except for Autopsy ( $62.50\%$ ), Rheumatology ( $40.00\%$ ), Speech and Language ( $33.33\%$ ), and Surgery ( $29.50\%$ ). Labels without label leakage are Allergy or Immunology, Cardiovascular or Pulmonary, Cosmetic or Plastic Surgery, Diets and Nutritions, Hematology or Oncology, Laboratory Medicine or Clinical Pathology, Obstetrics or Gynecology, Pediatrics or Neonatal, Physical Medicine and Rehabilitation, or Chiropractic, Psychiatry or Psychology, and Sleep Medicine. The presence of labels in texts of other labels is not considered, given that this is a multi-label dataset, and the analysis and interpretation of such occurrences are inherently complex.

A.2 MS-CXR

MS-CXR [106, 107, 108] is a multi-class dataset with 8 different classes and a corpus of $1,448$ data samples, comprising 718 unique samples. The data can be obtained from [108].

A.2.1 Preprocessing

The preprocessing procedure involves removing instances without associated reports and eliminating duplicates. To be precise, 730 samples ( $50.41\%$ ) were identified as duplicates, with a maximum of 82 and an average of 3 duplicates, considering only repeated reports. In addition, when duplicate reports do not agree with the assigned label, either of these labels is evaluated as the true one.

A.2.2 Description

The class distribution is depicted in Fig. 10. Overall, the dataset does not exhibit class imbalance. The most frequent classes are Pneumonia ( $24.37\%$ ), closely followed by Pneumothorax ( $21.17\%$ ), while the less frequent classes are Cardiomegaly ( $5.15\%$ ), preceded by Edema ( $5.43\%$ ).

Upon analysis of label leakage, as presented in Fig. 10, a high label leakage is observed, except for Lung Opacity, which has a low leakage rate of $1.33\%$ . In particular, Consolidation, Edema, and Pneumothorax exhibit leakage rates that exceed $90\%$ . Classes with leakage rates below $50\%$ include Pneumonia, Cardiomegaly, and Lung Opacity, as mentioned earlier. Regarding the presence of labels in text from other labels, notable occurrences include Consolidation in the classes of Edema ( $12.82\%$ ) and Pneumonia ( $24.00\%$ ), and Pleural Effusion in the Atelectasis class ( $21.43\%$ ).

To conclude, each class’s word count per text is measured, and their distributions are presented in Fig. 12. Classes with shorter texts include Cardiomegaly and Pneumothorax. Although classes with longer texts are not observed, there are flatter distributions with heavy tails, suggesting that the length of texts in these classes is less concentrated around a specific value.

A.3 MIMIC-CXR

MIMIC-CXR [109, 110, 108] is a dataset of radiographic reports that encompasses $78,584$ samples. After extracting the most pertinent sections, $75,029$ samples are identified as informative. This dataset is accessible through [108].

A.3.1 Preprocessing

The preprocessing procedure involves extracting the most relevant sections from chest X-ray reports using the codes [111] designed for this purpose and publicly available on GitHub [112]. In addition, texts lacking content and duplicate samples are removed. Texts lacking information are defined as those that are empty or match one of the following: “.”, “As above”, “As above.”, “As above..”, “None.”, “See above.”, “No changes.”, “___”, “___ earlier”, “___,”, or “___.”. Those mentioned above were identified after meticulously examining texts with a maximum length of two words. In total, these non-informative texts represent merely $0.26\%$ of the dataset. Regarding duplicates, $1.69\%$ of the total samples are duplicated, comprising $23.07\%$ of the dataset. The text with the most duplicates is ”No acute cardiopulmonary process.” representing $7.88\%$ of the samples. On average, each text appears twice in the dataset.

Upon completion of the preprocessing steps, the dataset results in $57,711samples$ , composed mainly of impressions ( $81.92\%$ ) and findings ( $17.48\%$ ).

A.3.2 Description

Considering the nature of this dataset, its description focuses mainly on the distribution of the number of words per sample, as shown in Fig. 12. This distribution is left-skewed, with a peak of around 10 words per sample. Moreover, there is a significant plateau between 20 and 40 words per sample. Interestingly, the distribution’s right tail extends beyond 150 words per sample. In summary, most texts ( $75\%$ ) contain at most 51 words, with a pronounced peak of around 10 words per sample. However, this dataset also includes longer texts, some reaching up to 307 words.

Appendix B Models

Table 3: Details on the models studied. The total inference time represents the average time to process the entire dataset per experiment.

Model

Type

Domain

Model size (no. parameters)

Input max. size (no. tokens)

Total inference time (seconds)

Classification

MS-CXR

Generation

MIMIC-CXR

m00

BERT

{}_{\texttt{BASE}}

[16]

Encode-only

General

110

512

3.33

–

m01

BERT

{}_{\texttt{LARGE}}

[16]

Encode-only

General

340

512

5.25

–

m02

BiomedBERT

(abstracts + full text) [66]

Encode-only

Biomedical

110

512

3.10

–

m03

BiomedBERT

(abstracts only) [66]

Encode-only

Biomedical

110

512

3.09

–

m04

BiomedBERT-large

(abstracts only) [66]

Encode-only

Biomedical

340

512

4.56

–

m05

SciBERT [65]

Encode-only

Biomedical

110

512

3.28

m06

SapBERT [90]

Encode-only

Biomedical

110

512

3.14

–

m07

BioLORD-STAMB2-v1 [91]

Encode-only

Biomedical

110

512

3.35

–

m08

BioLORD-STAMB2-v1-STS2 [91]

Encode-only

Biomedical

110

512

3.31

–

m09

BioLORD-PMB [91]

Encode-only

Biomedical

110

512

3.29

–

m10

Bio+Clinical BERT [70]

Encode-only

Clinical

110

512

3.15

–

m11

NLI-DeBERTa

{}_{\texttt{base}}

[88]

Encoder-only

(cross-encoder)

General

100

512

8.84

–

m12

RoBERTa

{}_{\texttt{LARGE}}

-MNLI [89]

Encoder-only

(cross-encoder)

General

355

512

20.38

–

m13

BART Large-MNLI [94]

Encoder-decoder

General

407

1\,024

23.93

–

m14

T5-V1.1-Base [92, 93]

Encoder-decoder

General

220

512

6.16

–

m15

T5-V1.1-Large [92, 93]

Encoder-decoder

General

770

512

14.52

–

m16

T5-V1.1-3B [92, 93]

Encoder-decoder

General

3.0

512

38.57

–

m17

T5-V1.1-11B [92, 93]

Encoder-decoder

General

11.0

512

64.88

–

m18

Flan-T5-Base [17]

Encoder-decoder

(instruction-tuned)

General

220

512

6.74

–

m19

Flan-T5-Large [17]

Encoder-decoder

(instruction-tuned)

General

770

512

16.18

–

m20

Flan-T5-XL [17]

Encoder-decoder

(instruction-tuned)

General

3.0

512

40.71

–

m21

Flan-T5-XLL [17]

Encoder-decoder

(instruction-tuned)

General

11.0

512

69.1

–

m22

T0 3B [95]

Encoder-decoder

(instruction-tuned)

General

3.0

512

38.62

–

m23

T0++ [95]

Encoder-decoder

(instruction-tuned)

General

11.0

512

63.89

–

m24

ClinicalT5-base [96]

Encoder-decoder

Clinical

220

512

5.56

–

m25

ClinicalT5-large [96]

Encoder-decoder

Clinical

700

512

11.94

–

m26

GPT-2 Medium [19]

Decoder-only

General

355

1\,024

–

3\,169.67

m27

GPT-2 Large [19]

Decoder-only

General

774

1\,024

–

5\,206.18

m28

GPT-2 XL [19]

Decoder-only

General

1.5

1\,024

–

5\,330.05

m29

Palmyra Base 5B [97]

Decoder-only

General

5.0

512

94.54

11\,890.56

m30

Camel 5B [99]

Decoder-only

(instruction-tuned)

General

5.0

1\,024

96.33

–

m31

GPT-J 6B [100]

Decoder-only

General

6.0 B

2 048

132.50

16\,495.20

m32

Instruct GPT-J [101]

Decoder-only

(instruction-tuned)

General

6.0

2\,048

132.58

–

m33

Falcon-7B [102]

Decoder-only

General

7.0

2\,048

151.15

17\,496.83

m34

Falcon-7B-Instruct [102]

Decoder-only

(instruction-tuned)

General

7.0

2\,048

151.10

–

m35

MPT-7B [103]

Decoder-only

General

7.0

2\,048

140.49

15\,384.00

m36

MPT-7B-Instruct [103]

Decoder-only

(instruction-tuned)

General

7.0

2\,048

140.56

–

m37

LLaMA-7B [27]

Decoder-only

General

7.0

2\,048

143.56

19\,203.39

m38

LLaMA 2-7B [28]

Decoder-only

General

7.0

2\,048

144.27

19\,225.13

m39

Alpaca 7B [104]

Decoder-only

(instruction-tuned)

General

7.0

512

146.31

–

m40

LLaMA 2-CHAT-7B [28]

Decoder-only

(instruction-tuned)

General

7.0

2\,048

144.50

–

m41

OpenLLaMA 3B [98]

Decoder-only

General

3.0 B

2 048

–

9\,736.33

m42

OpenLLaMA 3Bv2 [98]

Decoder-only

General

3.0 B

2 048

–

9\,914.52

m43

OpenLLaMA 7B [98]

Decoder-only

General

7.0 B

2 048

–

17\,433.58

m44

OpenLLaMA 7Bv2 [98]

Decoder-only

General

7.0 B

2 048

–

27\,589.57

m45

OpenLLaMA 13B [98]

Decoder-only

General

13.0 B

2 048

–

7\,125.28

m46

GPT-2-PubMed Medium [105]

Decoder-only

Biomedical

355 M

1 024

–

2\,023.37

m47

GPT-2-PubMed Large [105]

Decoder-only

Biomedical

774 M

1 024

–

3\,213.49

m48

BioGPT [69]

Decoder-only

Biomedical

347 M

1 024

–

1\,680.22

m49

BioGPT-Large [69]

Decoder-only

Biomedical

1.5 B

1 024

–

4\,840.45

m50

Galactica 1.3B [26]

Decoder-only

Biomedical

1.3 B

2 048

–

3\,941.80

m51

Galactica 6.7B [26]

Decoder-only

Biomedical

6.7 B

2 048

–

15\,118.26

m52

MedAlpaca 7b [71]

Decoder-only

Clinical

7.0

512

146.88

–

Appendix C Prompts

The prompts used for the text classification task via contextual embedding similarity, natural language inference (NLI), and multiple-choice question answering (QA) are presented.

C.1 Prompts for text classification via contextual embedding similarity and NLI

The prompts proposed for text classification using contextual embedding similarity and Natural Language Inference (NLI) are exclusively applied to the label (in the case of NLI, to the hypothesis). Table 4 lists the prompts used. Prompt template ID 0 is the default to generate the hypothesis in the zero-shot text classification using the NLI setting, as documented in HuggingFace.

Table 4: Prompt templates to be used as contextual embedding similarity and NLI prompts. The column “Dataset” specifies the dataset in which the prompt template is applied.

ID	Prompt template	Dataset
0	This example is {label}.	Transcriptions, MS-CXR
1	This is an example of {label}.	Transcriptions, MS-CXR
2	This report belongs to the category {label}.	Transcriptions
3	This report belongs to the medical speciality {label}.	Transcriptions
4	This report belongs to the medical speciality: {label}.	Transcriptions
5	The diagnosis is {label}.	MS-CXR
6	There is evidence of {label}.	MS-CXR
7	These findings are consistent with {label}.	MS-CXR

C.2 Prompts for text classification via multiple-choice QA

The proposed prompts for text classification via multiple-choice question answering are based on the default prompt templates specific to various of the considered instruction-tuned models. These templates are systematically assessed using a set of questions, enabling us to quantify the influence of the question wording. For the MS-CXR dataset, we also incorporate role-based questions. The prompts, their corresponding datasets, and specific requirements are summarized in Table 6 and Table 5.

Each class or label is encoded with an uppercase letter denoting the option, followed by its name. For instance, if the first label is “ $y_{1}$ ”, it is represented as “(A) $y_{1}$ ” within the prompt. In the context of the transcriptions dataset, there are 29 distinct labels. However, due to their large number, we include the top 10 most frequent labels and categorize the remaining labels under an additional “Other” option. Specifically, for the transcriptions dataset, we employ templates t01, t02, t03, t04, and t07 along with questions q07, q08, and q09. Whereas for MS-CXR dataset, we employ templates t01, t02, t03, t04, t07, t11, and t13, and questions q03, q04, and q05.

Table 5: Questions to be used for the multiple-choice QA templates. The column “Dataset” specifies the target dataset.

ID	Question	Dataset
q01	What is the most plausible diagnosis?	MS-CXR
q02	What is the patient’s diagnosis?	MS-CXR
q03	What is the diagnosis?	MS-CXR
q04	Which one of the following is the diagnosis?	MS-CXR
q05	Which one is the patient’s diagnosis?	MS-CXR
q06	Which of the options is the most likely to be the diagnosis?	MS-CXR
q07	Which category does the report belong to?	Transcriptions
q08	What is the field that best suits the report?	Transcriptions
q09	Which one is the topic of the report?	Transcriptions

Table 6: Prompt structures to be used as multiple-choice QA prompts. Regarding the column “Requirements”, “report” refers to the text sample, “options” to the labels provided as choices, and “question” to the question itself (see Table 5). Note that the term “question” sometimes appears capitalized, indicating that the question begins with an uppercase letter when integrated into the template.

ID	Prompt structure	Requirements	Dataset
t01	Below is an instruction that describes a task, paired with an input that provides further context. Write a response that appropriately completes the request. ### Instruction: {question} Select one of the following options: {options} ### Input: {report} ### Response: (	report, options, QUESTION	Transcriptions, MS-CXR
t02	Context: {report} Question: {question} Options: {options} Answer: (	report, options, QUESTION	Transcriptions, MS-CXR
t03	Context: {report} Question: Based on the context, {question} Options: {options} Answer: (	report, options, question	Transcriptions, MS-CXR
t04	{report}. Which one of the following, if true, most strengthens the argument? {options}. (	report, options	Transcriptions, MS-CXR
t05	Read the following and answer the question. {report} {question} {options} (	report, options, QUESTION	Transcriptions, MS-CXR
t06	{report} What’s the best answer to this question: {question} {options} (	report, options, QUESTION	Transcriptions, MS-CXR
t07	{report} {question} {options} (	report, options, QUESTION	Transcriptions, MS-CXR
t08	Read this chest x-ray report: “{report}” Now answer this question: “{question}” {options} (	report, options, QUESTION	MS-CXR
t09	Knowing that “{report}”, how would one answer “{question}” {options} (	report, options, QUESTION	Transcriptions, MS-CXR
t10	{report} Based on the above text, what’s the best answer to this question: {question} {options} (	report, options, QUESTION	Transcriptions, MS-CXR
t11	You are a doctor and have the following information about a patient from a chest x-ray: {report}. Which one of the following, if true, most strengthens the argument? {options}. (	report, options	MS-CXR
t12	You are a doctor and have the following information about a patient from a chest x-ray: {report}. {question} {options}. (	report, options, QUESTION	MS-CXR
t13	I want you to act as a virtual doctor. I will describe my symptoms and you will choose the most probable diagnosis among the following: {options}. You should only reply with the chosen diagnosis, and nothing else. My request is “{report}”. (	report, options	MS-CXR
t14	I want you to act as a virtual doctor. I will describe my symptoms and you will choose a diagnosis among the possible diag- noses. You should only reply with the chosen diagnosis, and nothing else. Do not write explanations. The possible diagnoses are: {options}. My request is “{report}”. (	report, options	MS-CXR

Appendix D Supplementary results

This appendix presents supplementary figures and tables to support the results presented. They are displayed first by dataset and then by task or approach. These results do not are not obtained by bootstrapping, but singular values for the complete inference dataset.

D.1 Text classification task

D.1.1 Contextual embedding similarity

Results are depicted in Figs. 2, 14 and 13. The mapping between the models and their ID is

m00: BERT ${}_{\texttt{BASE}}$	m06: SapBERT
m01: BERT ${}_{\texttt{LARGE}}$	m07: BioLORD-STAMB2-v1
m02: BiomedBERT (abstracts + full text)	m08: BioLORD-STAMB2-v1-STS2
m03: BiomedBERT (abstracts only)	m09: BioLORD-PMB
m04: BiomedBERT-large (abstracts only)	m10: Bio+Clinical BERT
m05: SciBERT

D.1.2 Natural language inference

Results are depicted in Figs. 16 and 15. The mapping between the models and their ID is

m11: NLI-DeBERTa

{}_{\texttt{base}}

m12: RoBERTa

{}_{\texttt{LARGE}}

-MNLI

m13: BART Large-MNLI

D.1.3 Multiple choice question answering

Results are depicted in Figs. 18, 17, 20 and 19. The mapping between the models and their ID is

m14: T5-V1.1-Base	m21: Flan-T5-XXL	m31: GPT-J 6B	m38: LLaMA 2-7B
m15: T5-V1.1-Large	m22: T0-3B	m32: Instruct GPT-J	m39: Alpaca 7B
m16: T5-V1.1-3B	m23: T0++	m33: Falcon-7B	m40: LLaMA 2-CHAT-7B
m17: T5-V1.1-11B	m24: ClinicalT5-base	m34: Falcon-7B-Instruct	m52: MedAlpaca 7b
m18: Flan-T5-Base	m25: ClinicalT5-large	m35: MPT-7B
m19: Flan-T5-Large	m29: Palmyra Base 5B	m36: MPT-7B-Instruct
m20: Flan-T5-XL	m30: Camel 5B	m37: LLaMA-7B

D.2 Conditional text generation task

Results are depicted in Fig. 21. The mapping between the models and their ID is

m26: GPT-2 Medium	m35: MPT-7B	m44: OpenLLaMA 7Bv2	m50: Galactica 1.3B
m27: GPT-2 Large	m37: LLaMA-7B	m45: OpenLLaMA 13B	m51: Galactica 6.7B
m28: GPT-2 XL	m38: LLaMA 2-7B	m46: GPT-2-PubMed Medium
m29: Palmyra Base 5B	m42: OpenLLaMA 3B	m47: GPT-2-PubMed Large
m31: GPT-J 6B	m42: OpenLLaMA 3Bv2	m48: BioGPT-Large
m33: Falcon-7B	m43: OpenLLaMA 7B	m49: BioGPT-Large