Evaluation of Language Models in the Medical Context Under Resource-Constrained Settings

Andrea Posada, Daniel Rueckert, Felix Meissen, and Philip Müller
Abstract

Since the Transformer architecture emerged, language model development has grown, driven by their promising potential. Releasing these models into production requires properly understanding their behavior, particularly in sensitive domains like medicine. Despite this need, the medical literature still lacks practical assessment of pre-trained language models, which are especially valuable in settings where only consumer-grade computational resources are available. To address this gap, we have conducted a comprehensive survey of language models in the medical field and evaluated a subset of these for medical text classification and conditional text generation. The subset includes 53535353 models with 110110110110 million to 13131313 billion parameters, spanning the Transformer-based model families and knowledge domains. Different approaches are employed for text classification, including zero-shot learning, enabling tuning without the need to train the model. These approaches are helpful in our target settings, where many users of language models find themselves. The results reveal remarkable performance across the tasks and datasets evaluated, underscoring the potential of certain models to contain medical knowledge, even without domain specialization. This study thus advocates for further exploration of model applications in medical contexts, particularly in computational resource-constrained settings, to benefit a wide range of users. The code is available on https://github.com/anpoc/Language-models-in-medicine.

1 Introduction

Natural language processing (NLP) holds great promise in the medical field. The medical community has recently shown substantial interest in leveraging state-of-the-art language models to address various medical challenges [1, 2]. In particular, generative large language models (LLMs) have showcased emergent abilities beyond their original training objectives, such as text summarization and question answering [3]. These newfound abilities have enabled LLMs to perform tasks of significant clinical importance, including passing medical examinations, summarizing clinical and radiological reports, as well as medical dialogues, extracting drug names from medical notes, responding to patient inquiries, and writing medical histories and physical assessments [2, 4].

The versatility of language models can be attributed to a convergence of factors [2, 4, 5]. The first factor is their ability to learn valuable patterns within large amounts of unlabeled data via self-supervision. The second factor revolves around the Transformer architecture [6] and its suitability for efficient parallel processing on modern computing hardware. Lastly, the third factor encompasses the crucial process of fine-tuning language models to align their responses with human expectations through instruction tuning.

Integration of language models in medical settings is becoming a reality as partnerships between developers and healthcare systems continue to grow [7]. The potential benefits are significant, as they can derive broadly applicable representations from extensive medical corpora at scale and encapsulate clinical knowledge [8]. Nevertheless, it is essential to recognize that our understanding of the behavior of both small pre-trained and large language models still needs to be completed [4]. Deploying these models also carries risks, such as the generation of inaccurate results, a phenomenon known as hallucinations, and the potential amplification of existing biases [1, 4]. Language models’ implementation in sensitive fields, such as healthcare, should therefore be approached with the utmost care [5].

Computing and energy resources required by language models for their development and operation are another critical and limiting factor, especially in LLMs. The standard computing resources available in hospitals are of the consumer-grade type, where it is currently infeasible to handle models with hundreds of billions of parameters. Such resource-constrained settings, i.e., with consumer-grade computing resources, are presented not only by healthcare agents and institutions but also by research groups.

When large language models do not represent a cost-effective or viable solution, smaller pre-trained language models can be an alternative. LLMs, albeit more massive, have similar architectures and pre-training tasks to smaller pre-trained language models [9]. With the same computing budget, a smaller model trained with more high-quality data can perform better than its larger counterparts due to undertraining [10]. Using curated scientific and biomedical corpora in pre-trained language models has also been effective for discriminative and generative language modeling [11]. Furthermore, these smaller models align with the crucial imperative of environmental sustainability and open up the possibilities for organizations to develop applications that can run directly on commodity hardware and small devices rather than relying on cloud-based services [12]. Language models in resource-constrained settings thereby address practical challenges and have great potential in local computing.

To further understand the performance of language models in clinical scenarios with limited computational resources, we conducted a comprehensive evaluation focusing on the classification and conditional generation of medical texts in open-source models. The datasets employed enable the assessment of general and radiology-specific medical knowledge. In total, 53535353 models are tested, ranging from 110110110110M to 13131313B parameters, spanning all Transformer-based model families and knowledge domains from general to clinical. For conditional text generation, solely decoder-only models are used. The approaches adopted for text classification, together with prompt engineering, allow for improved model performance without the need for training or fine-tuning. An analysis of the impact of the prompts on performance is also included. To the best of our knowledge, this is the first work to evaluate such a large number of small pre-trained language models for medical tasks.

2 Preliminaries

The evolution of natural language processing can be condensed into four major groups of models: (1) statistical models, (2) neural language models, (3) pre-trained language models, and (4) large language models [9]. Each of these groups represents a paradigm shift in natural language modeling and has contributed significantly to the conception of language models as we know them today.

The first transition, from statistical to neural language models, entailed a shift from word prediction based on minimal local context to probabilistic evaluation of word sequences using neural networks. This transition also introduced the representation of words as low-dimensional continuous embeddings based on their contextual usage (distributional semantics). The second transition, from neural to pre-trained language models, involved turning from developing task-specific models to pre-training and fine-tuning methodologies. The third transition to large language models moved the focus from discriminative AI to generative AI, from model-centric to data-centric approaches, and from fine-tuning to prompt engineering and prompt tuning [9, 13, 14]. These advances have paved the way for more sophisticated language models with broader applications and improved capabilities.

2.1 Pre-trained language models

Refer to caption
(a) Encoder-only models
Refer to caption
(b) Decoder-only models
Refer to caption
(c) Encoder-decoder models
Figure 1: Graphical representation of the three families of Transformer-based models: encoder-only, decoder-only, and encoder-decoder models. Colors signal the correspondence between outputs and targets. Encoder-only models are mainly used for discriminative tasks. Their input is tokenized, and some of these tokens are masked. They are then fed into Transformer blocks with self-attention to obtain contextualized output embeddings, which are further processed by next sentence prediction (NSP) and language model (LM) heads or used by downstream task-specific heads. Depending on the training objective, the NSP head may or may not be necessary. Decoder-only models focus on generation tasks. Their input is tokenized and fed to Transformer blocks with causal self-attention. The causal self-attention ensures that the information flows unidirectionally from left to right. Encoder-decoder models are used for text-to-text tasks. Their encoder processes the input text, similar to encoder-only models but excluding the NSP head, and flows information to the decoder via the cross-attention mechanism. This information is used with the target output so that the decoder learns to produce the latter generatively.

The emergence of pre-trained language models represented a paradigm shift, driving research toward designing more efficient architectures and refining pre-training strategies. These pre-trained models have been commonly adapted or specialized to downstream tasks via fine-tuning, which involves transferring knowledge by further training a model on new data. There are significant advantages demonstrated by these models in language understanding and model performance in various tasks [13, 9].

ELMo is one of the earliest attempts at pre-trained language models[15]. This model was developed to capture context-aware word representations by pre-training a bidirectional Long Short-Term Memory (biLSTM) network and fine-tuning it for subsequent downstream tasks. Later the Transformer architecture was introduced, revolutionizing the NLP field by offering highly parallelizable structures and self-attention mechanisms. The Transformer [6] follows the autoencoder archetype, from which three families of models arose: (1) BERT-family or encoder-only models, (2) GPT-family or decoder-only models, and (3) text-to-text or encoder-decoder models. In Fig. 1, the graphical representations of these families are shown.

2.1.1 Encoder-only models

Encoder-only models, exemplified by BERT [16], are based on masked language modeling (MLM), where parts of the input are masked to encourage the model to reconstruct the original sequence, leveraging contextual information bidirectionally. These models can be stated as v1:nϕ(v1:n)subscript𝑣:1𝑛italic-ϕsubscript𝑣:1𝑛v_{1:n}\rightarrow\phi(v_{1:n})italic_v start_POSTSUBSCRIPT 1 : italic_n end_POSTSUBSCRIPT → italic_ϕ ( italic_v start_POSTSUBSCRIPT 1 : italic_n end_POSTSUBSCRIPT ). In particular, their contextual embeddings have been proven highly effective as general-purpose semantic features, significantly boosting performance in discriminative NLP tasks.

2.1.2 Decoder-only models

Decoder-only models focus on autoregressive language modeling, i.e., predicting the next token in a sequence based on previous tokens. These models produce contextual embeddings and distribution over the subsequent tokens vi+1subscript𝑣𝑖1v_{i+1}italic_v start_POSTSUBSCRIPT italic_i + 1 end_POSTSUBSCRIPT, which can be stated as v1:iϕ(v1:i),(vi+1|v1:i)subscript𝑣:1𝑖italic-ϕsubscript𝑣:1𝑖conditionalsubscript𝑣𝑖1subscript𝑣:1𝑖v_{1:i}\rightarrow\phi(v_{1:i}),\mathbb{P}(v_{i+1}|v_{1:i})italic_v start_POSTSUBSCRIPT 1 : italic_i end_POSTSUBSCRIPT → italic_ϕ ( italic_v start_POSTSUBSCRIPT 1 : italic_i end_POSTSUBSCRIPT ) , blackboard_P ( italic_v start_POSTSUBSCRIPT italic_i + 1 end_POSTSUBSCRIPT | italic_v start_POSTSUBSCRIPT 1 : italic_i end_POSTSUBSCRIPT ). However, the contextual embeddings they generate depend solely on the left context. Most research efforts are currently directed toward decoder-only models due to their exceptional performance in conditional generation tasks and their demonstrated emergent capabilities.

2.1.3 Encoder-decoder models

Text-to-text models, or encoder-decoder models, are trained to learn the correspondence between a pair of texts and can be stated as v1:nϕ(v1:n),(w1:m|ϕ(v1:n))subscript𝑣:1𝑛italic-ϕsubscript𝑣:1𝑛conditionalsubscript𝑤:1𝑚italic-ϕsubscript𝑣:1𝑛v_{1:n}\rightarrow\phi(v_{1:n}),\mathbb{P}(w_{1:m}|\phi(v_{1:n}))italic_v start_POSTSUBSCRIPT 1 : italic_n end_POSTSUBSCRIPT → italic_ϕ ( italic_v start_POSTSUBSCRIPT 1 : italic_n end_POSTSUBSCRIPT ) , blackboard_P ( italic_w start_POSTSUBSCRIPT 1 : italic_m end_POSTSUBSCRIPT | italic_ϕ ( italic_v start_POSTSUBSCRIPT 1 : italic_n end_POSTSUBSCRIPT ) ). These models combine bidirectional contextual embeddings with the capability to generate output sequences, making them versatile in various text-to-text tasks without requiring additional heads for fine-tuning. Moreover, by having a broad spectrum of language tasks that can be translated into text-to-text representation, these models can potentially be used for a wide range of applications.

2.2 Large language models

Scaling of language models has often resulted in improved model capabilities in various tasks [17, 10, 18, 19, 20, 21, 22], including those requiring specialized scientific knowledge and reasoning [23]. Research by Kaplan et al. [24] revealed that there is an empirical power-law relationship between the language model performance, in terms of cross-entropy loss, and the model size, dataset size, and amount of compute used for training. It was further found that architectural details, such as network width or depth, had minimal effects on performance. Scaling laws have further been studied by Hoffmann et al. [10] and Bahri et al. [25].

Following these empirical results, several studies have trained progressively larger language models of up to hundreds of billion parameters, such as GPT-3 [18], PaLM [20], Galactica [26], LLaMA [27, 28], Claude [29], Gemini 1.5 [30], and Mistral [31]. Among all, GPT-3 and ChatGPT can be considered the precursors of the large language models, the name by which these large-scale language models are known [13, 9]. GPT-4, a latter version of GPT-3, stands out for its exceptional performance, often matching or surpassing human performance on a variety of tasks [11, 32, 33], even in specialized domains [34]. Extensive evaluations have been conducted to GPT-4 [23, 35, 36, 37], exploring even the path toward Artificial General Intelligence (AGI) [32].

LLMs can be adapted to different tasks via prompt engineering, which, unlike fine-tuning, does not require retraining the model and updating its weights. These prompting techniques have led to observing unexpected emergent capabilities in LLMs, demonstrating the potential to address a wide range of complex tasks and exhibit apparent reasoning abilities [38, 3, 8, 14, 39, 18, 22, 40, 41, 42]. In the medical field, for example, Chain of Thought (CoT) has been used for explainability [43] and in-context learning to mitigate the need for costly medical annotations [13]. Numerous studies have even highlighted the competence of large language models as implicit knowledge bases [8, 23, 26, 44].

In-context learning techniques, such as zero-shot and few-shot learning, have also proven to be remarkably effective on instruction-tuned models and models to which reinforcement learning techniques have been applied [22, 39, 8, 45]. Zero-shot learning consists of asking a trained model to complete a task without providing explicit examples of this task, whereas in few-shot learning, some examples are provided. Nonetheless, prompting techniques are not exclusive to LLMs but are also applicable to smaller pre-trained language models, especially encoder-decoder and decoder-only models.

Despite their advantages, LLMs also have limitations. Their high computational resource requirements and associated computational challenges represent a major limitation of these models. For example, conducting studies to better understand the behavior of LLMs and assess important criteria, such as faithfulness and biases, can be costly and time-consuming. The detection biases and hallucinations, i.e., generating inaccurate results, is crucial in sensitive domains such as medicine.

Due to the significance of the computing limitation, alternatives such as model quantization [46] have been introduced. Quantization is a technique that reduces the computational and memory costs of model inference by representing its weights and activations with low-precision data types, such as 8-bit integers, instead of the usual 32-bit floating point. In natural language processing, this technique is currently being extensively studied, with [47, 48, 49, 50, 51, 52] being some examples in the literature.

Recommendations on the optimal use of computational resources have also been proposed. Chinchilla’s scaling law [10], one of these recommendations, states that the optimal model size and the number of tokens for training a language model should scale equally for compute-optimal training under a given computational budget. In [10], it is further proved that current large language models are significantly undertrained due to the recent focus on scaling language models while keeping the amount of training data constant. A smaller model trained with more high-quality data can thus achieve better performance than its larger counterparts with the same computing budget.

2.3 Language models in the biomedical/clinical context

Broadly speaking, language models used in specialized domains are (i) trained models solely on target domain data, (ii) pre-trained models on general domain corpus with tuning strategies, and (iii) pre-trained models on specialized domain corpus with(out) tuning strategies. Examples of tuning strategies are fine-tuning with target domain data and prompt engineering. Pre-training on (ii) can also be domain-adaptive continual pre-training (i.e., pre-trained on specialized domain corpus after pre-training on general domain corpus) or mixed-domain pre-training (i.e., pre-trained on a mix of general and specialized domain corpus, simultaneously).

GPT-4 is an example of a general domain language model that has been studied in medical applications. Research has covered from its utility as a medical chatbot [53] and in medical competency exams [33] to its applications in radiology [54, 11, 55, 56, 57, 58, 59], among others [60, 61, 62, 63]. Nevertheless, the models studied in the medical context are mostly domain-specific, either biomedical or clinical. These models include pre-trained models such as BioBERT [64], SciBERT [65], BioMedBERT [66], BioMegatron [67], ScholarBERT [68], BioGPT [69], and ClinicalBERT [70]; as well as large language models as Galactica [26], MedAlpaca [71], PMC-LLaMA [72], Med-PaLM 2 [73], GatorTron [74], GatorTronGPT [75], and ClinicalGPT [76].

Domain-specific models usually contain general domain data within the pre-training data, with exceptions such as BioMedBERT, Galactica, GatorTron, and GatorTronGPT. For large language models, instruction fine-tuning is the most common tuning technique, as in MedAlpaca, Med-PaLM 2, GatorTronGPT, and ClinicalGPT. Reinforcement learning from human feedback (RLHF) and reinforcement learning from AI feedback (RLAIF) have also been adopted, although less frequently, being HuatuoGPT [77] an example of this. Recent research studies indicate as well a multimodal trend that supports various types of healthcare data, including electronic health records (EHR), medical images, and medical sequence signals. Examples of these developments include LLaVAMed [78], MedAGI [79], OphGLM [80], Visual Med-Alpaca [81], MedFlamingo [82], and CheXzero [83].

3 Related Work

Comparative studies investigating language models are crucial to advance our understanding of them, shed light on their functionalities and pinpoint their constraints. Despite previous research, a notable gap persists in the literature due to, among other cause, current pace of development in NLP. This gap is particularly significant in fields that require heightened sensitivity, such as medicine, where a thorough understanding of models is imperative [45]. Existing research in medicine is mainly focused on specific tasks or datasets or models [5, 39, 14, 11, 8]. Moreover, most of the discursive and practical assessments focus on LLMs, as can be seen below. To the best of our knowledge, there is no practical assessment in the clinical context that includes a wide number of pre-trained models, covering all Transformer-based model families, targeting settings where only consumer grade computing resources are available.

The work by He et al. [13] stands out among exiting descriptive studies, comprehensively addressing the capabilities, limitations, development and integration of language models in healthcare. The language models in scope are pre-trained and large language models. The development process is explained in detail, covering aspects such as training data, methodologies, and optimization strategies. Concerns related to the integration of LLMs into healthcare are also investigated, as fairness, accountability, transparency, and ethics.

Zhou et al. [84] also provide a comprehensive overview of the development and deployment of LLMs in medicine, together with the challenges and opportunities these models face. Their study is both discursive and practical, being one of its highlights. The authors detail the principles of existing medical LLMs, comprising basic model structures, number of parameters, and data sources and scales used for model development. A comparison of the performance of different LLMs across various medical tasks, also against state-of-the-art lightweight models, is also included.

Continuing with practical reviews, Soni et al. [85] assessed the cost-effectiveness of pre-training and fine-tuning in BERT, BioBERT, Clinical BERT, and XLNet for medical question answering tasks. Their results indicate that BERT-based models exhibit superior performance when fine-tuned with mixed datasets (i.e., general and clinical domain data), highlighting a gap in well-generalizable medical QA datasets. The results also suggest that initial fine-tuning on general domain datasets, such as SQuAD, before doing it on clinical datasets can enhance performance. Prompting techniques were not included in their evaluations.

In a similar vein, Jahan et al. [45] studied the impact of data size for fine-tuning and that of prompts in zero-shot learning on model performance. Four large language models are evaluated on six benchmark biomedical text processing tasks across 26262626 datasets. Zero-shot LLMs outperform state-of-the-art fine-tuned models, such as BioBERT, BioGPT, and BioBART, when fine-tuning data is scarce. As the amount of fine-tuning data increases, so does the performance of these state-of-the-art fine-tuned models, surpassing zero-shot LLMs. The study also highlights LLMs’ sensitivity to prompts, as variations in these led to significant differences in outcomes. No single LLM consistently excelled across all datasets and tasks. The authors advocate the training of biomedical LLMs on domain-specific corpora while recognizing LLMs’ potential for biomedical applications that lack large annotated data.

Lehman et al. [86] further explored whether LLMs trained primarily on general web text are suitable for highly specialized, safety-critical domains such as medicine, or if domain-specific models are a better alternative. A total of 12121212 language models, ranging from 220220220220 million to 175175175175 billion parameters, are evaluated on three clinical tasks. As part of the experiments, T5 models were trained from scratch using MIMIC-III and MIMIC-IV clinical notes to investigate the efficiency of clinical tokens. Their findings suggest that relatively small, specialized clinical models significantly outperform all in-context learning approaches, even when fine-tuned on limited annotated data. Neither the models’ ability to handle long texts nor decoder-only and instruction-tuned models are accounted for in their work.

Lastly, Li et al. [87] focuses on pre-trained language models for long clinical text. A core limitation of Transformer-based models is their substantial memory consumption, leading to performance degradation in long clinical texts. To overcome this limitation, the authors pre-trained Longformer and BigBird, two long-sequence Transformers, on a large-scale clinical corpus, extending the maximum input length from 512512512512 to 4 09640964\,0964 096. These models consistently and significantly outperformed ClinicalBERT and other short sequence Transformers across ten tasks. Long-sequence Transformers enriched with clinical knowledge are thus capable of learning long-term dependencies in long clinical texts according to the results. No generative tasks and solely encoder-only models are considered in their evaluations.

4 Methodology

A series of experiments on medical text classification and conditional text generation are carried out to understand better the behavior of language models under resource-constrained settings, i.e., settings with consumer-grade computing resources. In total, 53535353 language models are evaluated, whose size ranges from 110110110110 million to 13131313 billion parameters. The selection of these models spans the general, biomedical, and clinical knowledge domains and includes the three families of Transformer-based models. Moreover, only open-source, smaller than 13131313 B parameters models are considered. Details on the selected models are found in Table 1 and Appendix B.

All experiments are performed using a Quadro RTX 8000 GPU and CUDA version 12.2. To guarantee that the selected models align with consumer-grade computing resources, models with more than 8888 billion parameters (i.e., OpenLLaMA 13B, Flan-T5-XXL, T5-V1.1-11B, and T0++) are run with float16161616 precision. By halving the floating-point precision, these 11111111 and 13131313 billion parameter model versions are still viable in computational resource-constrained settings.

The three families of Transformer-based models are considered for the text classification task via different approaches (described in Section 4.1.2), whereas solely decoder-only models are used for the conditional text generation task. Transcriptions, MIMIC-CXR, and MS-CXR have been chosen as evaluation datasets. Transcriptions covers a broad spectrum of medical specialties, allowing a general assessment of medical knowledge. MIMIC-CXR and its labeled version, MS-CXR, enable testing focused on radiology, one of the most promising fields for AI integration, narrowing the evaluation to specialized medical knowledge.

Table 1: The models used in this study are categorized by their type, domain, and size. Each model is presented with its number of parameters and may have one or more superscripts. Superscripts are 0: model used for contextual embedding similarity, 1: model used for natural language inference (NLI), 2: model used for multiple-choice questions, 3: model used for text generation, †: instruction-tuned model, ‡: cross-encoder model.

Small (S) Medium (M) Large (L) XL XXL ID Model Size ID Model Size ID Model Size ID Model Size ID Model Size Encoder-only General m00 BERTBASEBASE{}_{\texttt{BASE}}start_FLOATSUBSCRIPT BASE end_FLOATSUBSCRIPT 0 [16] 110 M m01 BERTLARGELARGE{}_{\texttt{LARGE}}start_FLOATSUBSCRIPT LARGE end_FLOATSUBSCRIPT 0 [16] 340 M - - - - - - - - - m11 NLI-DeBERTabasebase{}_{\texttt{base}}start_FLOATSUBSCRIPT base end_FLOATSUBSCRIPT ‡1 [88] 100 M m12 RoBERTaLARGELARGE{}_{\texttt{LARGE}}start_FLOATSUBSCRIPT LARGE end_FLOATSUBSCRIPT-MNLI ‡1 [89] 355 M Biomedical m02 BiomedBERT 110 M m04 BiomedBERT-large 340 M - - - - - - - - - (abstracts + full text) 0 [66] (abstracts only) 0 [66] m03 BiomedBERT 110 M (abstracts only) 0 [66] m05 SciBERT 0 [65] 110 M m06 SapBERT 0 [90] 110 M m07 BioLORD-STAMB2-v1 0 [91] 110 M m08 BioLORD-STAMB2-v1-STS2 0 [91] 110 M m09 BioLORD-PMB 0 [91] 110 M Clinical m10 Bio+Clinical BERT 0 [70] 110 M - - - - - - - - - - - - Encoder-
decoder
General m14 T5-V1.1-Base 2 [92, 93] 220 M m13 BART Large-MNLI 1 [94] 407 M m15 T5-V1.1-Large 2 [92, 93] 770 M m16 T5-V1.1-3B 2 [92, 93] 3.0 B m17 T5-V1.1-11B 2 [92, 93] 11.0 B
m18 Flan-T5-Base †2 [17] 220 M m19 Flan-T5-Large †2 [17] 770 M m20 Flan-T5-XL †2 [17] 3.0 B m21 Flan-T5-XLL †2 [17] 11.0 B m22 T0 3B †2 [95] 3.0 B m23 T0++ †2 [95] 11.0 B Biomedical - - - - - - - - - - - - - - - Clinical m24 ClinicalT5-base 2 [96] 220 M - - - m25 ClinicalT5-large 2 [96] 700 M - - - - - - Decoder-only General - - - m26 GPT-2 Medium3 [19] 355 M m27 GPT-2 Large 3 [19] 774 M m28 GPT-2 XL 3 [19] 1.5 B m29 Palmyra Base 5B 23 [97] 5.0 B m41 OpenLLaMA 3B 3 [98] 3.0 B m30 Camel 5B†2 [99] 5.0 B m42 OpenLLaMA 3Bv2 3 [98] 3.0 B m31 GPT-J 6B 23 [100] 6.0 B m32 Instruct GPT-J †2 [101] 6.0 B m33 Falcon-7B 23 [102] 7.0 B m34 Falcon-7B-Instruct †2 [102] 7.0 B m35 MPT-7B 23 [103] 7.0 B m36 MPT-7B-Instruct †2 [103] 7.0 B m37 LLaMA-7B 23 [27] 7.0 B m38 LLaMA 2-7B 23 [28] 7.0 B m39 Alpaca 7B †2 [104] 7.0 B m40 LLaMA 2-CHAT-7B †2 [28] 7.0 B m43 OpenLLaMA 7B 3 [98] 7.0 B m44 OpenLLaMA 7Bv2 3 [98] 7.0 B m45 OpenLLaMA 13B 3 [98] 13.0 B Biomedical
/ Scientific
- - - m48 BioGPT 3 [69] 347 M m47 GPT-2-PubMed Large 3 [105] 774 M m50 Galactica 1.3B 3 [26] 1.3 B m51 Galactica 6.7B 3 [26] 6.7 B
m46 GPT-2-PubMed Medium 3 [105] 355 M m49 BioGPT-Large 3 [69] 1.5 B Clinical - - - - - - - - - - - - m52 MedAlpaca 7b †2 [71] 7.0 B

4.1 Text classification

Text classification is addressed using the Transcriptions and MS-CXR datasets and three different approaches: (i) contextual embedding similarity, (ii) natural language inference (NLI), and (iii) multiple-choice question answering (MCQA). The contextual embedding similarity approach is intended for encoder-only models, the NLI approach for encoder-only and encoder-decoder models pre-trained for NLI, and the MCQA approach for encoder-decoder and decoder-only models..

Model tuning is implemented through zero-shot learning. To analyze the impact of prompting on text classification performance, different prompts are applied during inference. These prompts, grouped into two sets, are defined according to the classification approach. The first set of prompts is used for contextual embedding similarity and NLI. Since neither of these approaches requires a prompt to work, its non-use is also included in the analysis. The second set of prompts is used for MCQA, approach that needs a prompt to work. Prompts from the second set are defined based on those most commonly used in instruction-tuning models for multiple-choice question answering tasks.

Let xX𝑥𝑋x\in Xitalic_x ∈ italic_X be a text sample and yY𝑦𝑌y\in Yitalic_y ∈ italic_Y be a class, not necessarily corresponding to x𝑥xitalic_x. A prompt from the first prompt set, pP1𝑝subscript𝑃1p\in P_{1}italic_p ∈ italic_P start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT, is defined as a function of a prompt template and a label. For example, p1(y)=subscript𝑝1𝑦absentp_{1}(y)=italic_p start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( italic_y ) = “This is an example of y𝑦yitalic_y”. The set P1subscript𝑃1P_{1}italic_P start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT is only applied to the classes. Meanwhile, a prompt from the second prompt set, pP2𝑝subscript𝑃2p\in P_{2}italic_p ∈ italic_P start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT, combines a prompt template (consisting of the prompt structure and a question), a text sample, and the classes. For example, p(x,Y)=𝑝𝑥𝑌absentp(x,Y)=italic_p ( italic_x , italic_Y ) = “You are a doctor and have the following information about a patient from a chest x-ray: x𝑥xitalic_x. What is the diagnosis? Y𝑌Yitalic_Y. (”. In this example, the prompt template consists of the question “What is the diagnosis?” and the prompt structure, which is the rest of the text. Prompts are presented in detail in Appendix C.

4.1.1 Datasets

The datasets evaluated in text classification are Transcriptions and MS-CXR. Each of these datasets is introduced below. Their preprocessing and characterization details are given in Appendix A.

Transcriptions is a multi-label collection of electronic health records (EHRs) covering many medical specialties. Preprocessing is applied to the data, removing null entries, organizing the EHR format, and selecting the final set of labels. After all, 2 07420742\,0742 074 samples and 29292929 classes are available. Performance is measured by the AUC score since the dataset is multi-label.

Due to the length of some EHRs, certain token vectors exceed the maximum input length allowed by some models. To cope with this limit, the input sequence is processed using a non-overlapping sliding window method [87], as detailed in Section 4.1.2.

MS-CXR is a multi-class dataset composed of X-ray report sections, each accompanied by annotations made by a radiologist [106, 107, 108]. There are 718718718718 unique samples representing eight well-distributed classes. Preprocessing of this dataset consists of removing samples with missing information and duplicates. Contrary to Transcriptions, no sample exceeds the maximum allowed input length for any of the models. Performance is measured by accuracy, F1-score, precision, and recall in their macro-averaged version to ensure a comprehensive assessment.

4.1.2 Approaches

Text classification is performed through (i) contextual embedding similarity, (ii) natural language inference, and (iii) multiple-choice question answering.

Contextual embedding similarity is grounded in the cosine similarity between the contextual embeddings of the sample text and the classes. The contextual or sentence embedding is determined by three distinct pooling strategies: CLS-token embedding, average token-level embedding pooling, and maximum token-level embedding pooling.

For this approach, encoder-only models are employed, with a total of 11111111 models evaluated. These models have a maximum input token size of 512512512512 tokens. Therefore, the samples’ token vectors that exceed this limit are processed with the non-overlapping sliding window method. The fragments are aggregated according to the pooling strategy, as follows.

  • CLS pooling: The contextual embedding of each fragment from a sample is computed as its CLS-token output embedding. These embeddings are then aggregated using the element-wise average to obtain the contextual embedding representing the sample.

  • Maximum pooling: The contextual embedding of each fragment from a sample is computed by applying element-wise maximum at token level over the output embeddings. These embeddings are then aggregated using again the element-wise maximum to obtain the contextual embedding representing the sample.

  • Average pooling: The contextual embedding of each fragment from a sample is computed by applying the element-wise average at the token level over the output embeddings. These embeddings are then aggregated using the element-wise weighted average to obtain the contextual embedding representing the sample. Average’s weights indicate the number of non-padding tokens in each fragment.

Natural language inference is the task of determining whether a hypothesis is true (entailment), false (contradiction), or indeterminate (neutral) given a premise. When applied for text classification, the premise represents a test sample, and the hypothesis represents the classes. For multi-class datasets, the predicted label is calculated from the entailment logits of each hypothesized class. For multi-label datasets, the entailment and contradiction logits are transformed into binary probabilities, which indicate whether or not a particular hypothesized class is predicted. This could be viewed as having n𝑛nitalic_n binary text classifiers, where n𝑛nitalic_n is the number of classes.

This approach employs encoder-only (cross-encoder) and encoder-decoder models. These models have a lower maximum input token size than some of the test samples. Therefore, the token vectors of these particular samples are processed with the non-overlapping sliding window method. They are divided into fragments, whose scores are calculated individually, and then these scores are averaged to get the score of the whole sample.

Multiple-choice question answering enables generative models, i.e., encoder-decoder and decoder-only models, to perform text classification. A total of 27272727 models are assessed in this approach, including both pre-trained models and their instruction-tuned versions.

As multiple-choice question answering is not intended for a extensive number of choices, the Transcriptions dataset is evaluated using a reduced version with eleven classes instead of the 29 available. These eleven classes consist of the ten most frequent labels plus an “Other” class. The number of samples evaluated is not affected. Additionally, the models’ logit space has been constrained to align with the response options of a multiple-choice scenario and, thereby, allow for automated evaluation. The token identifiers associated with the feasible response options are determined and used to filter the logit space.

4.2 Conditional text generation task

Conditional text generation is assessed with the MIMIC-CXR dataset, using perplexity as the performance evaluation metric. Perplexity (PPL) is a measure of uncertainty on the value of a sample from a discrete probability distribution. Let X=(x0,x1,..,xT)X=(x_{0},x_{1},..,x_{T})italic_X = ( italic_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , . . , italic_x start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT ) be a tokenized sequence, then

PPL(X)=exp{1Tt=1Tlogpθ(xtx<t)}PPL𝑋1𝑇superscriptsubscript𝑡1𝑇subscript𝑝𝜃conditionalsubscript𝑥𝑡subscript𝑥absent𝑡\textnormal{PPL}(X)=\exp\{-\frac{1}{T}\sum_{t=1}^{T}\log p_{\theta}(x_{t}\mid x% _{<t})\}PPL ( italic_X ) = roman_exp { - divide start_ARG 1 end_ARG start_ARG italic_T end_ARG ∑ start_POSTSUBSCRIPT italic_t = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT roman_log italic_p start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∣ italic_x start_POSTSUBSCRIPT < italic_t end_POSTSUBSCRIPT ) }

where logpθ(xtx<t)subscript𝑝𝜃conditionalsubscript𝑥𝑡subscript𝑥absent𝑡\log p_{\theta}(x_{t}\mid x_{<t})roman_log italic_p start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∣ italic_x start_POSTSUBSCRIPT < italic_t end_POSTSUBSCRIPT ) is the log-likelihood of the t𝑡titalic_t-th token conditioned on the preceding tokens x<tsubscript𝑥absent𝑡x_{<t}italic_x start_POSTSUBSCRIPT < italic_t end_POSTSUBSCRIPT.

Decoder-only models are employed for evaluation, with a total of 20202020 models considered. Among these models is Galactica, whose tokenizer lacks special tokens. As a consequence, two scenarios are analyzed: the inclusion and the non-inclusion of the start-of-sequence (BOS) token. The BOS token is a special token typically used by generative models to indicate the start of a text. In the first scenario, this token is included during tokenization, and perplexity is calculated from the first token in the texts. When the model’s tokenizer does not have the BOS token predefined, such as Falcon-7B, it is then defined as the tokenizer’s first special token. In the second scenario, the BOS token is excluded, and perplexity is calculated from the text’s second token.

4.2.1 Dataset

The dataset evaluated in conditional text generation is MIMIC-CXR, introduced below. Details on its preprocessing and characterization are in Appendix A.

MIMIC-CXR is an X-ray reports dataset [109, 110, 108]. Relevant sections of these reports are extracted using the code provided by Johnson et al. [111, 112]. Subsequently, null and duplicate samples are removed, with the resulting dataset having 57 7115771157\,71157 711 samples. None of these samples exceeds the maximum input size allowed for the proposed models.

5 Results and Discussion

The main findings are outlined below. For comparability, AUC scores reported in this section correspond to evaluating the eleven class-reduced version of Transcriptions (see section 4.1.2). In addition, to ensure the robustness of the results, bootstrapping with 1 00010001\,0001 000 iterations is applied to each experiment. Supplementary results are found in Appendix D.

5.1 Text classification analysis

Table 2: Highest-performing models for text classification per approach and metric. The scores presented correspond to the mean and, in parenthesis, its standard deviation of 1 00010001\,0001 000 bootstrap iterations. Approaches are encoded as follows: CES stands for contextual embedding similarity, NLI for natural language inference, and MCQA for multiple-choice question answering.
Dataset Metric CES NLI MCQA
Model Score Prompt Pooling Model Score Prompt Model Score
MS-CXR Accuracy BioLORD-STAMB2-v1-STS2 69.68(1.70)69.681.7069.68~{}(1.70)69.68 ( 1.70 ) x Avg. RoBERTaLARGELARGE{}_{\texttt{LARGE}}start_FLOATSUBSCRIPT LARGE end_FLOATSUBSCRIPT-MNLI 76.49(1.59)76.491.5976.49~{}(1.59)76.49 ( 1.59 ) x T0++ 81.74(1.45)81.741.45\mathbf{81.74~{}(1.45)}bold_81.74 ( bold_1.45 )
F1-score BioLORD-STAMB2-v1-STS2 69.24(1.67)69.241.6769.24~{}(1.67)69.24 ( 1.67 ) Avg. RoBERTaLARGELARGE{}_{\texttt{LARGE}}start_FLOATSUBSCRIPT LARGE end_FLOATSUBSCRIPT-MNLI 78.15(1.44)78.151.4478.15~{}(1.44)78.15 ( 1.44 ) x T0++ 83.86(1.24)83.861.24\mathbf{83.86~{}(1.24)}bold_83.86 ( bold_1.24 )
Precision BioLORD-PMB 83.34(1.11)83.341.1183.34~{}(1.11)83.34 ( 1.11 ) CLS RoBERTaLARGELARGE{}_{\texttt{LARGE}}start_FLOATSUBSCRIPT LARGE end_FLOATSUBSCRIPT-MNLI 80.72(1.42)80.721.4280.72~{}(1.42)80.72 ( 1.42 ) x Alpaca 7B 85.83(0.95)85.830.95\mathbf{85.83~{}(0.95)}bold_85.83 ( bold_0.95 )
Recall BioLORD-STAMB2-v1-STS2 72.62(1.34)72.621.3472.62~{}(1.34)72.62 ( 1.34 ) Avg. RoBERTaLARGELARGE{}_{\texttt{LARGE}}start_FLOATSUBSCRIPT LARGE end_FLOATSUBSCRIPT-MNLI 82.27(1.33)82.271.3382.27~{}(1.33)82.27 ( 1.33 ) x T0++ 89.22(0.82)89.220.82\mathbf{89.22~{}(0.82)}bold_89.22 ( bold_0.82 )
Transcriptions AUC score BioLORD-STAMB2-v1-STS2 89.03(0.31)89.030.3189.03~{}(0.31)89.03 ( 0.31 ) x Avg. BART Large-MNLI 80.75(0.46)80.750.4680.75~{}(0.46)80.75 ( 0.46 ) x Flan-T5-XXL 92.37(0.26)92.370.26\mathbf{92.37~{}(0.26)}bold_92.37 ( bold_0.26 )
Refer to caption
(a) Results for MS-CXR dataset
Refer to caption
(b) Results for Transcriptions dataset
Figure 2: Highest model classification scores achieved by approach for the evaluated datasets. Each point corresponds to the mean of 1 00010001\,0001 000 bootstrap iterations. Error bars are calculated as three times the standard deviation of the mean. The highest-performing models are consistent across datasets: BioLORD models (m07-m09) for contextual embedding similarity, MNLI fine-tuned RoBERTa and BART (m12-m13) for NLI, and the largest instruction-tuned models within the T5 family (m20-m23) and instruction-tuned models within the LLaMA family (m39-m40, m52) for multiple-choice QA. Overall, the larger instruction-tuned T5 models emerge as the top performers. The correspondence between the model and ID is found in Table 1.

The highest F1 and AUC scores are achieved with the largest instruction-tuned T5 models, i.e., Flan-T5 (m19-m21) and T0 (m22-m23), as shown in Table 2 and Fig. 2. Some of these scores are above 80%percent8080\%80 % in the F1-score and 90%percent9090\%90 % in the AUC score. These instruction-tuned T5 models in question stand among all models considered, ranking within the top 10 highest-performing models in both datasets. Nevertheless, the optimal choice of models may vary when considering precision as the target metric, where Alpaca (m39) and LLaMA 2-CHAT-7B (m40) demonstrate high competence.

Conversely, the lowest F1 and AUC scores are paradoxically obtained with the base T5 models (m14-m17), as evidenced in Fig. 2. These models, along with their clinically fine-tuned versions (m24-m25), rank within the top 10 lowest-performing models on both datasets. Moreover, the 100%percent100100\%100 % (1/1) and 75%percent7575\%75 % (6/8) of the models underperforming a random evaluator in Transcriptions and MS-CXR datasets, respectively, belong to base and clinically fine-tuned T5 models.

Delving into each approach, BioLORD models (m07-m09) are consistently the best choice for the contextual embedding similarity approach. MS-CXR dataset is relatively more complex than the Transcriptions dataset for these models, as reflected by their ranking in performance: 11th versus 3rd place, respectively. BART Large-MNLI (m13) represents the best overall model for the NLI approach. For both datasets, BART Large-MNLI is included in the top 10 highest-performing models, while RoBERTaLARGELARGE{}_{\texttt{LARGE}}start_FLOATSUBSCRIPT LARGE end_FLOATSUBSCRIPT-MNLI (m12) only does so for MS-CXR dataset. For the multiple choice QA approach, instruction-tuned models stand out, which include instruction-tuned LLaMa models (m39-m40) to the aforementioned instruction-tuned T5 models. Notably, LLaMA 2-CHAT-7B (m40) is within the top 10 highest-performing models in both datasets. These highest performers per approach consistently show results indicative of clinical knowledge or clinical notions.

The results of instruction-tuned T5 models support the feasibility of representing discriminative tasks as generative ones by framing them as instructions. These results also underline that generative tasks are not exclusive to decoder-only models, and text-to-text models may be a promising architecture to explore further. For example, versions of T5 tuned to instructions with 3B parameters (m20, m22) provide superior results to decoder-only models, almost three times larger, on both evaluated datasets.

Model size – More parameters alone do not always translate into better results

Refer to caption
(a) Results for MS-CXR dataset
Refer to caption
(b) Results for Transcriptions dataset
Figure 3: Analysis of the impact of the logarithm of size on model performance. Model performance is defined as the highest performance achieved per model over the configurations evaluated. Due to either the lack of size diversity or the low number of samples, Spearman’s coefficient, i.e., testing for monotonic relationships, is only reported for the multiple choice QA approach. An analysis of this coefficient suggests that there is not enough evidence to establish the statistical significance of the correlation, as reflected by the p-values.

The experiments yield findings questioning the claim that larger models consistently deliver superior performance. The performance of the models as a function of the logarithm of their size is depicted in Fig. 3. Testing for monotonic relationships via Spearman’s correlation is only reported for the multiple-choice QA approach due to the lack of diversity in sizes or number of samples. There is insufficient evidence to conclude that the Spearman’s correlation between size and performance is statistically significant in either dataset.

The trend of performance improvement with increasing size is almost nonexistent in the contextual embedding similarity approach. As seen in Fig. 2, for instance, models such as SapBERT (m06) and BioLORD models (m07-m09), which excel in this approach, outperform even three times larger models on both datasets. Within the same models, the deltas in performance associated with increasing the number of parameters are inconclusive. BERTLARGELARGE{}_{\texttt{LARGE}}start_FLOATSUBSCRIPT LARGE end_FLOATSUBSCRIPT (m01) marginally outperforms BERTBASEBASE{}_{\texttt{BASE}}start_FLOATSUBSCRIPT BASE end_FLOATSUBSCRIPT (m00) on the Transcriptions dataset, whereas the opposite is observed in all metrics on the MS-CXR dataset. BiomedBERT-large (abstracts only) (m04) surpasses, albeit marginally, BioMedBERT (abstracts only) (m03) on both evaluated datasets, excluding in precision. Furthermore, performance gains are evidenced when more training data is used, as shown by comparing BioMedBERT (abstracts only) and BiomedBERT (abstracts + full text) (m02).

Similarly, the effect of increasing the size on performance is not sufficiently clear or strong in the multiple-choice question answering approach. While positive Spearman’s correlations are obtained, there is insufficient evidence to deem them statistically significant. None of the p-values are <0.05absent0.05<0.05< 0.05, so the null hypothesis that the two variables have no ordinal correlation cannot be rejected. Within T5 models, the effect is minimal or inconsistent when considering their non-instruction-tuned versions (m14-m17, m24-m25). Within the instruction-tuned T5 models (m18-m23), a consistent positive effect of size on performance is observed for both FlanT5 and T0 models on Transcriptions and only for the latter on MS-CXR.

On the other hand, the results in NLI align with the expectation that larger models lead to better performance. However, more models are needed to draw a (solid) conclusion. To have a notion about the lower bound in performance between NLI-DeBERTabasebase{}_{\texttt{base}}start_FLOATSUBSCRIPT base end_FLOATSUBSCRIPT (m11) and the largest models evaluated, the difference between the highest and the lowest values obtained among the evaluated prompts, respectively, is calculated. Considering all metrics, these differences range between [36.32,51.59]36.3251.59[36.32,51.59][ 36.32 , 51.59 ] on MS-CXR, having thus that the largest models always lead to performance improvement. Reaching the same conclusion on the Transcriptions dataset is not straightforward, given the results obtained for RoBERTaLARGELARGE{}_{\texttt{LARGE}}start_FLOATSUBSCRIPT LARGE end_FLOATSUBSCRIPT-MNLI (m12), as depicted in Fig. 2. This model’s performance is closer to the performance of NLI-DeBERTabasebase{}_{\texttt{base}}start_FLOATSUBSCRIPT base end_FLOATSUBSCRIPT than to that of BART Large-MNLI (m13).

Altogether, the results do not provide sufficient evidence that only increasing the model size, in number of parameters, leads to an improvement in performance, whether comparing different or the same models. Although model size may be a relevant factor in determining performance, it is hypothesized that training data and objectives are more decisive in small pre-trained language models. This hypothesis aligns with findings in [10] and [12]. Expanding the sample size and diversity could be essential to validate these observations, considering a minimum of 30 or 35 models per approach.

Model domain – More than a specialized domain; model architecture, training data, and training objective

Current medical datasets remain relatively small compared to those of the general domain, covering only a tiny region of the medical knowledge space [84]. Domain specialization of models using only one of these datasets in question may limit their generalization ability [45].

The effectiveness of domain specialization in improving performance is not evident in the contextual embedding similarity approach, as displayed in Fig. 2. The domain-specific models considered in this approach are Bio+Clinical BERT (m10), BiomedBERT models (m02-m04), and SciBERT (m05). Bio+Clinical BERT achieves lower scores than expected, positioning around the middle of the performance ranking for this approach. Similarly, some of the BiomedBERT models are outperformed by BERTBASEBASE{}_{\texttt{BASE}}start_FLOATSUBSCRIPT BASE end_FLOATSUBSCRIPT (m00) and BERTLARGELARGE{}_{\texttt{LARGE}}start_FLOATSUBSCRIPT LARGE end_FLOATSUBSCRIPT (m01), their general domain counterparts. These findings, present in both datasets, challenge the superiority of domain-specific models over general domain ones in the task being evaluated via contextual embedding similarity.

Although existing, evidence supporting the effectiveness of domain specialization is still limited and unclear in the multiple-choice question answering approach. The models to be compared are T5 models (m14-m15) versus their clinical specialized versions (m24-m25), and Alpaca (m39) versus MedAlpaca (m52). Differences between ClinicalT5 and T5 models are 5.755.755.755.75 and 4.114.11-4.11- 4.11 in AUC scores and 5.245.245.245.24 and 0.000.000.000.00 in F1-scores. Similarly, differences between MedAlpaca and Alpaca are 1.861.86-1.86- 1.86 in AUC scores and 26.9726.9726.9726.97 in F1 scores. Due to these values, it can not be clearly stated that domain specialization positively impacts performance.

Considering the insights discussed and the remarkable performance of BioLORD (m07-m09) models, SapBERT (m06), Flan-T5 (m18-m21) models, and T0 (m22-m23) models in their respective approaches, the training data, training objectives, and model architectures are possibly critical in determining model generalization. Continual pre-training for named entity recognition or medical entity linkage using contrastive learning on UMLS data is likely one of the factors for the success of SapBERT and BioLORD models. Likewise, employing instruction-tuned text-to-text models represents a compelling approach to achieving high performance in multiple-choice QA. Due to the impossibility of concluding on the NLI approach, expanding the analysis to incorporate domain-specialized NLI models in biomedical and clinical domains could be valuable.

Prompting and instruction-tuning key to model performance

Refer to caption
(a) Results for MS-CXR dataset
Refer to caption
(b) Results for Transcriptions dataset
Figure 4: Distributions of the impact of prompting on model performance. In contextual embedding similarity and NLI, the impact of prompting is quantified as the difference in performance resulting from prompt usage, with positive values indicating improvement. As the distributions reveal, its usage only sometimes enhances performance. In multiple-choice QA, the impact of prompting is calculated as the variation in performance, expressed in standard deviations, when using different prompts. Optimal scenarios entail non-extreme values, suggesting that there is no strong dependence of performance on prompt wording. The distributions unveil some significant prompt-sensitive models in this case. These distributions are cut to the minimum and maximum observed values to avoid misleading remarks.

One of the central points of the study is to analyze the influence of prompting on the models and text classification approaches under investigation. Prompting impact is quantified as the difference in performance resulting from the prompt usage, with positive values indicating an improvement, in contextual embedding similarity and NLI. In multiple-choice QA, this impact is calculated as the variation in performance, expressed in standard deviations, when different instructions are used. The resulting distributions are shown in Fig. 4.

Using a prompt does not always confer benefits in contextual embedding similarity, as reflected by Fig. 4. On Transcriptions, the average impact on the AUC score is 2.252.25-2.25- 2.25 points, with values ranging from 9.329.32-9.32- 9.32 to 5.915.915.915.91. Using any of the proposed prompts improves performance for 45.45%percent45.4545.45\%45.45 % of the model + pooling strategy combinations. In contrast, none of these prompts led to AUC score improvements for BioLORD-PMB, BiomedBERT models, BERTBASEBASE{}_{\texttt{BASE}}start_FLOATSUBSCRIPT BASE end_FLOATSUBSCRIPT, and SciBERT. On MS-CXR, the impact of the prompt on performance is more positive on average, albeit with more variability. The average impact on the F1-score is 1.301.301.301.30 points, with values ranging from 25.4325.43-25.43- 25.43 to 40.4040.4040.4040.40. Similar values are reported on accuracy, precision, and recall. Employing any of the proposed prompts represents benefits for the 69.70%percent69.7069.70\%69.70 % to 84.85%percent84.8584.85\%84.85 % of the model + pooling strategy combinations, depending on the metric. The performance of BioMedBERT (abstracts only) and BiomedBERT-large (abstracts only) is enhanced with any of the prompts, whereas the performance of the BioLORD models and Bio+Clinical BERT is hindered.

More consistent benefits are observed than in contextual embedding similarity when examining the prompt impact in the NLI approach. On Transcriptions, any of the proposed prompts yields performance improvements, profiting larger models the most from its usage. The average impact on the AUC score is 8.038.038.038.03 for BART Large-MNLI, 2.152.152.152.15 for NLI-DeBERTabasebase{}_{\texttt{base}}start_FLOATSUBSCRIPT base end_FLOATSUBSCRIPT, and 4.874.874.874.87 for RoBERTaLARGELARGE{}_{\texttt{LARGE}}start_FLOATSUBSCRIPT LARGE end_FLOATSUBSCRIPT-MNLI. On MS-CXR, using a prompt only sometimes results in gains, particularly for NLI-DeBERTabasebase{}_{\texttt{base}}start_FLOATSUBSCRIPT base end_FLOATSUBSCRIPT. For this model, the average impact on the F1-score is 3.423.42-3.42- 3.42; while for BART Large-MNLI and RoBERTaLARGELARGE{}_{\texttt{LARGE}}start_FLOATSUBSCRIPT LARGE end_FLOATSUBSCRIPT-MNLI is 2.182.182.182.18 and 2.782.782.782.78, respectively. Moreover, positive prompt impacts are only observed on precision for NLI-DeBERTabasebase{}_{\texttt{base}}start_FLOATSUBSCRIPT base end_FLOATSUBSCRIPT. In both datasets, there are certain prompts with a high positive impact, whereas others do not, mostly independent of the model.

Similarly, prompt importance is also evident in the multiple-choice question answering approach, given its significant observed influence on model performance. The proportion of models performing better than a random evaluator (AUC score 50%percent5050\%50 %) on Transcriptions increases from 52%percent5252\%52 % to 96%percent9696\%96 % with appropriate prompts. Similarly, the proportion of better than a random evaluator (F1-score 12.5%percent12.512.5\%12.5 %) on MS-CXR rises from 25%percent2525\%25 % to 85%percent8585\%85 %. Prompting importance is thus highlighted not only by the high performance achieved but also by the brittleness of the models. The latter is reflected by the variability in Fig. 4, and further supported by Figs. 17 and 18 in Appendix D. Between datasets, the highest sensitivity to the prompt is found when evaluating Transcriptions, such that, with certain prompts, the instruction-tuned models yield similar results to their base counterparts. Overall, no single prompt works universally well for all models.

Regarding instruction-tuning, these models generally outperform their non instruction-tuned counterparts. The instruction-tuned T5 versions, whether T0 or Flan-T5, in any size considered, exhibit superior performance than their base counterparts. Instruction-tuning also improves performance consistently for the LLaMA models, whereas this is not always the case for other generative models: MPT and GPT-J are exceptions on the Transcriptions dataset and Falcon on the MS-CXR dataset. Overall, this tuning technique represents a gain, with an average increase of 21.4521.4521.4521.45 points in the AUC score and 43.5543.5543.5543.55 points in the F1-score.

Summarizing, the results endorse the crucial role of the prompt and its wording in the model’s performance, with both positive and negative effects presented. Consequently, we advocate using prompts and advanced prompting techniques to guide the model toward better results. This process should also not be limited to a single prompt due to the observed and well-known phenomenon of prompt brittleness [13]. Regarding instruction tuning, this technique proves to be beneficial for the models. More details on the prompt impact can be found in Figs. 20 and 19 in Appendix D.

5.2 Conditional text generation analysis

Refer to caption
Figure 5: Mean perplexity scores for the MIMIC-CXR dataset, disaggregated by BOS token usage. Each point corresponds to the mean of 1 00010001\,0001 000 bootstrap iterations. Error bars are calculated as three times the standard deviation of the mean. The highest-performers are the LlaMA models (m38-m39), whereas the lowest-performers are the BioGPT models (m48-m49). Not using the BOS token is beneficial for 77.78%percent77.7877.78\%77.78 % (14/18) of the models, with the exceptions of the GPT-2 models (m26-m28) and Palmyra Base 5B (m29). The correspondence between model and ID is found in Table 1.

LLaMA models (m38-m39) stand out as the ones with the highest predictive capacity among the models evaluated. Particularly, LLaMA 2-7B (m38) is the highest performer, with a mean perplexity of 9.129.129.129.12 when including the BOS token and 8.218.218.218.21 when not. LLaMA models are also notable for the low standard deviation of their mean, with approximate values of 0.050.050.050.05 and 0.130.130.130.13 depending on the BOS token usage. These standard deviations indicate higher confidence in the estimated value of the mean.

Conversely, BioGPT models (m48-m49) are the models with the most significant difficulty in comprehending the dataset. BioGPT (m48), the lowest performer, presents a mean perplexity of 80.3480.3480.3480.34 when including the BOS token and 38.7038.7038.7038.70 when not. The variability on the mean of these models is among the highest observed, with approximate standard deviations of 3.153.153.153.15 and 0.440.440.440.44 depending on the BOS token usage. These results are paradoxical considering that BioGPT is domain-specific while LLaMA 2-7B is not.

Similarly to previous findings for text classification, domain specialization does not necessarily imply surpassing general domain models. For the medium-size domain-specific models, it is observed that BioGPT (m48) does not outperform any of the general domain models, while GPT-2-PubMed Medium (m46) does. For the large size models, domain specialization proves beneficial; whereas for the XL and XXL sizes, neither Galactica (m50-m51) nor BioGPT-Large (m49) clearly outperforms general domain models. Consequently, the only specialized models that prove advantageous are GPT-2-PubMed (m46-m47).

On the other hand, increasing the model size contributes to improved performances, regardless of whether or not the BOS token is included. A slight performance improvement is also observed for the second versions (m42, m44) versus the first versions (m41, m43) of OpenLLaMA. This improvement is on average 1.581.581.581.58 and 1.951.951.951.95 points on the perplexity for the 3B and 7B parameter versions, respectively. Considering that the difference between these versions of OpenLLaMA is the dataset used for pre-training, the results obtained for conditional text generation do not contradict those for text classification.

Furthermore, the standard deviations of perplexity reveal the presence of exceptionally challenging samples for the models, that is, outliers, which is visually depicted in Fig. 21 in Appendix D. Moderate outliers, above quantile 0.750.750.750.75 by 1.51.51.51.5 times the IQR, represent between 7%percent77\%7 % and 11%percent1111\%11 % of the data, with BioGPT models having the highest percentages. Extreme outliers, above quantile 0.750.750.750.75 by three times the IQR, make up between 4%percent44\%4 % and 7%percent77\%7 % of the data, with most models exhibiting percentages around 4%percent44\%4 % and 5%percent55\%5 %.

Groups of generative models – LLaMA and GPT-2

Refer to caption
(a) Perplexities with BOS token
Refer to caption
(b) Perplexities without BOS token
Figure 6: Dendrograms of the UMAP’s principal components after being applied to the perplexities. Two major clusters of models are observed: the GPT-2 models and the LLaMA models.

Two procedures are carried out to determine whether the models exhibit similar perplexity behavior and identify potential clusters among them. The first procedure involves calculating the correlations between the models. Spearman’s and Pearson correlations are considered, assessing monotonic and linear relationships, respectively. The second procedure consists of dimensionality reduction via UMAP, followed by hierarchical clustering, represented by dendrograms in Fig. 6. Both procedures reveal the existence of two main groups of models: the GPT-2 and the LLaMA models.

In general, all models are positively correlated, indicating that most samples have a similar relative difficulty for these models. BioGPT models (m48-m49) are the only exception to this. Further looking at the Pearson correlations, clustering patterns are present, where groups such as the LLaMA, the OpenLLaMA, and the GPT-2 models are identified. Although these previous clusters are somewhat expected, some unexpected associations are also evident, such as between Falcon-7B and MPT-7B and between Palmyra Base 5B and GPT-J 6B. Moreover, linear relationships between the LLaMA and OpenLLaMA models weaken, interestingly, when the BOS token is used, indicating more pronounced performance disparities.Possibly, training data plays a role, as it is essentially their main difference [98].

6 Conclusion

This study comprehensively explores small pre-trained language models with varying sizes, architectural families, and domains. These models, being 52525252 considered, are tested for two fundamental medical natural language processing tasks: text classification and conditional text generation. The size of the models ranges from 110110110110 million to 13131313 billion parameters, which is relatively small compared to recent language models but suitable for consumer-grade computing resources. Our findings have significant implications, particularly for researchers and organizations operating under computational resource-constrained settings.

For the text classification task, three distinct approaches are explored: context embedding similarity, natural language inference, and multiple-choice question answering. BioLORD and SapBERT models have demonstrated remarkable performance in text classification via contextual embedding similarity. Similarly, the instruction-tuned versions of T5, Flan-T5 and T0, followed by the instruction-tuned versions of LLaMA, have exhibited outstanding results in the multiple-choice question answering approach. Flan-T5 and T0 are remarkably good in both general medical and radiology-specific knowledge assessments. To fully understand NLI models’ potential, further exploration of this approach is needed, particularly in specialized domains.

A common thread running through our findings is the significance of the prompt in improving text classification performance across different datasets and approaches. This significance extends beyond performance gains; they present a viable alternative to the resource-intensive processes of training and fine-tuning language models, which are often associated with substantial financial and environmental costs. Effective prompt engineering is also essential to mitigate prompt brittleness, ensuring more robust and reliable outcomes. As prompt brittleness is evidenced during the study, and given its importance, further exploration in this line of research is recommended.

Medical datasets often remain relatively small and cover only a small region of the medical knowledge space [84], so domain-specific models specialized using these datasets might see their generalization ability hindered. This practice could explain, to some extent, the results obtained. The results also suggest that the architecture, training data, and training objectives are crucial in determining the model’s generalization abilities, possibly outweighing the relevance of model size as a single variable.

For the conditional text generation task, LLaMA models stand out due to their low perplexities with minimal variation. Two groups of models are also identified based on the perplexities obtained in MIMIC-CXR: a group consisting of GPT-2 models and another of LLaMA models. Further research is needed to identify and understand the outliers within these results, as they could hold important insights.

In conclusion, this research highlights the critical role of prompts in language model inference and reaffirms the effectiveness of instruction-tuned generative models in addressing downstream tasks. It also underscores the relevance of model architecture, training data, and training objectives, potentially even more so than model size alone, in its generalization capacity. We advocate for further investigations into topics such as model calibration, i.e., how certain the model is about output, prompt engineering and tuning, and performance concerning issues like hallucinations and biases, among others. Such studies can lead to more effective and ethical applications of language models in healthcare. Extensions to include quantized models and more medical NLP tasks will be considered in further research. Quantification is an interesting and promising approach to making LLMs viable in consumer-grade computing resources.

References

  • [1] W. F. Wiggins and A. S. Tejani, “On the Opportunities and Risks of Foundation Models for Natural Language Processing in Radiology,” Radiology: Artificial Intelligence, vol. 4, no. 4, p. e220119, Jul. 2022.
  • [2] N. H. Shah, D. Entwistle, and M. A. Pfeffer, “Creation and Adoption of Large Language Models in Medicine,” JAMA, vol. 330, no. 9, pp. 866–869, Sep. 2023.
  • [3] J. Wei et al., “Emergent Abilities of Large Language Models,” Transactions on Machine Learning Research, 2022.
  • [4] J. A. Omiye, H. Gui, S. J. Rezaei, J. Zou, and R. Daneshjou, “Large language models in medicine: The potentials and pitfalls : A narrative review,” Ann. Intern. Med., vol. 177, no. 2, pp. 210–220, Feb. 2024.
  • [5] V. Liévin, C. E. Hother, A. G. Motzfeldt, and O. Winther, “Can large language models reason about medical questions?” Patterns, vol. 5, no. 3, p. 100943, 2024.
  • [6] A. Vaswani et al., “Attention is All you Need,” in Advances in Neural Information Processing Systems, I. Guyon et al., Eds., vol. 30.   Curran Associates, Inc., 2017.
  • [7] G. Kuling, B. Curpen, and A. L. Martel, “BI-RADS BERT and Using Section Segmentation to Understand Radiology Reports,” Journal of Imaging, vol. 8, no. 5, p. 131, 2022.
  • [8] K. Singhal et al., “Large language models encode clinical knowledge,” Nature, vol. 620, no. 7972, pp. 172–180, Aug. 2023.
  • [9] W. X. Zhao et al., “A Survey of Large Language Models,” 2023, arXiv:2303.18223 [cs.CL].
  • [10] J. Hoffmann et al., “An empirical analysis of compute-optimal large language model training,” in Advances in Neural Information Processing Systems, S. Koyejo, S. Mohamed, A. Agarwal, D. Belgrave, K. Cho, and A. Oh, Eds., vol. 35.   Curran Associates, Inc., 2022, pp. 30 016–30 030.
  • [11] Q. Liu et al., “Exploring the Boundaries of GPT-4 in Radiology,” in Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing.   Association for Computational Linguistics, Dec. 2023, pp. 14 414–14 445.
  • [12] M. Abdin et al., “Phi-3 Technical Report: A Highly Capable Language Model Locally on Your Phone,” 2024, arXiv:2404.14219 [cs.CL].
  • [13] K. He et al., “A Survey of Large Language Models for Healthcare: from Data, Technology, and Applications to Accountability and Ethics,” 2024, arXiv:2310.05694 [cs.CL].
  • [14] L. Tang et al., “Evaluating large language models on medical evidence summarization,” npj Digital Medicine, vol. 6, no. 1, p. 158, Aug. 2023.
  • [15] M. E. Peters et al., “Deep Contextualized Word Representations,” in Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers), M. A. Walker, H. Ji, and A. Stent, Eds.   Association for Computational Linguistics, Jun. 2018, pp. 2227–2237.
  • [16] J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova, “BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding,” in Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), J. Burstein, C. Doran, and T. Solorio, Eds.   Association for Computational Linguistics, Jun. 2019, pp. 4171–4186.
  • [17] H. W. Chung et al., “Scaling Instruction-Finetuned Language Models,” Journal of Machine Learning Research, vol. 25, no. 70, pp. 1–53, 2024.
  • [18] T. Brown et al., “Language Models are Few-Shot Learners,” in Advances in Neural Information Processing Systems, H. Larochelle, M. Ranzato, R. Hadsell, M. Balcan, and H. Lin, Eds., vol. 33.   Curran Associates, Inc., 2020, pp. 1877–1901.
  • [19] A. Radford, J. Wu, R. Child, D. Luan, D. Amodei, and I. Sutskever, “Language Models are Unsupervised Multitask Learners,” OpenAI, Tech. Rep., 2019. [Online]. Available: https://api.semanticscholar.org/CorpusID:160025533
  • [20] A. Chowdhery et al., “PaLM: Scaling Language Modeling with Pathways,” Journal of Machine Learning Research, vol. 24, no. 240, pp. 1–113, 2023.
  • [21] J. W. Rae et al., “Scaling Language Models: Methods, Analysis & Insights from Training Gopher,” 2022, arXiv:2112.11446 [cs.CL].
  • [22] J. Wei et al., “Finetuned Language Models are Zero-Shot Learners,” in International Conference on Learning Representations, 2022.
  • [23] D. Hendrycks et al., “Measuring Massive Multitask Language Understanding,” in International Conference on Learning Representations, 2021.
  • [24] J. Kaplan et al., “Scaling Laws for Neural Language Models,” 2020, arXiv:2001.08361 [cs.LG].
  • [25] Y. Bahri, E. Dyer, J. Kaplan, J. Lee, and U. Sharma, “Explaining neural scaling laws,” Proceedings of the National Academy of Sciences, vol. 121, no. 27, p. e2311878121, 2024.
  • [26] R. Taylor et al., “Galactica: A Large Language Model for Science,” 2022, arXiv:2211.09085 [cs.CL].
  • [27] H. Touvron et al., “LLaMA: Open and Efficient Foundation Language Models,” 2023, arXiv:2302.13971 [cs.CL].
  • [28] ——, “Llama 2: Open Foundation and Fine-Tuned Chat Models,” 2023, arXiv:2307.09288 [cs.CL].
  • [29] Antropic, “Introducing the next generation of Claude,” Mar. 2024. [Online]. Available: https://www.anthropic.com/news/claude-3-family
  • [30] Gemini Team et al., “Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context,” 2024, arXiv:2403.05530 [cs.CL].
  • [31] A. Q. Jiang et al., “Mistral 7B,” 2023, arXiv:2310.06825 [cs.CL].
  • [32] S. Bubeck et al., “Sparks of Artificial General Intelligence: Early experiments with GPT-4,” 2023, arXiv:2303.12712 [cs.CL].
  • [33] H. Nori, N. King, S. M. McKinney, D. Carignan, and E. Horvitz, “Capabilities of GPT-4 on Medical Challenge Problems,” 2023, arXiv:2303.13375 [cs.CL].
  • [34] R. Mao, G. Chen, X. Zhang, F. Guerin, and E. Cambria, “GPTEval: A Survey on Assessments of ChatGPT and GPT-4,” in Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024).   ELRA and ICCL, May 2024, pp. 7844–7866.
  • [35] J. López Espejel, E. H. Ettifouri, M. S. Yahaya Alassan, E. M. Chouham, and W. Dahhane, “GPT-3.5, GPT-4, or BARD? Evaluating LLMs reasoning ability in zero-shot setting and performance boosting through prompts,” Natural Language Processing Journal, vol. 5, p. 100032, 2023.
  • [36] P. Liang et al., “Holistic Evaluation of Language Models,” Transactions on Machine Learning Research, 2023.
  • [37] H. Liu, R. Ning, Z. Teng, J. Liu, Q. Zhou, and Y. Zhang, “Evaluating the Logical Reasoning Ability of ChatGPT and GPT-4,” 2023, arXiv:2304.03439 [cs.CL].
  • [38] R. Schaeffer, B. Miranda, and S. Koyejo, “Are Emergent Abilities of Large Language Models a Mirage?” in Advances in Neural Information Processing Systems, A. Oh, T. Naumann, A. Globerson, K. Saenko, M. Hardt, and S. Levine, Eds., vol. 36.   Curran Associates, Inc., 2023, pp. 55 565–55 581.
  • [39] M. Agrawal, S. Hegselmann, H. Lang, Y. Kim, and D. Sontag, “Large language models are few-shot clinical information extractors,” in Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, Y. Goldberg, Z. Kozareva, and Y. Zhang, Eds.   Association for Computational Linguistics, Dec. 2022, pp. 1998–2022.
  • [40] V. Sanh et al., “Multitask Prompted Training Enables Zero-Shot Task Generalization,” in International Conference on Learning Representations, 2022.
  • [41] A. Lampinen et al., “Can language models learn from explanations in context?” in Findings of the Association for Computational Linguistics: EMNLP 2022, Y. Goldberg, Z. Kozareva, and Y. Zhang, Eds.   Association for Computational Linguistics, Dec. 2022, pp. 537–563.
  • [42] T. Kojima, S. S. Gu, M. Reid, Y. Matsuo, and Y. Iwasawa, “Large Language Models are Zero-Shot Reasoners,” in Advances in Neural Information Processing Systems, S. Koyejo, S. Mohamed, A. Agarwal, D. Belgrave, K. Cho, and A. Oh, Eds., vol. 35.   Curran Associates, Inc., 2022, pp. 22 199–22 213.
  • [43] J. Wei et al., “Chain-of-Thought Prompting Elicits Reasoning in Large Language Models,” in Advances in Neural Information Processing Systems, S. Koyejo, S. Mohamed, A. Agarwal, D. Belgrave, K. Cho, and A. Oh, Eds., vol. 35.   Curran Associates, Inc., 2022, pp. 24 824–24 837.
  • [44] M. Joshi, E. Choi, D. Weld, and L. Zettlemoyer, “TriviaQA: A Large Scale Distantly Supervised Challenge Dataset for Reading Comprehension,” in Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), R. Barzilay and M.-Y. Kan, Eds.   Association for Computational Linguistics, Jul. 2017, pp. 1601–1611.
  • [45] I. Jahan, M. T. R. Laskar, C. Peng, and J. X. Huang, “A comprehensive evaluation of large language models on benchmark biomedical text processing tasks,” Computers in Biology and Medicine, vol. 171, p. 108189, 2024.
  • [46] B. Jacob et al., “Quantization and training of neural networks for efficient integer-arithmetic-only inference,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2018.
  • [47] G. Xiao, J. Lin, M. Seznec, H. Wu, J. Demouth, and S. Han, “SmoothQuant: Accurate and efficient post-training quantization for large language models,” in Proceedings of the 40th International Conference on Machine Learning, ser. Proceedings of Machine Learning Research, A. Krause, E. Brunskill, K. Cho, B. Engelhardt, S. Sabato, and J. Scarlett, Eds., vol. 202.   PMLR, July 2023, pp. 38 087–38 099.
  • [48] Y. Tay, M. Dehghani, D. Bahri, and D. Metzler, “Efficient transformers: A survey,” ACM Comput. Surv., vol. 55, no. 6, December 2022.
  • [49] S. Li et al., “Evaluating quantized large language models,” in Forty-first International Conference on Machine Learning, 2024.
  • [50] S. Kim et al., “SqueezeLLM: Dense-and-sparse quantization,” in Forty-first International Conference on Machine Learning, 2024.
  • [51] J. Guo et al., “Compressing large language models by joint sparsification and quantization,” in Forty-first International Conference on Machine Learning, 2024.
  • [52] R. Jin et al., “A comprehensive evaluation of quantization strategies for large language models,” in Findings of the Association for Computational Linguistics ACL 2024, L.-W. Ku, A. Martins, and V. Srikumar, Eds.   Association for Computational Linguistics, August 2024, pp. 12 186–12 215.
  • [53] P. Lee, S. Bubeck, and J. Petro, “Benefits, Limits, and Risks of GPT-4 as an AI Chatbot for Medicine,” New England Journal of Medicine, vol. 388, no. 13, pp. 1233–1239, 2023.
  • [54] M. A. Fink, “Goße Sprachmodelle wie ChatGPT und GPT-4 für eine patientenzentrierte Radiologie [Large language models such as ChatGPT and GPT-4 for patient-centered care in radiology],” Radiologie, vol. 63, no. 9, pp. 665–671, Sep. 2023.
  • [55] Q. Lyu et al., “Translating radiology reports into plain language using ChatGPT and GPT-4 with prompt learning: results, limitations, and potential,” Visual Computing for Industry, Biomedicine, and Art, vol. 6, no. 1, p. 9, May 2023.
  • [56] L. C. Adams et al., “Leveraging GPT-4 for Post Hoc Transformation of Free-text Radiology Reports into Structured Reporting: A Multilingual Feasibility Study,” Radiology, vol. 307, no. 4, p. e230725, 2023.
  • [57] R. Bhayana, R. R. Bleakney, and S. Krishna, “GPT-4 in Radiology: Improvements in Advanced Reasoning,” Radiology, vol. 307, no. 5, p. e230987, 2023.
  • [58] Z. Wu et al., “Exploring the Trade-Offs: Unified Large Language Models vs Local Fine-Tuned Models for Highly-Specific Radiology NLI Task,” 2023, arXiv:2304.09138 [cs.CL].
  • [59] M. Ranjit, G. Ganapathy, R. Manuel, and T. Ganu, “Retrieval Augmented Chest X-Ray Report Generation using OpenAI GPT models,” in Proceedings of the 8th Machine Learning for Healthcare Conference, ser. Proceedings of Machine Learning Research, K. Deshpande et al., Eds., vol. 219.   PMLR, Aug. 2023, pp. 650–666.
  • [60] B. Meskó and E. J. Topol, “The imperative for regulatory oversight of large language models (or generative AI) in healthcare,” npj Digital Medicine, vol. 6, no. 1, p. 120, Jul. 2023.
  • [61] D. Gala and A. N. Makaryus, “The Utility of Language Models in Cardiology: A Narrative Review of the Benefits and Concerns of ChatGPT-4,” International Journal of Environmental Research and Public Health, vol. 20, no. 15, 2023.
  • [62] S. B. Atallah, N. R. Banda, A. Banda, and N. A. Roeck, “How large language models including generative pre-trained transformer (GPT) 3 and 4 will impact medicine and surgery,” Techniques in Coloproctology, vol. 27, no. 8, pp. 609–614, Aug. 2023.
  • [63] K. Cheng, Q. Guo, Y. He, Y. Lu, S. Gu, and H. Wu, “Exploring the Potential of GPT-4 in Biomedical Engineering: The Dawn of a New Era,” Annals of Biomedical Engineering, vol. 51, no. 8, pp. 1645–1653, Aug. 2023.
  • [64] J. Lee et al., “BioBERT: a pre-trained biomedical language representation model for biomedical text mining,” Bioinformatics, vol. 36, no. 4, pp. 1234–1240, Sep. 2019.
  • [65] I. Beltagy, K. Lo, and A. Cohan, “SciBERT: A Pretrained Language Model for Scientific Text,” in Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), K. Inui, J. Jiang, V. Ng, and X. Wan, Eds.   Association for Computational Linguistics, Nov. 2019, pp. 3613–3618.
  • [66] Y. Gu et al., “Domain-Specific Language Model Pretraining for Biomedical Natural Language Processing,” ACM Trans. Comput. Heal., vol. 3, no. 1, pp. 2:1–2:23, Oct. 2022.
  • [67] H. Shin et al., “BioMegatron: Larger Biomedical Domain Language Model,” in Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), B. Webber, T. Cohn, Y. He, and Y. Liu, Eds.   Association for Computational Linguistics, Nov. 2020, pp. 4700–4706.
  • [68] Z. Hong, A. Ajith, J. G. Pauloski, E. Duede, K. Chard, and I. T. Foster, “The Diminishing Returns of Masked Language Models to Science,” in Findings of the Association for Computational Linguistics: ACL 2023, A. Rogers, J. L. Boyd-Graber, and N. Okazaki, Eds.   Association for Computational Linguistics, 2023, pp. 1270–1283.
  • [69] R. Luo et al., “BioGPT: generative pre-trained transformer for biomedical text generation and mining,” Briefings in Bioinformatics, vol. 23, no. 6, Sep. 2022.
  • [70] E. Alsentzer et al., “Publicly Available Clinical BERT Embeddings,” in Proceedings of the 2nd Clinical Natural Language Processing Workshop.   Association for Computational Linguistics, Jun. 2019, pp. 72–78.
  • [71] T. Han et al., “MedAlpaca – An Open-Source Collection of Medical Conversational AI Models and Training Data,” 2023, arXiv:2304.08247 [cs.CL].
  • [72] C. Wu, W. Lin, X. Zhang, Y. Zhang, W. Xie, and Y. Wang, “PMC-LLaMA: toward building open-source language models for medicine,” Journal of the American Medical Informatics Association: JAMIA, vol. 31, no. 9, pp. 1833–1843, Apr. 2024.
  • [73] K. Singhal et al., “Towards Expert-Level Medical Question Answering with Large Language Models,” 2023, arXiv:2305.09617 [cs.CL].
  • [74] X. Yang et al., “A large language model for electronic health records,” npj Digital Medicine, vol. 5, no. 1, p. 194, Dec. 2022.
  • [75] C. Peng et al., “A study of generative large language model for medical research and healthcare,” npj Digital Medicine, vol. 6, no. 1, p. 210, Nov. 2023.
  • [76] G. Wang, G. Yang, Z. Du, L. Fan, and X. Li, “ClinicalGPT: Large Language Models Finetuned with Diverse Medical Data and Comprehensive Evaluation,” 2023, arXiv:2306.09968 [cs.CL].
  • [77] H. Zhang et al., “HuatuoGPT, Towards Taming Language Model to Be a Doctor,” in Findings of the Association for Computational Linguistics: EMNLP 2023, H. Bouamor, J. Pino, and K. Bali, Eds.   Association for Computational Linguistics, Dec. 2023, pp. 10 859–10 885.
  • [78] C. Li et al., “LLaVA-Med: Training a Large Language-and-Vision Assistant for Biomedicine in One Day,” in Advances in Neural Information Processing Systems, A. Oh, T. Naumann, A. Globerson, K. Saenko, M. Hardt, and S. Levine, Eds., vol. 36.   Curran Associates, Inc., 2023, pp. 28 541–28 564.
  • [79] J. Zhou, X. Chen, and X. Gao, “Path to Medical AGI: Unify Domain-specific Medical LLMs with the Lowest Cost,” 2023, arXiv:2306.10765 [cs.AI].
  • [80] W. Gao et al., “OphGLM: Training an Ophthalmology Large Language-and-Vision Assistant based on Instructions and Dialogue,” 2023, arXiv:2306.12174 [cs.CV].
  • [81] C. Shu, B. Chen, F. Liu, Z. Fu, E. Shareghi, and N. Collier, “Visual Med-Alpaca: A Parameter-Efficient Biomedical LLM with Visual Capabilities,” 2013. [Online]. Available: https://github.com/cambridgeltl/visual-med-alpaca
  • [82] M. Moor et al., “Med-Flamingo: a Multimodal Medical Few-shot Learner,” in Proceedings of the 3rd Machine Learning for Health Symposium, ser. Proceedings of Machine Learning Research, S. Hegselmann et al., Eds., vol. 225.   PMLR, Dec. 2023, pp. 353–367.
  • [83] E. Tiu, E. Talius, P. Patel, C. P. Langlotz, A. Y. Ng, and P. Rajpurkar, “Expert-level detection of pathologies from unannotated chest X-ray images via self-supervised learning,” Nature Biomedical Engineering, vol. 6, no. 12, pp. 1399–1406, Dec. 2022.
  • [84] H. Zhou et al., “A Survey of Large Language Models in Medicine: Progress, Application, and Challenge,” 2024, arXiv:2311.05112 [cs.CL].
  • [85] S. Soni and K. Roberts, “Evaluation of Dataset Selection for Pre-Training and Fine-Tuning Transformer Language Models for Clinical Question Answering,” in Proceedings of the 12th Language Resources and Evaluation Conference, N. Calzolari et al., Eds.   European Language Resources Association, May 2020, pp. 5532–5538.
  • [86] E. Lehman et al., “Do we still need clinical language models?” in Proceedings of the Conference on Health, Inference, and Learning, ser. Proceedings of Machine Learning Research, B. J. Mortazavi, T. Sarker, A. Beam, and J. C. Ho, Eds., vol. 209.   PMLR, Aug. 2023, pp. 578–597.
  • [87] Y. Li, R. M. Wehbe, F. S. Ahmad, H. Wang, and Y. Luo, “A comparative study of pretrained language models for long clinical text,” Journal of the American Medical Informatics Association, vol. 30, no. 2, pp. 340–347, 11 2022.
  • [88] Sentence Transformers - Cross-Encoders, “cross-encoder/nli-deberta-base,” 2021. [Online]. Available: https://huggingface.co/cross-encoder/nli-deberta-base
  • [89] Y. Liu et al., “RoBERTa: A Robustly Optimized BERT Pretraining Approach,” 2019, arXiv:1907.11692 [cs.CL].
  • [90] F. Liu, E. Shareghi, Z. Meng, M. Basaldella, and N. Collier, “Self-Alignment Pretraining for Biomedical Entity Representations,” in Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies NAACL-HLT 2021, K. Toutanova et al., Eds.   Association for Computational Linguistics, Jun. 2021, pp. 4228–4238.
  • [91] F. Remy, K. Demuynck, and T. Demeester, “BioLORD: Learning Ontological Representations from Definitions for Biomedical Concepts and their Textual Descriptions,” in Findings of the Association for Computational Linguistics: EMNLP 2022, Y. Goldberg, Z. Kozareva, and Y. Zhang, Eds.   Association for Computational Linguistics, Dec. 2022, pp. 1454–1465.
  • [92] C. Raffel et al., “Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer,” Journal of Machine Learning Research, vol. 21, no. 140, pp. 1–67, 2020.
  • [93] Google, “google/t5-v1_1,” 2023. [Online]. Available: https://huggingface.co/google
  • [94] AI at Meta, “facebook/bart-large-mnli,” 2023. [Online]. Available: https://huggingface.co/facebook/bart-large-mnli
  • [95] V. Sanh et al., “Multitask Prompted Training Enables Zero-Shot Task Generalization,” in The Tenth International Conference on Learning Representations, ICLR 2022.   OpenReview.net, 2022.
  • [96] Q. Lu, D. Dou, and T. Nguyen, “ClinicalT5: A Generative Language Model for Clinical Text,” in Findings of the Association for Computational Linguistics: EMNLP 2022, Y. Goldberg, Z. Kozareva, and Y. Zhang, Eds.   Association for Computational Linguistics, Dec. 2022, pp. 5436–5443.
  • [97] Writer Engineering team, “Palmyra-base Parameter Autoregressive Language Model,” Jan. 2023. [Online]. Available: https://dev.writer.com
  • [98] X. Geng and H. Liu, “OpenLLaMA: An Open Reproduction of LLaMA,” May 2023. [Online]. Available: https://github.com/openlm-research/open_llama
  • [99] Writer Engineering team, “Camel-5B InstructGPT,” Apr. 2023. [Online]. Available: https://dev.writer.com
  • [100] B. Wang and A. Komatsuzaki, “GPT-J-6B: A 6 Billion Parameter Autoregressive Language Model,” May 2021. [Online]. Available: https://github.com/kingoflolz/mesh-transformer-jax
  • [101] NLP Cloud, “nlpcloud/instruct-gpt-j-fp16,” 2023. [Online]. Available: https://huggingface.co/nlpcloud/instruct-gpt-j-fp16
  • [102] E. Almazrouei et al., “The Falcon Series of Open Language Models,” 2023, arXiv: 2311.16867 [cs.CL].
  • [103] MosaicML NLP Team, “Introducing MPT-7B: A New Standard for Open-Source, Commercially Usable LLMs,” May 2023. [Online]. Available: www.mosaicml.com/blog/mpt-7b
  • [104] R. Taori et al., “Stanford Alpaca: An Instruction-following LLaMA model,” GitHub, 2023. [Online]. Available: https://github.com/tatsu-lab/stanford_alpaca
  • [105] Y. Papanikolaou, “healx/gpt-2-pubmed,” 2020. [Online]. Available: https://huggingface.co/healx
  • [106] B. Boecking et al., “MS-CXR: Making the Most of Text Semantics to Improve Biomedical Vision-Language Processing (version 0.1),” PhysioNet, 2022. [Online]. Available: https://doi.org/10.13026/b90j-vb87
  • [107] ——, “Making the Most of Text Semantics to Improve Biomedical Vision–Language Processing,” in Computer Vision – ECCV 2022: 17th European Conference.   Cham: Springer Nature Switzerland, Oct. 2022, pp. 1–21. [Online]. Available: https://doi.org/10.1007/978-3-031-20059-5_1
  • [108] A. L. Goldberger et al., “PhysioBank, PhysioToolkit, and PhysioNet: Components of a new research resource for complex physiologic signals,” Circulation [Online], vol. 101, no. 23, pp. e215–e220, Jun. 2000.
  • [109] A. E. W. Johnson, T. Pollard, R. Mark, S. Berkowitz, and S. Horng, “The MIMIC-CXR Database,” PhysioNet, 2019. [Online]. Available: https://doi.org/10.13026/C2JT1Q
  • [110] A. E. W. Johnson et al., “MIMIC-CXR, a de-identified publicly available database of chest radiographs with free-text reports,” Sci Data, vol. 6, no. 1, p. 317, 2019. [Online]. Available: https://doi.org/10.1038/s41597-019-0322-0
  • [111] A. E. W. Johnson, D. J. Stone, L. A. Celi, and T. J. Pollard, “The MIMIC Code Repository: enabling reproducibility in critical care research,” Journal of the American Medical Informatics Association, vol. 25, no. 1, pp. 32–39, 2018.
  • [112] A. Johnson et al., “MIT-LCP/mimic-code: MIMIC Code v2.2.1,” Zenodo, Jul. 2022. [Online]. Available: https://doi.org/10.5281/zenodo.6818823

Appendix A Data

This section describes the data employed and outlines the corresponding preprocessing procedure.

A.1 Transcriptions

Transcriptions is a multi-label dataset with 40 different labels and 2,35823582,3582 , 358 data samples. The data were extracted from Kaggle, and additional information about the labels can be found in MTSamples.com.

A.1.1 Preprocessing

The preprocessing procedure involves the removal of samples that lack associated reports, adjusting the formatting of the report, and selecting and renaming labels. Formatting adjustments are necessary because line breaks are encoded as comma patterns. To ascertain the final format, we considered the original data source MTSamples.com and the results generated by ChatGPT as a guide to knowledge of language models.

In terms of labels, less relevant categories were excluded due to their broad level of generality or lack of association with a specific medical specialty. Precisely, the eliminated labels are: “Consult - History and Phy.”, “Discharge Summary”, “Emergency Room Reports”, “General Medicine”, “Hospice - Palliative Care”, “IME-QME-Work Comp etc.”, “Letters”, “Office Notes”, “Pain Management”, “SOAP / Chart / Progress Notes”. Additionally, several labels contained the “/” character, indicating “or”, which we explicitly replaced with the latter. For example, “Allergy / Immunology” was transformed into “Allergy or Immunology”. Subsequently, the labels “Chiropractic” and “Physical Medicine - Rehab” were merged into a unified category called “Physical Medicine and Rehabilitation, or Chiropractic”. Other modifications include transforming “ENT - Otolaryngology” into “Otolaryngology”, “Hematology - Oncology” into “Hematology or Oncology”, “Lab Medicine - Pathology” into “Laboratory Medicine or Clinical Pathology”, “Pediatrics - Neonatal” into “Pediatrics or Neonatal”, and “Speech - Language” into “Speech and Language”.

Upon completion of the preprocessing, the initial count of 40 different labels is reduced to 29, and the number of samples to consider is 2,07420742,0742 , 074.

A.1.2 Description

The class distribution is visualized in Fig. 8. Surgery is the most prevalent category (in 52.46%percent52.4652.46\%52.46 % of the samples), followed by Cardiovascular or Pulmonary (17.89%percent17.8917.89\%17.89 %) and Orthopedic (17.11%percent17.1117.11\%17.11 %). On the other hand, Allergy or Immunology (0.33%percent0.330.33\%0.33 %), preceded by Autopsy (0.38%percent0.380.38\%0.38 %) and Laboratory Medicine or Clinical Pathology (0.38%percent0.380.38\%0.38 %), are the least frequent categories. The number of labels per sample ranges from 1 to 4, with an average of 2 labels per sample. Additionally, some labels never co-occur within the same sample.

The results of analyzing label leakage, which refers to whether a label appears explicitly in the text to be classified, are shown in Fig. 8. For most labels, label leakage is minimal, except for Autopsy (62.50%percent62.5062.50\%62.50 %), Rheumatology (40.00%percent40.0040.00\%40.00 %), Speech and Language (33.33%percent33.3333.33\%33.33 %), and Surgery (29.50%percent29.5029.50\%29.50 %). Labels without label leakage are Allergy or Immunology, Cardiovascular or Pulmonary, Cosmetic or Plastic Surgery, Diets and Nutritions, Hematology or Oncology, Laboratory Medicine or Clinical Pathology, Obstetrics or Gynecology, Pediatrics or Neonatal, Physical Medicine and Rehabilitation, or Chiropractic, Psychiatry or Psychology, and Sleep Medicine. The presence of labels in texts of other labels is not considered, given that this is a multi-label dataset, and the analysis and interpretation of such occurrences are inherently complex.

Refer to caption
Figure 7: Label distribution of the transcriptions dataset after preprocessing
Refer to caption
Figure 8: Label leakage for the transcriptions dataset

A.2 MS-CXR

MS-CXR [106, 107, 108] is a multi-class dataset with 8 different classes and a corpus of 1,44814481,4481 , 448 data samples, comprising 718 unique samples. The data can be obtained from [108].

A.2.1 Preprocessing

The preprocessing procedure involves removing instances without associated reports and eliminating duplicates. To be precise, 730 samples (50.41%percent50.4150.41\%50.41 %) were identified as duplicates, with a maximum of 82 and an average of 3 duplicates, considering only repeated reports. In addition, when duplicate reports do not agree with the assigned label, either of these labels is evaluated as the true one.

A.2.2 Description

The class distribution is depicted in Fig. 10. Overall, the dataset does not exhibit class imbalance. The most frequent classes are Pneumonia (24.37%percent24.3724.37\%24.37 %), closely followed by Pneumothorax (21.17%percent21.1721.17\%21.17 %), while the less frequent classes are Cardiomegaly (5.15%percent5.155.15\%5.15 %), preceded by Edema (5.43%percent5.435.43\%5.43 %).

Upon analysis of label leakage, as presented in Fig. 10, a high label leakage is observed, except for Lung Opacity, which has a low leakage rate of 1.33%percent1.331.33\%1.33 %. In particular, Consolidation, Edema, and Pneumothorax exhibit leakage rates that exceed 90%percent9090\%90 %. Classes with leakage rates below 50%percent5050\%50 % include Pneumonia, Cardiomegaly, and Lung Opacity, as mentioned earlier. Regarding the presence of labels in text from other labels, notable occurrences include Consolidation in the classes of Edema (12.82%percent12.8212.82\%12.82 %) and Pneumonia (24.00%percent24.0024.00\%24.00 %), and Pleural Effusion in the Atelectasis class (21.43%percent21.4321.43\%21.43 %).

Refer to caption
Figure 9: Label distribution of the MS-CXR dataset
Refer to caption
Figure 10: Label leakage and presence of class label on other classes for MS-CXR dataset

To conclude, each class’s word count per text is measured, and their distributions are presented in Fig. 12. Classes with shorter texts include Cardiomegaly and Pneumothorax. Although classes with longer texts are not observed, there are flatter distributions with heavy tails, suggesting that the length of texts in these classes is less concentrated around a specific value.

A.3 MIMIC-CXR

MIMIC-CXR [109, 110, 108] is a dataset of radiographic reports that encompasses 78,5847858478,58478 , 584 samples. After extracting the most pertinent sections, 75,0297502975,02975 , 029 samples are identified as informative. This dataset is accessible through [108].

A.3.1 Preprocessing

The preprocessing procedure involves extracting the most relevant sections from chest X-ray reports using the codes [111] designed for this purpose and publicly available on GitHub [112]. In addition, texts lacking content and duplicate samples are removed. Texts lacking information are defined as those that are empty or match one of the following: “.”, “As above”, “As above.”, “As above..”, “None.”, “See above.”, “No changes.”, “___”, “___ earlier”, “___,”, or “___.”. Those mentioned above were identified after meticulously examining texts with a maximum length of two words. In total, these non-informative texts represent merely 0.26%percent0.260.26\%0.26 % of the dataset. Regarding duplicates, 1.69%percent1.691.69\%1.69 % of the total samples are duplicated, comprising 23.07%percent23.0723.07\%23.07 % of the dataset. The text with the most duplicates is ”No acute cardiopulmonary process.” representing 7.88%percent7.887.88\%7.88 % of the samples. On average, each text appears twice in the dataset.

Upon completion of the preprocessing steps, the dataset results in 57,711samples57711𝑠𝑎𝑚𝑝𝑙𝑒𝑠57,711samples57 , 711 italic_s italic_a italic_m italic_p italic_l italic_e italic_s, composed mainly of impressions (81.92%percent81.9281.92\%81.92 %) and findings (17.48%percent17.4817.48\%17.48 %).

A.3.2 Description

Considering the nature of this dataset, its description focuses mainly on the distribution of the number of words per sample, as shown in Fig. 12. This distribution is left-skewed, with a peak of around 10 words per sample. Moreover, there is a significant plateau between 20 and 40 words per sample. Interestingly, the distribution’s right tail extends beyond 150 words per sample. In summary, most texts (75%percent7575\%75 %) contain at most 51 words, with a pronounced peak of around 10 words per sample. However, this dataset also includes longer texts, some reaching up to 307 words.

Refer to caption
Figure 11: Length distribution of reports, in number of words, for MS-CXR dataset
Refer to caption
Figure 12: Length distribution of reports in number of words for MIMIC-CXR dataset

Appendix B Models

Table 3: Details on the models studied. The total inference time represents the average time to process the entire dataset per experiment.
ID Model Type Domain Model size (no. parameters) Input max. size (no. tokens) Total inference time (seconds)
Classification
MS-CXR
Generation
MIMIC-CXR
m00 BERTBASEBASE{}_{\texttt{BASE}}start_FLOATSUBSCRIPT BASE end_FLOATSUBSCRIPT [16] Encode-only General 110110110110 M 512512512512 3.333.333.333.33
m01 BERTLARGELARGE{}_{\texttt{LARGE}}start_FLOATSUBSCRIPT LARGE end_FLOATSUBSCRIPT [16] Encode-only General 340340340340 M 512512512512 5.255.255.255.25
m02
BiomedBERT
(abstracts + full text) [66]
Encode-only Biomedical 110110110110 M 512512512512 3.103.103.103.10
m03
BiomedBERT
(abstracts only) [66]
Encode-only Biomedical 110110110110 M 512512512512 3.093.093.093.09
m04
BiomedBERT-large
(abstracts only) [66]
Encode-only Biomedical 340340340340 M 512512512512 4.564.564.564.56
m05 SciBERT [65] Encode-only Biomedical 110110110110 M 512512512512 3.283.283.283.28
m06 SapBERT [90] Encode-only Biomedical 110110110110 M 512512512512 3.143.143.143.14
m07 BioLORD-STAMB2-v1 [91] Encode-only Biomedical 110110110110 M 512512512512 3.353.353.353.35
m08 BioLORD-STAMB2-v1-STS2 [91] Encode-only Biomedical 110110110110 M 512512512512 3.313.313.313.31
m09 BioLORD-PMB [91] Encode-only Biomedical 110110110110 M 512512512512 3.293.293.293.29
m10 Bio+Clinical BERT [70] Encode-only Clinical 110110110110 M 512512512512 3.153.153.153.15
m11 NLI-DeBERTabasebase{}_{\texttt{base}}start_FLOATSUBSCRIPT base end_FLOATSUBSCRIPT [88]
Encoder-only
(cross-encoder)
General 100100100100 M 512512512512 8.848.848.848.84
m12 RoBERTaLARGELARGE{}_{\texttt{LARGE}}start_FLOATSUBSCRIPT LARGE end_FLOATSUBSCRIPT-MNLI [89]
Encoder-only
(cross-encoder)
General 355355355355 M 512512512512 20.3820.3820.3820.38
m13 BART Large-MNLI [94] Encoder-decoder General 407407407407 M 1 02410241\,0241 024 23.9323.9323.9323.93
m14 T5-V1.1-Base [92, 93] Encoder-decoder General 220220220220 M 512512512512 6.166.166.166.16
m15 T5-V1.1-Large [92, 93] Encoder-decoder General 770770770770 M 512512512512 14.5214.5214.5214.52
m16 T5-V1.1-3B [92, 93] Encoder-decoder General 3.03.03.03.0 B 512512512512 38.5738.5738.5738.57
m17 T5-V1.1-11B [92, 93] Encoder-decoder General 11.011.011.011.0 B 512512512512 64.8864.8864.8864.88
m18 Flan-T5-Base [17]
Encoder-decoder
(instruction-tuned)
General 220220220220 M 512512512512 6.746.746.746.74
m19 Flan-T5-Large [17]
Encoder-decoder
(instruction-tuned)
General 770770770770 M 512512512512 16.1816.1816.1816.18
m20 Flan-T5-XL [17]
Encoder-decoder
(instruction-tuned)
General 3.03.03.03.0 B 512512512512 40.7140.7140.7140.71
m21 Flan-T5-XLL [17]
Encoder-decoder
(instruction-tuned)
General 11.011.011.011.0 B 512512512512 69.169.169.169.1
m22 T0 3B [95]
Encoder-decoder
(instruction-tuned)
General 3.03.03.03.0 B 512512512512 38.6238.6238.6238.62
m23 T0++ [95]
Encoder-decoder
(instruction-tuned)
General 11.011.011.011.0 B 512512512512 63.8963.8963.8963.89
m24 ClinicalT5-base [96] Encoder-decoder Clinical 220220220220 M 512512512512 5.565.565.565.56
m25 ClinicalT5-large [96] Encoder-decoder Clinical 700700700700 M 512512512512 11.9411.9411.9411.94
m26 GPT-2 Medium [19] Decoder-only General 355355355355 M 1 02410241\,0241 024 3 169.673169.673\,169.673 169.67
m27 GPT-2 Large [19] Decoder-only General 774774774774 M 1 02410241\,0241 024 5 206.185206.185\,206.185 206.18
m28 GPT-2 XL [19] Decoder-only General 1.51.51.51.5 B 1 02410241\,0241 024 5 330.055330.055\,330.055 330.05
m29 Palmyra Base 5B [97] Decoder-only General 5.05.05.05.0 B 512512512512 94.5494.5494.5494.54 11 890.5611890.5611\,890.5611 890.56
m30 Camel 5B [99]
Decoder-only
(instruction-tuned)
General 5.05.05.05.0 B 1 02410241\,0241 024 96.3396.3396.3396.33
m31 GPT-J 6B [100] Decoder-only General 6.0 B 2 048 132.50132.50132.50132.50 16 495.2016495.2016\,495.2016 495.20
m32 Instruct GPT-J [101]
Decoder-only
(instruction-tuned)
General 6.06.06.06.0 B 2 04820482\,0482 048 132.58132.58132.58132.58
m33 Falcon-7B [102] Decoder-only General 7.07.07.07.0 B 2 04820482\,0482 048 151.15151.15151.15151.15 17 496.8317496.8317\,496.8317 496.83
m34 Falcon-7B-Instruct [102]
Decoder-only
(instruction-tuned)
General 7.07.07.07.0 B 2 04820482\,0482 048 151.10151.10151.10151.10
m35 MPT-7B [103] Decoder-only General 7.07.07.07.0 B 2 04820482\,0482 048 140.49140.49140.49140.49 15 384.0015384.0015\,384.0015 384.00
m36 MPT-7B-Instruct [103]
Decoder-only
(instruction-tuned)
General 7.07.07.07.0 B 2 04820482\,0482 048 140.56140.56140.56140.56
m37 LLaMA-7B [27] Decoder-only General 7.07.07.07.0 B 2 04820482\,0482 048 143.56143.56143.56143.56 19 203.3919203.3919\,203.3919 203.39
m38 LLaMA 2-7B [28] Decoder-only General 7.07.07.07.0 B 2 04820482\,0482 048 144.27144.27144.27144.27 19 225.1319225.1319\,225.1319 225.13
m39 Alpaca 7B [104]
Decoder-only
(instruction-tuned)
General 7.07.07.07.0 B 512512512512 146.31146.31146.31146.31
m40 LLaMA 2-CHAT-7B [28]
Decoder-only
(instruction-tuned)
General 7.07.07.07.0 B 2 04820482\,0482 048 144.50144.50144.50144.50
m41 OpenLLaMA 3B [98] Decoder-only General 3.0 B 2 048 9 736.339736.339\,736.339 736.33
m42 OpenLLaMA 3Bv2 [98] Decoder-only General 3.0 B 2 048 9 914.529914.529\,914.529 914.52
m43 OpenLLaMA 7B [98] Decoder-only General 7.0 B 2 048 17 433.5817433.5817\,433.5817 433.58
m44 OpenLLaMA 7Bv2 [98] Decoder-only General 7.0 B 2 048 27 589.5727589.5727\,589.5727 589.57
m45 OpenLLaMA 13B [98] Decoder-only General 13.0 B 2 048 7 125.287125.287\,125.287 125.28
m46 GPT-2-PubMed Medium [105] Decoder-only Biomedical 355 M 1 024 2 023.372023.372\,023.372 023.37
m47 GPT-2-PubMed Large [105] Decoder-only Biomedical 774 M 1 024 3 213.493213.493\,213.493 213.49
m48 BioGPT [69] Decoder-only Biomedical 347 M 1 024 1 680.221680.221\,680.221 680.22
m49 BioGPT-Large [69] Decoder-only Biomedical 1.5 B 1 024 4 840.454840.454\,840.454 840.45
m50 Galactica 1.3B [26] Decoder-only Biomedical 1.3 B 2 048 3 941.803941.803\,941.803 941.80
m51 Galactica 6.7B [26] Decoder-only Biomedical 6.7 B 2 048 15 118.2615118.2615\,118.2615 118.26
m52 MedAlpaca 7b [71] Decoder-only Clinical 7.07.07.07.0 B 512512512512 146.88146.88146.88146.88

Appendix C Prompts

The prompts used for the text classification task via contextual embedding similarity, natural language inference (NLI), and multiple-choice question answering (QA) are presented.

C.1 Prompts for text classification via contextual embedding similarity and NLI

The prompts proposed for text classification using contextual embedding similarity and Natural Language Inference (NLI) are exclusively applied to the label (in the case of NLI, to the hypothesis). Table 4 lists the prompts used. Prompt template ID 0 is the default to generate the hypothesis in the zero-shot text classification using the NLI setting, as documented in HuggingFace.

Table 4: Prompt templates to be used as contextual embedding similarity and NLI prompts. The column “Dataset” specifies the dataset in which the prompt template is applied.
ID Prompt template Dataset
0 This example is {label}. Transcriptions, MS-CXR
1 This is an example of {label}. Transcriptions, MS-CXR
2 This report belongs to the category {label}. Transcriptions
3 This report belongs to the medical speciality {label}. Transcriptions
4 This report belongs to the medical speciality: {label}. Transcriptions
5 The diagnosis is {label}. MS-CXR
6 There is evidence of {label}. MS-CXR
7 These findings are consistent with {label}. MS-CXR

C.2 Prompts for text classification via multiple-choice QA

The proposed prompts for text classification via multiple-choice question answering are based on the default prompt templates specific to various of the considered instruction-tuned models. These templates are systematically assessed using a set of questions, enabling us to quantify the influence of the question wording. For the MS-CXR dataset, we also incorporate role-based questions. The prompts, their corresponding datasets, and specific requirements are summarized in Table 6 and Table 5.

Each class or label is encoded with an uppercase letter denoting the option, followed by its name. For instance, if the first label is “y1subscript𝑦1y_{1}italic_y start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT”, it is represented as “(A) y1subscript𝑦1y_{1}italic_y start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT” within the prompt. In the context of the transcriptions dataset, there are 29 distinct labels. However, due to their large number, we include the top 10 most frequent labels and categorize the remaining labels under an additional “Other” option. Specifically, for the transcriptions dataset, we employ templates t01, t02, t03, t04, and t07 along with questions q07, q08, and q09. Whereas for MS-CXR dataset, we employ templates t01, t02, t03, t04, t07, t11, and t13, and questions q03, q04, and q05.

Table 5: Questions to be used for the multiple-choice QA templates. The column “Dataset” specifies the target dataset.
ID Question Dataset
q01 What is the most plausible diagnosis? MS-CXR
q02 What is the patient’s diagnosis? MS-CXR
q03 What is the diagnosis? MS-CXR
q04 Which one of the following is the diagnosis? MS-CXR
q05 Which one is the patient’s diagnosis? MS-CXR
q06 Which of the options is the most likely to be the diagnosis? MS-CXR
q07 Which category does the report belong to? Transcriptions
q08 What is the field that best suits the report? Transcriptions
q09 Which one is the topic of the report? Transcriptions
Table 6: Prompt structures to be used as multiple-choice QA prompts. Regarding the column “Requirements”, “report” refers to the text sample, “options” to the labels provided as choices, and “question” to the question itself (see Table 5). Note that the term “question” sometimes appears capitalized, indicating that the question begins with an uppercase letter when integrated into the template.
ID Prompt structure Requirements Dataset
t01 Below is an instruction that describes a task, paired with an input that provides further context. Write a response that appropriately completes the request. ### Instruction: {question} Select one of the following options: {options} ### Input: {report} ### Response: ( report, options, QUESTION Transcriptions, MS-CXR
t02 Context: {report} Question: {question} Options: {options} Answer: ( report, options, QUESTION Transcriptions, MS-CXR
t03 Context: {report} Question: Based on the context, {question} Options: {options} Answer: ( report, options, question Transcriptions, MS-CXR
t04 {report}. Which one of the following, if true, most strengthens the argument? {options}. ( report, options Transcriptions, MS-CXR
t05 Read the following and answer the question. {report} {question} {options} ( report, options, QUESTION Transcriptions, MS-CXR
t06 {report} What’s the best answer to this question: {question} {options} ( report, options, QUESTION Transcriptions, MS-CXR
t07 {report} {question} {options} ( report, options, QUESTION Transcriptions, MS-CXR
t08 Read this chest x-ray report: “{report}” Now answer this question: “{question}” {options} ( report, options, QUESTION MS-CXR
t09 Knowing that “{report}”, how would one answer “{question}” {options} ( report, options, QUESTION Transcriptions, MS-CXR
t10 {report} Based on the above text, what’s the best answer to this question: {question} {options} ( report, options, QUESTION Transcriptions, MS-CXR
t11 You are a doctor and have the following information about a patient from a chest x-ray: {report}. Which one of the following, if true, most strengthens the argument? {options}. ( report, options MS-CXR
t12 You are a doctor and have the following information about a patient from a chest x-ray: {report}. {question} {options}. ( report, options, QUESTION MS-CXR
t13 I want you to act as a virtual doctor. I will describe my symptoms and you will choose the most probable diagnosis among the following: {options}. You should only reply with the chosen diagnosis, and nothing else. My request is “{report}”. ( report, options MS-CXR
t14 I want you to act as a virtual doctor. I will describe my symptoms and you will choose a diagnosis among the possible diag- noses. You should only reply with the chosen diagnosis, and nothing else. Do not write explanations. The possible diagnoses are: {options}. My request is “{report}”. ( report, options MS-CXR

Appendix D Supplementary results

This appendix presents supplementary figures and tables to support the results presented. They are displayed first by dataset and then by task or approach. These results do not are not obtained by bootstrapping, but singular values for the complete inference dataset.

D.1 Text classification task

D.1.1 Contextual embedding similarity

Results are depicted in Figs. 2, 14 and 13. The mapping between the models and their ID is

m00: BERTBASEBASE{}_{\texttt{BASE}}start_FLOATSUBSCRIPT BASE end_FLOATSUBSCRIPT        m06: SapBERT
m01: BERTLARGELARGE{}_{\texttt{LARGE}}start_FLOATSUBSCRIPT LARGE end_FLOATSUBSCRIPT        m07: BioLORD-STAMB2-v1
m02: BiomedBERT (abstracts + full text)        m08: BioLORD-STAMB2-v1-STS2
m03: BiomedBERT (abstracts only)        m09: BioLORD-PMB
m04: BiomedBERT-large (abstracts only)        m10: Bio+Clinical BERT
m05: SciBERT
Refer to caption
(a) Performance scores without template
Refer to caption
(b) Performance scores with template
Figure 13: Performance scores for transcriptions dataset using contextual embedding similarity, disaggregated by template and overflow usage, and pooling strategy. Using overflow leads to improvement on 87.88%percent87.8887.88\%87.88 % of the cases, being a case a model + pool strategy. The impact of using overflow is, on average, 0.900.900.900.90 points of the AUC score. Thus, most of the trends observed when not using overflow are kept. Regarding the pooling strategy, when using overflow, average pooling produces the best results (8 of 11 models), followed by CLS pooling. SapBERT and BioLORD models stand out due to their performance, having more than a 10-point difference in the AUC score with the rest of the models when the best configurations are compared.
Refer to caption
(a) Performance scores without template
Refer to caption
(b) Performance scores with template
Figure 14: Performance scores for MS-CXR dataset using contextual embedding similarity, disaggregated by template usage and pooling strategy. SapBERT and BioLORD models stand out due to their performance, having at least a difference of 12, 16, 10, and 16 points in accuracy, F1-score, precision, and recall, respectively, with the rest of the models when the best configuration is compared. However, this gap no longer exists when templates are allowed to be used. That means that templates have a primordial role in determining model performance, both to boost or hinder it. Specifically, it boosts the performance on between 69.70%percent69.7069.70\%69.70 % to 84.85%percent84.8584.85\%84.85 % of the cases depending on the performance metric, being a case a model + pool strategy. Regarding the pooling strategy, average pooling produces the best results (5-8 of 11 models, depending on the metric).

D.1.2 Natural language inference

Results are depicted in Figs. 16 and 15. The mapping between the models and their ID is

m11: NLI-DeBERTabasebase{}_{\texttt{base}}start_FLOATSUBSCRIPT base end_FLOATSUBSCRIPT m12: RoBERTaLARGELARGE{}_{\texttt{LARGE}}start_FLOATSUBSCRIPT LARGE end_FLOATSUBSCRIPT-MNLI m13: BART Large-MNLI
Refer to caption
Figure 15: Performance scores for transcriptions dataset using natural language inference, disaggregated by template and overflow usage. Using overflow leads to improvement on all three models, having an impact of 2.152.152.152.15, 4.874.874.874.87, 8.038.038.038.03 on NLI-DeBERTabasebase{}_{\texttt{base}}start_FLOATSUBSCRIPT base end_FLOATSUBSCRIPT, RoBERTaLARGELARGE{}_{\texttt{LARGE}}start_FLOATSUBSCRIPT LARGE end_FLOATSUBSCRIPT-MNLI, and BART Large-MNLI respectively. Noteworthy to observe is the benefit that using overflow represents for RoBERTaLARGELARGE{}_{\texttt{LARGE}}start_FLOATSUBSCRIPT LARGE end_FLOATSUBSCRIPT-MNLI: from performing lower than NLI-DeBERTabasebase{}_{\texttt{base}}start_FLOATSUBSCRIPT base end_FLOATSUBSCRIPT to outperforming it when using overflow. Using a template consistently improves performance, particularly template 2, resulting in AUC scores of up to 80.5480.5480.5480.54.
Refer to caption
Figure 16: Performance scores for MS-CXR dataset using natural language inference, disaggregated by template. Using overflow improves the large models, impacting the F1-score on 3.433.43-3.43- 3.43, 2.782.782.782.78, 2.182.182.182.18 on NLI-DeBERTabasebase{}_{\texttt{base}}start_FLOATSUBSCRIPT base end_FLOATSUBSCRIPT, RoBERTaLARGELARGE{}_{\texttt{LARGE}}start_FLOATSUBSCRIPT LARGE end_FLOATSUBSCRIPT-MNLI, and BART Large-MNLI, respectively. Template 5 is a good choice for the large models (6.656.656.656.65 for RoBERTaLARGELARGE{}_{\texttt{LARGE}}start_FLOATSUBSCRIPT LARGE end_FLOATSUBSCRIPT-MNLI and 3.493.493.493.49 for BART Large-MNLI), while template 6 is not (3.093.09-3.09- 3.09 for RoBERTaLARGELARGE{}_{\texttt{LARGE}}start_FLOATSUBSCRIPT LARGE end_FLOATSUBSCRIPT-MNLI and 0.860.860.860.86 for BART Large-MNLI). The choice of template can represent that RoBERTaLARGELARGE{}_{\texttt{LARGE}}start_FLOATSUBSCRIPT LARGE end_FLOATSUBSCRIPT-MNLI is better than BART Large-MNLI, as they both have performances quite similar.

D.1.3 Multiple choice question answering

Results are depicted in Figs. 18, 17, 20 and 19. The mapping between the models and their ID is

m14: T5-V1.1-Base m21: Flan-T5-XXL m31: GPT-J 6B m38: LLaMA 2-7B
m15: T5-V1.1-Large m22: T0-3B m32: Instruct GPT-J m39: Alpaca 7B
m16: T5-V1.1-3B m23: T0++ m33: Falcon-7B m40: LLaMA 2-CHAT-7B
m17: T5-V1.1-11B m24: ClinicalT5-base m34: Falcon-7B-Instruct m52: MedAlpaca 7b
m18: Flan-T5-Base m25: ClinicalT5-large m35: MPT-7B
m19: Flan-T5-Large m29: Palmyra Base 5B m36: MPT-7B-Instruct
m20: Flan-T5-XL m30: Camel 5B m37: LLaMA-7B
Refer to caption
Figure 17: Performance scores for transcriptions dataset using multiple-choice question answering. The T5 family of models represents text-to-text models, whereas the rest of the models represent autoregressive models. Instruction tuning usually leads to a performance increase, with an impact of 21.4521.4521.4521.45 on the AUC score when comparing the best performance per model. Considering the model size, text-to-text models perform similarly or better than their autoregressive counterparts, and Flan-T5-XXL is the best-performing model of all. Regarding the sensitivity of the models to the prompt used, the high sensitivity of the models is reflected in the clustering of the intra-model performance together with the visible variability of the latter.
Refer to caption
Figure 18: Performance scores for MS-CXR dataset using multiple-choice question answering. The T5 family of models represents text-to-text models, whereas the rest of the models represent autoregressive models. Instruction-tuning leads to a performance increase, except for Falcon-7B, with an impact of 43.5543.5543.5543.55 on the AUC score when comparing the best performance per model. Taking into account the model size, text-to-text models perform similarly or better than their autoregressive counterparts, having that T0++ achieves the best scores for accuracy, F1-score, and recall. However, Alpaca 7B does it for precision. On the other hand, models that are not suitable for this task are T5 models, in all their size versions, and ClinicalT5-large. Regarding the sensitivity of the models to the prompt used, the high sensitivity of the models is reflected in the clustering of the intra-model performance together with the visible variability of the latter.
Refer to caption
Figure 19: Prompt analysis for the transcriptions dataset using multiple-choice question answering, disaggregated by template. Neither a template nor a question works best for all the models. Regarding the impact measured in terms of standard deviations, templates have an average impact of 3.843.843.843.84. On its side, questions have an impact of 0.800.800.800.80 on a template. Thus, the template’s wording plays a more important role than the questions itself. In general, prompting has a great impact on performance.
Refer to caption
Figure 20: Prompt analysis for the MS-CXR dataset using multiple-choice question answering, disaggregated by template. Neither a template nor a question works best for all the models. However, some templates are least suitable for this task and dataset. For example, template 4 is not good for the T0 and LLaMA family; template 7 is not for Flan-T5 models, and some generative no LLaMA models. Also, for some models, the role prompting strategy does not give good results, with emphasis in template 11. Regarding the impact, measured in terms of F1-score standard deviations, templates have an average impact of 11.0211.0211.0211.02. On its side, questions have an impact of 1.741.741.741.74 on a template. Thus, the template’s wording plays a more important role than the questions itself. In general, prompting greatly impacts performance, even decisive in terms of the ranking of models.

D.2 Conditional text generation task

Results are depicted in Fig. 21. The mapping between the models and their ID is

m26: GPT-2 Medium m35: MPT-7B m44: OpenLLaMA 7Bv2 m50: Galactica 1.3B
m27: GPT-2 Large m37: LLaMA-7B m45: OpenLLaMA 13B m51: Galactica 6.7B
m28: GPT-2 XL m38: LLaMA 2-7B m46: GPT-2-PubMed Medium
m29: Palmyra Base 5B m42: OpenLLaMA 3B m47: GPT-2-PubMed Large
m31: GPT-J 6B m42: OpenLLaMA 3Bv2 m48: BioGPT-Large
m33: Falcon-7B m43: OpenLLaMA 7B m49: BioGPT-Large
Refer to caption
Figure 21: Performance scores for the MIMIC-CXR dataset, disaggregated by BOS token usage. The perplexities are displayed in logarithmic scale. Not using the BOS token is beneficial for 77.78%percent77.7877.78\%77.78 % (14 out of 18) of the models, with the exceptions of the GPT-2 models (m26-m28) and Palmyra Base 5B (m29). Concerning outliers, their presence is quite strong. Moderate outliers, above quantile 0.750.750.750.75 by 1.51.51.51.5 times the IQR, represent between 7%percent77\%7 % and 11%percent1111\%11 % of the data, with BioGPT models (m48-m49) having the highest percentages. Extreme outliers, above quantile 0.750.750.750.75 by three times the IQR, make up between 4%percent44\%4 % and 7%percent77\%7 % of the data, with most models exhibiting percentages around 4%percent44\%4 % and 5%percent55\%5 %.