R2Gen-Mamba: A Selective State Space Model for Radiology Report Generation^†^†thanks: This work has been submitted to the IEEE for possible publication. Copyright may be transferred without notice, after which this version may no longer be accessible.

Abstract

Radiology report generation is crucial in medical imaging, but the manual annotation process by physicians is time-consuming and labor-intensive, necessitating the development of automatic report generation methods. Existing research predominantly utilizes Transformers to generate radiology reports, which can be computationally intensive, limiting their use in real applications. In this work, we present R2Gen-Mamba, a novel automatic radiology report generation method that leverages the efficient sequence processing of the Mamba with the contextual benefits of Transformer architectures. Due to lower computational complexity of Mamba, R2Gen-Mamba not only enhances training and inference efficiency but also produces high-quality reports. Experimental results on two benchmark datasets with more than 210,000 X-ray image-report pairs demonstrate the effectiveness of R2Gen-Mamba regarding report quality and computational efficiency compared with several state-of-the-art methods. The source code can be accessed online.

Index Terms— Radiology, Report Generation, Selective Satte Space Model, Transformer, Mamba

1 INTRODUCTION

Radiology report generation is crucial in medical imaging, offering key information necessary for diagnosing and managing patient conditions. Traditionally, these reports are manually annotated by physicians, which is time-consuming and labor-intensive. This challenge is further exacerbated by the ever-increasing volume of medical image data, making it difficult for radiologists to meet the demands for timely and accurate reporting. There has been a growing interest in developing automatic report generation methods that can alleviate the burden on medical professionals while maintaining the high standards required in clinical settings.

Numerous approaches have been introduced for automatic radiology report generation [1, 2, 3]. Most existing studies rely on Transformer models [4] that have demonstrated impressive performance in a variety of natural language processing tasks such as image captioning and text generation. Transformers leverage self-attention mechanisms to model long-range dependencies, making them particularly well-suited for generating coherent and contextually relevant reports from complex medical images. However, Transformer models are often criticized for their high computational complexity, limiting their use in real applications. Recently, the Mamba model [5], designed to reduce computational complexity without compromising performance, has attracted increasing attention. Mamba’s efficient sequence processing capabilities make it an attractive alternative to Transformers, but its potential for radiology report generation has not yet been fully explored.

In this work, we propose a novel radiology report generation method, called R2Gen-Mamba, which leverages the strengths of both Mamba and Transformer architectures. Specifically, R2Gen-Mamba leverages Mamba with low computational complexity as the encoder, and Transformer as the decoder retaining powerful contextual processing capability. By combining these complementary models, R2Gen-Mamba provides a new pathway for reducing the computational burden in radiology while ensuring high-quality, contextually relevant reports. Experimental results on two benchmark datasets IU X-Ray [6] and MIMIC-CXR [7], suggests that R2Gen-Mamba outperforms traditional Transformer-based models regarding report quality and computational efficiency. Compared with state-of-the-art (SOTA) studies, R2Gen-Mamba provides a more resource-efficient solution for automatic radiology report generation.

2 METHODOLOGY

Refer to caption — Fig. 1: Architecture of the proposed R2Gen-Mamba framework, with visual extractor and decoder denoted by gray dashed boxes. The Mamba encoder is highlighted within green dashed boxes. Conv: convolution; SSM: selective state space model; Linear: linear projection.

Radiology report generation can be framed as a sequence-to-sequence problem, where the input image patch features serve as the input sequence and the corresponding report as the target sequence. Typically, the input patch feature sequence $\mathbf{X}=\{\mathbf{x}_{1},\mathbf{x}_{2},\dots,\mathbf{x}_{S}\}$ , where $S$ is the number of patches, each $\mathbf{x}_{s}\in\mathbb{R}^{d}$ , consists of visual features extracted from the image patches using pre-trained visual extractor like convolutional neural networks. The output sequence $Y=\{y_{1},y_{2},\dots,y_{T}\}$ , where $T$ is the maximum length of reports, each $y_{t}$ is a token from a predefined vocabulary, represents the generated report. This sequence-to-sequence framework is optimized through maximum likelihood of generating the correct report given the input image. Our R2Gen-Mamba contains three major parts (i.e., visual extractor, Mamba encoder, and Transformer decoder), which are outlined in subsequent subsections.

2.1 Visual Extractor

To produce radiology reports, we begin by extracting visual features from the radiology images using convolutional neural networks such as VGG or ResNet. As illustrated in Fig. 1, the image is passed through the Visual Extractor to extract the feature map. Each spatial pixel in the feature map corresponds to a patch in the original image. These spatial pixels are flattened to obtain a sequence representation that serves as the input sequence for subsequent Mamba encoder. This process is formally represented as: $\{\mathbf{x}_{1},\mathbf{x}_{2},\dots,\mathbf{x}_{S}\}=f_{v}(Img)$ , where $f_{v}(\cdot)$ is the visual extractor, and $Img$ is the input image.

2.2 Mamba Encoder

To extract contextual semantic information, we use Mamba as the encoder. Mamba is designed to process sequence data. Compared with Transformers that have quadratic computational complexity, Mamba has linear complexity for the number of tokens. Provided the input sequence $\{\mathbf{x}_{1},\mathbf{x}_{2},\dots,\mathbf{x}_{S}\}$ , the output sequence $\mathbf{Z}$ is obtained by $\{\mathbf{z}_{1},\mathbf{z}_{2},\dots,\mathbf{z}_{S}\}=f_{e}(\mathbf{x}_{1},% \mathbf{x}_{2},\dots,\mathbf{x}_{S})$ , where $f_{e}$ denotes the Mamba encoder. As for the core state space model (SSM) of Mamba, given the input sequence $\mathbf{U}$ , the output sequence $\mathbf{V}$ is obtained by $\{\mathbf{v}_{1},\mathbf{v}_{2},\dots,\mathbf{v}_{S}\}=\text{SSM}(\mathbf{u}_{% 1},\mathbf{u}_{2},\dots,\mathbf{u}_{S})$ . Specifically, as illustrated in Fig. 1, $\mathbf{u}_{t}$ , $t\in\{1,2,\dots,S\}$ is fed into linear layers to obtain continuous parameters: $\mathbf{B}_{t},\mathbf{C}_{t},\Delta_{t}=\text{Project}(\mathbf{u}_{t})$ . Then discretization is performed by zero-order hold (ZOH): $\bar{\mathbf{A}}_{t}=\exp(\Delta_{t}\mathbf{A})$ ; $\bar{\mathbf{B}}_{t}=(\Delta_{t}\mathbf{A})^{-1}(\exp(\Delta_{t}\mathbf{A})-% \mathbf{I})\cdot\Delta_{t}\mathbf{B}_{t}$ , where $\mathbf{A}$ is a learnable embedding. Finally, the sequence-to-sequence transformation is achieved in two stages: $\mathbf{h}_{t}=\bar{\mathbf{A}}_{t}\mathbf{h}_{t-1}+\bar{\mathbf{B}}_{t}% \mathbf{u}_{t}$ ; $\mathbf{v}_{t}=\mathbf{C}_{t}\mathbf{h}_{t}$ .

2.3 Transformer Decoder

In the proposed R2Gen-Mamba, the decoder is built upon the standard Transformer architecture. The decoding procedure is formulated as: $y_{t}=f_{d}(\mathbf{z}_{1},\mathbf{z}_{2},\dots,\mathbf{z}_{S},y_{1},\dots,y_{% t-1})$ , where $f_{d}(\cdot)$ is the Transformer decoder. As noted in [4], the decoder needs to rely on the generation results of the previous step due to its auto-regressive nature and requires additional attention mechanisms, so we repeat the decoder layer $N_{d}$ times. In our experiments, we set $N_{d}$ to 3.

2.4 Objective Function

The overall generation process in R2Gen-Mamba can be mathematically framed as a recursive implementation of the chain rule, where the probability of the target sequence $\{y_{1},y_{2},\dots,y_{T}\}$ provided the input image $Img$ is expressed as: $p(Y\mid Img)=\prod\limits_{t=1}^{T}p(y_{t}\mid y_{1},\dots,y_{t-1},Img)$ . The model is trained by maximizing the likelihood of the target sequence conditioned on the input image:

\theta^{*}=\operatorname*{argmax}_{\theta}\sum\nolimits_{t=1}^{T}\log p(y_{t}% \mid y_{1},\dots,y_{t-1},Img;\theta)

(1)

where $\theta^{*}$ represents the parameters of R2Gen-Mamba. This optimization process ensures that the model learns to accurately generate the report text based on the visual features extracted from the input image. During inference, we use the beam search strategy to sample predictions. To facilitate reproducible research, we have shared the source code to the public through GitHub.

3 EXPERIMENTS

Table 1: Details of two benchmark datasets used in this work.

Dataset	IU X-Ray			MIMIC-CXR
Dataset	Train	Validation	Test	Train	Validation	Test
Image #	5.23K	0.75K	1.50K	368.96K	2.99K	5.16K
Report #	2.77K	0.40K	0.79K	222.76K	1.81K	3.27K
Patient #	2.77K	0.40K	0.79K	64.59K	0.50K	0.29K
Average Length	37.56	36.78	33.62	53.00	53.05	66.40

Table 2: Comparisons of different methods on IU X-Ray and MIMIC-CXR. ‘BLEU-x’: BLEU score with an n-gram size of x. The best results are highlighted in bold.

Data	Method	NLG Metrics						CE Metrics
Data	Method	BLEU-1	BLEU-2	BLEU-3	BLEU-4	METEOR	ROUGE-L	Precision	Recall	F1 score
IU X-Ray	R2Gen	0.423	0.275	0.203	0.160	0.176	0.358	-	-	-
	R2Gen-CMN	0.470	0.300	0.215	0.166	0.189	0.367	-	-	-
	R2Gen-RL	0.291	0.178	0.121	0.086	0.096	0.312	-	-	-
	R2Gen-Mamba (Ours)	0.482	0.315	0.228	0.176	0.208	0.382	-	-	-
MIMIC-CXR	R2Gen	0.371	0.223	0.148	0.105	0.141	0.271	0.429	0.243	0.310
	R2Gen-CMN	0.352	0.214	0.141	0.099	0.139	0.274	0.441	0.326	0.375
	R2Gen-RL	0.122	0.067	0.042	0.028	0.047	0.137	0.061	0.027	0.038
	R2Gen-Mamba (Ours)	0.352	0.222	0.152	0.110	0.141	0.284	0.483	0.325	0.389

3.1 Experimental Setup

We perform experiments on two benchmark datasets: IU X-Ray [6] and MIMIC-CXR [7]. The IU X-Ray dataset includes 7,470 chest X-ray images paired with 3,955 reports, while MIMIC-CXR comprises 473,057 images and 206,563 reports. Following prior studies [1, 2, 3], we exclude samples without reports. We use a 70%/10%/20% split for training, validation, and testing on IU X-Ray, and the official split for MIMIC-CXR, as detailed in Table 1. Two evaluation metrics are employed: traditional natural language generation (NLG) metrics (BLEU [8], METEOR [9], and ROUGE-L [10]) and clinical efficacy (CE) metrics. For CE metrics, we use the CheXbert [11] tool to automatically label generated reports, comparing them to ground truths across 14 thoracic disease categories using precision, recall, and F1 score.

3.2 Implementation Details

Following [1, 2, 3], we use two images per patient for IU X-Ray and one image for MIMIC-CXR as input. The visual extractor utilizes a ResNet101 model pre-trained on ImageNet, with patch features projected to a dimension of 512. The Mamba encoder is set to a dimension of 512, with an SSM state expansion factor of 16, a local convolution width of 4, and a block expansion factor of 2. The Transformer decoder also has a dimension of 512, with 3 layers, 8 heads, and a dropout rate of 0.1. We use the Adam optimizer and set learning rates of $5\times 10^{-5}$ for the visual extractor and $1\times 10^{-4}$ for other parameters, decayed by 0.8 per epoch. The model that achieved the best BLEU-4 score on the validation sets is selected, with a beam size of 3 for inference to balance between generation quality and computational efficiency.

3.3 Visual and Quantitative Results

To evaluate the effectiveness of our R2Gen-Mamba, we performed a comparative analysis against existing SOTA methods, namely R2Gen [1], R2Gen-CMN [2], and R2Gen-RL [3]. Using the same data, R2Gen and R2Gen-CMN were implemented using their released code and checkpoints for inference, and R2Gen-RL was retrained from scratch using their released code. Several typical reports generated by different methods are shown in Fig. 2. It can be seen from this figure that the report generated by R2Gen-Mamba contains more precise information, providing superior results than the competing methods in accuracy and clarity. The quantitative results regarding NLG and CE metrics are summarized in Table 2, from which we have several key findings.

Firstly, our R2Gen-Mamba, which incorporates Mamba and Transformer, outperforms existing approaches in most cases, suggesting the advantages of Mamba for report generation and the feasibility of combining Mamba with Transformer. Secondly, R2Gen-Mamba slightly under-performs R2Gen on BLEU-1 and BLEU-2 metrics for MIMIC-CXR but surpasses it on BLEU-3, BLEU-4, METEOR, and ROUGE-L. BLEU-1 and BLEU-2 measure the overlap of single words and word pairs, reflecting basic vocabulary matching. BLEU-3 and BLEU-4 measure triples and quadruples, capturing longer context dependencies. Higher BLEU-3 and BLEU-4 scores indicate R2Gen-Mamba generates text with better grammatical and semantic structures, reflecting stronger context modeling and grammatical consistency. METEOR combines lexical matching, word order, and morphological changes, while ROUGE-L assesses the longest common subsequence between generated and reference texts. Our R2Gen-Mamba’s better performance on these metrics demonstrates stronger vocabulary choice, grammatical structure, and alignment with reference text. Thirdly, R2Gen-Mamba demonstrates superior performance on clinical efficacy (CE) metrics, suggesting that the generated reports offer more valuable clinical information for diagnosis and decision-making. This highlights the clinical relevance and utility of our R2Gen-Mamba compared with the competing methods.

3.4 Computation Complexity Analysis

With the Mamba encoder in the proposed R2Gen-Mamba framework, we can significantly reduce model complexity, with only 594.944 K parameters and incurring a computational load of 58.216 M floating-point operations (FLOPs). This represents a substantial improvement over the Transformer encoder utilized in the SOTA R2Gen model, which comprises 4.728 M parameters and incurs a computational complexity of 462.422 M FLOPs. The considerable reduction in both parameter count and computational cost highlights the efficiency of the Mamba encoder, making it more suitable for resource-constrained environments while maintaining superior performance in radiology report generation.

4 CONCLUSION

This paper presents R2Gen-Mamba, a novel radiology report generation model that leverages Mamba’s efficient sequence processing and Transformer’s contextual strengths. R2Gen-Mamba reduces computational complexity while producing high-quality radiology reports. Experiments on two datasets show that R2Gen-Mamba surpasses existing methods in both natural language generation and clinical efficacy metrics. Our findings highlight the effectiveness of merging Mamba with Transformer techniques for radiology report generation.

5 COMPLIANCE WITH ETHICAL STANDARDS

This research was conducted retrospectively using human subject data made available in open access by IU X-Ray and MIMIC-CXR. Ethical approval was not required as confirmed by the license attached with the open-access data.

6 ACKNOWLEDGMENTS

The research of M. Liu and H. Zhu was supported in part by NIH grants AG073297 and AG082938. The research of C. Lian was supported in by NSFC Grants (Nos. 12326616, 62101431, and 62101430) and Natural Science Basic Research Program of Shaanxi (No. 2024JC-TBZC-09).

References

[1] Z. Chen, Y. Song, T.-H. Chang, and X. Wan, “Generating radiology reports via memory-driven Transformer,” in EMNLP, 2020, pp. 1439–1449.
[2] Z. Chen, Y. Shen, Y. Song, and X. Wan, “Cross-modal memory networks for radiology report generation,” in ACL/IJCNLP, 2021, pp. 5904–5914.
[3] H. Qin and Y. Song, “Reinforced cross-modal alignment for radiology report generation,” in ACL, 2022, pp. 448–458.
[4] A. Vaswani, “Attention is all you need,” NeurIPS, 2017.
[5] A. Gu and T. Dao, “Mamba: Linear-time sequence modeling with selective state spaces,” arXiv preprint arXiv:2312.00752, 2023.
[6] D. Demner-Fushman, M. D. Kohli, M. B. Rosenman, S. E. Shooshan, L. Rodriguez, S. Antani, G. R. Thoma, and C. J. McDonald, “Preparing a collection of radiology examinations for distribution and retrieval,” J. Am. Med. Inform. Assoc., vol. 23, no. 2, pp. 304–310, 2016.
[7] A. E. W. Johnson, T. J. Pollard, N. R. Greenbaum, M. P. Lungren, C.-Y. Deng, Y. Peng, Z. Lu, R. G. Mark, S. J. Berkowitz, and S. Horng, “MIMIC-CXR-JPG, a large publicly available database of labeled chest radiographs,” arXiv preprint arXiv:1901.07042, 2019.
[8] K. Papineni, S. Roukos, T. Ward, and W.-J. Zhu, “BLEU: A method for automatic evaluation of machine translation,” in ACL, 2002, pp. 311–318.
[9] M. Denkowski and A. Lavie, “Meteor 1.3: Automatic metric for reliable optimization and evaluation of machine translation systems,” in WMT, 2011, pp. 85–91.
[10] C.-Y. Lin, “ROUGE: A package for automatic evaluation of summaries,” in Text Summ. Branches Out, 2004, pp. 74–81.
[11] A. Smit, S. Jain, P. Rajpurkar, A. Pareek, A. Y. Ng, and M. P. Lungren, “Combining automatic labelers and expert annotations for accurate radiology report labeling using BERT,” in EMNLP, 2020, pp. 1500–1519.

R2Gen-Mamba: A Selective State Space Model for Radiology Report Generation††thanks: This work has been submitted to the IEEE for possible publication. Copyright may be transferred without notice, after which this version may no longer be accessible.