R2Gen-Mamba: A Selective State Space Model for Radiology Report Generationthanks: This work has been submitted to the IEEE for possible publication. Copyright may be transferred without notice, after which this version may no longer be accessible.

Abstract

Radiology report generation is crucial in medical imaging, but the manual annotation process by physicians is time-consuming and labor-intensive, necessitating the development of automatic report generation methods. Existing research predominantly utilizes Transformers to generate radiology reports, which can be computationally intensive, limiting their use in real applications. In this work, we present R2Gen-Mamba, a novel automatic radiology report generation method that leverages the efficient sequence processing of the Mamba with the contextual benefits of Transformer architectures. Due to lower computational complexity of Mamba, R2Gen-Mamba not only enhances training and inference efficiency but also produces high-quality reports. Experimental results on two benchmark datasets with more than 210,000 X-ray image-report pairs demonstrate the effectiveness of R2Gen-Mamba regarding report quality and computational efficiency compared with several state-of-the-art methods. The source code can be accessed online.

Index Terms—  Radiology, Report Generation, Selective Satte Space Model, Transformer, Mamba

1 INTRODUCTION

Radiology report generation is crucial in medical imaging, offering key information necessary for diagnosing and managing patient conditions. Traditionally, these reports are manually annotated by physicians, which is time-consuming and labor-intensive. This challenge is further exacerbated by the ever-increasing volume of medical image data, making it difficult for radiologists to meet the demands for timely and accurate reporting. There has been a growing interest in developing automatic report generation methods that can alleviate the burden on medical professionals while maintaining the high standards required in clinical settings.

Numerous approaches have been introduced for automatic radiology report generation [1, 2, 3]. Most existing studies rely on Transformer models [4] that have demonstrated impressive performance in a variety of natural language processing tasks such as image captioning and text generation. Transformers leverage self-attention mechanisms to model long-range dependencies, making them particularly well-suited for generating coherent and contextually relevant reports from complex medical images. However, Transformer models are often criticized for their high computational complexity, limiting their use in real applications. Recently, the Mamba model [5], designed to reduce computational complexity without compromising performance, has attracted increasing attention. Mamba’s efficient sequence processing capabilities make it an attractive alternative to Transformers, but its potential for radiology report generation has not yet been fully explored.

In this work, we propose a novel radiology report generation method, called R2Gen-Mamba, which leverages the strengths of both Mamba and Transformer architectures. Specifically, R2Gen-Mamba leverages Mamba with low computational complexity as the encoder, and Transformer as the decoder retaining powerful contextual processing capability. By combining these complementary models, R2Gen-Mamba provides a new pathway for reducing the computational burden in radiology while ensuring high-quality, contextually relevant reports. Experimental results on two benchmark datasets IU X-Ray [6] and MIMIC-CXR [7], suggests that R2Gen-Mamba outperforms traditional Transformer-based models regarding report quality and computational efficiency. Compared with state-of-the-art (SOTA) studies, R2Gen-Mamba provides a more resource-efficient solution for automatic radiology report generation.

2 METHODOLOGY

Refer to caption
Fig. 1: Architecture of the proposed R2Gen-Mamba framework, with visual extractor and decoder denoted by gray dashed boxes. The Mamba encoder is highlighted within green dashed boxes. Conv: convolution; SSM: selective state space model; Linear: linear projection.

Radiology report generation can be framed as a sequence-to-sequence problem, where the input image patch features serve as the input sequence and the corresponding report as the target sequence. Typically, the input patch feature sequence 𝐗={𝐱1,𝐱2,,𝐱S}𝐗subscript𝐱1subscript𝐱2subscript𝐱𝑆\mathbf{X}=\{\mathbf{x}_{1},\mathbf{x}_{2},\dots,\mathbf{x}_{S}\}bold_X = { bold_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , bold_x start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , … , bold_x start_POSTSUBSCRIPT italic_S end_POSTSUBSCRIPT }, where S𝑆Sitalic_S is the number of patches, each 𝐱sdsubscript𝐱𝑠superscript𝑑\mathbf{x}_{s}\in\mathbb{R}^{d}bold_x start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT, consists of visual features extracted from the image patches using pre-trained visual extractor like convolutional neural networks. The output sequence Y={y1,y2,,yT}𝑌subscript𝑦1subscript𝑦2subscript𝑦𝑇Y=\{y_{1},y_{2},\dots,y_{T}\}italic_Y = { italic_y start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_y start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , … , italic_y start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT }, where T𝑇Titalic_T is the maximum length of reports, each ytsubscript𝑦𝑡y_{t}italic_y start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT is a token from a predefined vocabulary, represents the generated report. This sequence-to-sequence framework is optimized through maximum likelihood of generating the correct report given the input image. Our R2Gen-Mamba contains three major parts (i.e., visual extractor, Mamba encoder, and Transformer decoder), which are outlined in subsequent subsections.

2.1 Visual Extractor

To produce radiology reports, we begin by extracting visual features from the radiology images using convolutional neural networks such as VGG or ResNet. As illustrated in Fig. 1, the image is passed through the Visual Extractor to extract the feature map. Each spatial pixel in the feature map corresponds to a patch in the original image. These spatial pixels are flattened to obtain a sequence representation that serves as the input sequence for subsequent Mamba encoder. This process is formally represented as: {𝐱1,𝐱2,,𝐱S}=fv(Img)subscript𝐱1subscript𝐱2subscript𝐱𝑆subscript𝑓𝑣𝐼𝑚𝑔\{\mathbf{x}_{1},\mathbf{x}_{2},\dots,\mathbf{x}_{S}\}=f_{v}(Img){ bold_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , bold_x start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , … , bold_x start_POSTSUBSCRIPT italic_S end_POSTSUBSCRIPT } = italic_f start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT ( italic_I italic_m italic_g ), where fv()subscript𝑓𝑣f_{v}(\cdot)italic_f start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT ( ⋅ ) is the visual extractor, and Img𝐼𝑚𝑔Imgitalic_I italic_m italic_g is the input image.

2.2 Mamba Encoder

To extract contextual semantic information, we use Mamba as the encoder. Mamba is designed to process sequence data. Compared with Transformers that have quadratic computational complexity, Mamba has linear complexity for the number of tokens. Provided the input sequence {𝐱1,𝐱2,,𝐱S}subscript𝐱1subscript𝐱2subscript𝐱𝑆\{\mathbf{x}_{1},\mathbf{x}_{2},\dots,\mathbf{x}_{S}\}{ bold_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , bold_x start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , … , bold_x start_POSTSUBSCRIPT italic_S end_POSTSUBSCRIPT }, the output sequence 𝐙𝐙\mathbf{Z}bold_Z is obtained by {𝐳1,𝐳2,,𝐳S}=fe(𝐱1,𝐱2,,𝐱S)subscript𝐳1subscript𝐳2subscript𝐳𝑆subscript𝑓𝑒subscript𝐱1subscript𝐱2subscript𝐱𝑆\{\mathbf{z}_{1},\mathbf{z}_{2},\dots,\mathbf{z}_{S}\}=f_{e}(\mathbf{x}_{1},% \mathbf{x}_{2},\dots,\mathbf{x}_{S}){ bold_z start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , bold_z start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , … , bold_z start_POSTSUBSCRIPT italic_S end_POSTSUBSCRIPT } = italic_f start_POSTSUBSCRIPT italic_e end_POSTSUBSCRIPT ( bold_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , bold_x start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , … , bold_x start_POSTSUBSCRIPT italic_S end_POSTSUBSCRIPT ), where fesubscript𝑓𝑒f_{e}italic_f start_POSTSUBSCRIPT italic_e end_POSTSUBSCRIPT denotes the Mamba encoder. As for the core state space model (SSM) of Mamba, given the input sequence 𝐔𝐔\mathbf{U}bold_U, the output sequence 𝐕𝐕\mathbf{V}bold_V is obtained by {𝐯1,𝐯2,,𝐯S}=SSM(𝐮1,𝐮2,,𝐮S)subscript𝐯1subscript𝐯2subscript𝐯𝑆SSMsubscript𝐮1subscript𝐮2subscript𝐮𝑆\{\mathbf{v}_{1},\mathbf{v}_{2},\dots,\mathbf{v}_{S}\}=\text{SSM}(\mathbf{u}_{% 1},\mathbf{u}_{2},\dots,\mathbf{u}_{S}){ bold_v start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , bold_v start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , … , bold_v start_POSTSUBSCRIPT italic_S end_POSTSUBSCRIPT } = SSM ( bold_u start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , bold_u start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , … , bold_u start_POSTSUBSCRIPT italic_S end_POSTSUBSCRIPT ). Specifically, as illustrated in Fig. 1, 𝐮tsubscript𝐮𝑡\mathbf{u}_{t}bold_u start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT, t{1,2,,S}𝑡12𝑆t\in\{1,2,\dots,S\}italic_t ∈ { 1 , 2 , … , italic_S } is fed into linear layers to obtain continuous parameters: 𝐁t,𝐂t,Δt=Project(𝐮t)subscript𝐁𝑡subscript𝐂𝑡subscriptΔ𝑡Projectsubscript𝐮𝑡\mathbf{B}_{t},\mathbf{C}_{t},\Delta_{t}=\text{Project}(\mathbf{u}_{t})bold_B start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , bold_C start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , roman_Δ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = Project ( bold_u start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ). Then discretization is performed by zero-order hold (ZOH): 𝐀¯t=exp(Δt𝐀)subscript¯𝐀𝑡subscriptΔ𝑡𝐀\bar{\mathbf{A}}_{t}=\exp(\Delta_{t}\mathbf{A})over¯ start_ARG bold_A end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = roman_exp ( roman_Δ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT bold_A ); 𝐁¯t=(Δt𝐀)1(exp(Δt𝐀)𝐈)Δt𝐁tsubscript¯𝐁𝑡superscriptsubscriptΔ𝑡𝐀1subscriptΔ𝑡𝐀𝐈subscriptΔ𝑡subscript𝐁𝑡\bar{\mathbf{B}}_{t}=(\Delta_{t}\mathbf{A})^{-1}(\exp(\Delta_{t}\mathbf{A})-% \mathbf{I})\cdot\Delta_{t}\mathbf{B}_{t}over¯ start_ARG bold_B end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = ( roman_Δ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT bold_A ) start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ( roman_exp ( roman_Δ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT bold_A ) - bold_I ) ⋅ roman_Δ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT bold_B start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT, where 𝐀𝐀\mathbf{A}bold_A is a learnable embedding. Finally, the sequence-to-sequence transformation is achieved in two stages: 𝐡t=𝐀¯t𝐡t1+𝐁¯t𝐮tsubscript𝐡𝑡subscript¯𝐀𝑡subscript𝐡𝑡1subscript¯𝐁𝑡subscript𝐮𝑡\mathbf{h}_{t}=\bar{\mathbf{A}}_{t}\mathbf{h}_{t-1}+\bar{\mathbf{B}}_{t}% \mathbf{u}_{t}bold_h start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = over¯ start_ARG bold_A end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT bold_h start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT + over¯ start_ARG bold_B end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT bold_u start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT; 𝐯t=𝐂t𝐡tsubscript𝐯𝑡subscript𝐂𝑡subscript𝐡𝑡\mathbf{v}_{t}=\mathbf{C}_{t}\mathbf{h}_{t}bold_v start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = bold_C start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT bold_h start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT.

2.3 Transformer Decoder

In the proposed R2Gen-Mamba, the decoder is built upon the standard Transformer architecture. The decoding procedure is formulated as: yt=fd(𝐳1,𝐳2,,𝐳S,y1,,yt1)subscript𝑦𝑡subscript𝑓𝑑subscript𝐳1subscript𝐳2subscript𝐳𝑆subscript𝑦1subscript𝑦𝑡1y_{t}=f_{d}(\mathbf{z}_{1},\mathbf{z}_{2},\dots,\mathbf{z}_{S},y_{1},\dots,y_{% t-1})italic_y start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = italic_f start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT ( bold_z start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , bold_z start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , … , bold_z start_POSTSUBSCRIPT italic_S end_POSTSUBSCRIPT , italic_y start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_y start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT ), where fd()subscript𝑓𝑑f_{d}(\cdot)italic_f start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT ( ⋅ ) is the Transformer decoder. As noted in [4], the decoder needs to rely on the generation results of the previous step due to its auto-regressive nature and requires additional attention mechanisms, so we repeat the decoder layer Ndsubscript𝑁𝑑N_{d}italic_N start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT times. In our experiments, we set Ndsubscript𝑁𝑑N_{d}italic_N start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT to 3.

2.4 Objective Function

The overall generation process in R2Gen-Mamba can be mathematically framed as a recursive implementation of the chain rule, where the probability of the target sequence {y1,y2,,yT}subscript𝑦1subscript𝑦2subscript𝑦𝑇\{y_{1},y_{2},\dots,y_{T}\}{ italic_y start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_y start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , … , italic_y start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT } provided the input image Img𝐼𝑚𝑔Imgitalic_I italic_m italic_g is expressed as: p(YImg)=t=1Tp(yty1,,yt1,Img)𝑝conditional𝑌𝐼𝑚𝑔superscriptsubscriptproduct𝑡1𝑇𝑝conditionalsubscript𝑦𝑡subscript𝑦1subscript𝑦𝑡1𝐼𝑚𝑔p(Y\mid Img)=\prod\limits_{t=1}^{T}p(y_{t}\mid y_{1},\dots,y_{t-1},Img)italic_p ( italic_Y ∣ italic_I italic_m italic_g ) = ∏ start_POSTSUBSCRIPT italic_t = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT italic_p ( italic_y start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∣ italic_y start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_y start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT , italic_I italic_m italic_g ). The model is trained by maximizing the likelihood of the target sequence conditioned on the input image:

θ=argmaxθt=1Tlogp(yty1,,yt1,Img;θ)superscript𝜃subscriptargmax𝜃superscriptsubscript𝑡1𝑇𝑝conditionalsubscript𝑦𝑡subscript𝑦1subscript𝑦𝑡1𝐼𝑚𝑔𝜃\theta^{*}=\operatorname*{argmax}_{\theta}\sum\nolimits_{t=1}^{T}\log p(y_{t}% \mid y_{1},\dots,y_{t-1},Img;\theta)italic_θ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT = roman_argmax start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ∑ start_POSTSUBSCRIPT italic_t = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT roman_log italic_p ( italic_y start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∣ italic_y start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_y start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT , italic_I italic_m italic_g ; italic_θ ) (1)

where θsuperscript𝜃\theta^{*}italic_θ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT represents the parameters of R2Gen-Mamba. This optimization process ensures that the model learns to accurately generate the report text based on the visual features extracted from the input image. During inference, we use the beam search strategy to sample predictions. To facilitate reproducible research, we have shared the source code to the public through GitHub.

3 EXPERIMENTS

Table 1: Details of two benchmark datasets used in this work.
Dataset IU X-Ray MIMIC-CXR
Train Validation Test Train Validation Test
Image # 5.23K 0.75K 1.50K 368.96K 2.99K 5.16K
Report # 2.77K 0.40K 0.79K 222.76K 1.81K 3.27K
Patient # 2.77K 0.40K 0.79K 64.59K 0.50K 0.29K
Average Length 37.56 36.78 33.62 53.00 53.05 66.40
Table 2: Comparisons of different methods on IU X-Ray and MIMIC-CXR. ‘BLEU-x’: BLEU score with an n-gram size of x. The best results are highlighted in bold.
Data Method NLG Metrics CE Metrics
BLEU-1 BLEU-2 BLEU-3 BLEU-4 METEOR ROUGE-L Precision Recall F1 score
IU X-Ray R2Gen 0.423 0.275 0.203 0.160 0.176 0.358 - - -
R2Gen-CMN 0.470 0.300 0.215 0.166 0.189 0.367 - - -
R2Gen-RL 0.291 0.178 0.121 0.086 0.096 0.312 - - -
R2Gen-Mamba (Ours) 0.482 0.315 0.228 0.176 0.208 0.382 - - -
MIMIC-CXR R2Gen 0.371 0.223 0.148 0.105 0.141 0.271 0.429 0.243 0.310
R2Gen-CMN 0.352 0.214 0.141 0.099 0.139 0.274 0.441 0.326 0.375
R2Gen-RL 0.122 0.067 0.042 0.028 0.047 0.137 0.061 0.027 0.038
R2Gen-Mamba (Ours) 0.352 0.222 0.152 0.110 0.141 0.284 0.483 0.325 0.389

3.1 Experimental Setup

We perform experiments on two benchmark datasets: IU X-Ray [6] and MIMIC-CXR [7]. The IU X-Ray dataset includes 7,470 chest X-ray images paired with 3,955 reports, while MIMIC-CXR comprises 473,057 images and 206,563 reports. Following prior studies [1, 2, 3], we exclude samples without reports. We use a 70%/10%/20% split for training, validation, and testing on IU X-Ray, and the official split for MIMIC-CXR, as detailed in Table 1. Two evaluation metrics are employed: traditional natural language generation (NLG) metrics (BLEU [8], METEOR [9], and ROUGE-L [10]) and clinical efficacy (CE) metrics. For CE metrics, we use the CheXbert [11] tool to automatically label generated reports, comparing them to ground truths across 14 thoracic disease categories using precision, recall, and F1 score.

3.2 Implementation Details

Following [1, 2, 3], we use two images per patient for IU X-Ray and one image for MIMIC-CXR as input. The visual extractor utilizes a ResNet101 model pre-trained on ImageNet, with patch features projected to a dimension of 512. The Mamba encoder is set to a dimension of 512, with an SSM state expansion factor of 16, a local convolution width of 4, and a block expansion factor of 2. The Transformer decoder also has a dimension of 512, with 3 layers, 8 heads, and a dropout rate of 0.1. We use the Adam optimizer and set learning rates of 5×1055superscript1055\times 10^{-5}5 × 10 start_POSTSUPERSCRIPT - 5 end_POSTSUPERSCRIPT for the visual extractor and 1×1041superscript1041\times 10^{-4}1 × 10 start_POSTSUPERSCRIPT - 4 end_POSTSUPERSCRIPT for other parameters, decayed by 0.8 per epoch. The model that achieved the best BLEU-4 score on the validation sets is selected, with a beam size of 3 for inference to balance between generation quality and computational efficiency.

Refer to caption
Fig. 2: Examples of ground truth and generated reports by different methods, with similar findings marked in the same color.

3.3 Visual and Quantitative Results

To evaluate the effectiveness of our R2Gen-Mamba, we performed a comparative analysis against existing SOTA methods, namely R2Gen [1], R2Gen-CMN [2], and R2Gen-RL [3]. Using the same data, R2Gen and R2Gen-CMN were implemented using their released code and checkpoints for inference, and R2Gen-RL was retrained from scratch using their released code. Several typical reports generated by different methods are shown in Fig. 2. It can be seen from this figure that the report generated by R2Gen-Mamba contains more precise information, providing superior results than the competing methods in accuracy and clarity. The quantitative results regarding NLG and CE metrics are summarized in Table 2, from which we have several key findings.

Firstly, our R2Gen-Mamba, which incorporates Mamba and Transformer, outperforms existing approaches in most cases, suggesting the advantages of Mamba for report generation and the feasibility of combining Mamba with Transformer. Secondly, R2Gen-Mamba slightly under-performs R2Gen on BLEU-1 and BLEU-2 metrics for MIMIC-CXR but surpasses it on BLEU-3, BLEU-4, METEOR, and ROUGE-L. BLEU-1 and BLEU-2 measure the overlap of single words and word pairs, reflecting basic vocabulary matching. BLEU-3 and BLEU-4 measure triples and quadruples, capturing longer context dependencies. Higher BLEU-3 and BLEU-4 scores indicate R2Gen-Mamba generates text with better grammatical and semantic structures, reflecting stronger context modeling and grammatical consistency. METEOR combines lexical matching, word order, and morphological changes, while ROUGE-L assesses the longest common subsequence between generated and reference texts. Our R2Gen-Mamba’s better performance on these metrics demonstrates stronger vocabulary choice, grammatical structure, and alignment with reference text. Thirdly, R2Gen-Mamba demonstrates superior performance on clinical efficacy (CE) metrics, suggesting that the generated reports offer more valuable clinical information for diagnosis and decision-making. This highlights the clinical relevance and utility of our R2Gen-Mamba compared with the competing methods.

3.4 Computation Complexity Analysis

With the Mamba encoder in the proposed R2Gen-Mamba framework, we can significantly reduce model complexity, with only 594.944 K parameters and incurring a computational load of 58.216 M floating-point operations (FLOPs). This represents a substantial improvement over the Transformer encoder utilized in the SOTA R2Gen model, which comprises 4.728 M parameters and incurs a computational complexity of 462.422 M FLOPs. The considerable reduction in both parameter count and computational cost highlights the efficiency of the Mamba encoder, making it more suitable for resource-constrained environments while maintaining superior performance in radiology report generation.

4 CONCLUSION

This paper presents R2Gen-Mamba, a novel radiology report generation model that leverages Mamba’s efficient sequence processing and Transformer’s contextual strengths. R2Gen-Mamba reduces computational complexity while producing high-quality radiology reports. Experiments on two datasets show that R2Gen-Mamba surpasses existing methods in both natural language generation and clinical efficacy metrics. Our findings highlight the effectiveness of merging Mamba with Transformer techniques for radiology report generation.

5 COMPLIANCE WITH ETHICAL STANDARDS

This research was conducted retrospectively using human subject data made available in open access by IU X-Ray and MIMIC-CXR. Ethical approval was not required as confirmed by the license attached with the open-access data.

6 ACKNOWLEDGMENTS

The research of M. Liu and H. Zhu was supported in part by NIH grants AG073297 and AG082938. The research of C. Lian was supported in by NSFC Grants (Nos. 12326616, 62101431, and 62101430) and Natural Science Basic Research Program of Shaanxi (No. 2024JC-TBZC-09).

References

  • [1] Z. Chen, Y. Song, T.-H. Chang, and X. Wan, “Generating radiology reports via memory-driven Transformer,” in EMNLP, 2020, pp. 1439–1449.
  • [2] Z. Chen, Y. Shen, Y. Song, and X. Wan, “Cross-modal memory networks for radiology report generation,” in ACL/IJCNLP, 2021, pp. 5904–5914.
  • [3] H. Qin and Y. Song, “Reinforced cross-modal alignment for radiology report generation,” in ACL, 2022, pp. 448–458.
  • [4] A. Vaswani, “Attention is all you need,” NeurIPS, 2017.
  • [5] A. Gu and T. Dao, “Mamba: Linear-time sequence modeling with selective state spaces,” arXiv preprint arXiv:2312.00752, 2023.
  • [6] D. Demner-Fushman, M. D. Kohli, M. B. Rosenman, S. E. Shooshan, L. Rodriguez, S. Antani, G. R. Thoma, and C. J. McDonald, “Preparing a collection of radiology examinations for distribution and retrieval,” J. Am. Med. Inform. Assoc., vol. 23, no. 2, pp. 304–310, 2016.
  • [7] A. E. W. Johnson, T. J. Pollard, N. R. Greenbaum, M. P. Lungren, C.-Y. Deng, Y. Peng, Z. Lu, R. G. Mark, S. J. Berkowitz, and S. Horng, “MIMIC-CXR-JPG, a large publicly available database of labeled chest radiographs,” arXiv preprint arXiv:1901.07042, 2019.
  • [8] K. Papineni, S. Roukos, T. Ward, and W.-J. Zhu, “BLEU: A method for automatic evaluation of machine translation,” in ACL, 2002, pp. 311–318.
  • [9] M. Denkowski and A. Lavie, “Meteor 1.3: Automatic metric for reliable optimization and evaluation of machine translation systems,” in WMT, 2011, pp. 85–91.
  • [10] C.-Y. Lin, “ROUGE: A package for automatic evaluation of summaries,” in Text Summ. Branches Out, 2004, pp. 74–81.
  • [11] A. Smit, S. Jain, P. Rajpurkar, A. Pareek, A. Y. Ng, and M. P. Lungren, “Combining automatic labelers and expert annotations for accurate radiology report labeling using BERT,” in EMNLP, 2020, pp. 1500–1519.