1 Introduction

In December 2019, the COVID-19 outbreak emerged and then spread locally around the world. It not only causes a drain on the world economy but also poses a threat to the lives of human beings all over the world (Nicola et al. 2020). Therefore, early detection, early diagnosis, and early treatment are important methods for improving patient’s survival rates (Yan et al. 2020). However, it is difficult to confirm the severity of infection by direct judgmental analysis of patients (Munusamy et al. 2021); thus, doctors diagnose COVID-19 lung involvement by CT images. CT technology can show some distinctive features, including ground-glass opacity (GGO), pulmonary fibrosis (PF), pleural effusion (PE), and pulmonary consolidation (PC), which has important research value and practical significance for the early diagnosis of lung lesions (Shi et al. 2020; Kanne 2019). In diagnosis, doctors need to rely on imagination to convert the 2D CT patient images into 3D images to obtain the location and size of pathological tissues. Nevertheless, with the large increase in the number of confirmed and suspected cases of COVID-19, doctors need to spend considerable time and effort manually labeling the CT lesion area. Therefore, computer-aided systems can be used to help doctors diagnose lung infection and quantitatively evaluate the effect before and after treatment. It not only improves the medical image interpretation efficiency of doctors but also strengthens their clinical diagnosis ability and improves the patient cure rate.

Recently, deep learning-based computerized imaging diagnostic systems have been used to help examine infected patients, which utilize models to obtain features to identify areas of lung infection. For example, Kumar Singh et al. (2021) proposed a model for segmenting the COVID-19 infections of lung CT images based on a receptive-field-aware (RFA) module, called LungINFseg. The RFA module can enlarge the receptive field of the segmentation models and learn context information. Wang et al. (2020) proposed a deep convolutional neural network (COVID-Net), which aims to screen patients with suspected infection by identifying obvious signs of COVID-19 from chest X-rays. To relieve the diagnostic pressure caused by the lack of labeled data, Zhou et al. (2019) adopted a self-supervised learning strategy to effectively improve the utilization rate of a mass of unlabeled images. Moreover, they used model genesis to achieve 3D transfer learning of medical images. Alhudhaif et al. (2021) designed a generalized convolutional neural network capable of identifying COVID-19 through feature extraction from chest X-ray images. Wang et al. (2021) proposed a DeepSC-COVID model for 3D lesion classification and segmentation COVID-19 and realized assisted diagnosis of COVID-19 through multitask learning.

Despite the emergence of intelligent diagnostic systems for COVID-19 and the active exploration of lung infection regions, there are still many challenges. First, feature analysis and information extraction are affected due to the large morphological difference and variable location of infected regions in lung CT images. Second, compared with natural images, CT images have low contrast and are susceptible to noise, resulting in blurred edges between different tissues or between tissues and lesions, which increases the segmentation difficulty. In addition, data collection and labeling for the study are difficult. Therefore, making reliable pseudo-labels is essential to assisting doctors in diagnosing patients.

To deal with the above challenges, we propose a novel semi-supervised dual-task balanced fusion network (DBF-Net) to produce high-quality pseudo-labels based on lung infection areas in CT images. Inspired by the method of the radiologist in detecting the infected region, we first roughly locate the infected region and then further determine the outline of the infected area according to the local characteristics. In our view, relatively clear regions and boundaries are key features in determining whether the lung is infected. Our deep learning model extracts boundary information layer by layer through a fusiform equilibrium fusion pyramid. The original CT image and the enhanced image are fed into both network branches for training and learning, thus extracting more complete image information. In addition, we design a semi-supervised learning framework to combine unlabeled and labeled data for training, and effectively make pseudo-labels to expand the infected region segmentation training dataset. In summary, our research mainly includes the following threefold:

  1. (1)

    We design a novel deep learning network (DBF-Net). The dual-task learning of lung CT image segmentation is realized by adopting a unique dual-branch training method. By using a lightweight double convolution module for down-sampling, this module is a simpler and more effective down-sampling method than the ordinary down-sampling module.

  2. (2)

    We propose the fusiform equilibrium fusion pyramid in the down-sampling of our model for feature extraction layer by layer. First, our pyramid convolution is divided into different levels, and each level corresponds to a convolution kernel of different sizes. Then, the convolution kernel at the top of the pyramid is sequentially divided. The convolution kernel at the bottom of another pyramid realizes feature fusion in sequence. After that, the aggregated features can communicate the context information, reduce the number of convolution parameters and achieve the favorable effect of balanced fusion features.

  3. (3)

    We combine the image enhancement approach with a semi-supervised learning strategy in our model to generate pseudo-labels, which improves the utilization of unlabeled data by selecting a specific quantity of unlabeled data and labeled data for mixed training each time.

2 Related work

2.1 Lung CT image segmentation

The diagnostic results of lung CT images can be used as the evaluation basis for patients with COVID-19 (Sluimer et al. 2006; Kamble et al. 2020). Radiologists segmented the lung lesion area by viewing the lung CT images and combining them with clinical information to diagnose COVID-19 patients. After the global outbreak of COVID-19, many researchers have carried out research on COVID-19 lung CT image segmentation based on deep learning. Based on U-Net, Chen et al. (2020) utilized aggregated residual transformations and soft attention mechanism to learn robust and expressive feature representations, thereby improving the model’s ability to distinguish various symptoms of COVID-19. Rajamani et al. (2021) proposed a dynamic deformable attention network (DDANet) for COVID-19 lesion semantic segmentation. The model is based on a deformable criss-cross attention block, which continuously learn sparse attention filter offsets to capture sufficient context information and improve segmentation performance. To solve the problem of insufficient training samples, Shan et al. Shan et al. (2021) proposed a VB-Net model based on “bottleneck structure” to segment COVID-19 CT images and proposed a semi-supervised training strategy of “human-in-the-loop (HITL)” participated by professional doctors to reduce network training time and improve segmentation efficiency. In addition, some studies combine classification and segmentation. Wang et al. (2020) developed a weakly-supervised deep learning framework using 3D CT volumes, which can accurately predict the probability of COVID-19 infection and find lesion regions in chest CT without labeling the lesions for training. The easily-trained and high-performance deep learning algorithm provides a method for quickly identifying COVID-19 patients, which is conducive to controlling the outbreak of SARS-CoV-2. Li et al. (2020) created a fully automated framework for detecting COVID-19 through lung CT, distinguishing community-acquired pneumonia from other non-pneumonic lung diseases.

2.2 Semi-supervised learning

Semi-supervised learning (SSL) has been extensively studied in various computer vision tasks. Recently, to reduce the labeling burden, an increasing number of scholars have devoted themselves to the study of deep learning models for semi-supervised medical image segmentation. Existing semi-supervised methods are mainly divided into two categories. The first category is based on pseudo-labels (Fan et al. 2020; Bai et al. 2017), which jointly improves the segmentation model by training labeled images so that unlabeled images are tested on the network to obtain pseudo-labels. Fan et al. (2020) incrementally augmented the training dataset with unlabeled data and then generated pseudo-labels for training. Bai et al. Bai et al. (2017) redefined pseudo-labels by improving pseudo-segment labels, adjusting network parameters, or using conditional random field (CRF). However, this method ignores the pseudo-labels properties, which may not improve the network learning performance. The second class of methods learns from both labeled and unlabeled images, and they usually consist of a supervised loss function for labeled images and an unsupervised regularization loss function for all images. Cui et al. (2019) proposed a consistency loss to exploit unlabeled data and added an exponential moving average to prevent overfitting. Li et al. (2020) introduced more data perturbations and model perturbations on the teacher-student model to construct the consistency of the same input under different perturbations. Chen et al. (2019) simultaneously optimized the supervised segmentation and unsupervised segmentation reconstruction targets, and the reconstruction targets adopted an attention mechanism to separate the image reconstruction regions corresponding to different categories. Nie et al. (2018) proposed a novel deep adversarial network to facilitate partial unlabeled images to approximate labeled images for biological-image segmentation.

3 Methods

This section introduces our proposed balanced fusion network (DBF-Net) based on dual-task consistency, as well as corresponding key modules. Our model combines image enhancement and semi-supervised learning framework (Shan et al. 2021) to augment the training dataset with limited labeled data, thereby improving the segmentation accuracy of lung CT images. In addition, we extend DBF-Net to utilize pseudo-labels for segmentation tasks. Experimental comparison with mainstream segmentation algorithms shows the superiority of the proposed algorithm.

3.1 Dual-task balanced fusion network (DBF-Net)

The architecture of our DBF-Net is shown in Fig. 1. It adopts the commonly used and efficient encoder-decoder structure for medical image segmentation (Zhou et al. 2020; Liu et al. 2017; Wang et al. 2019). In the encoder, we first simultaneously feed the original image and the enhanced image into the network branch and perform the dimension-raising operation through a 1\(\times\)1 standard convolution. Then, the lightweight double convolution (LDC) module is used to perform the down-sampling operation to extract image information layer by layer. Meanwhile, the fusiform equilibrium fusion pyramid (FEFP) is embedded behind the LDC module of the two branches. Furthermore, in each FEFP operation, feature fusion of different feature layers is realized with the other branch to reduce information loss. Second, we utilize the attentional feature fusion (AFF) module (Dai et al. 2021) to optimize the results of the double branches, and then use the atrous spatial pyramid pooling (ASPP) module (Chen et al. 2017) to increase the convolutional receptive field to effectively learn the edge features of the lung-infected regions. Finally, we use a decoding structure similar to U-Net to complete up-sampling and obtain the final mask of the lung-infected regions.

Fig. 1
figure 1

The architecture of our proposed DBF-Net, which consists of a lightweight double convolution (LDC) module connected to the fusiform equilibrium fusion pyramid (FEFP) convolution

3.2 Lightweight double convolution (LDC)

The lightweight module reduces the time and space complexity when extracting image features (Li et al. 2019). To improve the rate of feature extraction and transfer, the number of parameters is reduced to some extent. We extract features from the input image through a lightweight double convolution module, the structure is shown in Fig. 2.

Fig. 2
figure 2

The structure of lightweight double convolution module (LDC), which is used as the down-sampling operation module

Specifically, we take the convolution operation, batch normalization, and activation function as a basic processing unit and integrate the max-pooling operation in the middle layer to achieve down-sampling. From the global perspective, the residual structure we propose preserves as much of the boundary information lost due to ordinary down-sampling operations as possible while featuring extraction and transmission, which is critical for accurate medical image segmentation.

The final output feature F can be formulated as follows:

$$\begin{aligned} \begin{aligned} F = {f_2}\left( {{f_1}\left( x \right) } \right) + {f_1}\left[ {{f_2}\left( {{f_1}\left( x \right) } \right) } \right] + {f_2}\left( {{f_1}\left( x \right) } \right) \\ = 2{f_2}\left( {{f_1}\left( x \right) } \right) + {f_1}\left[ {{f_2}\left( {{f_1}\left( x \right) } \right) } \right] \end{aligned} \end{aligned}$$
(1)

In this formula, \({f_1}\) and \({f_2}\) can be expressed as:

$$\begin{aligned} {f_1}=\;& {} {\mathop {\mathrm{Re}}} \mathrm{{LU}}\left[ {\mathrm{{BN}}\left( {\mathrm{{Con}}{\mathrm{{v}}_\mathrm{{3}}}\left( x \right) } \right) } \right] + {\mathop {\mathrm{Re}}} \mathrm{{LU}}\left[ {\mathrm{{BN}}\left( {\mathrm{{Con}}{\mathrm{{v}}_1}\left( x \right) } \right) } \right] \end{aligned}$$
(2)
$$\begin{aligned} {f_2}= & {} \;\mathrm{{MP}}\left[ {2{\mathop {\mathrm{Re}}} \mathrm{{LU}}\left( {\mathrm{{BN}}\left( {\mathrm{{Con}}{\mathrm{{v}}_\mathrm{{3}}}\left( x \right) } \right) } \right) } \right] \end{aligned}$$
(3)

where x is the input of the LDC module, \(\mathrm{{Con}}{\mathrm{{v}}_3}\left( . \right)\) denotes the 3\(\times\)3 convolutional layer, \(\mathrm{{Con}}{\mathrm{{v}}_1}\left( . \right)\) represents the 1\(\times\)1 convolutional layer, and + is the element-wise addition, \(\mathrm{{MP}}\left( . \right)\) denotes max-pooling.

3.3 Fusiform equilibrium fusion pyramid (FEFP)

Different from ordinary pyramidal convolution (Duta et al. 2020) (PyConv), our fusiform equilibrium fusion pyramid (FEFP) module not only contains different levels of kernels with different sizes and depths, but also balances the features extracted from large- and small-scale kernels in symmetric form. Therefore, in addition to expanding the convolution receptive field, FEFP can also capture richer multiscale details than PyConv.

Fig. 3
figure 3

The structure of fusiform equilibrium fusion pyramid convolution (FEFP), which is utilized to achieve balanced feature fusion

Our FEFP is shown in Fig. 3, which is composed of two symmetrical pyramid splices. To be able to utilize different depth kernels at each level on FEFP, we divide the input feature maps into different groups by grouping convolution and apply the kernel independently for each input feature map group. The input feature map \(FM_i\) is divided into two feature blocks \(FM_{i_1}\) and \(FM_{i_2}\). Each level of \(\left\{ {1,2,3,4} \right\}\) of the FEFP convolution corresponds to a different space size kernel \(\left\{ {K_1^2,K_2^2,K_3^2,K_4^2} \right\}\) and \(\left\{ {K_{{1^\mathrm{{*}}}}^2,K_{{2^\mathrm{{*}}}}^2,K_{{3^\mathrm{{*}}}}^2,K_{{4^\mathrm{{*}}}}^2}\right\}\). The kernels of different depths are obtained by grouping in Fig. 3: \(\left\{ {FM_{{i1,}} ,\frac{{FM_{{i1}} }}{{\left( {\frac{{K_{2}^{2} }}{{K_{1}^{2} }}} \right)}},\frac{{FM_{{i1}} }}{{\left( {\frac{{K_{3}^{2} }}{{K_{1}^{2} }}} \right)}},\frac{{FM_{{i1}} }}{{\left( {\frac{{K_{4}^{2} }}{{K_{1}^{2} }}} \right)}}} \right\},\left\{ {FM_{{i2,}} ,\frac{{FM_{{i2}} }}{{\left( {\frac{{K_{{2^{*} }}^{2} }}{{K_{{1^{*} }}^{2} }}} \right)}},\frac{{FM_{{i2}} }}{{\left( {\frac{{K_{{3^{*} }}^{2} }}{{K_{{1^{*} }}^{2} }}} \right)}},\frac{{FM_{{i2}} }}{{\left( {\frac{{K_{{4^{*} }}^{2} }}{{K_{{1^{*} }}^{2} }}} \right)}}} \right\}.\)

The output feature maps of the two pyramids are \(\left\{ FM_{o11},FM_{o12},FM_{o13},FM_{o14} \right\}\) and \(\left\{ FM_{o21},FM_{o22},FM_{o23},FM_{o24} \right\}\), as well as \(\left( {FM_{{o14}} + FM_{{o21}} } \right) + \left( {FM_{{o13}} + FM_{{o22}} } \right) + \left( {FM_{{o12}} + FM_{{o23}} } \right) + \left( {FM_{{o11}} + FM_{{o24}} } \right) = FM_{o}\). The convolution kernel at the top of the pyramid and at the bottom of the other pyramid are sequentially combined to achieve feature fusion. Finally, the output feature map \(FM_o\) is obtained by connecting the feature maps at each level according to the number of channels.

The FEFP module kernel type is a symmetric pyramid. With the increase in kernel size, the kernel depth decreases from level 1 to level n and vice versa. Kernels of different sizes communicate information to maximize feature complementarity. Through the interconnection of receptive fields of different sizes, the feature fusion of different scale kernels is realized, and infected areas the recognition in CT images is improved.

3.4 Semi-supervised learning strategy

Manual labeling of the infected regions in lung CT images is time-consuming and labor-intensive, resulting in very little labeled data. To augment the dataset, we adopt the combination of image enhancement and semi-supervised learning strategy to improve DBF-Net.

The training set is augmented by a small quantity of labeled data to help unlabeled data generate pseudo-labels. First, we send the training set composed of labeled images and enhanced images to pretrain the DBF-Net model. Then, N unlabeled images are randomly fed for prediction to obtain N corresponding pseudo-labels. The training set is mixed with N pseudo-label images and then fed to the network for training again so that the weight is continuously updated. Repeating the above operations, we periodically feed the training set of labeled images and N unlabeled images and then complete the network training after 200 epochs, thereby generating the desired high-quality pseudo-labels.

Specifically, in the dataset we used, there are 1600 unlabeled images and 100 labeled images. During the experiment, 60 labeled images and corresponding enhanced images were used as the training set, 10 labeled images were used as the verification set, and 30 labeled images were used as the test set. Then, 8 unlabeled images were randomly sent to predict each time, i.e., \(N_i\) =8. The semi-supervised learning framework is shown in Fig. 4.

Fig. 4
figure 4

Semi-supervised DBF-Net architecture diagram, where blue refers to labeled images and enhanced images, yellow refers to unlabeled images, and red refers to our proposed DBF-Net

3.5 Image enhancement for medical images

Taking different image enhancement methods may affect the network model performance. In general, the fundamental purpose of image processing is to learn the critical information of the image. Because of our needs and the characteristics of medical image processing, we carried out four transformations on lung CT images (Zhou et al. 2019), as shown in Fig. 5.

Fig. 5
figure 5

Methods of lung CT image enhancement. a Original image. b Nonlinear transformation. c Local pixel change. d Internal pixel change. e External pixel change

  1. (1)

    Nonlinear transformation The pixel value in the CT image is the corresponding value of the X-ray attenuation coefficient of each tissue, also known as the Hounsfield Unit (HU) value. Different HU values correspond to different tissues. The nonlinear function is used to perform the nonlinear transformation on the input image HU. The global contrast enhancement of the image is realized by adjusting the transformation parameters to identify different tissues.

  2. (2)

    Local pixel change In CT image A, a small cube c is randomly determined. The pixel position in cube c is randomly scrambled to obtain \({c'}\) , and then c is replaced by \({c'}\) . This process is repeated several times to obtain the transformed CT image Ã. On the premise that the overall image shape does not change greatly, the model can learn the local structure and texture features.

  3. (3)

    Internal pixel change In CT image A, two cubes \({c_1}\) and \({c_2}\) are randomly selected, and \({c_1}\) \(\cap\) \({c_2}\) = \(\emptyset\) is satisfied. The pixel values of two cubes are exchanged, i.e., \(c{\prime _1}\) = \({c_2}\) ,\(c{\prime _2}\) = \({c_1}\) . Then, this process is repeated several times to obtain the transformed image Ã.

  4. (4)

    External pixel change Irregular masking on the outer edge of the original image prompts the model to analyze the internal structural information to infer the external structure and extract more critical visual features. The overall algorithm flow is shown in Algorithm 1.

figure a

4 Experiments

4.1 Lung datasets

In this paper, the COVID-SemiSeg dataset (Fan et al. 2020) and COVID-19 CT segmentation dataset (Milletari et al. 2016) are used to perform the experimentation and comparison with mainstream approaches.

  1. (1)

    The COVID-SemiSeg dataset is aimed at semi-supervised COVID-19 infection segmentation and 3D CT images from more than 20 COVID-19 patients, and the dataset is extended with the help of many unlabeled CT images.

  2. (2)

    The COVID-19 CT segmentation dataset consists of 100 labeled axial CT images from over 40 COVID-19 patients. The CT images were all collected by the Italian Society of Medical and Interventional Radiology, and radiologists segmented the CT images based on three labels of ground-glass opacity (GGO), consolidation, and pleural effusion to determine regions of lung infection.

We strictly split the COVID-19 CT segmentation dataset containing 100 labeled axial CT images. Specifically, 60 are used for training, 10 for validation, and 30 for testing. The COVID-SemiSeg dataset contains 1600 unlabeled CT images. We perform training on this dataset and the training set in the COVID-19 CT segmentation dataset following the semi-supervised learning strategy in Sect. 3.4.

4.2 Experimental settings

All the experiments of the proposed method DBF-Net are conducted on Intel I7-11700K with NVIDIA RTX3080TI GPU. The development environment is based on the Ubuntu 20.04 operating system, CUDA11.4 + Pytorch1.9, and the programming language is Python 3.8.

Since resizing images will have an impact on image quality, we first resample the original image slices and then crop all slices uniformly to 384 \(\times\) 384. The training is performed with an Adam optimizer with a momentum of 0.9 and a weight decay of 0.0005. The initial learning rate is 0.01, the batch size is set to 4, and a total of 200 epochs are trained.

For image segmentation, the cross-entropy loss function is widely used as the main function. To solve the problem of CT image category imbalance and difficult-to-classify samples, this paper trains the DBF-Net model by combining the Dice loss function and the Focal loss function. Then, the final loss function is:

$$\begin{gathered} L = \gamma L_{{Dice}} + \left( {1 - \gamma } \right)L_{{Focal}} \hfill \\ \,\,\,\, = \gamma \left( {C - \sum\limits_{{c = 0}}^{{C - 1}} {\frac{{TP_{n} \left( c \right)}}{{TP_{p} \left( c \right) + \alpha FN_{p} \left( c \right) + \beta FP_{p} \left( c \right)}}} } \right) \hfill \\ \,\,\,\, - \left( {1 - \gamma } \right)\frac{1}{N}\sum\limits_{{c = 0}}^{{C - 1}} {\sum\limits_{{n = 1}}^{N} {g_{n} } } \left( c \right)\left( {1 - P_{n} \left( c \right)} \right)^{2} log\left( {P_{n} \left( c \right)} \right) \hfill \\ \end{gathered}$$
(4)

In the equation, c is a specific class; \(TP_{P}\left( c\right)\), \(FN_{P}\left( c\right)\), \(FP_{P}\left( c\right)\) are the true positive rate, false-negative rate, and false-positive rate of the class; \(P_{n}\left( c\right)\) refers to when the pixel n is of class c; \(g_{n}\left( c\right)\) refers to the real situation that pixel n is class c; C is the total number of classes; N is the sum of the number of pixels; \(\alpha\) and \(\beta\) are the penalty weights of false negative and false positive, respectively, set to 0.5; \(\gamma\) and \(1-\gamma\) is the weight of Dice loss and Focal loss, while \(\gamma\) is set to 0.3.

4.3 Evaluation metrics

To evaluate the performance of our proposed method in the lung lesion segmentation task, we used several evaluation metrics. Sensitivity, Specificity, Dice, and Precision were defined as follows:

$$\begin{aligned} Sensitivity= & {} \frac{TP}{TP+FN} \end{aligned}$$
(5)
$$\begin{aligned} Specificity= & {} \frac{TN}{TN+FP} \end{aligned}$$
(6)
$$\begin{aligned} Dice= & {} \frac{2TP}{2TP+FP+FN} \end{aligned}$$
(7)
$$\begin{aligned} Precision= & {} \frac{TP}{TP+FP} \end{aligned}$$
(8)

where TP refers to the number of infected and accurately predicted regions; TN refers to the number of uninfected and accurately predicted regions; FP refers to the number of uninfected regions that are wrongly predicted to be infected; FN refers to the number of infected regions that are wrongly predicted to be uninfected.

Note that the increment of evaluation metrics in the experimental subjects are computed by following mathematical expression:

$$\begin{aligned} I=\frac{ES_{a} -ES_{b}}{ES_{b}} \times 100\% \end{aligned}$$
(9)

where I denotes the incremental ratio of an evaluation metric, \(ES_{a}\) and \(ES_{b}\) denote experimental subjects respectively.

4.4 Comparison of segmentation performance for different algorithms

In order to verify the segmentation performance of the proposed DBF-Net, we utilized DBF-Net as the lung CT image segmentation model, and compared it with the existing classical algorithms. The segmentation results are shown in Fig. 6. The first row is the original CT images, the second row is the result of manual marking by radiologists as the evaluation standard; the third row is the segmentation result of FCN-8s (Long et al. 2015); the fourth row is the segmentation result of U-Net++ (Zhou et al. 2019); the fifth row is the segmentation result of the combination of the ResNet50 encoder (He et al. 2016) and the U-Net decoder (ResUNet). The sixth row is the segmentation result of our proposed algorithm DBF-Net.

Fig. 6
figure 6

From top to bottom, there are several exemplar results in 2D views (a) obtained by the corresponding ground truth (b) on the COVID-19 CT segmentation dataset, FCN-8s (c), U-Net++ (d), ResUNet (e), and our DBF-Net Model (f), where the red and green labels indicate the GGO and consolidation, respectively

The experimental results show that compared with the segmentation results of U-Net++ and other algorithms, the segmentation effect of the proposed DBF-Net has the best performance and high image quality. For the same CT images, the FCN-8s segmentation effect is the worst, and the boundary segmentation is relatively rough; U-Net++ and ResUNet have different degrees of over-segmentation of the image. However, our proposed algorithm performs significantly in segmenting the boundary contour of the lung lesion region, with fewer incorrectly segmented regions, and the segmentation effect is relatively close to the image manually labeled by the radiologist.

Although the subjective evaluation is simple and direct, it is susceptible to subjective factors, so it is still necessary to quantitatively evaluate the segmentation results. The quantitative results of different segmentation algorithms are shown in Table 1. It can be seen that the DBF-Net performance is better than that of the other models in four evaluation metrics. Its sensitivity reaches 70.6%, specificity reaches 92.8%, the Dice coefficient reaches 68.7%, precision reaches 67.5%, and the segmentation effect is relatively better. To sum up, under the premise of the semi-supervised learning strategy, the proposed DBF-Net model has a great improvement in CT image segmentation performance.

Table 1 Comparison of quantitative results of different CT segmentation algorithms

4.5 Performance comparison of different semi-supervised models

In this section, we compare the performance of DBF-Net to recent semi-supervised models. This concludes COPLE-Net (Wang et al. 2020) and Semi-Inf-Net (Fan et al. 2020). Table 2 presents the segmentation performance of the competing approaches across different measures by implementing them under the same experimental settings. COPLE-Net performed relatively poorly on different measures. Compared with COPLE-Net, Semi-Inf-Net showed better performance improvements. However, the multistage training of Semi-Inf-Net limited the realization of the optimal performance. In contrast with COPLE-Net, DBF-Net improved sensitivity by 4.1%, specificity by 8.9%, the Dice coefficient by 5.0% and precision by 5.6%. It attained robust segmentation performance with performance improvements over competing models. Fig. 7 shows a graphical comparison of the segmentation results of different real-world COVID-19 axial slices.

Table 2 Comparison between DBF-Net and other semi-supervised models
Fig. 7
figure 7

Visual comparison of segmentation performance for different semi-supervised models

4.6 Ablation experiments

In this section, for a deeper analysis of the DBF-Net performance, ablation experiments were performed to enable understanding of the model behavior under different settings. In these experiments, we select U-Net as the base backbone for our DBF-Net.

  1. (1)

    Impact of different modules on model performance

Table 3 lists the impact of different modules on model performance. Based on U-Net, our FEFP module and image enhancement are added one after another, thus demonstrating the effectiveness of the FEFP module’s multiscale extraction features and image enhancement methods for lung image segmentation. Then, combined with the semi-supervised learning strategy, we replace the down-sampling structure of U-Net with LDC and FEFP modules to form DBF-Net, while adding image enhancement methods to improve the network effect. Finally, pseudo-labels are generated, and DBF-Net is used to achieve lung lesion segmentation.

Table 3 Performance of the network with different blocks

As seen in the first and second rows of the table, the performance of the FEFP module is improved by up to 15.2% based on U-Net, which makes the model more accurate in segmenting the infected region. The second and fourth rows of the table show that the LDC module has a maximum performance improvement of 10.4% compared to the original down-sampling module in U-Net, which is necessary for performance improvement. The fourth and fifth rows of the table show that the image enhancement approach for medical images has a maximum performance improvement of 3.2% over DBF-Net, which can effectively improve the segmentation effect of the proposed model.

To further visualize the segmentation model performance, we plot the loss curves when different modules are combined for training, as shown in Fig. 8. During the training process, with the superposition of modules, the convergence speed of the network in this paper gradually accelerates, and the accuracy after convergence is relatively high. Therefore, our proposed DBF-Net is easier to train and can better localize the lung lesion region.

Fig. 8
figure 8

Training progress of different module combinations

  1. (2)

    Impact of image enhancement on model performance

To reflect the important role of image enhancement in our model, we first combine two original images, one original image, and an enhanced image, and then send them to DBF-Net for performance comparison. The comparison results are shown in Table 4. It can be seen that the original image is transformed by image enhancement dedicated to medical image segmentation, and then sent to DBF-Net, and the network segmentation effect is better enhanced (sensitivity 3.5%; specificity 2.8%; Dice 3.5%; precision 2.4%). Due to the uniqueness of the image enhancement method, the structure and texture features of lung CT images are well highlighted in model segmentation, and the definition and accuracy of pseudo-labels are improved. The corresponding visualization is shown in Fig. 9.

Table 4 Performance comparison of different image combination methods
Fig. 9
figure 9

Visual comparison of segmentation effects for different image combinations. a Original image. b Ground truth. c Segmentation effect of two original image combinations. d Segmentation effect of a combination of the original image and enhanced image

  1. (3)

    Impact of different pyramids on model performance

The main purpose of this experiment was to investigate the impact of different pyramids on CT image segmentation performance. We conducted a comparison experiment between the classical PyConv (Duta et al. 2020) and the proposed FEFP module for the segmentation task, and the experimental results are shown in Table 5. Each evaluation metric of the FEFP module is better than pyramid convolution. Given the complex texture of a CT image and its susceptibility to noise, when a single PyConv is used for down-sampling, the feature extraction of the infected region of the CT image is insufficient, and it is difficult for the model to better learn the edge features, which reduces the training accuracy. Our FEFP in this paper communicates pixel information by fusing the features of different feature layers. Compared with PyConv, the performance has a minimum improvement of 4.9% in all metrics.

Table 5 Comparison between FEFP and PyConv
  1. (4)

    Impact of semi-supervised learning strategy on model performance

The model combined with the semi-supervised learning strategy uses unlabeled images to produce pseudo-labels, thus helping to significantly reduce manual labeling costs. To investigate the impact of the semi-supervised learning strategy on model performance, we put 60 labeled images and mixed images (including labeled images and unlabeled images) into the model for fully-supervised and semi-supervised training, respectively, under the assumption of insufficient manual labeling. In the network input, the original image is enhanced by the image enhancement method to ensure the consistency of the experimental conditions. The experimental results are shown in Fig. 10.

Fig. 10
figure 10

Visual comparison of the impact of semi-supervised learning strategy on the model. a Original image. b Ground truth. c Pseudo-labels generated by training with 60 labeled images. d Pseudo-labels generated by training with mixed images that include 60 labeled images and 1600 unlabeled images

Under the premise of a small number of manually labeled data, using only labeled data will lead to over-fitting of the model. Thus, incomplete segmentation occurs when predicting in the testing dataset. In contrast, we use a semi-supervised learning strategy to generate pseudo-labels, which makes up for the insufficient data problem to avoid the overfitting phenomenon caused by model training. Therefore, in the same training epochs, DBF-Net combined with semi-supervised learning strategy can obtain more complete segmentation and higher quality pseudo-labels.

  1. (5)

    Impact of different training scales on model performance

In this experiment, we use different training scales for DBF-Net to compare the quality of pseudo-labels. The qualitative results are shown in Fig. 11. Initially, we selected 32 unlabeled CT images and 60 labeled images to train together and found that the edge information of the infected area was less extracted, and the images were blurred; then, we tested 16 unlabeled and 8 unlabeled images with labeled images. After experiments, it was found that the training effect of the combination of 8 unlabeled images and 60 labeled images is better than that of 16 unlabeled and 60 labeled images, it has more accurate boundaries, and the segmentation effect is significantly better. The quantitative results are shown in Table 6. From the perspective of various metrics, the combination of 8 unlabeled images and 60 labeled images has the best effect, which can accurately segment the GGO and consolidation infections.

Fig. 11
figure 11

Comparison of visual effects of segmentation results of different training scales

Table 6 Comparison of Segmentation performance of different training scales

5 Conclusion

In this paper, we propose a novel semi-supervised dual-task balanced fusion network model (DBF-Net), which can help doctors identify infected regions in CT images of COVID-19 patients and reduce the variability of manual diagnosis. The model utilizes a lightweight double convolution module and a fusiform equilibrium pyramid convolution for down-sampling to maximize the localization of infected regions and combines a semi-supervised learning strategy to alleviate the shortage of labeled data. Additionally, we adopt an image enhancement method specifically for medical images to extract more critical visual features and obtain richer pixel information. A series of experimental results on the test set show that the DBF-Net model is superior to other segmentation models with three evaluation metrics Sensitivity, Specificity, and Precision. The proposed algorithm is highly competitive in segmenting the COVID-19 lung infected regions. In future work, we will continue to improve the DBF-Net segmentation model, such as combining segmentation with vision transformer, thus solving the problems of little data and inaccurate lesion localization. This can not only assist doctors in clinical diagnosis but also have important implications for medical research in the big data era.