1. Introduction
Brain tumors pose a serious threat to human life. Currently, there are more than 100 types of brain tumors affecting humans [
1]. The treatment methods for such diseases include surgery, chemotherapy, and radiotherapy. With the continuous development of artificial intelligence, tumor diagnosis and surgical pre-assessment interventions based on artificial intelligence are playing an increasingly important role. Fine segmentation of brain tumors using techniques, such as voxel analysis, can help one to study their progression and assist in preoperative planning [
2]. Brain tumor segmentation from brain tumor images is currently at the forefront of research [
3,
4,
5].
Magnetic resonance imaging (MRI) technology can provide images of different contrasts (i.e., modalities) and is a non-invasive, high-performance soft tissue contrast imaging modality [
6]. A complete MRI image includes four modalities: T1-weighted (T1), T1-enhanced contrast (T1-ce), T2-weighted (T2), and T2 fluid-attenuated inversion recovery (Flair). Each of the four modalities captures specific features of the underlying anatomical information. Combining multiple modalities can provide highly comprehensive information for analyzing different subregions of organs and lesions. Among them, T2 and Flair images are suitable for detecting edema around the lesion. T1 and T1ce are suitable for detecting the core of the lesion. Generally speaking, there are obvious differences in the gray level of the lesion area and normal tissue in Flair images, while the boundary features of the lesion area in T1ce images are more obvious [
3,
4,
5]. MRI technology can produce high-quality brain images without damage and skull artifacts and can provide more comprehensive information for the diagnosis and treatment of brain tumors, including the shape, size, and location of organs and lesions. It plays a key role in diagnosis and is the main technical means of brain tumor diagnosis and treatment.
For the auxiliary diagnosis technology of medical images, some studies focus on the fusion of images. For example, CSID [
7] proposes an algorithm for fusing CT and MRI images to enhance the detailed information of clinical diagnosis, but the most important research at present is image segmentation, such as using modified-moth-flame algorithm and Kapur`s thresholding for evaluating brain tumor [
8], or using deep learning methods to segment brain tumor images.
In recent years, deep convolutional neural networks, such as Alex-Net [
9], VGG-Net [
10], ResNet [
11], and Google-Net [
12], have been successfully applied to many computer vision tasks, and have been maintaining SOTA performance. Due to the powerful feature extraction capability of deep convolutional neural networks, they were soon applied to the field of medical image processing and analysis [
13,
14,
15]. In brain tumor segmentation, the method using fuzzy edge detection and U-NET CNN classification [
16] can exceed the performance of traditional machine learning methods; convolutional-neural-network-based segmentation methods have also achieved state-of-the-art performance in various tests [
17,
18,
19,
20]. However, due to the small and fixed size of the convolutional kernel of CNNs, although the dilated Convolution [
21,
22] expands the perceptual field and the deformable Convolution [
23] allows some offset in the kernel, they find it difficult to extend adaptively and flexibly to the entire feature map. This limits their ability to learn global features and remote features, which are crucial for the accurate segmentation of tumors of various shapes and sizes.
Transformer models have excellent performance in natural language processing tasks [
24,
25]. In the last two years, ViT [
26], an approach that introduces a transformer to computer vision tasks, has also achieved state-of-the-art performance in that time. ViT relies on the global and remote modeling capabilities of a transformer and can achieve better performance than CNNs, but the huge number of parameters in this class of models makes it easy to implement in 2D segmentation tasks, while it is subject to the memory limitation of the graphics card in 3D segmentation tasks. For example, the U-Netr method [
27] designed a network model using ViT as an encoder. Although this method showed better performance, the number of model parameters reached more than 100 M. The training of the model was time consuming and labor intensive. Swin Transformer [
28] proposed a hierarchical visual transformer using a shifted window, which restricts the self-attention to the window. The performance of the SwinUnet [
29] method based on the Swin Transformer method is also powerful. The reduction in the number of parameters makes it easier to introduce the transformer into the 3D segmentation task. In this work, we propose a novel method called SwinBTS, which uses an encoder–decoder architecture, utilizing the 3D Swin Transformer module as an encoder to extract contextual information and connect to a decoder with the same resolution by skip-connection. The decoder also uses the 3D Swin Transformer module. The NFCE (Neighbor-Feature Connection Enhancement) module is used between the encoder and the downsampling to enhance the feature information between the transformer structure and the convolutional downsampling with a step size. The NFCE module is also added between the decoder and the upsampling. The resulting method is found to be insufficient for detailed feature extraction after experiments. For the extraction of local detail features, effective methods are channel attention, spatial attention, convolution, etc. After comparing the experiments and the inspiration provided by the method ELSA [
30], we propose a module ETrans (Enhanced Transformer) in the bottom BottleNeck part of the model combined with the matrix operation of Hadamard product, which has the structure of a transformer and is mainly implemented with a convolution operation to extract local information.
Experiments on the BraTS 2019, BraTS 2020, and BraTS 2021 datasets demonstrate the effectiveness of our model.
The main contributions of this work are as follows:
We propose a new transformer-based method for 3D medical image segmentation.
In this method, we designed a combination of Transformer structure and CNN to achieve better performance, and we also designed a module ETrans to enhance detail feature extraction.
The proposed model achieves excellent segmentation results on BraTS 2019, BraTS 2020, and BraTS 2021 datasets.
The structure of this paper is as follows.
Section 2 details the 3D medical image segmentation method using CNN and the 3D medical image segmentation method using a transformer.
Section 3 mainly introduces the overall network structure and its various components.
Section 4 introduces the experimental dataset, experimental environment, and experimental hyperparameters. We compare and analyze the experimental results. Finally, a summary is presented in
Section 5.
4. Experiments
4.1. Datasets
The experiments were conducted mainly on three public multimodal brain tumor datasets BraTS2019, BraTS2020, and BraTS2021 [
3,
4,
5,
46,
47]. All three datasets are competition data provided by the BraTS challenge, which aims to evaluate state-of-the-art methods for semantic segmentation of brain tumors by providing 3D MRI datasets with Ground Truth annotated by physicians. BraTS 2019 contains 335 cases of brain images for training, with each sample consisting of four brain MRI scans, namely T1-weighted (T1), T1-enhanced contrast (T1-ce), T2-weighted (T2), and T2 fluid-attenuated inversion recovery (Flair). The volume of each mode is
, which has been aligned into the same space. The labels contain four categories: background (Label 0), necrotic and non-enhancing tumors (Label 1), peritumoral edema (Label 2), and GB-enhancing tumors (Label 4), for segmentation of the enhanced tumor region (ET, Label 4), the core tumor region (TC, Labels 1, 4), and the entire tumor region (WT, Labels 1, 2, 4). The BraTS 2020 dataset contains 369 cases of training data and 125 cases of validation data (unlabeled, for online validation), the BraTS 2021 dataset contains 1251 cases of training data and 219 cases of validation data (unlabeled, for online validation). Except for the number of cases in the dataset, all other data of BraTS 2020, BraTS 2021, and BraTS 2019 are the same.
4.2. Implementation Details
The model parameters are initialized using the weights pre-trained by Swin-T on ImageNet-1K. For training, we use the Adam optimizer to train the model with an initial learning rate of 1 × 10
−4 with a cosine decay strategy. The source code can be found at
https://github.com/langwangdezhexue/Swin_BTS (accessed on 18 May 2022). The following data enhancement techniques are used:
Our loss function is a combination of dice loss and cross-entropy loss, and it can be computed in a voxel-wise manner according to Equations (4)–(6):
where
is the number of voxels,
is the number of classes,
and
denote the probability output and one-hot encoded ground truth for class
at voxel
, respectively.
In the model, we use the Dropout operation, which is to make each neuron in a state of inactivation with a certain probability in the forward propagation of the training process to achieve the purpose of reducing overfitting.
4.3. Evaluation Metrics
We use the Dice score and 95% Hausdorff Distance (HD) to evaluate the accuracy of segmentation in our experiments. The Dice score and HD metrics are defined as:
For a given semantic class, let and denote the ground truth and prediction values for voxel , and denote ground truth and prediction surface point sets, respectively. The 95% HD uses the 95th percentile of the distances between ground truth and prediction surface point sets. As a result, the impact of a very small subset of outliers is minimized when calculating HD.
4.4. Experiment Results
4.4.1. BraTS 2019 Dataset
On this dataset, we mainly perform model validation, and we divide 335 samples into 222, 57, and 56 cases as training, validation, and test sets, respectively. The Dice scores and the average Dice scores of SwinBTS on this dataset for ET, TC, and WT categories reach 74.43%, 79.28%, 89.75%, and 81.15%, respectively. We trained some SOTA models with the same dataset partitioning for use as a comparison, and the experimental data are shown in
Table 1.
In general, SwinBTS has good performance in ET, TC, and WT categories.
Table 1 shows that compared with the classical 3D U-Net, the Dice score of SwinBTS has a great advantage, which shows the advantage of the Transformer structure compared with convolution. The results show that the method can have stronger segmentation performance than 3D U-Net. U-Netr uses the transformer structure as the encoder, which is more conducive to learning long-distance features, so the effect of TC and WT category segmentation is more obvious. TransBTS adds a Transformer structure to 3D U-Net to model global relationships, but the method does not perform well in the ET category, indicating that the extraction of local features is insufficient. VTU-Net is a method established using the transformer structure. The Dice scores of each category are well balanced while SwinBTS has an excellent performance in the ET and TC categories, indicating the excellent extraction ability of our method for global features and local features.
The data in brackets in
Table 1 is the standard deviation of the segmentation results, from which it can be seen that the SwinBTS model has the smallest classification variance for the three categories, indicating that the model has the best segmentation effect and the most stable segmentation, and the segmentation results will not have large differences.
In
Table 2 we compare the mIOU results on the 2019 test dataset, and we also compare the standard deviation. Compared with 3D Unet, SwinBTS can achieve a 10.07% higher mIOU result, and also has a 1.03% improvement compared to the VTU-Net model that also uses the Transformer structure. The smaller standard deviation also shows the stability of the SwinBTS model segmentation results.
We also draw the boxplots of the Dice scores of the SwinBTS and TransBTS methods on the BraTS 2019 test dataset for comparative analysis, as shown in
Figure 3. These two methods achieve high-performance segmentation in most test samples. However, since the dataset itself is obtained in different ways, the data are easily disturbed by factors, such as noise, so there will be outliers, resulting in lower segmentation results.
4.4.2. BraTS 2020 Dataset
This dataset is mainly used for comparison with the SOTA model. We divide the training set in the dataset into a training set and a validation set for training at a ratio of 8:2 and then perform segmentation prediction on the BraTS 2020 Validation dataset. The results were submitted to the BraTS 2020 Challenge official website for online verification and comparison with the SOTA model. The dataset evaluates the results according to the main evaluation indicators of the challenge, Dice score, and 95% Hausdorff distance. The results are shown in
Table 3.
The Dice scores of the SwinBTS method in the three categories of ET, TC, and WT are 77.36%, 80.30%, and 89.06%, respectively. The Hausdorff distances are 26.84 mm, 15.78 mm, and 8.56 mm, respectively. Compared with traditional CNN methods, such as 3D U-Net, V-Net, and Residual U-Net, SwinBTS has obvious improvement. It also has a certain improvement compared with methods using transformer structures, such as U-Netr, TransBTS, and VTU-Net. We can also see that the improvement in the SwinBTS model in
Table 3 is relatively limited compared to the VTU-Net model, so the standard deviation of the Dice score is compared, and it is found that the standard deviation of the SwinBTS model is much lower, indicating that the model is in a large number of segmentation tasks. The model is much more stable and does not exhibit large deviations.
4.4.3. BraTS 2021 Dataset
This dataset has the same settings as the BraTS 2020 dataset. It also uses the 8:2 ratio split dataset as the training set and the verification set for training, and finally, conducts online verification. The results are shown in
Table 4.
SwinBTS also achieved excellent segmentation results on the BraTS 2021 dataset. ET, TC, and WT can achieve Dice scores of 83.21%, 84.75%, and 91.83%, respectively, and the Hausdorff distances are 16.03 mm, 14.51 mm, and 3.65 mm, respectively, exceeding the results of most methods.
4.5. Ablation Experiments and Analysis
We also conducted sufficient ablation experiments to verify the effectiveness of the SwinBTS model. There are mainly two ablation experiments:
(1) We verify the validity of each module, as shown in
Table 5. The segmentation results of the basic model, adding the NFCE module, adding the Transformer module to the Bottleneck model, adding the convolution module to the BottleNeck model, and adding the ETrans module, are compared.
We first convert SwinUnet to a 3D version as the basic method of the study. From the experimental results, we can see that the average Dice score obtained by SwinUnet3D is 78.96%, which is lower than that of TransBTS and other methods. After adding the NFCE module, the Dice score can be improved by 0.87%. This module improves all three categories’ results. In the bottom bottleneck part, we first tried to add a Transformer structure and a convolution module, respectively. The results show that adding the Transformer structure can improve the performance to a certain extent, but compared with the SOTA model, the model has poor performance for ET and TC category segmentation. For these two categories, information extraction requires the model to have strong detail feature extraction capabilities. Therefore, we chose to use convolution to enhance the model, but after adding the convolution module, the model performance dropped by 0.37%. The reason for the analysis may be that the convolution structure is a second-order mapping, resulting in insufficient model fitting ability. Therefore, we combine the Transformer structure to transform the convolution operation into a third-order map and add the MLP structure to design the ETrans module. The final experiment also proves the effectiveness of this module. The average Dice score is 1.32% higher than that without this module and 2.19% higher than the baseline model.
(2) Depth ablation experiment with the ETrans module. The segmentation outcomes when the number of stacks in this module is 1, 2, or 4, respectively, are shown in
Table 6.
From
Table 5, we can see that it is not the case that the higher the number of stacks, the better the segmentation effect is. We follow the usual setting that the segmentation result when the depth is four is not as good as the segmentation result when the depth is two. Therefore, the depth of ETrans used in this paper is two.
4.6. Heatmap Analysis
Figure 4 shows the heatmap of the model before and after adding the ETrans module. The role of the ETrans module is mainly to improve the model’s feature extraction capability for local features, especially small-size categories. From
Figure 4, we can see that when the ETrans module is not added, the model has poor recognition of the necrotic area and the enhanced tumor area in the entire tumor area, so the segmentation ability of the two categories of ET and TC is poor. After adding the ETrans module, the network obviously pays more attention to the central area of the tumor.
4.7. The Impact of Dataset Noise on Experiments
Noise is an unavoidable problem in medical images. Due to different factors, such as equipment, operations, patients, and environments, datasets always have different levels of noise problems. Therefore, in this section, we explore the segmentation ability of the SwinBTS model for datasets with varying degrees of noise.
Our main approach is to add different degrees of Gaussian noise to the BraTS2019 dataset. The dataset comparison after adding noise is shown in
Figure 5.
In
Table 7, we enumerate the effect of adding different degrees of noise on the segmentation effect.
From
Table 7, we can see that SwinBTS is more sensitive to noise. When the noise is low, the model still has excellent performance, but when the added noise level (noise-sigma = 5) is large, the final segmentation result of the model will drop by about 10%. Therefore, in the task of medical image analysis, noise is a key influencing factor, but for MRI images, the noise factor has less influence, and it does not have a great impact on brain tumor segmentation tasks.
4.8. Visual Comparison
In this section, we compare the visualization of the brain tumor segmentation results between the proposed method and 3DU-Net, TransBTS, and VTU-Net, as shown in
Figure 4. In
Figure 6, the first row is the cross-sectional image of the brain tumor, the second row is the sagittal image, and the third row is the coronal image. For the convenience of observation, we show all three cross-sectional images and set the coordinates of the intercept point as (109, 89, 78). The red label in the figure represents the ET area, the yellow label represents the TC area, and the green label represents the WT area. It can be seen from
Figure 6 that all models have the best segmentation effect for the WT region, and the segmentation effect for the two complex edges of ET and TC is very different. Compared with Ground Truth, our model is more accurate for the segmentation results of edge details.
4.9. Discussion
Using artificial intelligence to assist doctors in diagnosis can greatly improve the efficiency of diagnosis. The use of deep learning methods for medical image segmentation is currently the most cutting-edge research and has the best performance. However, in order to truly apply this research to medical-aided diagnosis, higher segmentation accuracy is required to ensure the safety and effectiveness of diagnosis. The main purpose of our research is to improve the segmentation accuracy of brain tumor MRI images by designing and improving the network model. The comparison of the above experimental results also proves that our model has excellent segmentation performance (Dice score). However, our model also shows some inadequacies; it can be seen from
Table 3 that the SwinBTS model performs worse than the Unetr model and the TransBTS model in the 95% Hausdorff Distance indicator, indicating that the SwinBTS model has relatively insufficient ability to segment image edges. We think this problem is caused by the extensive use of the Transformer structure. Therefore, our next step will be to combine CNN on the basis of SwinBTS to improve it and achieve better performance.
5. Conclusions
We proposed a novel segmentation method, SwinBTS, which can automatically segment diseased tissue in brain MRI images. The model effectively combines a 3D Swin Transformer, 3D convolutional structure, and encoder–decoder structure to achieve efficient and accurate segmentation of MRI images. Different from using CNN, we use a 3D Swin Transformer as the encoder and decoder to effectively extract the global information of the feature map. After the encoder/decoder, the combination of NFCE module and downsampling/upsampling can reduce information loss when downsampling/upsampling. The ETrans module added at the bottom of the model is designed by combining CNN and Transformer structures. This module is used to extract local detailed features, so that the model also has strong segmentation capabilities for categories that occupy a small proportion of the image (such as ET). Finally, we validate the method on three datasets (BraTS 2019, BraTS 2020, and BraTS 2021). Experimental results show that our method has better performance in brain tumor MRI image segmentation compared with some state-of-the-art methods (such as Residual U-Net, Attention U-Net, and TransBTS). Multiple experimental results show that our method achieves good results on three datasets, indicating its potential for practical application in auxiliary diagnostic systems. The visualization results show that the proposed method has good segmentation performance for all three lesion regions of brain tumors. In future work, we will explore optimizing the self-attention structure of the Swin Transformer module to improve the overall performance of the method, and at the same time, explore how to more effectively combine the Transformer and CNN to improve the model’s ability to segment the lesion edge region.