1. Introduction
The emergence of deep learning (DL) has brought a new age of data science study and development [
1]. Within a relatively short period of time, DL has had an impact on every aspect of life. The greatest immediate impact is felt in image processing [
2], robotics [
3], self-driving vehicles [
4], natural language processing [
5], computer games [
6], and many other fields. The excellent performance-to-cost ratio, along with the widespread availability of computer technology such as graphics processing units (GPUs) and multi-core processor chips, has made DL extremely popular among data scientists [
7]. The cornerstone of DL is the formalization of the concept that all brain functions are generated from neural activity in the brain [
8]. The McCulloch-Pitts neuron model is a groundbreaking investigation into the operation of neural networks that led to the creation of numerous additional neural models of the brain, i.e., feedback neural networks [
9], feed-forward neural networks [
10], perceptrons [
11], etc. While previous networks were either single layer (input-and-output) or featured a single hidden layer (input-hidden-outputs), the DL paradigm takes advantage of adding depth to the network by utilizing multiple hidden layers. Learning can be supervised or unsupervised. In supervised learning, the algorithm is taught by a human observer using training data and associated with the ground truth (GT) desired output and the system learns to recognize complicated patterns at the end of training. Unsupervised learning is the process through which an algorithm learns to recognize complicated patterns and processes without the assistance of a human observer. There are several DL adaptations for medical imaging. Deep belief network (DBN) is a deep learning (DL) adaption for unsupervised learning in which the top two layers work as associative memory [
12,
13]. DBN has found major uses in the creation and recognition of images [
14,
15], video surviellance [
16], and motion capture data [
17].Based on its intended segmentation outcome and accuracy, the active contour model (ACM) has become a prominent technique in image segmentation [
18]. A stable and robust segmentation method based on pre-fitting energy was developed by Ge et al. [
19]. Autoencoder is a deep learning-based network that is used for unsupervised learning [
20]. An autoencoder’s input and output layers are composed of the same number of nodes, with one or more hidden layers linking them. It is done particularly to train the hidden nodes to encode the input data in a certain form so that it may be regenerated from that representation. As a result, instead of using traditional GT, input data is used to train the autoencoder. The convolutional neural network (CNN) is a form of deep learning (DL) that is employed especially in computer vision [
21]. It is inspired by the way the biological visual system works. CNNs, like the animal visual cortex, take advantage of spatially-local correlation by imposing a local connection pattern between neurons in neighboring layers. There are several CNN models available, including GoogleNet [
22], AlexNet [
21], and LeNet [
23] etc. It has been observed that as the depth of a DL-based system increases, its performance stagnates and then rapidly deteriorates. Deep Residual Networks (DRNs) allow for increased depth without sacrificing performance [
24]. Scan images of infected/abnormal areas are often obtained in medical imaging utilizing computed tomography (CT) [
25], magnetic resonance imaging (MRI) [
26] and ultrasound (US) [
27]. In most cases, professional physicians are responsible for identifying diseased tissues or any abnormalities. The development of computer vision and machine learning (ML) has produced a new generation of technologies in disease computer-assisted diagnostics (CAD) (
Table 1). Image segmentation becomes complicated by intensity inhomogeneity, slow segmentation speed, and narrow area of application which can be corrected by an additive bias correction (ABC) model [
28]. In [
29] authors created an active deformable model for cervical tumor identification in 2008. Suri and his colleagues developed a feature-based recognition and edge-based segmentation method for measuring carotid intima-media thickness (cIMT) in 2011 [
30]. In 2014, the same group developed an ML-based method for ovarian tissue characterisation [
31].
During the same year, an attempt was undertaken to create a CAD system for detecting Hashimoto thyroiditis in US pictures from a Polish population [
46]. Suri and his colleagues created a method for semi-automated segmentation of carotid artery wall thickness in MRI in 2014 [
47]. The selection of a specific set of feature extraction methods is part of the ML characterization process. Selected features are integrated, in various ways, by ML-based algorithms for successful characterization. An open loop feature extraction procedure usually yields poor results. The introduction of DL in medical imaging has reduced the necessity for feature extraction techniques, as DL systems create features internally, avoiding the ineffective feature extraction stage. Deformable models are commonly employed in segmentation to estimate the shape of an infected/abnormal area in a medical image [
48]. However, the inclusion of noise or missing data in a picture reduces the accuracy of deformable models, resulting in a poor border shape. DL uses pixel-to-pixel characterization to determine the form of an infected/abnormal shape in an image. This enables the DL to give an accurate form delineation. For 3D segmentation (34), in ML, a 3D atlas feature vector is generated from each voxel (3D picture unit), coupled with probability maps, and then training/testing is performed to define the inferred shape [
49]. Such feature vector estimation is job dependent and may not be accurate for various types of 3D datasets. Internal feature extraction is performed in DL to approximate the position of the desired shape. As a result, DL provides a generic technique for segmenting 3D images that may also be expanded to accommodate 4D data such as video. Unlike ML, which updates weights concurrently, DL weights are changed layer by layer during training. Weights are updated layer by layer, which aids in the training of DL systems.
Machine learning, particularly deep learning, has proliferated in the diagnostic imaging industry over the past decade [
50]. Deep learning algorithms, also known as deep neural networks, are constructed by stacking huge numbers of discrete artificial neurons, each of which performs elementary mathematical operations such as multiplication, summation, and thresholding. One of the fundamental factors behind the success of these new deep neural networks is the concept of representation learning, which is the process of automatically learning valuable characteristics from data as opposed to manual selection by experienced staff [
1]. A convolutional neural network (CNN) is specifically intended to extract characteristics from two-dimensional grid data, such as images, using a sequence of learned filters and non-linear activation functions. This set of characteristics may subsequently be utilized to accomplish different downstream tasks such as image classification, object recognition, and semantic or instance segmentation [
1]. Lately, U-Net [
51], an end-to-end fully convolutional network (FCN) [
52], was published for semantic segmentation of various structures in medical images. The U-Net design is composed of a contracting path that collects high-resolution, contextual data while downsampling at each layer, and an expanding path that boosts output resolution by upsampling at each layer [
51]. Via skip connections, the features from the contracting path are joined with those from the expanding path, ensuring that the retrieved contextual characteristics are localised [
53]. Originally designed for cell tracking, the U-Net model has lately been extended to additional medical segmentation applications such as brain vascular segmentation [
54], brain tumor segmentation, and retinal segmentation [
55]. In the medical picture segmentation literature, many multi-path architectures have been developed to retrieve features from provided data at different levels [
37,
56,
57]. Inception modules have also achieved the notion of extracting and aggregating characteristics at multiple sizes [
23]. Plain feature extraction techniques, however, differ from that of multi-path systems [
37,
56,
57]. In this paper, we provide an end-to-end brain tumor segmentation system that combines a modified U-Net architecture with Inception modules to achieve multi-scale feature extraction. Furthermore, we assess the impact of training different models to directly segment glioma sub-regions rather than intra-tumoral features. All learning procedures were combined in a new loss function based on the Dice Similarity Coefficient (DSC). The suggested scheme is a fusion of CNN and U-net architecture. We propose four architectures and discuss their performance comparison. The first one is a recurrent-inception U-net network, the second is a recurrent-inception depth-wise separable U-net, the third architecture is a hybrid recurrent-inception U-net, and the fourth one is depth-wise separable hybrid-recurrent-inception U-net. Each one will be explained further in the paper. It is preferable to eliminate class imbalance using ROI detection for accurate segmentation. Using a CNN design, slices with tumor and no-tumor are categorized in the first stage. Following that, these slices containing tumors are sent to the network for pixel-by-pixel classification. The FLAIR and T2 MRI modalities are used by the classification network to highlight whole tumor areas, whereas the segmentation network uses all four modalities (i.e., T1, T2, T1c, and FLAIR). More details about these modalities are provided in
Section 2.2.
6. Method and Experiment
6.1. Recurrent-Inception UNET
Convolutional neural networks are characterized by two structural parameters: depth and width. Width denotes the filter number at every layer, whereas depth represents the number of layers. There will be an exponential increase in parameters to be tuned if more layers are incorporated within the network. Too many parameters can result in the overfitting of the network, whereas deep networks are more likely to experience a problem of vanishing gradient. Google Net used the bottleneck layer of convolution , which can be channel-wise, for map pooling to reduce map numbers with their high characteristics quality while overcoming large space parameters. Inception modules feature several filter sizes that aid in learning various types of variations found in distinct images to enhance the handling of multiple object scales. Firstly, the features learned from a layer are delivered to distinct routes; secondly, each path using the appropriate filter size learns features; and lastly, concatenation of the features from all the paths is carried out and passed down to the next layer.
Inception
Inception modules enhance network scalability by capturing data at multiple levels. As we get deeper into convolutional networks, the spatial concentration of features decreases. Large-sized kernels are important in the early stages for capturing more global information, but small-sized kernels are preferable in the latter stages for capturing more local information. Across the network, different inception modules are employed based on the variable dimension of features, since bigger filter sizes are more beneficial for learning key aspects of images with high spatial sizes while having an averaging impact on images with small spatial sizes. At the beginning of encoder, Inc.Block consists of high ratio large kernels and in comparison with small kernel . Within deeper levels, Inc.Block consists of a small ratio of large kernel and in comparison with small kernel . Furthermore, to solve the deep model’s delayed convergence difficulty, a batch normalization layer is employed after every inception layer for normalizing features.
Figure 4 and
Figure 5 depicts the first and second Inc.Block. In both blocks, different filter sizes are employed, and features from several branches are concatenated (Inception block). First Inc.Block consists of
and
sized filters, whereas in second Inc.Block learns small filters
. The first proposed architecture is shown in
Figure 6.
6.2. Recurrent-Inception UNET with DS-Convolution
Depthwise Seperable Convolution
Convolution is performed on all picture channels simultaneously by the standard convolution kernel. Each convolution kernel is associated with a feature map. A simultaneous learning of the deep and spatial convolutions occurs [
85] (
Figure 7).
represents the height and
stands for the width of the convolution kernel. Let (
P,
Q,
C ) be the input feature map, where
P corresponds to width,
Q is height, and
C denotes the number of input channels. Consequently, the output feature size will be (
P,
Q,
D) with
D being the number of output channels.
is the standard convolution which is calculated as:
Depth-separable convolution includes point-wise and depth-wise convolution. The first one is in charge of filtering, while the second one is in charge of mapping output characteristics. For each channel, a deep convolution operates separately, and the 4 input channels in the 2D plane are combined with distinct kernels. Depth-wise convolution is calculated as:
Point-wise convolution is calculated as
, whereas
denotes depth-wise separable convolution, calculated as the sum of depth-wise convolution and point-wise convolution,
When only a single attribute is retrieved, the depth-wise separable convolution performs worse than the conventional convolution. However, as the network depth and number of extracted characteristics rise, depth-wise separable convolution can save a considerable amount of computation time [
86]. Depthwise separable convolution can be calculated as follows:
The proposed network employs the residual dense connection approach in the encoder-decoder network to address the issue of restricted numbers of information streams [
87].
The residual dense block is a fundamental network unit, where the first convolution layer of the first encoder block is added to the first convolution layer of all forthcoming encoder blocks. It can improve the ability of feature propagation to better replicate the image. Between different network blocks, the full-scale skip connections are established [
88]. As network depth rises, the amount of image features increases. Since so many convolutional layers would result in information redundancy, in our work implementation of fusion of local features is used before upsampling for extraction and fusion of effective features within every base unit. The U-Net framework must perform upsampling four times since having too many features between four connections would result in an unacceptably lengthy network training period. This work is based on the concept of a residual network. We send the context information from the first residual dense block to the following residual dense blocks and incorporate the global characteristics. As a result, SDCN-Net may acquire deep features in a hierarchical framework.
Shallow feature information may be extracted using depth-wise separable convolutional layers. The residual dense block network structure is made up of three major components: extracting shallow features, learning local adaptive features, and fusing global features. Global feature fusion and local features are combined for the reduction of dimensionality. The full-scale skipping connection used in the design of the SDCN-Net network module can improve network generalization and minimize network degradation. Among cascading operations, U-Net’s long skip connection and the short skip connection of the residual network are merged, thereby causing an effect on output results by the bottom layer [
89].
We set the kernel size to , whereas the number of modal channels is set to 4. The computational performance benefit becomes increasingly apparent as the number of channels rises.
In the recurrent-inception network, the regular convolution operation is replaced by the depth-wise separable convolution operation. Inc.Blocks are identical to the ones given in the previous section (
Figure 4 and
Figure 5) with the only difference that here depthwise separable convolution is used as shown in the
Figure 8 and
Figure 9.
The depth-wise separable recurrent-inception network is shown in
Figure 10.
6.3. Hybrid Recurrent-Inception UNET
In the Hybrid recurrent-inception network, we combine the regular U-Net blocks, and the recurrent-inception blocks together to form a U-shaped architecture with skip connections. The design is shown in
Figure 11.
DS-Hybrid Recurrent-Inception Unet
The depth-wise separable hybrid recurrent-inception U-net is shown in the
Figure 12. It is identical to the hybrid recurrent-inception U-Net, except that the depth-wise separable convolution is used for depth-wise feature learning.
6.4. Experimental Setup
Each network was built using the backend of Tensorflow and the Keras framework. Furthermore, experiments are performed on a GPU-based system with 128 GB RAM and Nvidia K80 (12 GB VRAM).
The model was fed with cropped slices of size . Training of the classification network is carried out using optimizer Adam with learning rate , 200 epochs with 25 batch size. Class in Keras was used to initialize all of the convolutional layers in the segmentation UNET architecture.
Experiments were carried out utilizing a variety of CNN models with different numbers of convolution and dense layers. The CNN model which is optimized to achieve the best performance has nine layers in its design. Initially filter size of the convolution layers has size with 32, 64, 128, 256, and 512 filters for the first five layers followed by fully connected layers. The activation function used in all layers is ReLU except the last one, where we used a sigmoid function. During training, we employed some data augmentation techniques such as horizontal flip, vertical flip, zoom range, and 0.2 shear range.
All UNET settings are kept the same for Inception-UNET, except Inc.Block is inserted in every block. Features are acquired in the Inc.Block utilizing different scales of and kernels. To accomplish feature fusion, features were extracted from input by , , and convolutional layers, concatenated and batch-normalized to enhance convergence.
RI-UNET has various Inc.Blocks at various UNET levels depending upon spatial features concentration at every stage.
Because of the high spatial feature concentration at these stages, Inc.Block 1 is applied at the encoder’s first two stages and the later stages of the decoder. First Inc.Block consists of a higher number of large-sized filters than small ones. To accomplish feature fusion, features maps extracted from the input , , and are combined.
Due to the minimal spatial concentration of features at these levels, Inc.Block 2 is incorporated in the encoder’s latter stages and the first layers of the decoder. Here the small-sized filters are more in number than the large ones. To accomplish feature fusion, feature maps extracted from the input, , , and convolutional layers are combined.
8. Comparison of Training, Validation, and Test Results
Figure 19,
Figure 20 and
Figure 21 show the comparison of all four models when they are trained and validated using the same set of images. It can be seen that the depth-wise separable hybrid model performed better than the rest of the models. Similarly,
Figure 22,
Figure 23 and
Figure 24 show the validation results.
Figure 25 shows the loss of each model when tested on the test images. It can be seen that the MI-Unet and DS-MIUnet have almost similar loss curves, whereas the hybrid-Unet and DS-Hybrid Unet have low values of loss showing better performance than the rest of the models.
Figure 26 and
Figure 27 show the dice coefficient and accuracy for these models when evaluated on test images. It can be observed that Hybrid Unet and DS-Hybrid Unet outperformed the MI-Unet and DS-MIUnet with a remarkable increase in performance.
Each model is tested on the test images and the results are compared and are shown in
Table 2. The difference in performance with each model is calculated and shown in the following tables. In
Table 3 the baseline architecture is compared with the four proposed architectures. There has been a considerable performance improvement (the visual results are shown in
Figure 28,
Figure 29,
Figure 30,
Figure 31,
Figure 32 and
Figure 33). From
Table 3 it can be seen that MI-Unet showed a performance increase in comparison to baseline Unet architecture by
in dice score,
insensitivity, and
in specificity. Depth-wise separable MI-Unet showed a performance increase by
in dice score,
in sensitivity, and
in specificity as compared to the baseline Unet architecture. Hybrid Unet architecture achieved performance improvement of
in dice score,
in sensitivity, and
in specificity. Whereas the depth-wise separable hybrid Unet architecture outperformed the baseline architecture by
in dice score,
in sensitivity, and
in specificity.
In
Table 4 the proposed architectures are compared with MI-Unet architecture. It can be seen from the table that Depth-wise Separable MI-Unet architecture increased by
in dice coefficient, and specificity is increased by
. The hybrid model improved the dice coefficient up to
as compared to the MI-Unet and specificity is increased by
. A decrease in sensitivity can be seen in comparison with MI-Unet architecture.
Table 5 shows the results of a comparison of depth-wise separable MI-Unet architecture with the Hybrid model and Depth-wise separable Hybrid model. There is improvement in performance by
in dice coefficient score,
in sensitivity, whereas improvement by
in dice coefficient and
in sensitivity can be seen with depth-wise separable Unet, when compared with depth-wise separable MI-Unet architecture. A small decrease can be seen by
and
in sensitivity with respect to hybrid Unet and depth-wise separable Unet architectures, respectively.
Performance comparison of depth-wise separable hybrid U-net architecture with hybrid Unet one is shown in
Table 6. Improvement in dice coefficient by
and sensitivity
can be highlighted, with a slight decrease in specificity by
.
Overall, it can be concluded that the SD-Hybrid model outperforms all other models presented in this work. At the end, we compared our results with state of the art methods.
To identify tumor pixels, a probabilistic approach that combines sparse representation and a Markov random field has been suggested in [
90]. Random decision trees are trained on image characteristics in [
91] to categorize voxels. The experimental findings obtained in this study are compared to state-of-the-art approaches in the benchmark brain tumor segmentation challenge [
79] and methods presented in [
79,
90,
91] as shown in the
Table 7. The suggested model’s high specificity values indicate that it is effective in detecting the primary tumor area, avoiding false positives. It is evident that the proposed architectures performed better than the state of the art presented in
Table 7.