1. Introduction
Image semantic segmentation assigns a semantic category to each input image pixel to obtain pixel-wise dense classification. This is a fundamental task in computer vision and is vital in various applications such as medical image analysis [
1], target detection [
2], and autonomous driving [
3]. Semantic segmentation methods can be categorized into traditional and deep neural network-based methods.
The former methods primarily rely on image features such as texture, color, shape, and edges, dividing the image into regions with similar characteristics through different algorithms. Traditional image segmentation methods involve techniques based on thresholds [
4], edges [
5], and regions [
6]. Threshold-based image segmentation methods typically classify image pixels into foreground and background by setting a grayscale threshold and categorizing the image’s grayscale histogram. Although this strategy is simple to implement and computationally efficient, the segmentation results on complex images are poor. Region-based segmentation methods segment images based on spatial information, classify pixels according to pixel similarity features, and form regions. Although such a segmentation process suits complex images, region-based segmentation methods are sensitive to noise, under-segmenting or over-segmenting images. Moreover, edge-based image segmentation methods determine potential boundaries by detecting differences in grayscale values between pixels and adjacent pixels and connecting these boundaries to form edge contours. Typical edge detection operators are Roberts [
7], Sobel [
8], Prewitt [
9], LoG [
10], and Canny [
11], which, although they produce relatively clear edge contours, they are sensitive to noise and do not consider regional information between pixels, leading to over-segmentation. Despite traditional semantic segmentation methods performing exceptionally well on images with a uniform grayscale or slight differences in the grayscale between the object and background, they overlook the image’s semantic, spatial, and other feature information. Furthermore, these methods are susceptible to noise and thus unsuitable for complex images. In conclusion, traditional approaches rely heavily on handcrafted features and suffer from the abovementioned drawbacks. To provide a clearer understanding of these traditional methods, a summary table is presented below (
Table 1).
Neural network technology has recently become mainstream in image segmentation, aiming to overcome these limitations. For instance, AlexNet [
12] employs multiple convolutions to extract local features, significantly improving segmentation beyond traditional methods. Regarding semantic segmentation, dual-branch structured networks have been widely utilized. For instance, BiSeNet [
13] introduces a detail branch that focuses on extracting details and boundary information and complements the semantic branch, which captures global contextual information. This dual-branch network enriches the spatial features. Additionally, BiSeNetv2 [
14] introduces a powerful detail branch and lightweight spatial branch on top of BiSeNet, realizing better performance and real-time inference speed. Similarly, DANet [
15] integrates local and global features using spatial and channel attention modules, enhancing feature representation for precise segmentation without relying on multi-scale feature fusion. Furthermore, ACNet [
16] utilizes Asymmetric Convolution Blocks (ACBs) to replace standard convolutions, enhancing accuracy and robustness to distortions and seamlessly integrating into existing architectures for improved performance. Moreover, OCRNet [
17] aggregates object-contextual representations to enhance pixel classification in semantic segmentation by leveraging the relationship between pixels and object regions, resulting in improved segmentation accuracy. In addition, BiMSANet [
18] tackles scale variation in oblique aerial images by using bidirectional multi-scale attention networks, which fuse features adaptively for more effective semantic segmentation. Twins [
19] revisits spatial attention design in vision transformers, proposing efficient architectures Twins-PCPVT and Twins-SVT, excelling in classification, detection, and segmentation with optimized matrix multiplications. HMANet [
20] improves VHR aerial image segmentation by integrating class-augmented attention and region shuffle attention to effectively capture global correlations, enhancing efficiency and performance on benchmark datasets. Meanwhile, DDRNet [
21] effectively fuses the features through bidirectional and dense connections and introduces a multi-scale feature fusion technique that enhances the network’s ability to process multi-scale features. Neural network models aim to improve segmentation accuracy through multi-scale feature fusion and information recovery.
Encoder–decoder structures have also been widely used in image segmentation. For example, U-Net [
22] is known for its encoder–decoder architecture. It extracts features through the encoder, restores the original resolution through the decoder, and ingeniously utilizes skip connections to fuse low-level and deep-level information. Moreover, U2-Net [
23] adopts a two-level nested U-shaped structure, with each encoder and decoder module resembling UNet. Residual U blocks (RSU) with different-sized receptive fields are adopted to capture more contextual information while pooling operations are also employed to increase the network’s depth. This architecture effectively improves its performance, reduces training costs, and solves the problems of insufficient scale and contextual information in image segmentation. Similarly, SegFormer [
24] advances the field by unifying Transformers with MLP decoders, using a hierarchically structured encoder and lightweight decoder to efficiently aggregate multi-scale features. This approach achieves superior performance without relying on positional encoding, marking a significant step forward. In addition, UVid-Net [
25] enhances UAV video semantic segmentation by incorporating temporal information into an encoder–decoder CNN. Additionally, a feature-refiner module is added to ensure accurate, temporally consistent labeling and localization, demonstrating the importance of temporal coherence in video data. In contrast, SETR [
26] approaches semantic segmentation as a sequence-to-sequence task, using a pure transformer to encode images into patch sequences. This method provides global context and powerful segmentation capabilities without the need for convolutions, highlighting the versatility of transformer architectures. Furthering the efficiency in vision modeling, Swin Transformer [
27] introduces a hierarchical architecture with shifted windows for efficient self-attention. This design enables scalable vision modeling with linear computational complexity relative to image size, addressing the computational challenges posed by large images. To address issues specific to large-vision models, Swin V2 [
28] introduces residual-post-norm, log-spaced position bias, and SimMIM self-supervised pre-training. These innovations tackle training instability, resolution gaps, and the need for extensive labeled data. Swin-UNet [
29] utilizes a pure Transformer architecture, combining hierarchical Swin Transformers based on shifted windows as the encoder and symmetric Swin Transformers as the patch-expanding layer decoder for high-performance image semantic segmentation. Swin-UNet overcomes the challenges of global–local feature learning and spatial resolution recovery and effectively alleviates the limitations of traditional Unet in processing large-scale images. SegNet [
30] is known for its memory and computational efficiency, comprising encoder networks, decoder networks, and pixel-wise classification layers. Despite its low parameter cardinality and ease of training end-to-end, the max-pooling and subsampling may produce coarse segmentation results. Thus, SegNeXt [
31] introduces the Multi-Scale Convolutional Attention (MSCA) structure, which leverages depth-wise separable convolutions, multi-branch depth-wise separable convolutions, and 1 × 1 convolutions in the decoder to construct a lightweight global context. This approach enhances semantic segmentation performance, addresses the scale diversity issues in scene understanding, and effectively mitigates the model’s reliance on information at different scales. DeepLabV3+ [
32] is a deep, fully convolutional neural network that combines an encoder–decoder structure with an improved Atrous Spatial Pyramid Pooling (ASPP) module. DeepLabV3+ possesses strong semantic understanding capabilities, enabling it to better capture details through dilated convolutions and multi-scale contextual information. DeepLabV3-SAM [
33] combines traditional semantic segmentation with the Segment Anything Model (SAM), integrating the zero-sample transfer capability with traditional algorithms to fully utilize SAM’s zero-sample transfer capability for low-cost, accurate, and fully automated image segmentation. LM-DeepLabv3+ [
34] is a lightweight semantic segmentation method that adopts MobileNetV2 [
35] as the backbone network and introduces the ECA-Net [
36] attention mechanism and EPSA-Net [
37] cross-dimensional channel attention to reduce the number of parameters and computational complexity while enhancing the feature representation capability and interaction of important features. In light of these advancements, a summary of the key neural network-based image segmentation methods and their contributions is presented in
Table 2.
Despite the significant progress in semantic segmentation networks, these still suffer from insufficient contextual information, inadequate attention allocation, a lack of semantic information for low-level features, and insufficient resolution for high-level features. Therefore, this paper introduces an improved Deeplabv3+ model that utilizes ResNet101 as the backbone network. The proposed model incorporates a novel Semantic Channel Space Details Module (SCSDM) to extract richer information through multi-scale feature fusion and adaptive feature selection, enabling the network to focus on important features. Additionally, a Semantic Features Fusion Module (SFFM) is designed to integrate low-level and high-level semantic information better. Moreover, to address the problem of insufficient contextual information, the developed model utilizes a Dense Connection ASPP (DCASPP) module to enrich the feature representation through multiple convolutions, concatenations, and global average pooling operations. This strategy expands the feature sensory field and enhances semantic information.
A Dilated Convolutional Atrous Spatial Pyramid Pooling (DCASPP) module is constructed with multi-level dilated convolutions and global average pooling structures. In DCASPP, contextual information in semantic segmentation is extracted effectively, significantly enhancing the model’s understanding of complex scenes and segmentation accuracy.
SCSDM is introduced, which cleverly integrates the design of multi-scale receptive field fusion and spatial attention. SCSDM significantly enhances the model’s ability to perceive critical regions, thereby improving the accuracy and efficiency of semantic segmentation.
The SFFM is proposed, which incorporates feature-weighted fusion, channel attention, and spatial attention structures. This design effectively overcomes the challenges of limited semantic information in low-level features and reduced resolution in high-level features within semantic segmentation. Subsequently, the overall image comprehension and segmentation accuracy are significantly enhanced, demonstrating the SFFM’s crucial role in advancing the performance of semantic segmentation models.
Extensive experiments on two public datasets validate the effectiveness of SDAMNet. Specifically, SADMNet achieves a Mean Intersection over Union (MIOU) of 67.2% on the Aerial Semantic Segmentation Drone dataset and 75.3% on the UDD6 dataset.
The rest of this paper is organized as follows:
Section 2 reviews the literature of the study.
Section 3 illustrates the proposed SADMNet.
Section 4 presents and discusses the experimental results. Finally,
Section 5 provides the conclusions.
3. Semantic Segmentation Network Based on Adaptive Attention and Deep Fusion with the Multi-Scale Dilated Convolutional Pyramid
This section introduces the overall architecture of SDAMNet. Subsequently, the designed DCASPP, SCSDM, and SFFM are introduced.
3.1. Overall Architecture
Although the segmentation network based on Deeplabv3+ has achieved good results, it suffers from two urgent problems: neglecting shallow features, leading to poor edge segmentation results, and decreasing the feature map resolution. The latter is due to multiple downsampling operations in deep convolutional neural networks, which reduce the prediction accuracy and discard boundary information. To tackle these issues, this paper develops a new semantic segmentation network named SDAMNet, with its architecture presented in
Figure 4. The operation flowchart of SDAMNet is shown in
Figure 5.
Figure 5 Simplified Summary of SDAMNet Algorithm
Encoder–Decoder Architecture
SDAMNet is an end-to-end neural network utilizing an encoder–decoder architecture.
Encoder: Uses ResNet101 as the backbone to extract multi-scale feature information.
Decoder: Integrates high-level and low-level features, refines the feature maps using the SFFM module, adjusts channel numbers, processes with convolutions, upscales to the original image size, and enhances segmentation accuracy and detail capture.
Key Modules
DCASPP (Dilated Convolutional Atrous Spatial Pyramid Pooling): Enhances feature extraction by capturing information at multiple scales and incorporating the global context.
SCSDM (Semantic Channel Space Details Module): Improves the model’s ability to focus on important features and accurately capture target boundaries and details.
SFFM (Semantic Features Fusion Module): Combines high-level and low-level features to utilize the strengths of both.
Processing Steps
Feature Extraction: The encoder processes the input image to generate high-level and low-level semantic information.
Feature Enhancement: The DCASPP and SCSDM modules refine the feature maps, enhancing the model’s perception of key areas and details.
Feature Fusion: The SFFM module merges the enhanced feature maps, balancing the information from different levels.
Upsampling and Prediction: The decoder adjusts the feature maps to the original input image size through upsampling, followed by the final prediction.
SDAMNet is an end-to-end neural network that adopts an encoder–decoder architecture. The encoder utilizes ResNet101 as the backbone, achieving multi-scale feature fusion through the DCASPP module and introducing global information aggregation. The SCSDM module further enhances the model’s ability to perceive important features and effectively captures target boundaries and details. The feature map is pre-processed by a 1 × 1 convolution to adjust the number of channels and an upsampling operation. The decoder achieves efficient feature fusion through the SFFM module, adaptively integrating the advantages of low-level and high-level features. The model’s semantic segmentation accuracy and effectiveness are substantially improved by comprehensively utilizing multi-scale features, introducing global information, and capturing detailed information.
Regarding the encoder of SDAMNet, the backbone ResNet101 generates two outputs, i.e., the high-level semantic information sent to the decoder and the low-level semantic information sent to the encoder. The output resolution of the high-level semantics is H/16 × W/16. H and W, respectively, represent the height and width of the input image. Moreover, the DCASPP module obtains rich feature information at various scales, and global average pooling integrates the global information of the entire image into the feature maps. Next, the obtained feature maps are input into the SCSDM module to further enhance the model’s perception of important features and its ability to focus on key areas. This architecture allows the segmentation results to capture the target boundaries and details more accurately, substantially enhancing the model’s semantic segmentation capability. Following this, channel cardinality in the feature map is reduced through 1 × 1 convolutions, which is then upsampled by ×4 and input into the decoder.
In the decoder, the output resolution of the underlying semantics is H/4 × W/4, and the number of channels of the underlying semantics is adjusted with a 1 × 1 convolution to match the number of channels in the feature map output by the encoder. Next, the SFFM module fuses the high-level and the low-level semantic feature maps, allowing it to adaptively and selectively fuse feature information from different levels, comprehensively utilizing the advantages from both low-level and high-level features. After that, the feature map is further processed through a 3 × 3 convolution operation, and the feature map is transformed to the original input image size by a ×4 upsampling process. Finally, the resized feature map is used for prediction. The whole process enables the model to capture detailed information better and improve the accuracy and effectiveness of semantic segmentation.
3.2. Dilated Convolutional Atrous Spatial Pyramid Pooling
The context information extraction module is crucial in semantic segmentation, significantly enhancing the model’s understanding of images and improving segmentation accuracy. Introducing a larger receptive field and enhancing the contextual correlation capabilities enhances the neural network’s understanding of the overall image structure and the relationship between pixels. The context information extraction module is particularly effective for handling complex scenes, size variations, and pixel details. Having an understanding of the global context enables the model to perceive distant objects and related areas, while multi-scale feature fusion helps to address different scales and semantic dependencies. At the same time, it reduces spatial information loss, enhances semantic features, and significantly improves the segmentation effect.
This paper introduces the DCASPP context information extraction module to enhance the model’s understanding of the overall image structure and pixel relationships and address the demands of complex scenes and multi-scale features. The DCASPP module in SDAMNet in
Figure 4 shows the structure in detail, where “c” denotes concatenation and “Rate” is the dilation rate. This module further strengthens the acquisition and utilization of context information, enhancing the model’s capability in complex scenes and processing image details.
The output of DCASPP is mathematically formulated as follows. Let
x be the input feature map. The concatenation operation is defined as Concat(a,b) concatenating feature maps a and b along the channel dimension. The dilated convolution operation
Convd(
x) refers to the convolution operation with a dilation rate of d, the global average pooling operation is
GAP(
x), and
ReLU(
Norm(
Conv1×1(
x))) involves a 1 × 1 convolution, normalization, and a ReLU activation function. Finally,
Upsample(
x,
size) is the bilinear interpolation upsampling operation.
Initially, the input feature map x undergoes a convolution operation with a dilation rate of 3, resulting in a new feature map. This new feature map is concatenated with the original input feature map x, forming a richer feature representation. The concatenated feature map is then passed to the next layer module as input, where a convolution operation with a dilation rate of 6 is performed. The result of this convolution is concatenated with the previously concatenated feature map to continue integrating multi-scale semantic information. Following this, the concatenated result from the previous step is used as input for convolution operations with dilation rates of 12, 18, and 24 successively, with each result concatenated with the previously concatenated result, continuing this integration process.
The input feature map x subsequently undergoes global average pooling to obtain a feature map containing the overall contextual information. The feature map generated by global average pooling is processed by applying a 1 × 1 convolution, followed by normalization and ReLU activation to adjust the feature values. To integrate the global information with the previous results, bilinear interpolation is employed to upsample the result of global average pooling to the same size as the input feature map x. This upsampled global pooling feature map is concatenated with the previous concatenation result.
Ultimately, a feature map is obtained that integrates multi-scale dilation rate information and global contextual information. This feature map exhibits a more robust expressive capability in semantic segmentation, enabling better comprehension of semantic information for objects at different scales in the image and providing more accurate prediction results for segmentation.
Multiple convolutions, concatenations, and global average pooling processes gradually enrich the feature representation throughout the process, expanding the feature-receiving field and enhancing the semantic information.
3.3. Semantic Channel Space Details Module
Attention mechanisms are essential for high-quality semantic segmentation. They force the model to focus on important regions relevant to the task, reduce background interference, and enhance segmentation accuracy. However, they lack operations on multi-scale feature maps and thus cannot handle feature representations and dynamic weighting across different scales. Therefore, the SCSDM attention module (
Figure 6) is proposed to address these two limitations simultaneously.
The original feature map X is convolved with kernels of different sizes, producing the corresponding feature maps
U1 and
U2, representing information from different receptive fields. Then,
U1 and
U2 are summed to obtain the fused feature map U, integrating information from multiple receptive fields. U has a size of [
c,
h,
w], where c denotes the number of channels and h and w denote the height and width. Next, global average pooling is applied to the feature map
U along the height and width dimensions to obtain channel-level information, forming a one-dimensional tensor of size [
c, 1, 1], indicating the importance of each channel.
where
U is the fused feature map,
U1 and U2 are the feature maps obtained by convolving with kernels of different sizes, and
sc is the global average pooling value of channel
c.
A linear transformation maps the original C-dimensional information to the D-dimensional space, aiming to process this information further. Subsequently, another linear transformation remaps the D-dimensional information back to the original C-dimensional space, completing the information extraction in the channel dimension:
where
z is the transformed compact feature vector,
δ is the ReLU activation function,
B is the batch normalization operation,
W is the weight matrix, and
s is the channel-level information tensor. Next, the Softmax function normalizes the attention scores, providing a set of attention scores, with each channel corresponding to a score indicating its importance, i.e.,
ac and
bc. The resulting attention scores are multiplied with the corresponding
U1 and
U2 to obtain
V:
where
V is the generated feature map.
Applying attention weighting to the feature maps at various scales provides the final feature map V, which integrates information from different receptive fields. Compared to the initial feature map X, module V has refined information integrated from multiple receptive fields, enhancing the model’s ability to perceive key regions in semantic segmentation and thereby improving the accuracy and performance of the segmentation process.
Next,
V undergoes average pooling to obtain
avg_out and simultaneously undergoes max pooling to obtain
max_out. These operations are defined as follows:
The resulting
avg_out and
max_out are concatenated to form
V′, which undergoes a convolution operation to learn the attention weights for different spatial positions:
The output
V″ is then normalized using the sigmoid function to obtain the attention weights attention_weights for different spatial positions:
where
σ represents the sigmoid function.
By applying these attention weights to the input feature map
V, each channel is multiplied by the corresponding attention weight to introduce spatial attention:
This spatial attention mechanism enhances the model’s perception capability and prioritizes information from different spatial positions within the input feature map V. Consequently, during tasks such as semantic segmentation, the model focuses more effectively on crucial spatial regions, improving segmentation accuracy and effectiveness.
Ultimately, through the aforementioned operations, the multi-scale receptive field information and attention weights from different spatial positions are integrated, enhancing the model’s perception of crucial regions. This further improves semantic understanding and segmentation capabilities, improving overall performance.
3.4. Semantic Features Fusion Module
Feature fusion for semantic segmentation is crucial, as it effectively integrates features from different levels or branches and addresses issues such as the lack of semantic information in low-level features and the lower resolution of high-level features. Feature fusion enables the model to comprehensively capture image details and semantic information, significantly improving segmentation performance, particularly in handling complex scenes and details. Feature fusion is also crucial in deep learning due to improving performance in object detection and image segmentation.
However, to further optimize the feature fusion process, this paper develops the SFFM fusion module illustrated in
Figure 7. SFFM correlates features at different levels and combines the information of different scales and semantics to improve the quality of feature fusion and enhance the model’s performance in the semantic segmentation task. This design philosophy fully leverages the advantages of the features, effectively compensating for their deficiencies and enhancing the model’s ability to capture target shapes and semantic information, thus improving the performance of semantic segmentation models.
The output of SFFM is mathematically expressed as follows:
The following content provides a detailed explanation of the above formula. The SFFM first performs an addition operation on the input features to obtain the fused feature xa. Then, the Channel Attention (CA) module computes the channel attention weights and aggregates feature information through pooling and convolutional operations to enhance the perception of crucial semantic information. Additionally, CA aids in accurately distinguishing pixels of different categories, thereby enhancing the segmentation accuracy. Moreover, the Spatial Attention (SA) module computes the spatial attention weights. Specifically, by applying average pooling and max pooling, spatial information is captured. Convolutional operations then generate attention maps emphasizing the importance of various spatial locations, thereby enhancing the perception of features at different scales. This is important, as enhancing the model’s perception of object shapes and positions improves the accuracy and robustness of semantic segmentation models.
Furthermore, merging channel attention and spatial attention provides the attention map θ. Then, by multiplying the low-level semantic features with the attention map, multiplying the high-level semantic features with 1-θ, and adding them together, the weighted features xo are obtained. The weighted features xo contain the information fused from the channel and spatial attention and are passed through a convolutional layer for convolutional operation, further extracting features and enhancing the model’s representational capacity. Adding them aims to integrate the feature information emphasized by the attention mechanism with the feature information extracted by the convolutional operation, thus obtaining a richer and more accurate semantic representation.
The entire design concept and structure enable the model to integrate low-level and high-level features effectively. Furthermore, by focusing on information from different channels and spatial positions through the channel and spatial attention, the model enhances its performance in semantic segmentation. This process improves the perception of features at different scales and positions, thereby enhancing accuracy and generalization capability.