1. Introduction
Agricultural parcels are fundamental units in agricultural practice and applications [
1], serving as the essential material basis for agricultural production and food security [
2,
3]. The accurate identification and localization of these parcels are critical for crop recognition, yield estimation, and the strategic allocation of agricultural resources [
4,
5]. In recent years, remote sensing imagery has become the primary tool for extracting agricultural parcels [
6,
7,
8]. While actual parcels can be easily distinguished by clear boundaries formed by physical features such as ditches and roads, the complex spectral, structural, and textural characteristics of land features in remote sensing images present significant challenges for accurate parcel extraction [
9,
10,
11].
Traditional manual methods have enabled the extraction of agricultural parcels [
12,
13]. These methods can be categorized into three types: edge detection [
14,
15,
16,
17,
18], region segmentation [
19,
20,
21,
22,
23,
24,
25], and machine learning [
26,
27]. However, these methods are often time-consuming, labor-intensive, and perform poorly in complex scenarios and tasks.
Deep learning, with its ability to automatically learn features, has revolutionized remote sensing applications [
28,
29,
30,
31,
32,
33]. Methods based on Convolutional Neural Networks (CNNs) [
34,
35,
36,
37,
38,
39] and Fully Convolutional Networks (FCNs) [
40,
41,
42,
43] have shown great potential. However, spatial diversity results in agricultural parcels having complex shapes and sizes, and remote sensing imagery is often affected by complex backgrounds such as grasslands and bare land. Consequently, existing methods face three main issues: first, difficulty in preserving the unique morphological characteristics of agricultural parcels; second, an inability to ensure the high integrity of extraction results; and third, the challenge of balancing boundary and other detailed morphological features while maintaining high completeness.
To enhance the morphological accuracy of agricultural parcels, some studies have employed instance segmentation and multi-task learning methods. For example, Potlapally et al. [
44] utilized Mask R-CNN for instance segmentation, which improved the precision of parcel morphology by independently identifying the boundaries of each parcel. However, these methods exhibit certain limitations when dealing with complex backgrounds and variations in scale. Multi-task learning methods such as ResUNet-a [
45] and SEANet [
46] have improved morphological accuracy by jointly learning features for different tasks. Nevertheless, these methods often increase the complexity of model training and lack direct task correlations. Although instance segmentation and multi-task learning have somewhat enhanced the morphological accuracy of agricultural parcels, issues such as model complexity, sensitivity to background noise, and reliance on multi-task features result in incomplete extraction outcomes.
To achieve high integrity in segmentation results, some studies have explored constructing networks capable of capturing contextual information using Transformer [
47] technology [
48,
49]. In building extraction, BuildFormer [
50] significantly improved the accuracy of building detection by utilizing window-based linear tokens, convolution, MLP, and batch normalization. Chen et al. [
51] introduced a dual-channel Transformer framework that achieves more complete building segmentation by leveraging long-distance dependencies in spatial and channel dimensions. Xiao et al. [
52] developed the Swin-Transformer with a sliding window mechanism, resulting in more complete segmentation outcomes. Although these methods, primarily based on Transformer networks, have enhanced the completeness of segmentation results to some extent, they face challenges in complex scenes, such as lacking boundary morphology and detailed textures.
To ensure high integrity while preserving certain morphological characteristics, some researchers have begun exploring hybrid models. For instance, Wang et al. [
53] integrated Transformer technology into the traditional CNN framework and developed the CCTNet model for barley segmentation in remote sensing images. Xia et al. [
54] proposed a Dual-Stream Feature Extraction network that integrates CNN and Transformer technologies to fuse boundary and semantic information, achieving superior results on multiple building datasets. WiCoNet [
55] combines CNN and Transformer to fuse global and local information, achieving strong performance on the BLU, GID, and Potsdam remote sensing datasets. STranFuse [
56] combines the Swin Transformer with convolutional networks and uses an adaptive fusion module to manage feature representations across different semantic scales, achieving significant performance improvements on the Vaihingen dataset. Wang et al. [
57] proposed a dual-stream hybrid structure based on SAM to achieve the fusion of local and global information. However, these combinations of Transformer and CNN typically involve a simple integration of global and local information without specific feature analysis, making it challenging to balance morphological characteristics and segmentation completeness.
Based on the limitations of existing methods in agricultural parcel extraction, this study proposes DSTBA-Net, a segmentation network designed for agricultural parcel extraction. DSTBA-Net processes image and boundary data through Dual-Stream Feature Extraction (DSFE) and effectively fuses these data using a Transformer-dominated Global Feature Fusion Module (GFFM), enhancing boundary morphology and the integrity of extraction results. The decoder employs Feature Compensation Recovery (FCR) to reduce information loss. We propose a boundary-aware weighted loss algorithm to optimize boundary segmentation results. Experimental results demonstrate that DSTBA-Net performs exceptionally well on Danish and Shandong agricultural parcel datasets, exhibiting good generalization ability and robustness.
The main contributions of this study are as follows:
- (1)
DSTBA-Net, a novel segmentation network framework designed to accurately extract agricultural parcels from remote sensing images, is proposed.
- (2)
Dual-Stream Feature Extraction (DSFE) is designed to perform multi-level feature extraction on image and boundary data, guiding the model to focus on image edges, thereby preserving the unique morphological characteristics of parcels.
- (3)
A Transformer-dominated Global Feature Fusion Module (GFFM) is designed to effectively capture long-distance dependencies and merge them with detailed features, enhancing the completeness of feature extraction.
- (4)
A boundary-aware weighted loss algorithm is designed to balance the weights of image interiors and edges, effectively improving feature discrimination.
2. Methodology
This study proposes a semantic segmentation network, termed DSTBA-Net, for extracting agricultural parcels from remote sensing images. The network adopts an encoder–decoder architecture, where the encoder consists of a Dual-Stream Feature Extraction (DSFE) mechanism designed for both image and boundary data, and a Global Feature Fusion Module (GFFM). The decoder achieves accurate upsampling through Feature Compensation Restoration (FCR). Unlike conventional CNN-based algorithms, this study employs a Transformer network to construct the GFFM, facilitating the effective integration of global and detailed information. This approach not only addresses the limitations of using convolutional neural networks alone in handling remote dependencies but also resolves the deficiency of Transformer networks in capturing low-level detail information. The segmentation framework proposed in this study is illustrated in
Figure 1.
Additionally, to address challenges arising from boundary imprecision and absence, we propose a boundary-aware weighted loss algorithm. This algorithm incorporates an effective Dice loss function that emphasizes boundary regions. By integrating this function with a weighted binary cross-entropy loss, the network achieves a refined segmentation performance.
2.1. Framework Introduction
The segmentation network framework utilized in this study adopts an end-to-end encoder–decoder architecture. Within this framework, the network processes both image and boundary data simultaneously through a Dual-Stream Feature Extraction (DSFE) mechanism. Image data are processed using an embedded Residual Block to extract complex image features. In contrast, boundary data are captured through an external Boundary Feature Guidance (BFG) mechanism, which flexibly delineates boundary-specific features. Notably, boundary data are obtained using morphological dilation techniques from genuine block labels. Subsequently, in the final segment of the encoder, a meticulously designed Global Feature Fusion Module (GFFM) is employed to construct long-range dependency relationships, facilitating the effective fusion of detailed and global features. To mitigate information loss during the upsampling process, the Feature Compensation Restoration (FCR) technique is applied to accomplish hierarchical upsampling tasks. Furthermore, the model output is refined by utilizing a meticulously crafted boundary-aware weighted loss algorithm, thereby enhancing boundary optimization efforts. The detailed network design is depicted in
Figure 2.
In the encoder, as depicted in
Figure 3, we have designed a Dual-Stream Feature Extraction (DSFE) mechanism specifically for processing image and boundary data using separate Residual Blocks and Boundary Feature Guidance (BFG).
Specifically, RGB image data with dimensions H and W undergo sequential operations including 7 × 7 convolution, group normalization, ReLU activation, and 3 × 3 max-pooling. Subsequently, an enhanced deep residual network [
58] consisting of three blocks, each detailed in
Figure 4b, is employed. After being processed by the Residual Blocks, the image feature maps are reduced to 1/16 of the original size in both height and width, with a channel count of 1024, effectively extracting and enhancing high-level image feature representations.
Simultaneously, grayscale boundary data with dimensions H and W undergo a series of operations via BFG, including 3 × 3 convolution, three consecutive 3 × 3 convolutions followed by 2 × 2 max-pooling, a 3 × 3 convolution layer, average pooling, and ReLU activation. The resulting boundary feature maps are also reduced to 1/16 of the original size, with a channel count of 1024. The multiple convolution stages and pooling operations of BFG effectively refine the feature representation of grayscale boundary data, highlighting significant structural elements and spatial relationships within the boundaries. These refinements contribute to achieving precise segmentation and analysis.
Ultimately, through convolution operations, the model captures high-dimensional feature maps that encompass both image texture and boundary information. These feature maps serve as the foundational input for subsequent operations in the Global Feature Fusion Module and decoding section. DSFE integrates the processing requirements of both image and boundary data. By applying Residual Blocks and Boundary Feature Guidance (BFG), the model effectively integrates image and boundary features, significantly enhancing its ability to understand complex scenes and improve segmentation accuracy.
2.2. Global Feature Fusion Module (GFFM)
We propose a Global Feature Fusion Module (GFFM), led by a Transformer, to integrate global contextual information from feature maps containing abundant detailed information; the specific design is shown in
Figure 5. Initially, the module captures two classes of features from the image and the boundary through a stacking operation, followed by a series of auxiliary operations to reshape the high-dimensional feature maps into one-dimensional vectors
, where each patch has a size of
and
denotes the sequence length. Subsequently, trainable linear projections map the vectorized patches to a D-dimensional embedding space. Specific positional embeddings are incorporated to retain positional information. Mathematically, this is represented as
, where
and
.
Then, for the reshaped sequence, this module employs a Transformer to establish long-range dependencies, aiming to generate feature maps containing global contextual information. Specifically, the Transformer consists of L layers of multi-head self-attention and multilayer perceptron. The output of the Lth layer is expressed as and . Here, denotes the layer normalization operator, and represents the encoded image representation. The one-dimensional vector input to the Transformer block undergoes this operation 12 times. Finally, the model reshapes the features into through a series of operations.
2.3. Feature Compensation Reconstruction (FCR)
The purpose of image restoration is to transform feature maps from the feature space to the image space through convolutional layers. During the process of image recovery, relying solely on convolutional operations may lead to the loss of important information. To mitigate information loss in the feature maps after multiple convolutional layers, we adopt the Feature Compensation Restoration (FCR) design. In the decoder, we introduce three skip connections, utilizing multi-level features from dual feature extraction to compensate for boundary features. Specifically, we use a padding strategy to select boundary features at the pixel level and add appropriate padding values around the boundary pixels to achieve information restoration. Finally, in the feature restoration module, we concatenate these compensated features along the channels to the global contextual features. Consequently, the fused image is recovered through the Feature Compensation Restoration module.
2.4. Boundary-Aware Weighted Loss
For the final classification task of mask prediction, we combine the binary cross-entropy (BCE) loss model with the Dice loss model based on boundary area design to handle class imbalance and instability. For
i samples,
and
denote the true ground labeling probability and the predicted probability of sample
i. The definition of BCE Loss (
) is as follows:
Inspired by the Dice loss function and recognizing the high demand for boundary segmentation in agricultural parcel extraction tasks, we have devised a Dice loss function based on the boundary area, termed Boundary Dice Loss (
). The Dice loss function quantifies the similarity between predicted and ground truth regions by evaluating the ratio of their intersection to their union, thereby assessing segmentation accuracy. This approach guides the model to better comprehend the characteristics of boundary areas in agricultural parcels, enhancing the precision of edge segmentation. As depicted in
Figure 6a, the light purple area denotes the predicted region
for class
i, while the deep blue area represents the true region
. To avoid situations where both the numerator and denominator are 0, we introduce a constant ∈.
Finally, by introducing weight parameters
and
, after multiple tests, it is confirmed that
yields the best results. The final loss algorithm calculation formula is as follows: