1. Introduction
Rice is one of the most important staple foods in the world. The price of rice seeds varies greatly depending on variety, origin and quality, so the quality of rice seeds needs to be measured during marketing and processing. The traditional rice seed quality inspection approach is performed manually by experts, and it is time-consuming and easily influenced by subjective factors. In recent years, nondestructive rice seed inspection technology based on optical image processing technology has received widespread attention. Accurately segmenting rice seeds with similar physical characteristics, small particles and dense distributions from their backgrounds is critical for accurately determining the quality of rice seeds. Two types of traditional image segmentation methods are available: threshold-based methods and graph theory-based methods. Threshold-based segmentation methods calculate one or more thresholds based on the greyscale features of an image and assign pixels to appropriate classes based on these thresholds. Typical methods include adaptive thresholding [
1], maximum entropy thresholding [
2] and OSTU thresholding [
3], but these methods use only the greyscale information from a given image, which inevitably causes them to misclassify the noise in the background and thus affects their accuracy. Graph theory-based methods solve the segmentation problem by optimizing an objective function. Typical methods include the one-cut [
4], normalized cut [
5], min–max cut [
6] and graph cut [
7] approaches. These methods are insensitive to the shape of the target and are too computationally intensive to be suitable for rice seed segmentation.
In recent years, hyperspectral detection technology [
8,
9,
10,
11,
12,
13] has achieved good results in target segmentation and quality detection tasks by simultaneously using image information and spectral information from seeds. For example, in 2020, S. D. Fabiyi et al. [
14] combined hyperspectral image data and high-resolution RGB image data to segment rice seeds and more accurately acquire their spatial and spectral information to determine their categories, achieving a maximum recognition rate of 98.59% across six rice species. In 2021, Liu et al. [
15] used hyperspectral techniques to segment and classify four rice seed images with an average recognition rate of 98%. Jun Zhang et al. [
16] used hyperspectral techniques to obtain the spatial and spectral information for maize seeds with different degrees of frostbite, selected the best segmentation wavelength for the maize seeds and identified the degree of frostbite exhibited by them based on the segmentation information, achieving an accuracy of 97.5% corresponding to five degrees of frostbite. Although performing image segmentation with hyperspectral techniques yields high accuracy, it also has the disadvantages of requiring expensive acquisition devices and large volumes of image data, a slow acquisition speed and susceptibility to environmental influences, making it difficult to satisfy the demand for a large-scale, low-cost quality inspection method.
Instance segmentation [
17] based on deep learning technology [
18] can quickly and accurately solve complex problems and realize large-scale image data processing, so it has become one of the most popular research directions in the field of computer vision. Instance segmentation is widely used in many fields, such as industry, transportation, remote sensing images and agriculture. For example, in 2022, in the industrial field, Antwi-Bekoe et al. [
19] established an instance segmentation network for the segmentation of industrial insulator defects by using the interactions of cross-latitude information, and the segmentation accuracy achieved for insulator defects reached 89.4%. In the field of transportation, de Carvalho et al. [
20] generated a vehicle dataset via semisupervised iterative learning and proposed a spear box-free instance segmentation method, which achieved 90% accuracy with the generated dataset. In 2023, Chen et al. [
21] proposed a large-scale building extraction framework in the field of remote sensing images based on super-resolution and instance segmentation to perform instance segmentation for large-scale buildings in cities, and the building segmentation accuracy achieved for satellite images reached 84%. In the agricultural field, Borrenpohl et al. [
22] used instance segmentation techniques to segment cherry trees and their dormant fruits through images with active illumination and natural illumination, and the image segmentation accuracy of this approach reached 94%. The applications of instance segmentation in various fields have significant advantages, including low costs, high speeds, high accuracy rates and the capacity for large-scale intelligent quality inspection. Despite the above advantages of instance segmentation, however, a rice seed instance segmentation method has not been reported, mainly because of the similar characteristics, small particles and dense distributions of rice seeds, which make it difficult to accurately detect their quality and segment them.
The existing instance segmentation methods include Mask-R-CNN [
23], MNC [
24], FCIS [
25], RetinaMask [
26], SOLO2 [
27], Boxinst [
28], RTMDet [
29], Condinst [
30], etc. Mask-R-CNN is very influential in instance segmentation. Based on Faster-R-CNN [
31], Mask-R-CNN adds a new mask branch in the region of interest (ROI), and its classification branch and box branch are processed in parallel. Each mask branch of the ROI is a full convolution structure, and the mask is predicted in pixels. In another instance segmentation method, the HTC [
32] method, which was inspired by Mask-R-CNN and Cascade-R-CNN [
33], a new cascade structure was designed on the basis of Cascade-R-CNN; this structure fully interacts with the feature map information, crosses the mask prediction with the box prediction and introduces a mask path to a newly added semantic segmentation branch. The semantic segmentation branch is integrated with the mask branch to obtain richer contextual information and to supplement the feature information of the existing mask branch. These methods improve the accuracy of target segmentation, but their inference speeds are still slow.
The Yolact [
34] method divides the instance segmentation task into two subtasks for parallel processing, which greatly improves the inference speed of the model, but its segmentation accuracy for small targets is general and not specific. Due to the different sizes of the targets, the detection difficulty varies, and according to the definition of the COCO dataset [
35], target pixel areas less than 32
2 are defined as small targets, target pixel areas greater than 32
2 and less than 96
2 are defined as medium targets and target pixel areas greater than 96
2 are defined as large targets. The above methods have been used to conduct much research related to the segmentation accuracy and inference speed achieved for targets and have improved the segmentation accuracy that can be attained with large and medium-sized targets; however, the instance segmentation effects produced for small targets, such as rice seeds, are worse, mainly because rice seeds have dense distributions, small particles and similar physical characteristics, making them difficult to distinguish.
To address the problem of precisely segmenting small rice seed targets in nondestructive rice inspections, a network called the Swin Yolact network (SY-net) is proposed for rice seed instance segmentation that adopts the backbone of Swin’s architecture and the idea of parallelizing tasks based on Yolact. SY-net is built with four main modules to enhance its rice seed detection and segmentation effects. These four modules include a feature extraction module, a feature pyramid fusion module, a prediction head module and a prototype mask generation module. SY-net’s ability to learn the features of rice seeds is enhanced by the use of a Swin-T [
36] structure with a self-attentive mechanism [
37,
38] as a feature extractor in the feature extraction module. In the feature pyramid fusion module, a six-layer feature pyramid network is employed to fuse the feature maps at different stages, and a smaller detection scale is used with the feature maps to enhance the ability of the network to detect rice seeds. Furthermore, a parallel prediction head structure consisting of fully convolutional layers is used in the prediction head module to make full use of the feature maps output by each convolutional layer. A larger feature map is used in the prototype mask generation module to generate the prototype mask. The composition of this network structure is introduced in detail in the
Section 2; the training strategy, experimental results and analyses are introduced in the
Section 3; and the effectiveness of the method in rice seed segmentation tasks is summarized in the
Section 4 of this paper.
2. Materials and Methods
The overall architecture of SY-net, shown in
Figure 1, consists of six parts; namely, a transformer backbone, a feature pyramid network (FPN), a prediction head, a prototype mask net, a nonmaximum suppression (NMS) mechanism and a combination module. The transformer backbone is the feature extractor module, and the FPN is the feature pyramid fusion module. The prediction head and prototype mask network are the prediction head module and the prototype mask generation module, respectively, both of which are composed of fully convolutional modules. C1, C2, C3 and C4 are the feature map outputs of the transformer backbone at four stages (stages one to four), with sizes of 96 × 160 × 160 pixels, 192 × 80 × 80 pixels, 384 × 40 × 40 pixels and 768 × 20 × 20 pixels, respectively. P1–P6 are the feature map outputs of the FPN at six stages, and P4 comes from C4 and is downsampled to generate P5 and P6. P4 is upsampled and fused with C3 (this is the ⊕ operation in the figure, representing the feature fusion operation) to generate P3, P3 is upsampled and fused with C2 to generate P2 and P2 is upsampled and fused with C1 to generate P1. The sizes of P1–P6, which are the input feature maps of the prediction head, are 256 × 160 × 160 pixels, 256 × 80 × 80 pixels, 256 × 40 × 40 pixels, 256 × 20 × 20 pixels, 256 × 10 × 10 pixels and 256 × 5 × 5 pixels, and P1 is the input feature map of the prototype mask network. The prediction head outputs the prediction value, and the prototype mask network outputs the prototype mask. The NMS [
39] module filters the predicted values generated by the prediction head. The combination module combines the outputs of the prediction head and prototype mask network in a linear manner to generate the final instance mask.
Figure 1 shows the 32 prototype masks generated by the prototype mask generation network and the mask for the final instance obtained after the combination module.
2.1. Dataset
In this study, one private rice seed segmentation dataset and two public datasets (Pascal SBD [
40] and MS COCO2017) were used to verify SY-net’s instance segmentation capability for small targets. The private rice seed segmentation dataset contains eight categories and 2263 images. As shown in
Figure 2, the categories of rice are 884, Changxiang, Danhan, Daohuaxiang, Yinuo No. 9, glutinous, Liannuo and Xiangsijingzhan rice. Utilizing a mobile phone camera (Honour LLD-AL10, 1300 resolution; Shenzhen Zhixin New Information Technology Co., Ltd., Shenzhen, China), the lighting conditions and image capture settings (i.e., the background, focal length, capture angle and distance from the camera to the sample) were set to be random. A ratio of 0.9:0.1 was used to divide the training set and the validation set. The Pascal SBD dataset contains 20 categories with 11,000 images, and MS COCO2017 contains 80 categories with 123,000 images.
2.2. Feature Extraction Module
The physical features of rice seeds are similar, and their details are not obvious. It is difficult for existing feature extractors to focus on the key features of rice seeds. In this study, a feature extractor was built by using the transformer architecture. The learning of the physical features of small targets such as rice seeds can be enhanced through an attention mechanism, and the detailed process of extracting features from small targets can be improved from different angles. The network structure diagram for feature extraction is the transformer backbone, as shown in
Figure 3. In this figure, the transformer backbone consists of three large modules; namely, a patch embedding module (bright cyan cuboid), a patch merging module (light green cuboid) and a basic transformer block (light yellow cuboid). Stages 1–4 correspond to C1–C4 in
Figure 1, respectively.
In feature extraction networks, the features of small targets gradually disappear via continuous convolution, and the existing mainstream feature extractors are not good at extracting the detailed physical features of small targets, such as rice seeds. In this paper, a transformer-based feature extraction network with an attention mechanism is introduced that can learn the texture, colour and edge information for rice seeds to enhance the extraction of detailed physical rice seed feature information. SY-net constructs a transformer backbone through the superposition of the three main modules to extract and transfer target feature information. Compared with existing mainstream feature extraction networks, such as VGG [
41] ResNet50, ResNet101 [
42], Darknet [
43], DenseNet [
44] and MobileNetV3 [
45], the feature extractor can generate feature maps with richer and more effective feature information.
2.2.1. Patch Embedding
The patch embedding module (bright cyan cuboid in
Figure 3) is used to adjust the width, height and number of channels of the input image. Patch embedding downsamples the input image by a factor of 4 and outputs a specified channel
C.
As shown in
Figure 4, the patch embedding module is composed of a convolutional layer and a layer normalization layer. The input image is downsampled by using a convolutional layer with a kernel size of 4 × 4 and a span of 4, and a feature map with the specified number of channels (depth) is output, flattening the height and width of the feature map. Then, the layer normalization layer is used to normalize the map in the depth direction and, finally, the feature map is restored through the viewing method. The specific process is shown in
Figure 5.
In
Figure 5, the image size used as an example is 16 × 16 × 3 pixels. First, the input image is downsampled to generate a feature map 2 × 2 ×
C pixels in size; this is followed by flattening and layer normalization processing in the depth direction of the generated feature map. Finally, the processed data are restored to a feature map 2 × 2 ×
C pixels in size by the viewing method as the output feature map.
2.2.2. Patch Merging
The patch merging module (light green cuboid in
Figure 3) is used to adjust the widths, heights and numbers of channels for the input feature maps at different stages. Patch merging downsamples the input feature maps by a factor of 2, and the number of output channels is double the number of input channels.
As shown in
Figure 6, the patch merging module is composed of a layer normalization layer and a linear layer. First, the patch merging module performs equally spaced sampling on the feature maps, and the four feature maps generated by sampling are concatenated in the depth direction and processed by the layer normalization layer. Finally, they are mapped in the depth direction through the linear layer. The specific process is shown in
Figure 7.
In
Figure 7, a feature map 8 × 8 pixels in size is used as an example. First, the feature map is sampled at equal intervals (the sampling interval is 2) to generate four feature maps 4 × 4 pixels in size. Then, the four generated feature maps are connected in the depth direction. Next, layer normalization processing is performed, and linear mapping is executed by the linear layer to generate two feature maps with dimensions of 4 × 4 pixels as the output feature maps.
2.2.3. Basic Transformer Block
The basic transformer block (light yellow cuboid in
Figure 3) is used to extract the feature information from the feature map. As shown in
Figure 8, the basic transformer block consists of a window multi-head self-attention (W-MSA) mechanism, a shift window multi-head self-attention (SW-MSA) mechanism, a multilayer perceptron (MLP), a path dropping layer and a layer normalization layer. The W-MSA and SW-MSA mechanisms are the core of the basic transformer block.
Compared to a traditional convolutional neural network, the ability of the present network to use convolution to extract small-target features is weak. Multi-head self-attention (MSA) is used to extract the feature map information, as it makes it possible to learn the features in more detail and can generate high-quality feature maps. Two MSA modules, W-MSA and SW-MSA, are used in the basic transformer block to strengthen its ability to extract detail features from small targets. MSA functions based on self-attention (SA), which is calculated as shown in Formula (1). In Formula (1),
Attention (Q,K,V) is the feature output of the SA, and
Q,
K and
V are obtained through the operation of the input vector and the trainable parameter matrices
Wq, Wk and
Wv. The calculation formula is shown in Formula (2), where
X is the input feature vector, d is the dimensionality of
K and
B is the bias.
MSA splits the calculated
Q,
K and
V according to the number of heads and combines
Q,
K and
V after splitting to form the head. The calculation formula of MSA is shown in Formula (3).
In Formula (3), MutiHead(Q,K,V) is the feature output of the MSA. WiQ, WiK and WiV are used to split Q, K and V, respectively, and then the SA operation is performed. Then, each head (output of the SA) is spliced, and Wo is used to fuse the spliced feature map.
W-MSA divides the input feature map into four windows of the same size, and MSA feature extraction is conducted for each window. SW-MSA is a more detailed window division process based on W-MSA. The feature map is divided into nine windows, and MSA feature extraction is conducted. As shown in
Figure 8, the basic transformer block module in the transformer backbone uses W-MSA and SW-MSA alternately to realize feature learning and information exchange for the notable features of small targets and to build the global feature relationship.
2.3. Feature Pyramid Fusion Module
When the network extracts features from the feature map, the downsampling of the target loses important spatial features, resulting in difficulties when attempting to detect small targets and the inability to obtain accurate segmentation results. Therefore, in this study, the existing feature pyramid network [
46] was improved by using smaller detection scales on larger feature maps to resolve these detection difficulties. A six-layer feature pyramid network was designed to perform feature fusion on the output feature maps of the transformer backbone at different stages and to address the problem of spatial information loss. The feature fusion module makes full use of the feature map information contained in each feature map by performing up- and downsampling and feature fusion on the feature maps output by the feature extraction module at different stages, thus improving the mask segmentation capability of the network for small targets such as rice seeds. Four-stage feature maps obtained from the transformer backbone are used as the inputs of the feature pyramid fusion module, and six-stage feature maps are output.
The structure of the FPN is shown in
Figure 1. P4 is derived from C4; P4 is downsampled to generate P5 and P6. P4 is upsampled and fused with C3, C2 and C1 to generate P3, P2 and P1 in turn. Compared with the existing feature fusion network, the proposed FPN can generate larger feature maps with more abundant feature information. At the same time, six different scales are used for prediction, and three different proportions are used to adjust the sizes of the six FPN feature maps. The sizes of the prediction scales are [12, 24, 48, 96, 192, 384], and the proportions are [1, 1/2, 2/1]. In a larger feature map, a smaller prediction scale is used to strengthen the detection ability of the network for small targets. Three different proportions can make the prediction and positioning results more accurate.
2.4. Prediction Head Module
Many existing prediction heads predict categories and regression boxes independently and do not make full use of the predicted feature map output. In this paper, a parallel branch network is used, and the feature map output by each branch is fused with the features of the other branches. The prediction head adopts a parallel branch structure and a weight-sharing mechanism. This method has fewer weight parameters and a faster detection speed than other complex prediction heads. The input of the prediction head comes from the feature maps of the FPN, and it predicts all the feature maps generated by the FPN. In
Figure 1, the prediction head is the prediction head module, and the branch structure diagram of the network is shown in
Figure 9.
As shown in
Figure 9, two convolutional layers extract features from the input feature maps, and the other three convolutional layers are parallel branches that share the feature maps output from the first two convolutional layers; the latter layers predict the class, box and mask coefficients ka, respectively. The feature maps generated by each convolutional layer are fully utilized, and the feature maps generated by the convolutional layer for the category output are fused with the feature maps for predicting a regression box. Then, the feature maps generated by the convolutional layer for predicting the regression box and the feature maps for generating the mask coefficients are fused to predict the mask coefficients. Through such a feature fusion process, the feature output of each convolutional layer can be utilized and a more refined prediction result can be produced.
In this paper, the loss calculation for the class of a given target uses the softmax cross-entropy function to train with
c + 1 samples (
c positive labels and 1 background label), and the ratio for the positive and negative samples selected for training is 3:1. The smooth
L1 loss function is used to train the coordinate regression function of the prediction box because the smooth
L1 function has the advantages of good robustness, insensitivity to outliers and small gradient changes. The smooth
L1 loss function is shown in Formula (4).
where
tu is the coordinate predicted by the bounding-box regressor and
v is the bounding-box coordinate of the true label.
In Formula (5), x is the difference between the true label and the predicted label.
2.5. Prototype Mask Generation Module
In the instance segmentation task, the mask generation network usually adopts a fully convolutional network, and the quality of the mask is closely related to the size of the feature map. Therefore, in this paper, a prototype mask generation network composed of five convolutional layers is used to double the height and width of the input feature map, and this larger feature map is used to generate the mask. The input of the prototype mask generation module is P1 in the FPN. This is because the masks generated by deep-level feature maps in deep neural networks have better robustness, and larger feature maps contain more semantic information, enabling the generation of masks with higher quality. Moreover, this approach has better mask generation performance for small targets. P1 is the largest and deepest feature map in the FPN. By upsampling P4, P3 and P2 in turn, the smallest feature map is enlarged and then fused with the features of the next layer so that the bottom feature map has more abundant feature information; this improves the segmentation performance achieved for rice seeds. In
Figure 1, the prototype mask network is the prototype mask generation module, and the branch structure diagram of the network is shown in
Figure 10.
As shown in
Figure 10, for the convolution in the fully convolutional network, first, two convolutional layers are used. The number of input channels is 256, the number of output channels is 256, the convolution kernel size is 3 × 3, the stride is 1 and the padding is 1. Then, the height and width of the feature map are doubled through an upsampling layer. The feature map passes through two more convolutional layers (with the same parameters as the previous convolutional layer) and finally passes through a convolutional layer with 256 input channels, k output channels, a kernel size of 1 × 1, a stride of 1 and padding of 0, where k is the number of prototype masks generated. The rectified linear unit (ReLU) activation function is used to activate the prototype mask because it is a nonlinear function that has the advantages of simple operation, fast convergence and not having a saturation region. By using the ReLU function, the inference speed of the network can be accelerated, and the problem of gradient disappearance can be avoided. After a series of convolution and upsampling operations, the width and height of the output feature map will be double those of the input feature map. Performing feature extraction through larger feature maps can produce finer masks and richer semantic information.
2.6. NMS Processing
During the inference process implemented by the network, many candidates are generated for the regression bounding boxes and the class confidence. The existing method adopts traditional NMS processing, which filters the candidate boxes below the target confidence level while retaining the candidate boxes with confidence levels above the threshold. Although the traditional NMS approach can effectively deal with the candidate boxes, the shortcoming of its calculation process is that the candidate boxes are sorted according to their confidence scores. Then, the candidate boxes are deleted, which can cause the calculations to be executed in sequence, making this approach very slow. To improve the reasoning speed of the network, Fast NMS is used in this paper. Fast NMS is an NMS algorithm that can quickly conduct filtering. It performs parallel matrix calculations on the basis of traditional NMS. It first obtains the top
n detection targets in descending order of confidence for each class and then calculates the intersection over union (IOU) values among these
n detection targets to obtain an IOU matrix with a size of
c ×
n ×
n, where
c is the number of categories and
n ×
n is a diagonal and symmetrical matrix. Second, the values of the lower triangle and the diagonal of the diagonal matrix are set to 0 (
Xkij = 0, ∀
k,
j,
i ≥
j), and then the maximum value of each column is obtained, as shown in Formula (6). Finally, the candidate boxes with high overlap and low confidence rates that are higher than a certain threshold are filtered out, and the best candidate boxes for each category are screened out. In Formula (6),
k represents the
kth category,
i and
j represent the index of the IOU matrix of that category,
Xkij is the upper triangular matrix of the
kth category and
Kkj is the best regression box of that category, respectively.
2.7. Combination Module
To generate the final instance mask, the outputs of the prototype mask generation module and the prediction header module are linearly combined, and then the nonlinear sigmoid activation function is used for activation to generate the final mask. The linear combination of the outputs of the two modules can be realized by using a matrix. The calculation process is shown in Formula (7):
where
P is the matrix of the prototype mask of
h ×
w ×
k and
C is the matrix of the mask coefficient of size
n ×
k (
n is the candidate target screened after NMS processing).
σ denotes activation processing.
T stands for the transpose of the matrix. This simple linear combination has the advantage of being very fast.