1. Introduction
As human beings focus increasingly on the diversity of fishery resources and marine ecosystems, an increasing number of research fields need to be combined with research on underwater fish segmentation, which is necessary for marine biology research, marine ecological protection and fishery resource management. In the research of these related fields, it is necessary to accurately obtain the shape, size and quantity of fish to provide data for further research. However, due to immature technology in the early days, the acquisition of morphological data of underwater fish is mainly based on traditional measurements after landing. The disadvantages of using traditional measurement methods are obvious; for example, manual measurement is time-consuming and labor-intensive and is even less efficient when a fish population is large. With the popularization and application of information technology, morphological measurements of fish have begun to be performed by transmitting machines, and there is still room for further improvement in efficiency. Therefore, underwater fish segmentation is an important research topic in the era of intelligence.
However, for special underwater scenes, the existing segmentation methods face many challenges. First, imaging equipment may encounter color distortion, noise pollution and insufficient light propagation in an underwater environment, which results in low recognition and contrast in the obtained underwater images. Second, the variety of organisms that exist in the underwater environment similar in shape and color to fish also interferes with our ability to divide fish. In addition, there is a wide variety of fish species underwater, and images of fish captured using imaging equipment tend to vary in scale and attitude. These problems make it more difficult to achieve accurate underwater fish segmentation.
At present, fish segmentation methods are divided into traditional methods and deep-learning-based methods.
The traditional method divides the target according to the edge and color of the creature’s body. In 2000, Angelo Loy et al. [
1] used Fourier analysis to detect the shapes of finfish, which is considered the best automated method for detecting fish via traditional methods. In 2011, Meng-Che Chuang et al. [
2] used histogram backprojection on dual local-threshold images to ensure further effective fish segmentation. In 2014, LAN Yongtian et al. [
3] obtained dichotomous images of fish movements by combining three frames of difference with logical and mathematical morphological operations. In 2020, HE Qianhan et al. [
4] developed a method to extract the contours of horny jaws using the Canny algorithm, contributing to easy access to information on biomorphology. In 2021, Hitoshi Habe et al. [
5] proposed a National Aeronautical Advisory Council (NACA) attitude estimation method for fish wing models to identify fish accurately. However, this method required extracting fish morphology from the dataset. Moreover, if an image was blurred or disturbed by other factors, such as illumination, the extracted features were easily incomplete.
Semantic segmentation technology based on deep neural networks is a very advanced underwater fish segmentation method. Semantic segmentation belongs to a category of image classification, but it marks different image areas according to the semantic categories in the image and classifies each pixel in an image to generate the fine-grained mapping of semantic labels to image information. In 2017, Alfonso B. Labao et al. [
6] used the ResNet-FCN network model to semantically segment fish in underwater videos only using input characteristics based on fish color. In 2020, Rafael Garcia et al. [
7] used the Mask R-CNN architecture to locate and segment fish in images. In 2020, Fangfang Liu et al. [
8] introduced an unsupervised color correction module (UCM) based on the DeepLabv3+ network and altered the upper sampling layer in the network, showing that their method improved segmentation accuracy. In 2021, Wenbo Zhang et al. [
9] proposed a two-pool polymerization attentional network to improve underwater fish segmentation accuracy using a pool polymerization positional attention module and a pool polymerization channel attention module. In 2022, Jinkang Wang et al. [
10] proposed an underwater image semantic segmentation method to precisely segment targets; however, the first step in this method was to improve image quality by performing image enhancement operations based on multispatial transformation. In recent years, increasing numbers of researchers have begun to improve segmentation accuracy from the perspective of integrating multiscale features of fish targets, such as the multiscale CNN network [
11,
12,
13,
14] and the porous GAN network [
15,
16,
17,
18].
Each of the above deep-learning-based approaches has its advantages, but the fuzzy and distorted images of fish in underwater scenes and the disturbance in the surrounding environment still pose challenges to accurate fish segmentation. In response to the above challenges, this paper uses PSPNet [
19], which can integrate multiscale features as the basic network. An improved underwater fish segmentation algorithm based on PSPNet is proposed. This method can improve fish segmentation accuracy for fuzziness, similar colors and backgrounds and small sizes. The main contributions of this article are as follows:
- (1)
We propose an improved PSPNet network model for underwater fish segmentation. In the feature extraction phase, this article uses an iAFF module to connect to ResBlock. iAFF realizes the full perception of multiscale characteristics and environmental information of a target through the MS-CAM module, and global and local features are integrated through AFF. In addition, iAFF integrates more contextual information through its iterative nature to facilitate the overall understanding of fish in underwater images, thereby improving fish segmentation accuracy.
- (2)
To retain more fish characteristic information at the feature extraction stage, this article replaces the average pooling in the backbone ResNet50 network with SoftPool. Softpool reduces the number of parameters and increases the number of calculations after adding iAFF to the model through the fast exponential weighted calculation method, and the inference speed and precision are effectively improved.
- (3)
To make fish features more distinctive in underwater environments, in this paper a triplet attention mechanism module, triplet attention (TA), is added to the different scale features in the pyramid pool module to realize more detailed attention to fish features. The TA module captures richer feature information of fish targets in a cross-latitude, interactive way, which improves segmentation accuracy.
- (4)
In adding TA modules, we use a parameter-sharing strategy, which can reduce the numbers of model parameters and calculations by sharing the parameter weights of different scale features after passing through the TA module.
Compared to other underwater fish segmentation methods, the proposed IST-PSPNet (iAFF + SoftPool + TA) method achieves better segmentation accuracy for DeepFish datasets. In addition, the Params and GFLOPS model results do not increase compared to the baseline (PSPNet). The results also show that the proposed method can improve the MPA and FPS. In addition, we verify the effectiveness of our method.
The rest of this paper is arranged as follows. In the second section, the structure of the underwater fish segmentation IST-PSPNet method is introduced in detail. The third section gives the experimental results and analysis. The fourth section summarizes the work in this paper.
2. Proposed Method
2.1. Overall Network Structure
In this paper, a new method, IST-PSPNet, is proposed to solve the underwater fish segmentation problem. Its structure consists of an input image, a backbone network, an improved pyramid pooling module and an output image.
Figure 1 shows the overall structure of the IST-PSPNet network. In this model, the main improvements were based on the ResNet50 backbone; the iAFF module was designed and connected to ResBlock in ResNet50 to achieve iterative attention feature fusion; SoftPool was used to reduce the numbers of parameters and computations, replacing AvgPool after the last ResBlock in ResNet50; and the TA triad attention mechanism module was added to the different scale features in the gold-tower pool module to focus on the specific location of fish body features in the channel.
In the IST-PSPNet network structure for the input underwater fish images, feature extraction is carried out by the improved ResNet50. To mitigate the impact of the scale change in the feature map and smaller fish bodies in the feature extraction stage, connecting and embedding the iAFF module can effectively integrate fish body features with inconsistent semantics and scales by aggregating contextual information from different receptive fields for fish bodies with different scales. After that, the feature map can retain more feature information while reducing the size through SoftPool.
Then, the resulting global feature information is passed through the pyramid module, and four different sizes of feature information are obtained using different degrees of pooling operations. After that, the feature information of four different sizes is passed into the TA module through convolution. While the parameter number is ignored, the TA module captures the cross-dimensions between the spatial and channel dimensions of the first first-scale feature via cross-latitude interaction. Then, the first first-scale feature is shared with other features via the weight of the TA module through a parameter-sharing strategy, which reduces the number of parameters while improving the overall generalization ability.
Finally, the feature information obtained through the TA module is upsampled and then interpolated with the original feature map. The feature information extracted from the shallow network is fused with the deep feature information after passing through the pyramid network to obtain the global feature information. Finally, the fish prediction results are obtained after decoding.
2.2. iAFF Module
During the fish feature extraction stage, the ResBlock in ResNet50 utilizes skip connections, a form of linear connection, to pass the input feature information directly to the ResBlock so that the input feature information is directly added to the output. However, this approach does not fully perceive the context and does not further improve the semantic and scale inconsistencies between input features. For special scenes, such as those underwater, the feature information needs to be more detailed. Therefore, we proposed that the iterative attention feature fusion (iAFF) module [
20] (shown in
Figure 2) replace the common fusion approach in ResBlock. Experiments show that embedding this module can improve fish segmentation accuracy in underwater scenes.
To fully perceive the context, initially integrating the input features is the key point. In this paper, AFF is used to implement two attention modules to fuse input features to form an iAFF (iterative attention feature fusion) module. The equations for AFF and iAFF are as follows:
X is the constant mapping of input characteristics,
Y is the residual of learning in ResNet,
Z is the output fusion feature,
denotes elementwise multiplication and
is the integration of initial input characteristics. The above calculation equations represent the process of combining the different initial features
X and
Y. 1 −
M (
X Y) is the dotted line in the iAFF structure; the fusion weight
M (
X Y) is made up of real numbers between 0 and 1; and 1 −
M (
X Y) is made up of real numbers between 0 and 1, which enables the model to learn the weight between
X and
Y via training.
M denotes the multiscale channel attention module, which is the core module that makes up AFF and iAFF and whose structure is shown in
Figure 3. The key idea is to achieve channel attention at multiple scales by changing the size of the spatial pool.
MS-CAM aggregates more contextual information along channel dimensions, adds global mean pooling as a global channel branch and selects point-by-point convolution (PWConv) as a context aggregator for local channels. Compared to other channel attention modules, MS-CAM can simultaneously focus on larger objects with a more global distribution and smaller objects with a more local distribution. For this particular underwater scenario, the enhanced model’s attention to the local features of smaller fish is undoubtedly critical.
2.3. SoftPool
During the feature extraction phase, the ResNet50 used in this article reduces the size of the feature diagram via pooling, a process that is important for increasing the sensory field and reducing the computational load. However, in underwater scenes, images are often affected by complex factors such as light, sediment and water quality, which lead to some ambiguity, distortion and fish feature distortion. Therefore, the pooling operation, which can retain more characteristic information, is the key to fish edge feature extraction.
In this paper, we preserve more characteristic information in the feature diagram during the poolization process. We replace AvgPool in ResNet50, the backbone of PSPNet, with SoftPool [
21], a rapid exponentially weighted activation sampling method derived from ablation experiments using a range of architectures and pool-based methods, as shown in
Figure 4. In the ImageNet 1K classification task, replacing the pool layer in ResNet50 revealed that SoftPool showed some improvement in accuracy and CUDA-based inference time (FPS) and reduced computational complexity (GFLOPS) compared to architectural baselines and other pool methods.
Softpool is a kernel-based pooling approach that provides a balance between maximum and average pool-based operations by exponentially weighting each part of a region based on the strength of the feature map region. In our experiments, we demonstrated that SoftPool can retain more information about fish features on the feature maps of underwater scenes, which is directly reflected in improved segmentation accuracy and improved computational and memory efficiency.
As shown in
Figure 4, the SoftPool working diagram is subsampled using a (2 × 2) kernel to output an exponential weighted sum of the original pixels in the fish feature area.
This greatly improves the representation of the high-contrast areas that exist at the edge of the fish and in underwater scenes. To simplify the symbols, we ignore the channel dimension and assume that R is the index set corresponding to the features in the considered 2D spatial region. The
Wi weight is the ratio of the natural index of a feature to the sum of all the features. Here is how SoftPool calculated this:
is the SoftPool output value. denotes each feature. is the sum of the natural indices of all the features.
2.4. TA Module
In underwater environments, due to the different distances traveled by light of different wavelengths in water, water colors vary in images. As a result, there are situations where fish of different colors and water with different coloration exhibit highly similar features. As shown in
Figure 5a, the textured colors of the three tagged fish are highly similar to the background colors, which makes it difficult to identify fish body features. In addition, underwater environments typically contain ecological information such as algae, vegetation and reefs, which also directly contribute to the low differentiation of fish in such scenarios. As shown in
Figure 5b, there are reefs and vegetation interference in the living environment of the three species of fish, which limit the attention of the lower network to the characteristic fish information. In the fish segmentation task, we believe that adding attention mechanisms after different scale features can improve the adaptability and robustness of a network to the underwater environment, placing focus more on the detailed fish feature information and improving the understanding ability of fish features.
In the feature extraction stage, the backbone network (ResNet50) utilizes the iAFF module’s MS-CAM channel attention module to continuously acquire global and local feature weights along the channel dimension of the feature maps. This enhances the information exchange among channels. However, weighting in spatial dimensions is neglected, which means that the model cannot accurately adjust spatial feature responses in different locations as the layers deepen. Additionally, when calculating channel dimension weights, the global mean pooling performed breaks down the space in the input feature diagram into one pixel per channel. This results in a large loss in spatial information so that, when calculating attention on these single-pixel channels, there is no interdependence between channel and spatial dimensions. This may affect segmentation performance in fish segmentation missions.
To solve the problem of fish features being obscured by water quality and other ecological information interference, a triplet attention mechanism (TA) module [
22] is added to the pyramidal pool module (PPM) after different scale features. The TA module constructs the dependence of fish features between channels and spatial dimensions by rotating operation and residual transformation. The channel and spatial information are encoded with a negligible number of parameters. Specifically, for fish features, the TA module can focus spatial attention on specific locations in the channel by interacting across dimensions. The use of a TA module can compensate for our lack of attention to spatial dimensions by suppressing background interference in underwater scenes, highlighting fish features (contours and details) and making the different scale features in the pyramidal pool module more distinguishable, which can then be fused for more detailed results.
As
Figure 6 shows, the input of the triad attention mechanism module is a small-scale feature in a pyramid pool module consisting of three parallel branches, in which
Z-pool is responsible for reducing the tensor’s zero dimension to two dimensions by connecting the average pool feature and the maximum pool feature on the tensor dimension in that branch. This enables the layer to retain a wealth of information in the original tensor while reducing its depth for further calculation, as represented by the following equation for
Z-pool:
Of these,
is the zero dimension after maximization and average poolization. For example, a
Z-pool with a tensor in the shape of (C × H × W) produces a tensor in the shape of (2 × H × W).
In this module, the interaction between the height dimension and channel dimension is first established at the top branch; the input tensor X rotates counterclockwise at 90° along the height axis H to obtain the tensor
; then tensor
is reduced to
via the
Z-pool back channel and standard convolution of 7 × 7 through the core. Then, the tensor is passed to the sigmoid activation layer via the normalized layer to generate the final effective attention weight. The attention weight is then applied to
and rotated clockwise at 90° along the height axis H to obtain the original input tensor shape. Similarly, input on the second branch rotates counterclockwise at 90° along the width axis W to obtain tensor
, reduces to
after passing through
Z-pool and then undergoes a series of operations identical to the first branch to obtain the final attention weight. In the last branch, the input tensor X passes through the
Z-pool to obtain
and experiences the normalization layer and the sigmoid layer to generate the final attention weight. The final output is then obtained via averaging operations to aggregate the fine tensor of the shape (C × W × H) generated by each of the three branches. Therefore, the above process can be expressed in the following equation:
represents the sigmoid activation function.
,
and
represent a two-dimensional convolution layer with a kernel size of (7 × 7) in each of the three branches.
In this paper, after adding the TA module to the different scale features of the pyramid pool module, we also use a parameter-sharing strategy to share the weight of features learned from small-scale features through the TA module to other scale features to improve the generalizability and robustness of the input of the model. In summary, the addition of the TA (triplet attention) module aims to enhance the focus on fish characteristics within the pyramid pooling module without increasing the number of parameters, thereby improving the accuracy of the fish segmentation model. The parameter-sharing strategy is employed to accelerate the model-training process and inference speed, thereby enhancing the computational efficiency of the model.
4. Conclusions
Currently, acquiring underwater fish morphology data is still mostly based on traditional measurements after fishing and bringing the fish ashore. To improve efficiency, most technologies are now moving toward underwater real-time fish segmentation, which makes it easier to obtain fish morphology data. However, due to the influence of underwater water quality and the influence of many other types of ecological information, the images are blurred and distorted. To solve these problems, a high-precision segmentation method (IST-PSPNet) was proposed. Experiments showed that, compared to other semantic segmentation methods, this method could improve the segmentation accuracy for small fish with blurring and similar colors and backgrounds. Moreover, in underwater fish segmentation, our method achieved good segmentation accuracy.
- (1)
To fully relate the extracted features of different scales to the context in the feature extraction stage, we proposed an iterative attention feature fusion method based on an iAFF module. Through this method, we realized the depth mining of different scale feature information. Moreover, for this particular underwater scenario, this method could effectively integrate local feature information and global feature information to achieve full awareness of context information. In addition, we also solved the problem of how to initially integrate the received features through iteration.
- (2)
In an underwater environment, extracting more information about the characteristics of fish can help to better segment them. In this paper, the average pool in the backbone ResNet50 network was replaced by SoftPool to address the lack of feature information caused by the pooling process. In addition, SoftPool calculated features in a rapid, exponentially weighted way. In this way, compared with average pooling, the numbers of parameters and calculations were reduced to a great extent and the reasoning speed was accelerated.
- (3)
To make the network model more suitable for fuzzy underwater scenes, we added a triplet attention (TA) module after different scale features of the pyramidal pool module. The TA module captured the specific position of fish features in the channel dimension through the spatial dimension of the independent branch to realize the attention to fish features. The underwater fish segmentation performance was improved without increasing the calculation parameters.
- (4)
In this paper, a parameter-sharing strategy was utilized when adding the TA module. This strategy enabled different scale features in the pyramid pooling module to share the same parameters. In this way, the numbers of model parameters and calculations were greatly reduced.
The method proposed in this paper may play an important role in promoting the development of intelligent fisheries and provide some help with intelligently obtaining fish data. The focus of future research should be to reduce the numbers of parameters and computations of network models in underwater fish segmentation through further research to achieve lightweight processing.