1. Introduction
Interactive image segmentation (IIS) is a technique that utilizes user-interaction input to isolate target objects and plays a significant role in the field of image and video editing [
1] and medical diagnosis [
2]. IIS provides higher accuracy than automatic segmentation, especially in fields like image and video editing or medical diagnosis, where high-precision segmentation results are necessary or personal safety is at risk. Due to the lower segmentation accuracy of the automatic segmentation model and the lack of human audit control, IIS is a necessity in such fields. Researchers have conducted extensive research in this field, with recent work by Papadopoulos et al. [
3] demonstrating a more efficient method of obtaining bounding boxes using extremal clicks. The study demonstrated that extreme clicking, which took an average of 7.2 s, produced high-quality bounding boxes comparable to those obtained by traditional methods that require 34.5 s for object detection. These extremum points belong to the four extremum points of the upper, lower, left, and right sides of the object, containing four points located on the object boundary from which the bounding box can be easily obtained. The deep extreme cut (DEXTR) method [
4] leverages these extreme points to generate an attention Gaussian map, which is then used as an additional channel to fuse with the original red/green/blue (RGB) image and input it into the segmentation network. The network learns to map the four extreme points of the input into a mask of the target object. Additionally, in order to concentrate on the objects of interest, the input is cropped using a bounding box created from boundary point annotations. Moreover, to encompass context in the outcomes, the tight bounding boxes are slightly expanded by a few pixels. After the preprocessing step of extreme clicks, the input consists of RGB cropped images containing objects plus their extreme points. The feature backpropagating refinement scheme (F-BRS) method [
5,
6] proposes an improved scheme of feature backpropagation that reframes the parameter optimization problem, which runs forward and backward passes through a part of the network, namely the last few layers. The method introduces a set of auxiliary parameters for optimization, which can be optimized without computing and passing backwards through the entire network, making it more efficient.
The terms “internal” and “external” guidance refer to the user-generated guidance points located inside and outside the object during the interaction process. The foreground point, which is an internal guidance point, helps the model learn the internal characteristics of the object and benefits the segmentation process. On the other hand, the background points, which are external guidance points, assist the model in distinguishing between the foreground and background and thus enhance the segmentation performance of the model. The application of semantic and instance segmentation has seen significant advancements in various domains, such as general scenes, image editing, and medical diagnosis. However, creating the pixel-level training data necessary for building successful segmentation models can be time consuming, laborious, and expensive. Interactive segmentation offers an attractive and effective way to reduce the annotation workload by allowing human annotators to quickly extract objects of interest with some user input, such as bounding boxes or clicks. Recently, Maninis et al. [
4] proposed using the extreme points for IIS objects, including the leftmost, rightmost, top, and bottom pixel points, which enables fast interactive annotation and high-quality segmentation. Additionally, Zhang et al. [
7] explored an inner–outer guidance approach, which involves one foreground click and four background clicks for segmentation. Specifically, the method uses an inner point near the center of the object and two outer points at symmetrical corner positions of a tight bounding box around the target object. The approach attains state-of-the-art results on various popular benchmarks and showcases its generalization ability in diverse domains, including street scenes, aerial images, and medical images. However, the click process can be further optimized to reduce annotation time and effort.
The proposed method incorporates several key components, including the selection of inner and outer guide points and the utilization of channel attention mechanisms. These components were inspired by prior research in the field. In terms of the inner and outer guide points selection, previous studies have demonstrated their effectiveness in various computer vision tasks. Maninis et al. [
8] proposed a one-shot video object segmentation approach that leverages inner and outer boundary awareness for accurate segmentation. Similarly, Zheng et al. [
9] introduced the concept of inner and outer frames to rethink semantic segmentation from a sequence-to-sequence perspective. These works highlight the importance of considering both inner and outer cues in achieving precise and robust segmentation results. Regarding channel attention mechanisms, researchers have explored different strategies to enhance the modeling capabilities of convolutional neural networks. Zhu et al. [
10] introduced squeeze-and-excitation networks (SENet), which adaptively recalibrate channel-wise feature responses to capture informative features. Another approach, proposed by Woo et al. [
11], is the convolutional block attention module (CBAM), which incorporates both spatial and channel attention mechanisms. Additionally, Zhang et al. [
12] presented self-attention generative adversarial networks (SAGANs), which employ self-attention mechanisms to model long-range dependencies in images.
The field of machine learning-based applications, such as object detection, scene segmentation, and salient object detection, relies heavily on understanding and analyzing 2D/3D sensor data. Interactive object segmentation is a crucial task in image editing and medical diagnosis, requiring the accurate separation of target objects from their backgrounds based on user annotation information [
13,
14,
15]. However, existing methods struggle to effectively leverage user guidance information to guide segmentation models. This paper presents a novel interactive image segmentation technique for static images based on multi-level semantic fusion. The proposed method aims to utilize user guidance information both inside and outside the target object to achieve precise segmentation. It can be applied to both 2D and 3D sensor data, making it versatile for various applications. The main contributions of the proposed method can be summarized as follows:
(1) Multi-level semantic fusion: The method incorporates a cross-stage feature aggregation module that facilitates the effective propagation of multi-scale features from previous stages to the current stage. This module mitigates the loss of semantic information caused by multiple upsampling and downsampling operations, enabling the current stage to make more complete use of the semantic information from the previous stage. This multi-level fusion approach enhances the overall segmentation accuracy. (2) Fine segmentation edges: To address the issue of rough network segmentation edges, the method includes a feature channel attention mechanism. This mechanism captures richer feature details at the channel level, resulting in finer segmentation edges. By emphasizing important features and suppressing less relevant ones, the proposed attention mechanism improves the overall quality of the segmented object boundaries.
2. Problem Description
In natural image segmentation tasks, achieving the fine segmentation of objects of different scales is crucial since objects in natural images vary greatly in scale. The feature pyramid network (FPN) [
16], a fundamental component used for detecting objects of different scales, employs a horizontally connected top–down architecture to construct high-level semantic feature maps at all levels. This general-purpose feature extractor performs well in natural image-segmentation tasks. Meanwhile, the pyramid scene parsing network (PSPNet) [
17] utilizes a feature pyramid pooling module to fuse features from multiple scales, effectively utilizing global contextual information. However, the repeated up-and-down sampling in the multi-level network can result in the loss of information, which causes blurred edges in the segmentation results. Therefore, reducing the loss of information during repeated up-and-down sampling in multi-level networks is a valuable research direction. In addition, previous IIS methods based on extreme points such as DEXTR [
4] do not introduce background channels as segmentation information, which makes the model fail to focus on more background information for auxiliary segmentation. To address this, interaction information can effectively capture the user’s real intention and improve the precision and accuracy of the segmentation results.
Increasing the network depth can significantly improve the model learning quality. Batch normalization stabilizes the learning process in deep networks by adjusting input distributions at each layer, resulting in smoother optimization. However, with the increased depth of the segmentation network, the better integration of spatial and channel information at each layer to construct informative features and masking feature channels that are irrelevant to the segmentation task have become a valuable research area.
Based on the specific issues described above, this paper proposes an interactive image-segmentation method that utilizes multi-level semantic fusion. This method employs a cross-stage feature fusion strategy to transfer multi-scale feature information from the previous stage to the current stage. Thus, the current stage can make full use of prior information to extract more discriminative representations. The method introduces channel attention to perceive the relationship between different feature channels and assign corresponding weights to them. This increases the ability of the model to obtain more feature information. After training, the method can increase the weights of the features that improve the final segmentation accuracy according to the global information and reduce the weights of the features that do not improve the final segmentation accuracy.
5. Discussion
The comparison results between the methods used in this paper and several currently popular methods on public datasets are shown in
Table 4, and
Figure 9 and
Figure 10. The results indicate that the method proposed in this paper achieves the highest IOU and requires the least number of interactions. This improvement can be attributed to the combination of the cross-stage feature aggregation module and the channel attention module.
The cross-stage feature aggregation module enhances the feature information flow of the current stage network, facilitating the transfer of context information and enriching the feature information of the current stage. On the other hand, the channel attention module allows the network to perceive the inter-channel relationship, and leverages it to improve the network’s ability to select relevant features, thus effectively utilizing valuable feature information while disregarding useless ones.
This attention mechanism is lightweight and can enhance the network’s ability to select relevant features and redistribute weights without adding significant computational cost. The mechanism allows the network to re-learn the relationships between channels, improving the learning ability of relevant features for the final segmentation result while reducing the impact of irrelevant feature channels. Ultimately, this mechanism enhances the network’s ability to extract features.
Compared to other popular methods, the approach proposed in this paper requires only four clicks to reach 85% of the IOU value on the PASCAL public dataset. This finding demonstrates that this method is among the most efficient extreme point methods in terms of the number of clicks required to achieve the same level of accuracy. In addition, the segmentation IOU accuracy of various segmentation methods was evaluated using only four clicks on the PASCAL public dataset. The results indicate that the method proposed in this paper achieves an IOU value of 93.7%, which is 2.1% higher than the current popular method IOG.
Based on
Figure 3,
Figure 4,
Figure 5,
Figure 6,
Figure 7,
Figure 8,
Figure 9 and
Figure 10, it can be concluded that the method proposed in this paper achieves highly accurate object segmentation while minimizing user interaction costs. Creating a training set with pixel-level annotations is a challenging task that requires users to manually perform pixel-level accurate segmentation of target objects in images. Achieving pixel-level annotation is difficult for humans, especially for irregular and rough edges, such as human hair or animal feathers. Therefore, reducing the cost of manual annotation is essential in the process of interactive segmentation.
Figure 9 demonstrates that the method proposed in this paper requires the fewest user interaction clicks compared to other popular methods. Furthermore, the proposed method achieves a 2.1% increase in segmentation accuracy while reducing user interaction. These results validate the effectiveness of the proposed method in reducing manual interaction and improving segmentation accuracy. The results presented in
Figure 11 demonstrate the high accuracy of the segmentation method proposed in this paper, which is capable of accurately segmenting objects of various scales and different parts of the same object. The resulting segmentation mask is both smooth and complete, effectively describing the original shape of the target object. These results highlight the effectiveness of the proposed cross-stage feature aggregation and channel attention modules, which improve the segmentation of multi-scale objects.
In our proposed method, we introduce an attention mechanism to enhance the modeling capabilities of our network. Specifically, we compare our attention module with the SE block used in SENet to demonstrate the effectiveness of our approach. The SE block, proposed by Hu et al. [
20], focuses on recalibrating channel-wise feature responses in a network. It consists of a squeeze operation, which globally pools the input feature maps to capture channel-wise statistics, and an excitation operation, which learns a channel-wise excitation function to selectively emphasize informative features. The SE block has been widely adopted in various computer vision tasks and has shown improvements in performance. In our proposed method, we also incorporate a channel attention mechanism to assign appropriate weights to feature channels, allowing the network to focus on relevant and discriminative information. The method with the SE block in SENet achieved an IOU value of 90.2%, while ours achieved 93.7%, which is 3.5% higher.
By comparing our proposed attention mechanism with the SE block in SENet, we aim to demonstrate the effectiveness and superiority of our approach. (1) Capturing spatial and channel dependencies: while the SE block focuses on recalibrating channel-wise feature responses, our attention mechanism incorporates both spatial and channel dependencies. It allows our model to capture not only the importance of individual channels but also the spatial context within the feature maps. This comprehensive consideration of spatial and channel dependencies enables our method to better exploit the rich information present in the feature maps. (2) Integration with the network architecture: Our attention mechanism is seamlessly integrated into the network architecture, specifically in the FineNet subnet. It complements the existing structure and enhances its performance by providing refined feature representations. The attention module selectively highlights important features, enabling the network to focus on the most discriminative information for accurate segmentation. (3) Performance improvements: Through extensive experimental evaluations, we demonstrate that our attention mechanism yields superior performance compared to the SE block. The quantitative results show improved IOU, indicating the effectiveness of our approach in capturing and leveraging important features for the segmentation task. This study provides a comprehensive understanding of the benefits and advantages of our attention mechanism over the SE block. These improvements strengthen the credibility of our proposed method and highlight its potential for achieving state-of-the-art performance in interactive image segmentation tasks.
The qualitative attributes of this study lie in the following. (1) The visual quality of segmentation results: We compare the visual quality of segmentation results between our proposed method and related techniques. Through qualitative examples and visual comparisons, we demonstrate how our method achieves more accurate and precise object boundaries. The segmentations produced by our method exhibit smoother and more coherent object contours, resulting in visually appealing results. Additionally, we highlight scenarios where our method effectively handles challenging cases, such as complex backgrounds, object occlusions, or fine details, and produces superior segmentation quality compared to existing methods. (2) Robustness to challenging scenarios: We emphasize the robustness of our proposed method to challenging image conditions. We discuss how our approach handles difficult cases, including objects with low contrast, irregular shapes, and varying lighting conditions. By presenting qualitative examples, we demonstrate that our method maintains segmentation accuracy and robustness across a wide range of challenging scenarios. We show instances where our method successfully handles partial occlusions, object deformations, or variations in object appearance, showcasing its robustness compared to related techniques. (3) Handling of various object classes and shapes: We highlight the versatility of our proposed method in handling different object classes and shapes. Through qualitative comparisons, we discuss examples where our method effectively segments objects of various sizes, aspect ratios, and object categories. We showcase instances where our method accurately captures object boundaries, even for objects with complex shapes or instances with intricate boundaries. This demonstrates the ability of our method to handle diverse object characteristics and highlights its flexibility in different application domains.
The proposed method holds great potential for enhancing medical imaging applications, particularly in tasks such as tumor segmentation, organ delineation, and lesion detection. By leveraging multi-level semantic fusion and user guidance information, the method can accurately separate target objects from their backgrounds in medical images, leading to improved accuracy and reliability in diagnosis and treatment planning. This can aid in the better understanding and analysis of medical data, enabling clinicians to make informed decisions and improve patient care. Additionally, the compatibility of the proposed method with other machine learning techniques for visual semantic analysis allows for seamless integration into existing medical imaging workflows. Interactive image segmentation plays a crucial role in perception and scene understanding for autonomous vehicles. The proposed method can be effectively utilized to segment and separate objects of interest from the surrounding environment, enabling more precise and reliable object detection, tracking, and recognition. By incorporating multi-level semantic fusion and user guidance information, the method can enhance the accuracy and robustness of object segmentation in complex driving scenarios. This, in turn, contributes to improved situational awareness, enabling autonomous vehicles to make more informed decisions and navigate safely. The compatibility of the proposed method with 3D sensor data further enhances its applicability in autonomous driving systems. In both medical imaging and autonomous vehicles, the benefits of the proposed method lie in its ability to leverage user guidance information, both inside and outside the target objects, and its multi-level semantic fusion capabilities. These aspects enable more accurate and reliable object segmentation, leading to improved performance in various real-world applications.
In our study, the exact time needed for segmentation can vary depending on several factors, including the complexity of the images, the computational resources available, and the specific segmentation algorithm employed. As we conducted our experiments on a high-performance computing platform, we were able to achieve efficient segmentation times. However, it is important to note that the computational time may differ based on the hardware and software configurations used.
It is important to note that while we discussed medical imaging and autonomous vehicles as examples, the proposed method’s applicability extends to other domains, where interactive image segmentation is crucial, such as robotics, augmented reality, and computer-aided design. Overall, the interactive image segmentation method based on multi-level semantic fusion has the potential to significantly enhance a wide range of real-world applications, enabling more accurate and reliable analysis of visual data and empowering advanced decision-making processes.