Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                

HTML conversions sometimes display errors due to content that did not convert correctly from the source. This paper uses the following packages that are not yet supported by the HTML conversion tool. Feedback on these issues are not necessary; they are known and are being worked on.

  • failed: picinpar
  • failed: pbox

Authors: achieve the best HTML results from your LaTeX submissions by following these best practices.

License: CC BY-SA 4.0
arXiv:2402.17370v1 [cs.CV] 27 Feb 2024

An Efficient MLP-based Point-guided Segmentation Network for Ore Images with Ambiguous Boundary

Guodong Sun    Yuting Peng    Le Cheng    Mengya Xu    An Wang    Bo Wu    Hongliang Ren    and Yang Zhang Corresponding author: Yang Zhang.G. Sun, Y. Peng, L. Cheng, and Y. Zhang are with the School of Mechanical Engineering, Hubei University of Technology, Wuhan 430068, China (e-mail: sunguodong@hbut.edu.cn; pyt181@hbut.edu.cn; cl@hbut.edu.cn; yzhangcst@hbut.edu.cn ). M. Xu and H. Ren are with the Department of Biomedical Engineering, National University of Singapore (NUS), Singapore 117575, Singapore (e-mail: mengya@u.nus.edu; hlren@ieee.org). A. Wang, H. Ren, and Y. Zhang are also with the Department of Electronic Engineering, The Chinese University of Hong Kong, Hong Kong 999077, China (e-mail: wa09@link.cuhk.edu.hk).B. Wu is with the Shanghai Advanced Research Institute, Chinese Academy of Sciences, Shanghai 201210, China (e-mail: wubo@sari.ac.cn).
Abstract

The precise segmentation of ore images is critical to the successful execution of the beneficiation process. Due to the homogeneous appearance of the ores, which leads to low contrast and unclear boundaries, accurate segmentation becomes challenging, and recognition becomes problematic. This paper proposes a lightweight framework based on Multi-Layer Perceptron (MLP), which focuses on solving the problem of edge burring. Specifically, we introduce a lightweight backbone better suited for efficiently extracting low-level features. Besides, we design a feature pyramid network consisting of two MLP structures that balance local and global information thus enhancing detection accuracy. Furthermore, we propose a novel loss function that guides the prediction points to match the instance edge points to achieve clear object boundaries. We have conducted extensive experiments to validate the efficacy of our proposed method. Our approach achieves a remarkable processing speed of over 27 frames per second (FPS) with a model size of only 73 MB. Moreover, our method delivers a consistently high level of accuracy, with impressive performance scores of 60.4 and 48.9 in AP50box𝐴superscriptsubscript𝑃50𝑏𝑜𝑥AP_{50}^{box}italic_A italic_P start_POSTSUBSCRIPT 50 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_b italic_o italic_x end_POSTSUPERSCRIPT and AP50mask𝐴superscriptsubscript𝑃50𝑚𝑎𝑠𝑘AP_{50}^{mask}italic_A italic_P start_POSTSUBSCRIPT 50 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m italic_a italic_s italic_k end_POSTSUPERSCRIPT respectively, as compared to the currently available state-of-the-art techniques, when tested on the ore image dataset. The source code will be released at https://github.com/MVME-HBUT/ORENEXT.

{IEEEkeywords}

Instance segmentation, Point guidance, Edge processing, Local correlation, Ore image.

1 Introduction

\IEEEPARstart

Ore particle size analysis is an important part of ore processing tasks. Particle size information is an essential indicator to judge the effectiveness of the crusher, and it can serve as a guide for adjusting the process parameters of each process. Accurate ore segmentation is a significant prerequisite for particle size statistical analysis. The production site for beneficiation presents a complex environment with several challenging factors, such as ore stacking, ore adhesion resulting from dry-wet mixing, and variations in lighting conditions, as depicted in Fig. 1. These factors pose a significant challenge to achieving accurate ore segmentation. Moreover, the beneficiation scene is often arranged in the field, so the limitation of equipment resources is also a problem faced by the ore particle size analysis system.

Traditional image processing methods perform the segmentation of ore by setting thresholds, cluster analysis, and edge detection. With the development of convolutional neural networks (CNN), the algorithms based on deep learning have shown significant advancements in automatic feature extraction and generalization performance [1, 2, 3, 4, 5], so they have gradually gained a leading position in industrial image processing. Most existing instance segmentation methods are implemented on public datasets such as MS COCO [6]. These methods are not effective for ore image processing. It can be seen from Fig. 1(c) that the baseline framework is less capable of segmenting the ore edges. The processing of ore images is significantly challenged by complex working environments and the presence of diverse feature information.

Refer to caption
(a) Input
Refer to caption
(b) GroundTruth
Refer to caption
(c) Baseline
Refer to caption
(d) Ours
Figure 1: Visualizations of the feature maps. The ore stacking in the input image makes the boundary difficult to distinguish. The edge features of the feature maps obtained by the baseline are blurred. The consistency instances with clearer edges can be obtained using our network.

Aiming at the specific needs of ore segmentation, the U-Net was used to segment broken stones for the first time. However, these complex CNN-based methods have high computational costs, large model sizes, and low accuracy. Moreover, facing the complex environment, the existing algorithmic framework will easily ignore the problem of edge blur caused by inter-adhesive and shadowing of ore images. In this work, we focus on solving this problem and designing an efficient network with less computational overhead, fewer parameters, faster inference time, and better performance.

Recently, it has been found that MLP-based architectures can achieve comparable results to CNN and Transformer methods with less computation. The MLP-Mixer [7] proposed token mixing and channel mixing MLP to allow interaction between spatial locations and channels. The ResMLP [8] used cross-patch and cross-channel sublayers as components. Inspired by these works, we propose OreNeXt, an instance segmentation model based on the MLP framework for the ore edge problem. A lightweight MLP backbone network for feature extraction is introduced, followed by a feature pyramid network using a SparseMLP module to enhance the semantic information, then we introduce a loss function guided by the edge. Using MLP and a simple hybrid mechanism, we obtain a lightweight model suitable for deployment.

To implement this model framework to solve the above problems, we use the two-stage detector PointRend [9] as the baseline for the ore task. For ore tasks, the low-level features of ore overlapping edges are extremely crucial. Therefore, we propose a novel backbone network StoneMLP, which incorporates shifting operations to extract local information corresponding to different axial shifts. A sparse feature pyramid network is proposed to strengthen the small target information that is easy to ignore and misjudge while maintaining sufficient semantic information. Furthermore, we propose a loss, including detection loss that predicts the foreground score of each point and mask loss that performs edge guidance by dynamically adjusting vertex pairing. As illustrated in Fig. 1(d), our proposed framework is more likely to obtain clear ore segmentation results. Experiments on the ORE image dataset demonstrate that our framework performs state-of-the-art methods with smaller model size and faster inference speed. Our main contributions are summarized as follows.

  1. 1)

    A lightweight image segmentation network OreNeXt is designed, which solves the problem of blurred edges in ores by guiding the boundary points.

  2. 2)

    We propose a lightweight backbone StoneMLP for capturing local correlation and introduce a semantic enhancement SparseFPN network.

  3. 3)

    A loss function that matches edges by guided points is introduced, significantly improving the quality of predicted boundary details.

  4. 4)

    The experimental results validate the efficacy of our method in enhancing the performance of ore image segmentation tasks, while simultaneously reducing the number of parameters and improving inference speed.

The remainder of this paper is structured as follows. Section 2 introduces an overview of existing mineral image segmentation methods and MLP architecture development. Section 3 presents our overall framework, including two new structures and a new loss function. To validate the effectiveness of our method, in Section 4, we conduct integrated experiments. The paper is concluded in Section 5.

2 Related works

2.1 Ore image segmentation

Accurate individual ore segmentation is a crucial prerequisite for granularity statistics and an essential component in ore processing tasks. With the development of CNN, the task of ore image processing has gradually shifted from traditional image processing techniques to deep learning-based segmentation algorithms. For instance, Sun et al. [10] proposed an efficient instance segmentation algorithm to split ore. Dipti et al. [11] proposed an image segmentation system for estimating the granularity of oil sandstones. The ore image on the conveyor belt often suffers from ore adhesion and occlusion caused by dust and soil. These factors result in the blurring or even complete disappearance of multiple ore edges in the image, posing significant challenges to achieving accurate individual ore segmentation. To address this edge problem, numerous models have been employed to tackle the issue of under-segmentation in ore processing [12], Li et al. [13] proposed a model based on U-Net, which alleviated the problem of ore granularity detection by improving the loss function and using watershed technology. Liu et al. [14] developed the RLPNet to obtain high-precision segmentation images by reducing the interference of complex textures and enhancing the expression of edge features. With the effective emergence of Transformer, numerous Transformer-based designs have been proposed to effectively divide the adhesive ore [15], but the network structures are often complex and not suitable for outdoor deployment. Although these methods have improved the segmentation of ore images, there are still problems in clear edge segmentation.

2.2 Instance segmentation

Currently, instance segmentation methods can be divided into two categories: one-stage and two-stage. One-stage methods mainly involve a simple fully convolutional networks (FCN) architecture for mask prediction without region of interest (RoI) pooling, such as YOLACT [16] and BlendMask [17], proposal generation and feature re-pooling, with fast speed but poor accuracy. Two-stage instance segmentation methods first detect boundary boxes and then perform segmentation in each RoI region. Mask RCNN [18] adds a mask branch to the Faster R-CNN [19] architecture. Mask-Refined R-CNN [20] refines the detailed information of the target and considers the relationship between the edge pixels of the object. PointRend [9] samples feature points with low confidence scores and further improves their labels using a shared MLP. Cascade Mask R-CNN [21] improves segmentation accuracy through cascading methods. Since most two-stage methods use RoI Align operations, there is information loss in the spatial features of large targets, especially at the edges. Therefore, rough target edge prediction is a common problem for two-stage instance segmentation algorithms.

Refer to caption
Figure 2: A schematic overview of OreNeXt. The input image is fed into the lightweight backbone StoneMLP (Fig. 3) to produce feature maps. StoneMLP captures local dependencies and extracts edge information through horizontal and vertical shift operations (Fig. 4). Then, the feature maps enter the SparseFPN to generate multi-scale information-integrated feature maps. Our improved FPN structure adds two SparseMLP modules (Fig. 5), which are divided into three parallel branches for feature fusion through weighted summation. The addition of two sparsely connected SparseMLP modules allows both local and global features to be taken into account. Next, each layer feature map is computationally fused by the region proposal network (RPN) to obtain the fused feature map Sisubscript𝑆𝑖S_{i}italic_S start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT (the i-th feature maps). Finally, The point head uses interpolated features computed from the fine-grained feature of the CNN feature maps (Sisubscript𝑆𝑖S_{i}italic_S start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT) and the coarse prediction mask for subdivision prediction.

2.3 MLP-Based Models

With the success of the transformer in the field of natural language processing, some researchers have started exploring how to apply the transformer to the field of vision. ViT [22] pioneered the approach of segmenting images into non-overlapping blocks as tokens for operations, enabling the use of the transformer framework for image processing. Swin transformer [23] proposed a local group self-attention mechanism incorporating locality. Inspired by the elegant structure of ViT, MLP with simpler network structures has also received a lot of attention in the field of computer vision. MLP-Mixer [7] developed by Google, replaces the self-attention mechanism with MLPs to construct a pure MLP architecture. The ResMLP [8] used cross-patch and cross-channel sublayers as components. However, most existing MLP-based methods cannot adapt to image resolution, making them difficult to apply to downstream tasks. ASMLP [24] and CycleMLP [25], both of which use the motion of feature maps to integrate local information, enabling the pure MLP architecture to be used for downstream tasks. These MLP-based models have a similar overall structure but differ in the detailed design of the main modules. After MLP-Mixer [7] was proposed, people began to explore the architecture of MLP. With increased computational capacity and the availability of larger datasets coupled, the ancient simple structure of MLP can also achieve effective performance improvement in various vision tasks. Through the design structure, these MLP-based models no longer rely on prior knowledge of two-dimensional images, but on the input feature maps for long-range interactions. However, this rarely makes full use of local information, and the extraction of low-level features is also significant for many tasks.

3 Method

In this section, we initially introduce the overall structure of the network. On this basis, we introduce the StoneMLP model as the backbone, then describe a lightweight SparseFPN network and improve the detection head.

3.1 Overall Framework

In the ore image segmentation task, there is a problem of edge blur caused by the adhesiveness of ore images. Therefore, we propose a lightweight instance segmentation model with a point-based prediction strategy to address the edge problem, which utilizes a lightweight backbone and feature pyramid network (FPN) structure based on MLP to extract high-quality edge features. To further enhance the accuracy of edge prediction, we introduce an Edge Guidance Loss that dynamically aligns the predicted edge with the instance edge.

The OreNeXt is a two-stage instance segmentation model with a network structure as mentioned in Fig. 2. The model mainly consists of a lightweight MLP backbone, a SparseFPN, and a fine detection head. To better capture the low-level features, we designed a structure StoneMLP for extracting features of ore images as the backbone. By moving the feature information axially, information flows in different directions can be obtained, which helps to capture the local correlation between overlapping ores. The SparseFPN is used to extract hierarchical features. The feature maps are processed in the SparseMLP to efficiently obtain the global receptive field. Global information can assist in better describing semantic information. Furthermore, we use a lightweight segmentation head to generate coarse prediction features for each detected object. For the fine-grained feature from the CNN feature maps(Sisubscript𝑆𝑖S_{i}italic_S start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT), we perform bilinear interpolation by coordinate points on the feature maps to compute the feature vector. Then the fine-grained features that provide object detail information are combined with the coarse prediction features that provide global context information. The fused feature map is fed into the prediction head, making point-wise segmentation.

3.2 StoneMLP

We propose a backbone network StoneMLP based on the architecture of MLP. Compared to convolution and attention networks, the MLP structure has a lower inductive bias. It allows the model to learn solely from raw data, thereby making the network more concise. We propose a spatial shift method that enables the network to obtain different directions of information flow, achieving the same field of view as CNN and completing feature extraction. Instance segmentation of ore images necessitates the utilization of low-level features, including contours, textures, colors, and shapes. Our proposed spatial movement method effectively captures local correlations by prioritizing the extraction of these low-level features, thereby yielding more comprehensive edge information.

In Fig. 3, it shows the architecture of our StoneMLP model. For example, ResMLP [8] has shown that self-attention is not the key factor of transformers for achieving excellent performance. With further research, many methods prove that self-attention modules can be replaced by MLP, Convolution, or other layers, indicating that the success of the transformer may come from the entire architecture design. Therefore, we refer to Transformer [26] for the design of the overall structural framework to expect good results. It takes RGB images I3×H×W𝐼superscript3𝐻𝑊I\in\mathbb{R}^{3\times H\times W}italic_I ∈ blackboard_R start_POSTSUPERSCRIPT 3 × italic_H × italic_W end_POSTSUPERSCRIPT as input and then slices them into patches of 4×4444\times 44 × 4 size. All tokens obtained at this stage are 48×H4×W448𝐻4𝑊448\times\frac{H}{4}\times\frac{W}{4}48 × divide start_ARG italic_H end_ARG start_ARG 4 end_ARG × divide start_ARG italic_W end_ARG start_ARG 4 end_ARG. StoneMLP Block contains four fully connected layers, each corresponding to four channel projections for communicating specific location information. Channel projections, vertical movement, and horizontal movement are used to extract features in the Pixel Shift operation. The temporal and computational expenses associated with the axial movement are exceedingly minimal. The computational complexity is Ω(StoneMLP)=4HWC2ΩStoneMLP4𝐻𝑊superscript𝐶2\mathrm{\Omega(StoneMLP)}=4HWC^{2}roman_Ω ( roman_StoneMLP ) = 4 italic_H italic_W italic_C start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT.

As illustrated in Fig. 4, we will explain the horizontal movement. The following channel projection will use the features that were extracted for the dashed box. The vertical movement operation is very similar to the horizontal movement. Combining the horizontal and vertical movements, the features in different positions are realigned on a channel. During the next channel projection operation, information from both directions is integrated to obtain the result after local communication. When the input x𝑥xitalic_x and the displacement size are given, the output YSsubscript𝑌𝑆Y_{S}italic_Y start_POSTSUBSCRIPT italic_S end_POSTSUBSCRIPT is obtained.

YS=XhWch+XvWcv,subscript𝑌𝑆superscript𝑋superscriptsubscript𝑊𝑐superscript𝑋𝑣superscriptsubscript𝑊𝑐𝑣Y_{S}=X^{h}\ W_{c}^{h}+X^{v}\ W_{c}^{v},italic_Y start_POSTSUBSCRIPT italic_S end_POSTSUBSCRIPT = italic_X start_POSTSUPERSCRIPT italic_h end_POSTSUPERSCRIPT italic_W start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_h end_POSTSUPERSCRIPT + italic_X start_POSTSUPERSCRIPT italic_v end_POSTSUPERSCRIPT italic_W start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_v end_POSTSUPERSCRIPT , (1)

where Wchsuperscriptsubscript𝑊𝑐W_{c}^{h}italic_W start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_h end_POSTSUPERSCRIPT and Wcvsuperscriptsubscript𝑊𝑐𝑣W_{c}^{v}italic_W start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_v end_POSTSUPERSCRIPT denote the channel projection’s learnable weights in the vertical and horizontal directions, Xh=Concat(X1h,.,Xsh)X^{h}=Concat(X_{1}^{h},....,X_{s}^{h})italic_X start_POSTSUPERSCRIPT italic_h end_POSTSUPERSCRIPT = italic_C italic_o italic_n italic_c italic_a italic_t ( italic_X start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_h end_POSTSUPERSCRIPT , … . , italic_X start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_h end_POSTSUPERSCRIPT ) and Xv=Concat(X1v,.,Xsv)X^{v}=Concat(X_{1}^{v},....,X_{s}^{v})italic_X start_POSTSUPERSCRIPT italic_v end_POSTSUPERSCRIPT = italic_C italic_o italic_n italic_c italic_a italic_t ( italic_X start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_v end_POSTSUPERSCRIPT , … . , italic_X start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_v end_POSTSUPERSCRIPT ) represent the axial displacement.

Refer to caption
Figure 3: Architecture of StoneMLP. The proposed StoneMLP block mainly includes Norm, Pixel Shift operation, MLP, channel projection, and residual connection.
Refer to caption
(a) Horizontal shift
Refer to caption
(b) Vertical shift
Figure 4: The horizontal shift and vertical shift, where the arrows indicate the steps, and the number in each box is the index of the feature.

3.3 SparseFPN

We propose a new SparseFPN for improving inference speed while maintaining accuracy. Traditional FPN structures generate multi-scale integrated features, due to the multi-level feature fusion, small targets are easily missed or misidentified. Therefore, we append two SparseMLP modules after traditional FPN and use a 1×1111\times 11 × 1 convolution to output features. MLP can deal with more complex nonlinear relationships and learn more abstract feature representations.

The structure of SparseMLP is demonstrated in Fig. 5(a), which is a module based on MLP with a parallel structure composed of three parts: W channel mapping, H channel mapping, and identity mapping. In the horizontal mixing path, features are reshaped and mixed information for each row. In the vertical mixing path, similar operations are applied. The three parallel branches are connected by channels, and feature fusion is achieved by weighted summation and 1×1111\times 11 × 1 convolution, producing an output tensor Xout=FC(concat(XH,XW,X))superscript𝑋𝑜𝑢𝑡𝐹𝐶𝑐𝑜𝑛𝑐𝑎𝑡subscript𝑋𝐻subscript𝑋𝑊𝑋X^{out}=FC(concat(X_{H}\ ,\ X_{W}\ ,\ X))italic_X start_POSTSUPERSCRIPT italic_o italic_u italic_t end_POSTSUPERSCRIPT = italic_F italic_C ( italic_c italic_o italic_n italic_c italic_a italic_t ( italic_X start_POSTSUBSCRIPT italic_H end_POSTSUBSCRIPT , italic_X start_POSTSUBSCRIPT italic_W end_POSTSUBSCRIPT , italic_X ) ) with the same dimensions as the input tensor. SparseMLP modules avoid the common overfitting problem affecting MLP-based models’ performance. Each token only interacts directly with tokens in the same row or column, and each row and column can share the same projection weight. As shown in Fig. 5(b), a layer of SparseMLP will form a cross-shaped receptive field, and after two SparseMLPs, a global receptive field can be formed. That is to say, if this module is repeated twice, each token can accumulate information throughout the two-dimensional space, improving accuracy while reducing computational complexity Ω(SparseMLP)=HWC(H+W)+3HWC2ΩSparseMLP𝐻𝑊𝐶𝐻𝑊3𝐻𝑊superscript𝐶2\mathrm{\Omega(SparseMLP)}=HWC(H+W)+3HWC^{2}roman_Ω ( roman_SparseMLP ) = italic_H italic_W italic_C ( italic_H + italic_W ) + 3 italic_H italic_W italic_C start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT.

Refer to caption
(a) SparseMLP
Refer to caption
Refer to caption
(b) Receptive Field
Figure 5: The SparseMLP block consists of three branches: two are responsible for mixing information along the horizontal and vertical directions, respectively, and the other path is a constant mapping. (b) shows the receptive fields generated by two consecutive SparseMLPs.

3.4 Edge Guidance Loss

To address the uncertainty resulting from the blurriness of edges in ore images, we use PointRend [9] as the baseline and perform a recomputation of boundary hard pixels. We adopt the point selection and point-wise representation strategies to prioritize uncertain edge points for optimization of image segmentation for object edges. Additionally, we propose the Edge Guidance Loss function to improve the model performance. The loss can be divided into the following four parts:

LEG=Lclsb+αLplocbLdet+βLcoarsem+LpmatmLmask.subscript𝐿𝐸𝐺subscriptsuperscriptsubscript𝐿𝑐𝑙𝑠𝑏𝛼superscriptsubscript𝐿𝑝𝑙𝑜𝑐𝑏subscript𝐿𝑑𝑒𝑡subscript𝛽superscriptsubscript𝐿𝑐𝑜𝑎𝑟𝑠𝑒𝑚superscriptsubscript𝐿𝑝𝑚𝑎𝑡𝑚subscript𝐿𝑚𝑎𝑠𝑘L_{EG}={\underbrace{L_{cls}^{b}\mathrm{+}\ \alpha L_{ploc}^{b}}_{L_{det}}}% \mathrm{+}{\underbrace{\beta L_{coarse}^{m}\mathrm{+}\ L_{pmat}^{m}}_{L_{mask}% }}.italic_L start_POSTSUBSCRIPT italic_E italic_G end_POSTSUBSCRIPT = under⏟ start_ARG italic_L start_POSTSUBSCRIPT italic_c italic_l italic_s end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_b end_POSTSUPERSCRIPT + italic_α italic_L start_POSTSUBSCRIPT italic_p italic_l italic_o italic_c end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_b end_POSTSUPERSCRIPT end_ARG start_POSTSUBSCRIPT italic_L start_POSTSUBSCRIPT italic_d italic_e italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT + under⏟ start_ARG italic_β italic_L start_POSTSUBSCRIPT italic_c italic_o italic_a italic_r italic_s italic_e end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT + italic_L start_POSTSUBSCRIPT italic_p italic_m italic_a italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT end_ARG start_POSTSUBSCRIPT italic_L start_POSTSUBSCRIPT italic_m italic_a italic_s italic_k end_POSTSUBSCRIPT end_POSTSUBSCRIPT . (2)

where α𝛼\alphaitalic_α and β𝛽\betaitalic_β represent the weight of Lplocbsuperscriptsubscript𝐿𝑝𝑙𝑜𝑐𝑏L_{ploc}^{b}italic_L start_POSTSUBSCRIPT italic_p italic_l italic_o italic_c end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_b end_POSTSUPERSCRIPT and Lcoarsemsuperscriptsubscript𝐿𝑐𝑜𝑎𝑟𝑠𝑒𝑚L_{coarse}^{m}italic_L start_POSTSUBSCRIPT italic_c italic_o italic_a italic_r italic_s italic_e end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT, which is generally set to 0.5 and 1.0 respectively in the experiments. To improve the success rate of edge detection and segmentation, in addition to the classification loss and the coarse mask loss, we also introduce an Edge point Guidance match Loss and improve the target box offset loss to improve the positioning accuracy.

In our method, Lclsbsuperscriptsubscript𝐿𝑐𝑙𝑠𝑏L_{cls}^{b}italic_L start_POSTSUBSCRIPT italic_c italic_l italic_s end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_b end_POSTSUPERSCRIPT and Lcoarsemsuperscriptsubscript𝐿𝑐𝑜𝑎𝑟𝑠𝑒𝑚L_{coarse}^{m}italic_L start_POSTSUBSCRIPT italic_c italic_o italic_a italic_r italic_s italic_e end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT are defined as a standard cross entropy loss function with softmax activation. We treat the foreground points as positive samples and the other points as negative samples. The new regression localization loss Lplocbsuperscriptsubscript𝐿𝑝𝑙𝑜𝑐𝑏L_{ploc}^{b}italic_L start_POSTSUBSCRIPT italic_p italic_l italic_o italic_c end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_b end_POSTSUPERSCRIPT is responsible for predicting the foreground score of the prediction point:

Lplocb(R,R)=1nk=1n(xi,yi)(xi,yi)2.superscriptsubscript𝐿𝑝𝑙𝑜𝑐𝑏𝑅superscript𝑅1𝑛superscriptsubscript𝑘1𝑛subscriptnormsubscript𝑥𝑖subscript𝑦𝑖superscriptsubscript𝑥𝑖superscriptsubscript𝑦𝑖2L_{ploc}^{b}(R,R^{\prime})=\frac{1}{n}\sum\limits_{k=1}^{n}\parallel(x_{i},y_{% i})-(x_{i}^{\prime},y_{i}^{\prime})\parallel_{2}.italic_L start_POSTSUBSCRIPT italic_p italic_l italic_o italic_c end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_b end_POSTSUPERSCRIPT ( italic_R , italic_R start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) = divide start_ARG 1 end_ARG start_ARG italic_n end_ARG ∑ start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT ∥ ( italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) - ( italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT . (3)

It uses the point-to-point distance prediction, input predicted point, and real border box. Moreover, it sorts the coordinates of test points in the x or y direction and obtains new coordinates. The distance between each point and four borders is calculated, and the boundary loss of each border point is obtained, which is finally obtained by weighted addition.

Refer to caption

Figure 6: Edge point Guidance match Loss. The yellow points represent predicted vertices, while the blue points indicate labeled vertices. The arrows depict the path, representing the pairing relationship. Each prediction point is adjusted to the nearest point on the truth boundary.

The edge guidance match loss Lpmatmsuperscriptsubscript𝐿𝑝𝑚𝑎𝑡𝑚L_{pmat}^{m}italic_L start_POSTSUBSCRIPT italic_p italic_m italic_a italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT is the average of the traditional cross-entropy loss Lpclssubscript𝐿𝑝𝑐𝑙𝑠L_{pcls}italic_L start_POSTSUBSCRIPT italic_p italic_c italic_l italic_s end_POSTSUBSCRIPT function and the edge point guidance match loss Lpgsubscript𝐿𝑝𝑔L_{pg}italic_L start_POSTSUBSCRIPT italic_p italic_g end_POSTSUBSCRIPT:

Lpmatm=12(Lpcls+Lpg).superscriptsubscript𝐿𝑝𝑚𝑎𝑡𝑚12subscript𝐿𝑝𝑐𝑙𝑠subscript𝐿𝑝𝑔L_{pmat}^{m}=\frac{1}{2}(L_{pcls}+L_{pg}).italic_L start_POSTSUBSCRIPT italic_p italic_m italic_a italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT = divide start_ARG 1 end_ARG start_ARG 2 end_ARG ( italic_L start_POSTSUBSCRIPT italic_p italic_c italic_l italic_s end_POSTSUBSCRIPT + italic_L start_POSTSUBSCRIPT italic_p italic_g end_POSTSUBSCRIPT ) . (4)

In Fig. 6, the Lpgsubscript𝐿𝑝𝑔L_{pg}italic_L start_POSTSUBSCRIPT italic_p italic_g end_POSTSUBSCRIPT first calculates the distance between the prediction point Prediin𝑃𝑟𝑒superscriptsubscript𝑑𝑖𝑖𝑛Pred_{i}^{in}italic_P italic_r italic_e italic_d start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i italic_n end_POSTSUPERSCRIPT and the real border point, finds the nearest target point gtXipt𝑔superscriptsubscript𝑡𝑋𝑖𝑝𝑡gt_{X}^{ipt}italic_g italic_t start_POSTSUBSCRIPT italic_X end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i italic_p italic_t end_POSTSUPERSCRIPT to the prediction point according to the distance, and interpolates the target points to a specified number of points using an interpolation function to obtain the offset difference Xi*superscriptsubscript𝑋𝑖X_{i}^{*}italic_X start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT. It also increases the classification accuracy of the target and makes the target smoother.

Xi*=argminXPrediingtXipt2.superscriptsubscript𝑋𝑖𝑎𝑟𝑔subscript𝑋subscriptnorm𝑃𝑟𝑒superscriptsubscript𝑑𝑖𝑖𝑛𝑔superscriptsubscript𝑡𝑋𝑖𝑝𝑡2X_{i}^{*}=arg\min_{X}\parallel{Pred}_{i}^{in}-{gt}_{X}^{ipt}\parallel_{2}.italic_X start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT = italic_a italic_r italic_g roman_min start_POSTSUBSCRIPT italic_X end_POSTSUBSCRIPT ∥ italic_P italic_r italic_e italic_d start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i italic_n end_POSTSUPERSCRIPT - italic_g italic_t start_POSTSUBSCRIPT italic_X end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i italic_p italic_t end_POSTSUPERSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT . (5)

Then, the predicted points Prediout𝑃𝑟𝑒superscriptsubscript𝑑𝑖𝑜𝑢𝑡{Pred}_{i}^{out}italic_P italic_r italic_e italic_d start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_o italic_u italic_t end_POSTSUPERSCRIPT and the matched nearest interpolation label points gtXi*ipt𝑔superscriptsubscript𝑡superscriptsubscript𝑋𝑖𝑖𝑝𝑡{gt}_{X_{i}^{*}}^{ipt}italic_g italic_t start_POSTSUBSCRIPT italic_X start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i italic_p italic_t end_POSTSUPERSCRIPT are dynamically matched using the loss function smoothL1 to supervise the quality of boundary prediction. The edge point guidance match loss Lpgsubscript𝐿𝑝𝑔L_{pg}italic_L start_POSTSUBSCRIPT italic_p italic_g end_POSTSUBSCRIPT can consider the smoothness and robustness of large error and small error, which is not affected by outliers.

Lpg=1Nk=1nPredioutgtXi*ipt1.subscript𝐿𝑝𝑔1𝑁superscriptsubscript𝑘1𝑛subscriptnorm𝑃𝑟𝑒superscriptsubscript𝑑𝑖𝑜𝑢𝑡𝑔superscriptsubscript𝑡superscriptsubscript𝑋𝑖𝑖𝑝𝑡1L_{pg}=\frac{1}{N}\sum_{k=1}^{n}\parallel{Pred}_{i}^{out}-{gt}_{X_{i}^{*}}^{% ipt}\parallel_{1}.italic_L start_POSTSUBSCRIPT italic_p italic_g end_POSTSUBSCRIPT = divide start_ARG 1 end_ARG start_ARG italic_N end_ARG ∑ start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT ∥ italic_P italic_r italic_e italic_d start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_o italic_u italic_t end_POSTSUPERSCRIPT - italic_g italic_t start_POSTSUBSCRIPT italic_X start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i italic_p italic_t end_POSTSUPERSCRIPT ∥ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT . (6)

4 Experiments

In this section, we introduce the hardware environment for ore processing tasks and the datasets, evaluation metrics, and implementation details. Next, we conducted ablation research to evaluate the effectiveness of the design decisions. Eventually, we compare OreNeXt with the state-of-the-art methods for instance segmentation on the ore dataset.

4.1 Hardware Environment

The actual scene of the ore processing tasks is depicted in Fig. 7, revealing significant adhesion of the ore and blurred edges. The detection system is responsible for identifying the size indexes of ores on the conveyor belt, providing insights into the effectiveness of the upstream processes, and facilitating the identification of over-specification ores. Ore detection on the production line is performed through sampling inspection. The traditional manual screening involves the random selection of a set of ores as a sample to assess their compliance with required standards. The online detection method based on deep learning considers the complete ores present on the surface of the conveyor belt, as captured by an industrial camera, as a representative sample, so we only expect to segment independent and complete ore individuals to obtain their particle size information. Therefore, precise segmentation of the surface ores is crucial, particularly in accurately identifying the edges of overlapping ores. It can be seen from Fig. 7 that ore processing tasks are typically conducted outdoors, where hardware resources are limited, making the use of high-performance computing equipment impractical. Consequently, the model size becomes a critical factor in ore processing scenarios with constrained computing resources. We verify on the ORE dataset that compared with the current segmentation methods, our method achieves optimal accuracy while ensuring minimal memory usage.

Refer to captionRefer to caption
(a) Production site
Refer to caption
(b) Conveyor belt
Figure 7: Conveyor belt at the mine production site. The site environment is simple and the computing resources are poor. There is serious adhesion between the ores and the edges are blurred.

4.2 Experiments Setup

4.2.1 Datasets

In the experiment, we apply ImageNet [27] to pre-train the backbone model and train instance segmentation models using our created rock data set. For the ORE image dataset, we use 4,060 images for training and 1,060 images for validation. We supplement the data set in the following ways: first, we collect different scales of rock images using our experimental platform. Then, we change the positioning of rocks at different scales, such as sparse, thick, and so on. In addition, We break up large images into smaller ones that can be utilized to train networks. We segment images using a sliding window to further expand the datasets.

4.2.2 Evaluation Metrics

To confirm the efficiency of the suggested method, seven evaluation indicators are used such as APbox𝐴superscript𝑃𝑏𝑜𝑥AP^{box}italic_A italic_P start_POSTSUPERSCRIPT italic_b italic_o italic_x end_POSTSUPERSCRIPTAP50box𝐴superscriptsubscript𝑃50𝑏𝑜𝑥AP_{50}^{box}italic_A italic_P start_POSTSUBSCRIPT 50 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_b italic_o italic_x end_POSTSUPERSCRIPTAPmask𝐴superscript𝑃𝑚𝑎𝑠𝑘AP^{mask}italic_A italic_P start_POSTSUPERSCRIPT italic_m italic_a italic_s italic_k end_POSTSUPERSCRIPTAP50mask𝐴superscriptsubscript𝑃50𝑚𝑎𝑠𝑘AP_{50}^{mask}italic_A italic_P start_POSTSUBSCRIPT 50 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m italic_a italic_s italic_k end_POSTSUPERSCRIPT, inference time (FPS), inference memory consumption, and model size. The first four indicators are used to evaluate the accuracy. AP stands for Average Precision, and the number after the AP represents the IoU(Intersection over Union) threshold, which refers to the degree of overlap between the predicted box and the ground truth box. In general, a higher value of AP50𝐴subscript𝑃50AP_{50}italic_A italic_P start_POSTSUBSCRIPT 50 end_POSTSUBSCRIPT indicates a better performance of the algorithm. The latter three metrics evaluate whether a model can be deployed on low-cost hardware. The computational cost of the model is quantified by its inference time, and we aim to maximize this value. The utilization of memory during training and the model’s size demonstrate its reliance on the hardware system, and we strive to minimize these factors while maintaining accuracy.

4.2.3 Implementation Details

OreNeXt is trained via back-propagation and the SGD (stochastic gradient descent) optimizer. For the ORE image dataset, all models shown in Table 1 are trained for 12k iterations on an NVIDIA RTX2080Ti GPU. We set the initial learning rate to 0.001 cut it by 1/10 at 8k iterations and use a batch size of 8 for all ablation studies. The input images are changed during training to have short sides between 160 and 320 pixels and long sides of 320 pixels. The experiments on the MS COCO dataset, the pre-train of the StoneMLP, and SAM-related experiments are all trained on the NVIDIA RTX3090Ti GPU. To ensure fairness and reproducibility in the experiment, a fixed random seed is used, and the hyperparameters of all models are set according to the best values provided in their respective papers. The parameters that achieve the highest accuracy on the validation set are preserved for prediction.

4.3 Comparison with State-of-the-Art Methods

4.3.1 Classic Segmentation

To further demonstrate the supremacy of OreNeXt, we conduct a comparative analysis with state-of-the-art methods on the ORE dataset. Specifically, we compare the segmentation performance of a typical two-stage segmentation framework utilizing various CNN, Transformer, and MLP backbone networks. The results can be found in Table 1, which compares the segmentation accuracy, inference time, model size, and GPU memory usage.

Table 1: Comparison of accuracy with the state-of-the-art methods on ore image datasets. T-based: Transformer-based.
Method Backbone APbox𝐴superscript𝑃𝑏𝑜𝑥absentAP^{box}\uparrowitalic_A italic_P start_POSTSUPERSCRIPT italic_b italic_o italic_x end_POSTSUPERSCRIPT ↑ AP50box𝐴subscriptsuperscript𝑃𝑏𝑜𝑥50absentAP^{box}_{50}\uparrowitalic_A italic_P start_POSTSUPERSCRIPT italic_b italic_o italic_x end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 50 end_POSTSUBSCRIPT ↑ APmask𝐴superscript𝑃𝑚𝑎𝑠𝑘absentAP^{mask}\uparrowitalic_A italic_P start_POSTSUPERSCRIPT italic_m italic_a italic_s italic_k end_POSTSUPERSCRIPT ↑ AP50mask𝐴subscriptsuperscript𝑃𝑚𝑎𝑠𝑘50absentAP^{mask}_{50}\uparrowitalic_A italic_P start_POSTSUPERSCRIPT italic_m italic_a italic_s italic_k end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 50 end_POSTSUBSCRIPT ↑ FPSnormal-↑\uparrow Model Size (MB) normal-↓\downarrow Inf.Memory (MB) normal-↓\downarrow
CNN-based Mask RCNN[18] ResNet101 43.0 51.7 12.1 22.1 13 480 1876
PointRend[9] ResNet101 37.4 54.0 21.0 43.8 17 489 3361
YOLACT[16] ResNet101 40.3 50.6 7.1 11.9 24 410 7758
Mask RCNN[18] ResNet50 42.8 51.7 12.4 22.6 17 334 1769
PointRend[9] ResNet50 36.3 54.0 25.8 47.7 21 344 2677
Cascade Mask RCNN[21] ResNet50 41.6 51.7 36.4 48.3 16 587 1947
CondInst[28] ResNet50 43.1 52.2 39.0 48.7 23 259 1713
MS RCNN[29] ResNet50 42.7 51.7 14.7 14.7 16 428 1749
Blend Mask[17] ResNet50 44.1 58.1 38.9 48.7 23 274 1641
BoxInst[30] ResNet50 44.3 56.9 34.8 48.7 23 261 1681
SparseInst[31] ResNet50 - - 24.8 37.9 21 380 1711
RTMDet[32] CSPNeXt 38.7 49.7 35.1 43.4 22 109 2013
T-based Mask RCNN[18] Swin-T 38.3 52.0 27.7 46.5 19 542 1995
PointRend[9] Swin-T 30.6 55.4 32.8 44.7 16 371 2631
Cascade Mask RCNN[21] Swin-T 42.0 50.7 38.5 48.5 14 920 2063
Mask2Former[33] Swin-T 55.7 71.3 35.3 44.3 20 572 6098
Mask RCNN[18] PVTv2 39.6 51.4 27.3 44.8 24 263 1943
PointRend[9] PVTv2 39.8 51.8 16.1 43.0 19 185 2501
Cascade Mask RCNN[21] PVTv2 42.7 50.9 39.0 48.7 17 641 1907
MLP-based Mask RCNN[18] ASMLP 35.9 49.2 21.1 38.5 21 542 1811
PointRend[9] ASMLP 36.6 52.6 27.0 41.1 20 371 2389
Cascade Mask RCNN[21] ASMLP 42.4 50.7 38.7 48.5 19 613 2043
Mask RCNN[18] CycleMLP-B1 36.9 51.7 29.1 45.2 26 104 1673
PointRend[9] CycleMLP-B1 25.0 50.2 20.0 39.8 26 109 1843
OreNeXt(Ours) StoneMLP 38.2 60.4 33.8 48.9 27 73 1931

Firstly, the networks based on CNN have higher accuracy but are slower and have larger model sizes than ours. Since the transformer structure has the largest model capacity, the transformer-based networks have a larger model size, while the MLP-based networks have a smaller model size but worse accuracy. Secondly, Our proposed OreNext network exceeds the best CNN and Transformer algorithms both by 0.2 on AP50mask𝐴superscriptsubscript𝑃50𝑚𝑎𝑠𝑘AP_{50}^{mask}italic_A italic_P start_POSTSUBSCRIPT 50 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m italic_a italic_s italic_k end_POSTSUPERSCRIPT metrics, while also achieving high accuracy on AP50box𝐴superscriptsubscript𝑃50𝑏𝑜𝑥AP_{50}^{box}italic_A italic_P start_POSTSUBSCRIPT 50 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_b italic_o italic_x end_POSTSUPERSCRIPT, due to our novel approach of using a Sparse FPN that balances local and global information with two MLP structures, as well as an Edge Guidance Loss based on a point processing strategy to improve accuracy. Recent methods like Mask2Former [33] based on SwinTransformer [23] have achieved impressive results on the APbox𝐴superscript𝑃𝑏𝑜𝑥AP^{box}italic_A italic_P start_POSTSUPERSCRIPT italic_b italic_o italic_x end_POSTSUPERSCRIPT metric, surpassing the average performance, while its corresponding model size is 6×6\times6 × larger than our OreNeXt. A small model size is crucial in ore processing tasks with limited computational resources. Therefore, models that prioritize box accuracy at the expense of model size are not suitable for field ore tasks. Furthermore, OreNeXt also outperforms other lightweight frameworks significantly in inference time, model size, and GPU memory usage, thanks to our lightweight backbone StoneMLP, which is even smaller than other lightweight MLP-based backbone networks such as CycleMLP. OreNeXt strikes a balance between accuracy and computational cost and is more suitable for ore segmentation tasks. Finally, the performance of Swin-T and ASMLP used as the backbone network of PointRend is similar to that of OreNext in terms of accuracy and model size, which supports the idea that the token mixer module of the transformer model is not the key to its excellent performance but the entire transformer architecture. Therefore, replacing the most time-consuming attention of the transformer with the StoneMLP network can also achieve excellent performance.

4.3.2 Segment Anything Model

Table 2: The comparison of OreNeXt with SAM-based instance segmentation methods, \star means only validation
Method AP50box𝐴subscriptsuperscript𝑃𝑏𝑜𝑥50{AP^{box}_{50}}italic_A italic_P start_POSTSUPERSCRIPT italic_b italic_o italic_x end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 50 end_POSTSUBSCRIPT AP50mask𝐴subscriptsuperscript𝑃𝑚𝑎𝑠𝑘50{AP^{mask}_{50}}italic_A italic_P start_POSTSUPERSCRIPT italic_m italic_a italic_s italic_k end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 50 end_POSTSUBSCRIPT FPS Model Size (MB)
SAM[34] 42.2 44.7 3 609
RSPrompter[35] 42.7 41.6 2 631
FastSAM𝐹𝑎𝑠𝑡𝑆𝐴superscript𝑀FastSAM^{\star}italic_F italic_a italic_s italic_t italic_S italic_A italic_M start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT[36] 38.0 37.2 7 150
OreNeXt(Ours) 60.4 48.9 27 73

By finding foundational models that exhibit high performance in the fields of NLP and CV, the foundation Segment Anything Model (SAM) [34] proposed by Meta AI exhibits remarkable generalization capabilities. SAM is becoming a foundation step for many visual tasks, like image editing and remote sensing image segmentation. In the ore processing task, we compare the existing instance segmentation methods based on the SAM foundation model with our method, as shown in Table 2. As a category-agnostic instance segmentation method, SAM does not show its superiority in single-class ore image segmentation. Its segmentation accuracy is average, but the model size exceeds most traditional segmentation methods. In single-task industrial application scenarios, SAM not only fails to play its advantages of strong generalization ability, but its huge computation costs prevent it from wider applications in industry scenarios. At present, the traditional instance segmentation method is still the best choice after balancing accuracy and computational complexity in industrial applications. In the future, the goal is to achieve high-performance compression of large models and develop comprehensive industrial foundation modes capable of addressing multiple scenarios and tasks.

The visualization of various results from the instance segmentation method in ore images is depicted in Fig. 8. It can be observed that our method is significantly better than other methods in mask quality as well as edge sharpness, and the missed detection rate is also at a low level. Compared to the foundation large model, our method can detect as many complete ore images as possible. It can obtain the granularity information required for industrial production and meet the needs of sampling inspection.

Refer to caption
(a) Input
Refer to caption
(b) GroundTruth
Refer to caption
(c) SagmentAnything
Refer to caption
(d) PointRend
Refer to caption
(e) YOLACT
Refer to caption
(f) OreNeXt
Figure 8: The visual results of ore images processed by different methods. It can be seen that our method (f) has the best mask quality, and the foundation model (c) gets too much information about incomplete ores, which is unsuitable for actual production.

4.4 Ablation Study

4.4.1 StoneMLP

To demonstrate the effectiveness of our proposed backbone network, we verified the accuracy of this backbone network by comparing various backbone changes. In Table 3, StoneMLP outperforms the other backbone networks in three indicators of APbox𝐴superscript𝑃𝑏𝑜𝑥AP^{box}italic_A italic_P start_POSTSUPERSCRIPT italic_b italic_o italic_x end_POSTSUPERSCRIPT, APmask𝐴superscript𝑃𝑚𝑎𝑠𝑘AP^{mask}italic_A italic_P start_POSTSUPERSCRIPT italic_m italic_a italic_s italic_k end_POSTSUPERSCRIPT, AP50mask𝐴superscriptsubscript𝑃50𝑚𝑎𝑠𝑘AP_{50}^{mask}italic_A italic_P start_POSTSUBSCRIPT 50 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m italic_a italic_s italic_k end_POSTSUPERSCRIPT. The index AP50box𝐴superscriptsubscript𝑃50𝑏𝑜𝑥AP_{50}^{box}italic_A italic_P start_POSTSUBSCRIPT 50 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_b italic_o italic_x end_POSTSUPERSCRIPT of evaluating detection bounding box is 60.4, which is slightly lower than the highest value of ResNet101 in 61.9, while the model size of 73M is 6.5×\times× smaller than ResNet101. The small model size is also vital in ore processing tasks with limited computational resources. Since StoneMLP is designed for feature extraction of ore low-level edge information, it greatly outperforms the lightweight backbone CycleMLP with 99M, which is also based on MLP in terms of accuracy across the board. In order to verify the general effectiveness of StoneMLP, we conduct experiments on the MS COCO [6] dataset, and the results in Table 4 show that the accuracy is significantly improved compared with the baseline.

Table 3: The segmentation results of different backbones in OreNeXt
Backbone APbox𝐴superscript𝑃𝑏𝑜𝑥{AP^{box}}italic_A italic_P start_POSTSUPERSCRIPT italic_b italic_o italic_x end_POSTSUPERSCRIPT AP50box𝐴subscriptsuperscript𝑃𝑏𝑜𝑥50{AP^{box}_{50}}italic_A italic_P start_POSTSUPERSCRIPT italic_b italic_o italic_x end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 50 end_POSTSUBSCRIPT APmask𝐴superscript𝑃𝑚𝑎𝑠𝑘{AP^{mask}}italic_A italic_P start_POSTSUPERSCRIPT italic_m italic_a italic_s italic_k end_POSTSUPERSCRIPT AP50mask𝐴subscriptsuperscript𝑃𝑚𝑎𝑠𝑘50{AP^{mask}_{50}}italic_A italic_P start_POSTSUPERSCRIPT italic_m italic_a italic_s italic_k end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 50 end_POSTSUBSCRIPT Model Size (MB)
ResNet50[37] 36.8 60.8 31.1 47.4 329
ResNet101[37] 34.0 61.9 29.9 48.3 474
RegNet[38] 36.1 57.1 29.2 47.4 255
HRNet[39] 36.7 61.8 25.7 25.1 229
PVTv2[40] 36.1 61.3 22.5 47.7 178
Swin-T[23] 30.3 61.2 11.8 25.8 220
ASMLP[24] 36.4 58.9 30.5 46.8 214
CycleMLP-B1[25] 26.6 59.3 22.4 43.8 99
StoneMLP(ours) 38.2 60.4 33.8 48.9 73
Table 4: Comparison crucial components on MS COCO datasets
APbox𝐴superscript𝑃𝑏𝑜𝑥{AP^{box}}italic_A italic_P start_POSTSUPERSCRIPT italic_b italic_o italic_x end_POSTSUPERSCRIPT AP50box𝐴subscriptsuperscript𝑃𝑏𝑜𝑥50{AP^{box}_{50}}italic_A italic_P start_POSTSUPERSCRIPT italic_b italic_o italic_x end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 50 end_POSTSUBSCRIPT APmask𝐴superscript𝑃𝑚𝑎𝑠𝑘{AP^{mask}}italic_A italic_P start_POSTSUPERSCRIPT italic_m italic_a italic_s italic_k end_POSTSUPERSCRIPT AP50mask𝐴subscriptsuperscript𝑃𝑚𝑎𝑠𝑘50{AP^{mask}_{50}}italic_A italic_P start_POSTSUPERSCRIPT italic_m italic_a italic_s italic_k end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 50 end_POSTSUBSCRIPT
Baseline 21.6 37.4 20.3 34.5
EGLoss 19.91.7subscript19.91.719.9_{~{}{\color[rgb]{0,0,1}\tiny-1.7}}19.9 start_POSTSUBSCRIPT - 1.7 end_POSTSUBSCRIPT 39.3+1.9subscript39.31.939.3_{~{}{\color[rgb]{1,0,0}\tiny+1.9}}39.3 start_POSTSUBSCRIPT + 1.9 end_POSTSUBSCRIPT 21.1+0.8subscript21.10.821.1_{~{}{\color[rgb]{1,0,0}\tiny+0.8}}21.1 start_POSTSUBSCRIPT + 0.8 end_POSTSUBSCRIPT 36.8+2.3subscript36.82.336.8_{~{}{\color[rgb]{1,0,0}\tiny+2.3}}36.8 start_POSTSUBSCRIPT + 2.3 end_POSTSUBSCRIPT
StoneMLP 25.1+3.5subscript25.13.525.1_{~{}{\color[rgb]{1,0,0}\tiny+3.5}}25.1 start_POSTSUBSCRIPT + 3.5 end_POSTSUBSCRIPT 41.2+3.8subscript41.23.841.2_{~{}{\color[rgb]{1,0,0}\tiny+3.8}}41.2 start_POSTSUBSCRIPT + 3.8 end_POSTSUBSCRIPT 23.6+3.3subscript23.63.323.6_{~{}{\color[rgb]{1,0,0}\tiny+3.3}}23.6 start_POSTSUBSCRIPT + 3.3 end_POSTSUBSCRIPT 38.8+4.3subscript38.84.338.8_{~{}{\color[rgb]{1,0,0}\tiny+4.3}}38.8 start_POSTSUBSCRIPT + 4.3 end_POSTSUBSCRIPT
Table 5: Comparison of using different MLP in SparseFPN
Method AP50box𝐴subscriptsuperscript𝑃𝑏𝑜𝑥50{AP^{box}_{50}}italic_A italic_P start_POSTSUPERSCRIPT italic_b italic_o italic_x end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 50 end_POSTSUBSCRIPT AP50mask𝐴subscriptsuperscript𝑃𝑚𝑎𝑠𝑘50{AP^{mask}_{50}}italic_A italic_P start_POSTSUPERSCRIPT italic_m italic_a italic_s italic_k end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 50 end_POSTSUBSCRIPT FPS Model Size (MB)
MLP[23] 51.6 39.5 25 84
CycleMLP[25] 50.5 37.7 23 83
SparseMLP(ours) 53.0 48.5 27 77

4.4.2 SparseFPN

To improve detection accuracy while maintaining light weight, we designed an MLP module SparseMLP to enhance the receptive field by mixing information and fusing features through channel mapping. The module can be implemented using a simple MLP structure. In Table 5, we compare the effectiveness of different MLP structures. Experimental results show that connecting two SparseMLP after FPN can improve the receptive field of images, reduce the parameter count to avoid overfitting, and promote multi-stage processing in the pyramid architecture. The SparseFPN can strengthen the small object information that is easy to misjudge while maintaining sufficient semantic information.

4.4.3 Loss Function

To verify the robustness of EGLoss to the network, we compared the baseline loss with the newly proposed edge guidance loss and found that the  AP50box𝐴superscriptsubscript𝑃50𝑏𝑜𝑥AP_{50}^{box}italic_A italic_P start_POSTSUBSCRIPT 50 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_b italic_o italic_x end_POSTSUPERSCRIPT and AP50mask𝐴superscriptsubscript𝑃50𝑚𝑎𝑠𝑘AP_{50}^{mask}italic_A italic_P start_POSTSUBSCRIPT 50 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m italic_a italic_s italic_k end_POSTSUPERSCRIPT values were significantly improved while the model size remained constant. Additionally, we compared the accuracy stability before and after replacing the loss and observed that the AP50mask𝐴superscriptsubscript𝑃50𝑚𝑎𝑠𝑘AP_{50}^{mask}italic_A italic_P start_POSTSUBSCRIPT 50 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m italic_a italic_s italic_k end_POSTSUPERSCRIPT value became relatively stable after replacing the loss. In Table 4, experiments on the public dataset MS COCO[6] show that EGLoss achieves good results on AP50mask𝐴superscriptsubscript𝑃50𝑚𝑎𝑠𝑘AP_{50}^{mask}italic_A italic_P start_POSTSUBSCRIPT 50 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m italic_a italic_s italic_k end_POSTSUPERSCRIPT, which is an indicator for evaluating mask quality.

Table 6: Effects of the components of the proposed method. SM: StoneMLP, SF: SparseFPN, EL:EGLoss
Baseline SM SF EL AP50box𝐴subscriptsuperscript𝑃𝑏𝑜𝑥50{AP^{box}_{50}}italic_A italic_P start_POSTSUPERSCRIPT italic_b italic_o italic_x end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 50 end_POSTSUBSCRIPT AP50mask𝐴subscriptsuperscript𝑃𝑚𝑎𝑠𝑘50{AP^{mask}_{50}}italic_A italic_P start_POSTSUPERSCRIPT italic_m italic_a italic_s italic_k end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 50 end_POSTSUBSCRIPT Model Size (MB)
54 47.7 344
51.7 47.8 85
53.9 48.4 77
60.4 48.9 73

4.4.4 Module Breakdown Analysis

We conducted ablation studies, as presented in Table 6, to investigate the specific contributions of each module in OreNeXt. StoneMLP incorporates shifting operations to capture local information, resulting in a slight performance improvement while significantly reducing the number of parameters. The model size decreases from 344M to 95M. Next, the SparseMLP blocks are introduced into FPN, helping extract small features and improve accuracy. Finally, the Loss function is replaced with the EGLoss, leading to a significant improvement in accuracy from 53.9 to 60.4 in the AP50box𝐴subscriptsuperscript𝑃𝑏𝑜𝑥50AP^{box}_{50}italic_A italic_P start_POSTSUPERSCRIPT italic_b italic_o italic_x end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 50 end_POSTSUBSCRIPT and from 48.4 to 48.9 in the AP50mask𝐴subscriptsuperscript𝑃𝑚𝑎𝑠𝑘50AP^{mask}_{50}italic_A italic_P start_POSTSUPERSCRIPT italic_m italic_a italic_s italic_k end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 50 end_POSTSUBSCRIPT.

4.4.5 Hyperparameter

To explore the influence of hyperparameter settings on network performance, we conducted ablation experiments on the weight of the loss function and the number of channels in the neck. We have reduced the model size in SpraseFPN by changing the convolution size and reducing the number of channels. The results are depicted in Table 7, and the model size gradually lowers as the number of channels reduces from 256 to 32. The highest value of accuracy is achieved when the number of channels is 64, which are 54.7 in AP50box𝐴superscriptsubscript𝑃50𝑏𝑜𝑥AP_{50}^{box}italic_A italic_P start_POSTSUBSCRIPT 50 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_b italic_o italic_x end_POSTSUPERSCRIPT and 47.6 in AP50mask𝐴superscriptsubscript𝑃50𝑚𝑎𝑠𝑘AP_{50}^{mask}italic_A italic_P start_POSTSUBSCRIPT 50 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m italic_a italic_s italic_k end_POSTSUPERSCRIPT respectively. The channels of the feature map contain the feature information, and the number of channels should be reasonably designed according to the feature complexity of the detected object. Reducing the number of channels can improve accuracy due to the ORE dataset having fewer shots and features than the public dataset. When the number of channels is reduced from 64 to 32, the accuracy does not increase, which proves that the feature information will also be lost when the number of channels is too small. A channel number of 64 increases the model size almost negligible compared to a channel number of 32. These results indicate that a channel of 64 and 1×1111\times 11 × 1 convolution is the optimal choice for ore segmentation.

Parameter β𝛽\betaitalic_β in (2) reflects the significance of Lcoarsemsuperscriptsubscript𝐿𝑐𝑜𝑎𝑟𝑠𝑒𝑚L_{coarse}^{m}italic_L start_POSTSUBSCRIPT italic_c italic_o italic_a italic_r italic_s italic_e end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT in training. We perform ablation experiments on the MS COCO dataset and ORE dataset, respectively, and the results are reflected in Table 8. It can be seen that a smaller β𝛽\betaitalic_β value can achieve the highest detection accuracy due to the single category of the ore dataset, while for the MS COCO dataset with rich target categories, a larger weight is needed to make the model not ignore the sample features of some categories.

Table 7: Comparisons of different channels in SparseFPN
Channels Convs. AP50box𝐴subscriptsuperscript𝑃𝑏𝑜𝑥50{AP^{box}_{50}}italic_A italic_P start_POSTSUPERSCRIPT italic_b italic_o italic_x end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 50 end_POSTSUBSCRIPT AP50mask𝐴subscriptsuperscript𝑃𝑚𝑎𝑠𝑘50{AP^{mask}_{50}}italic_A italic_P start_POSTSUPERSCRIPT italic_m italic_a italic_s italic_k end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 50 end_POSTSUBSCRIPT FPS Model Size (MB)
32 1x1 52.2 46.1 25 77
3x3 53.6 42.6 24 79
64 1x1 54.7 47.6 26 80
3x3 52.6 46.7 24 88
96 1x1 53.2 45.6 25 90
3x3 52.9 44.3 24 98
128 1x1 52.2 46.4 26 98
3x3 54.6 46.5 23 109
256 1x1 52.0 44.9 23 130
3x3 52.7 44.3 22 161
Table 8: Validation of Lcoarsemsuperscriptsubscript𝐿𝑐𝑜𝑎𝑟𝑠𝑒𝑚L_{coarse}^{m}italic_L start_POSTSUBSCRIPT italic_c italic_o italic_a italic_r italic_s italic_e end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT in the different datasets. NC means cannot converge
ORE MS COCO
β𝛽\betaitalic_β 0.5 1 1.5 1.5 2.5 3
AP50mask𝐴subscriptsuperscript𝑃𝑚𝑎𝑠𝑘50{AP^{mask}_{50}}italic_A italic_P start_POSTSUPERSCRIPT italic_m italic_a italic_s italic_k end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 50 end_POSTSUBSCRIPT 45.5 48.9 46.4 30.8 36.8 NC

5 Conclusion

This study presents OreNeXt, a lightweight and efficient instance segmentation network designed for ore image segmentation. The network specifically addresses the challenging issue of edge blurring that results from ore overlap. OreNeXt is an MLP-based architecture that leverages two essential components to enhance low-level edge features. Firstly, it incorporates a StoneMLP backbone network structure that facilitates spatial information interaction through shift operations. Secondly, it incorporates a SparseFPN that effectively balances global and local information to improve accuracy. Additionally, a point-guided loss function is employed to enhance the clarity of edge segmentation. When assessed using the ore dataset, our model demonstrated superior performance in ore segmentation compared to other existing methods while also exhibiting faster inference time, decreased computational complexity, and a reduction in the number of parameters required. For future work, we will focus on further improving edge accuracy and increasing the model’s speed. We will also explore edge segmentation algorithms that can be extended to other industrial areas.

References

  • [1] Y. Zhang, Y. Zhou, H. Pan, B. Wu, and G. Sun, “Visual fault detection of multiscale key components in freight trains,” IEEE Trans. Ind. Informat., vol. 19, no. 8, pp. 9082–9090, 2023.
  • [2] S. Xia, L. Chu, L. Pei, W. Yu, and R. C. Qiu, “A boundary consistency-aware multitask learning framework for joint activity segmentation and recognition with wearable sensors,” IEEE Trans. Ind. Informat., vol. 19, no. 3, pp. 2984–2996, 2023.
  • [3] Y. Zhang, M. Liu, Y. Yang, Y. Guo, and H. Zhang, “A unified light framework for real-time fault detection of freight train images,” IEEE Trans. Ind. Informat., vol. 17, no. 11, pp. 7423–7432, 2021.
  • [4] J. Chu, Z. Guo, and L. Leng, “Object detection based on multi-layer convolution feature fusion and online hard example mining,” IEEE Access, vol. 6, pp. 19959–19967, 2018.
  • [5] W. Lin, J. Chu, L. Leng, J. Miao, and L. Wang, “Feature disentanglement in one-stage object detection,” Pattern Recognition, vol. 145, p. 109878, 2024.
  • [6] T.-Y. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan, P. Dollár, and C. L. Zitnick, “Microsoft coco: Common objects in context,” in Eur. Conf. Comput. Vis., pp. 740–755, 2014.
  • [7] I. O. Tolstikhin, N. Houlsby, A. Kolesnikov, L. Beyer, X. Zhai, T. Unterthiner, J. Yung, A. Steiner, D. Keysers, J. Uszkoreit, M. Lucic, and A. Dosovitskiy, “Mlp-mixer: An all-mlp architecture for vision,” in Neural Information Processing Systems, vol. 34, pp. 24261–24272, 2021.
  • [8] H. Touvron, P. Bojanowski, M. Caron, M. Cord, A. El-Nouby, E. Grave, G. Izacard, A. Joulin, G. Synnaeve, J. Verbeek, and H. Jegou, “Resmlp: Feedforward networks for image classification with data-efficient training,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 45, no. 4, pp. 5314–5321, 2023.
  • [9] A. Kirillov, Y. Wu, K. He, and R. Girshick, “Pointrend: Image segmentation as rendering,” in IEEE/CVF Conf. Comput. Vis. Pattern Recognit., pp. 9796–9805, 2020.
  • [10] G. Sun, D. Huang, L. Cheng, J. Jia, C. Xiong, and Y. Zhang, “Efficient and lightweight framework for real-time ore image segmentation based on deep learning,” Minerals, vol. 12, no. 5, pp. 1–18, 2022.
  • [11] D. P. Mukherjee, Y. Potapovich, I. Levner, and H. Zhang, “Ore image segmentation by learning image and shape features,” Pattern Recognition Letters, vol. 30, no. 6, pp. 615–622, 2009.
  • [12] Y. Zhang, L. Cheng, Y. Peng, C. Xu, Y. Fu, B. Wu, and G. Sun, “Faster orefsdet: A lightweight and effective few-shot object detector for ore images,” Pattern Recognition, vol. 141, p. 109664, 2023.
  • [13] H. Li, C. Pan, Z. Chen, A. Wulamu, and A. Yang, “Ore image segmentation method based on u-net and watershed,” Computers, Materials and Continua, vol. 65, no. 1, pp. 563–578, 2020.
  • [14] J. Liu, Z. Jiang, W. Gui, and Z. Chen, “A novel particle size detection system based on rgb-laser fusion segmentation with feature dual-recalibration for blast furnace materials,” IEEE Transactions on Industrial Electronics, vol. 70, no. 10, pp. 10690–10699, 2023.
  • [15] Y. Liu, Z. Zhang, X. Liu, W. Lei, and X. Xia, “Deep learning based mineral image classification combined with visual attention mechanism,” IEEE Access, vol. 9, pp. 98091–98109, 2021.
  • [16] D. Bolya, C. Zhou, F. Xiao, and Y. J. Lee, “YOLACT: real-time instance segmentation,” in IEEE/CVF Int. Conf. Comput. Vis., pp. 9156–9165, 2019.
  • [17] H. Chen, K. Sun, Z. Tian, C. Shen, Y. Huang, and Y. Yan, “Blendmask: Top-down meets bottom-up for instance segmentation,” in IEEE/CVF Conf. Comput. Vis. Pattern Recognit., pp. 8570–8578, 2020.
  • [18] K. He, G. Gkioxari, P. Dollár, and R. Girshick, “Mask r-cnn,” in IEEE/CVF Int. Conf. Comput. Vis., pp. 2980–2988, 2017.
  • [19] S. Ren, K. He, R. Girshick, and J. Sun, “Faster r-cnn: Towards real-time object detection with region proposal networks,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 39, no. 6, pp. 1137–1149, 2017.
  • [20] Y. Zhang, J. Chu, L. Leng, and J. Miao, “Mask-refined r-cnn: A network for refining object details in instance segmentation,” Sensors, vol. 20, no. 4, 2020.
  • [21] Z. Cai and N. Vasconcelos, “Cascade R-CNN: high quality object detection and instance segmentation,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 43, no. 5, pp. 1483–1498, 2021.
  • [22] A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weissenborn, X. Zhai, T. Unterthiner, M. Dehghani, M. Minderer, G. Heigold, S. Gelly, J. Uszkoreit, and N. Houlsby, “An image is worth 16x16 words: Transformers for image recognition at scale,” in International Conference on Learning Representations, pp. 1–21, 2021.
  • [23] Z. Liu, Y. Lin, Y. Cao, H. Hu, Y. Wei, Z. Zhang, S. Lin, and B. Guo, “Swin transformer: Hierarchical vision transformer using shifted windows,” in IEEE/CVF Int. Conf. Comput. Vis., pp. 9992–10002, 2021.
  • [24] D. Lian, Z. Yu, X. Sun, and S. Gao, “AS-MLP: An axial shifted MLP architecture for vision,” in International Conference on Learning Representations, pp. 1–19, 2022.
  • [25] S. Chen, E. Xie, C. Ge, R. Chen, D. Liang, and P. Luo, “Cyclemlp: A mlp-like architecture for dense visual predictions,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 45, no. 12, pp. 14284–14300, 2023.
  • [26] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, u. Kaiser, and I. Polosukhin, “Attention is all you need,” in Neural Information Processing Systems, pp. 6000–6010, 2017.
  • [27] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei, “Imagenet: A large-scale hierarchical image database,” in IEEE/CVF Conf. Comput. Vis. Pattern Recognit., pp. 248–255, 2009.
  • [28] Z. Tian, C. Shen, and H. Chen, “Conditional convolutions for instance segmentation,” in Eur. Conf. Comput. Vis., pp. 282–298, 2020.
  • [29] Z. Huang, L. Huang, Y. Gong, C. Huang, and X. Wang, “Mask scoring R-CNN,” in IEEE/CVF Conf. Comput. Vis. Pattern Recognit., pp. 6402 – 6411, 2019.
  • [30] Z. Tian, C. Shen, X. Wang, and H. Chen, “Boxinst: High-performance instance segmentation with box annotations,” in IEEE/CVF Conf. Comput. Vis. Pattern Recognit., pp. 5439–5448, 2021.
  • [31] T. Cheng, X. Wang, S. Chen, W. Zhang, Q. Zhang, C. Huang, Z. Zhang, and W. Liu, “Sparse instance activation for real-time instance segmentation,” in IEEE/CVF Conf. Comput. Vis. Pattern Recognit., pp. 4423–4432, 2022.
  • [32] C. Lyu, W. Zhang, H. Huang, Y. Zhou, Y. Wang, Y. Liu, S. Zhang, and K. Chen, “Rtmdet: An empirical study of designing real-time object detectors,” arXiv:2212.0778, 2022.
  • [33] B. Cheng, I. Misra, A. G. Schwing, A. Kirillov, and R. Girdhar, “Masked-attention mask transformer for universal image segmentation,” in IEEE/CVF Conf. Comput. Vis. Pattern Recognit., pp. 1280–1289, 2022.
  • [34] A. Kirillov, E. Mintun, N. Ravi, H. Mao, C. Rolland, L. Gustafson, T. Xiao, S. Whitehead, A. C. Berg, W.-Y. Lo, P. Dollar, and R. Girshick, “Segment anything,” in IEEE/CVF Int. Conf. Comput. Vis., pp. 4015–4026, October 2023.
  • [35] K. Chen, C. Liu, H. Chen, H. Zhang, W. Li, Z. Zou, and Z. X. Shi, “Rsprompter: Learning to prompt for remote sensing instance segmentation based on visual foundation model,” ArXiv:2306.12156, 2023.
  • [36] X. Zhao, W. Ding, Y. An, Y. Du, T. Yu, M. Li, M. Tang, and J. Wang, “Fast segment anything,” arXiv:2306.12156, 2023.
  • [37] K. He, X. Zhang, S. Ren, and J. Sun, “Deep Residual Learning for Image Recognition,” in IEEE/CVF Conf. Comput. Vis. Pattern Recognit., pp. 770–778, 2016.
  • [38] I. Radosavovic, R. P. Kosaraju, R. Girshick, K. He, and P. Dollár, “Designing Network Design Spaces,” in IEEE/CVF Conf. Comput. Vis. Pattern Recognit., pp. 10425–10433, 2020.
  • [39] K. Sun, B. Xiao, D. Liu, and J. Wang, “Deep high-resolution representation learning for human pose estimation,” in IEEE/CVF Conf. Comput. Vis. Pattern Recognit., pp. 5686–5696, 2019.
  • [40] W. Wang, E. Xie, X. Li, D.-P. Fan, K. Song, D. Liang, T. Lu, P. Luo, and L. Shao, “Pyramid vision transformer: A versatile backbone for dense prediction without convolutions,” in IEEE/CVF Int. Conf. Comput. Vis., pp. 548–558, 2021.