An Efficient MLP-based Point-guided Segmentation Network for Ore Images with Ambiguous Boundary

Guodong Sun Yuting Peng Le Cheng Mengya Xu An Wang Bo Wu Hongliang Ren and Yang Zhang Corresponding author: Yang Zhang.G. Sun, Y. Peng, L. Cheng, and Y. Zhang are with the School of Mechanical Engineering, Hubei University of Technology, Wuhan 430068, China (e-mail: sunguodong@hbut.edu.cn; pyt181@hbut.edu.cn; cl@hbut.edu.cn; yzhangcst@hbut.edu.cn ). M. Xu and H. Ren are with the Department of Biomedical Engineering, National University of Singapore (NUS), Singapore 117575, Singapore (e-mail: mengya@u.nus.edu; hlren@ieee.org). A. Wang, H. Ren, and Y. Zhang are also with the Department of Electronic Engineering, The Chinese University of Hong Kong, Hong Kong 999077, China (e-mail: wa09@link.cuhk.edu.hk).B. Wu is with the Shanghai Advanced Research Institute, Chinese Academy of Sciences, Shanghai 201210, China (e-mail: wubo@sari.ac.cn).

Abstract

The precise segmentation of ore images is critical to the successful execution of the beneficiation process. Due to the homogeneous appearance of the ores, which leads to low contrast and unclear boundaries, accurate segmentation becomes challenging, and recognition becomes problematic. This paper proposes a lightweight framework based on Multi-Layer Perceptron (MLP), which focuses on solving the problem of edge burring. Specifically, we introduce a lightweight backbone better suited for efficiently extracting low-level features. Besides, we design a feature pyramid network consisting of two MLP structures that balance local and global information thus enhancing detection accuracy. Furthermore, we propose a novel loss function that guides the prediction points to match the instance edge points to achieve clear object boundaries. We have conducted extensive experiments to validate the efficacy of our proposed method. Our approach achieves a remarkable processing speed of over 27 frames per second (FPS) with a model size of only 73 MB. Moreover, our method delivers a consistently high level of accuracy, with impressive performance scores of 60.4 and 48.9 in $AP_{50}^{box}$ and $AP_{50}^{mask}$ respectively, as compared to the currently available state-of-the-art techniques, when tested on the ore image dataset. The source code will be released at https://github.com/MVME-HBUT/ORENEXT.

{IEEEkeywords}

Instance segmentation, Point guidance, Edge processing, Local correlation, Ore image.

1 Introduction

\IEEEPARstart

Ore particle size analysis is an important part of ore processing tasks. Particle size information is an essential indicator to judge the effectiveness of the crusher, and it can serve as a guide for adjusting the process parameters of each process. Accurate ore segmentation is a significant prerequisite for particle size statistical analysis. The production site for beneficiation presents a complex environment with several challenging factors, such as ore stacking, ore adhesion resulting from dry-wet mixing, and variations in lighting conditions, as depicted in Fig. 1. These factors pose a significant challenge to achieving accurate ore segmentation. Moreover, the beneficiation scene is often arranged in the field, so the limitation of equipment resources is also a problem faced by the ore particle size analysis system.

Traditional image processing methods perform the segmentation of ore by setting thresholds, cluster analysis, and edge detection. With the development of convolutional neural networks (CNN), the algorithms based on deep learning have shown significant advancements in automatic feature extraction and generalization performance [1, 2, 3, 4, 5], so they have gradually gained a leading position in industrial image processing. Most existing instance segmentation methods are implemented on public datasets such as MS COCO [6]. These methods are not effective for ore image processing. It can be seen from Fig. 1(c) that the baseline framework is less capable of segmenting the ore edges. The processing of ore images is significantly challenged by complex working environments and the presence of diverse feature information.

Aiming at the specific needs of ore segmentation, the U-Net was used to segment broken stones for the first time. However, these complex CNN-based methods have high computational costs, large model sizes, and low accuracy. Moreover, facing the complex environment, the existing algorithmic framework will easily ignore the problem of edge blur caused by inter-adhesive and shadowing of ore images. In this work, we focus on solving this problem and designing an efficient network with less computational overhead, fewer parameters, faster inference time, and better performance.

Recently, it has been found that MLP-based architectures can achieve comparable results to CNN and Transformer methods with less computation. The MLP-Mixer [7] proposed token mixing and channel mixing MLP to allow interaction between spatial locations and channels. The ResMLP [8] used cross-patch and cross-channel sublayers as components. Inspired by these works, we propose OreNeXt, an instance segmentation model based on the MLP framework for the ore edge problem. A lightweight MLP backbone network for feature extraction is introduced, followed by a feature pyramid network using a SparseMLP module to enhance the semantic information, then we introduce a loss function guided by the edge. Using MLP and a simple hybrid mechanism, we obtain a lightweight model suitable for deployment.

To implement this model framework to solve the above problems, we use the two-stage detector PointRend [9] as the baseline for the ore task. For ore tasks, the low-level features of ore overlapping edges are extremely crucial. Therefore, we propose a novel backbone network StoneMLP, which incorporates shifting operations to extract local information corresponding to different axial shifts. A sparse feature pyramid network is proposed to strengthen the small target information that is easy to ignore and misjudge while maintaining sufficient semantic information. Furthermore, we propose a loss, including detection loss that predicts the foreground score of each point and mask loss that performs edge guidance by dynamically adjusting vertex pairing. As illustrated in Fig. 1(d), our proposed framework is more likely to obtain clear ore segmentation results. Experiments on the ORE image dataset demonstrate that our framework performs state-of-the-art methods with smaller model size and faster inference speed. Our main contributions are summarized as follows.

1)

A lightweight image segmentation network OreNeXt is designed, which solves the problem of blurred edges in ores by guiding the boundary points.
2)

We propose a lightweight backbone StoneMLP for capturing local correlation and introduce a semantic enhancement SparseFPN network.
3)

A loss function that matches edges by guided points is introduced, significantly improving the quality of predicted boundary details.
4)

The experimental results validate the efficacy of our method in enhancing the performance of ore image segmentation tasks, while simultaneously reducing the number of parameters and improving inference speed.

The remainder of this paper is structured as follows. Section 2 introduces an overview of existing mineral image segmentation methods and MLP architecture development. Section 3 presents our overall framework, including two new structures and a new loss function. To validate the effectiveness of our method, in Section 4, we conduct integrated experiments. The paper is concluded in Section 5.

2 Related works

2.1 Ore image segmentation

Accurate individual ore segmentation is a crucial prerequisite for granularity statistics and an essential component in ore processing tasks. With the development of CNN, the task of ore image processing has gradually shifted from traditional image processing techniques to deep learning-based segmentation algorithms. For instance, Sun et al. [10] proposed an efficient instance segmentation algorithm to split ore. Dipti et al. [11] proposed an image segmentation system for estimating the granularity of oil sandstones. The ore image on the conveyor belt often suffers from ore adhesion and occlusion caused by dust and soil. These factors result in the blurring or even complete disappearance of multiple ore edges in the image, posing significant challenges to achieving accurate individual ore segmentation. To address this edge problem, numerous models have been employed to tackle the issue of under-segmentation in ore processing [12], Li et al. [13] proposed a model based on U-Net, which alleviated the problem of ore granularity detection by improving the loss function and using watershed technology. Liu et al. [14] developed the RLPNet to obtain high-precision segmentation images by reducing the interference of complex textures and enhancing the expression of edge features. With the effective emergence of Transformer, numerous Transformer-based designs have been proposed to effectively divide the adhesive ore [15], but the network structures are often complex and not suitable for outdoor deployment. Although these methods have improved the segmentation of ore images, there are still problems in clear edge segmentation.

2.2 Instance segmentation

Currently, instance segmentation methods can be divided into two categories: one-stage and two-stage. One-stage methods mainly involve a simple fully convolutional networks (FCN) architecture for mask prediction without region of interest (RoI) pooling, such as YOLACT [16] and BlendMask [17], proposal generation and feature re-pooling, with fast speed but poor accuracy. Two-stage instance segmentation methods first detect boundary boxes and then perform segmentation in each RoI region. Mask RCNN [18] adds a mask branch to the Faster R-CNN [19] architecture. Mask-Refined R-CNN [20] refines the detailed information of the target and considers the relationship between the edge pixels of the object. PointRend [9] samples feature points with low confidence scores and further improves their labels using a shared MLP. Cascade Mask R-CNN [21] improves segmentation accuracy through cascading methods. Since most two-stage methods use RoI Align operations, there is information loss in the spatial features of large targets, especially at the edges. Therefore, rough target edge prediction is a common problem for two-stage instance segmentation algorithms.

2.3 MLP-Based Models

With the success of the transformer in the field of natural language processing, some researchers have started exploring how to apply the transformer to the field of vision. ViT [22] pioneered the approach of segmenting images into non-overlapping blocks as tokens for operations, enabling the use of the transformer framework for image processing. Swin transformer [23] proposed a local group self-attention mechanism incorporating locality. Inspired by the elegant structure of ViT, MLP with simpler network structures has also received a lot of attention in the field of computer vision. MLP-Mixer [7] developed by Google, replaces the self-attention mechanism with MLPs to construct a pure MLP architecture. The ResMLP [8] used cross-patch and cross-channel sublayers as components. However, most existing MLP-based methods cannot adapt to image resolution, making them difficult to apply to downstream tasks. ASMLP [24] and CycleMLP [25], both of which use the motion of feature maps to integrate local information, enabling the pure MLP architecture to be used for downstream tasks. These MLP-based models have a similar overall structure but differ in the detailed design of the main modules. After MLP-Mixer [7] was proposed, people began to explore the architecture of MLP. With increased computational capacity and the availability of larger datasets coupled, the ancient simple structure of MLP can also achieve effective performance improvement in various vision tasks. Through the design structure, these MLP-based models no longer rely on prior knowledge of two-dimensional images, but on the input feature maps for long-range interactions. However, this rarely makes full use of local information, and the extraction of low-level features is also significant for many tasks.

3 Method

In this section, we initially introduce the overall structure of the network. On this basis, we introduce the StoneMLP model as the backbone, then describe a lightweight SparseFPN network and improve the detection head.

3.1 Overall Framework

In the ore image segmentation task, there is a problem of edge blur caused by the adhesiveness of ore images. Therefore, we propose a lightweight instance segmentation model with a point-based prediction strategy to address the edge problem, which utilizes a lightweight backbone and feature pyramid network (FPN) structure based on MLP to extract high-quality edge features. To further enhance the accuracy of edge prediction, we introduce an Edge Guidance Loss that dynamically aligns the predicted edge with the instance edge.

The OreNeXt is a two-stage instance segmentation model with a network structure as mentioned in Fig. 2. The model mainly consists of a lightweight MLP backbone, a SparseFPN, and a fine detection head. To better capture the low-level features, we designed a structure StoneMLP for extracting features of ore images as the backbone. By moving the feature information axially, information flows in different directions can be obtained, which helps to capture the local correlation between overlapping ores. The SparseFPN is used to extract hierarchical features. The feature maps are processed in the SparseMLP to efficiently obtain the global receptive field. Global information can assist in better describing semantic information. Furthermore, we use a lightweight segmentation head to generate coarse prediction features for each detected object. For the fine-grained feature from the CNN feature maps( $S_{i}$ ), we perform bilinear interpolation by coordinate points on the feature maps to compute the feature vector. Then the fine-grained features that provide object detail information are combined with the coarse prediction features that provide global context information. The fused feature map is fed into the prediction head, making point-wise segmentation.

3.2 StoneMLP

We propose a backbone network StoneMLP based on the architecture of MLP. Compared to convolution and attention networks, the MLP structure has a lower inductive bias. It allows the model to learn solely from raw data, thereby making the network more concise. We propose a spatial shift method that enables the network to obtain different directions of information flow, achieving the same field of view as CNN and completing feature extraction. Instance segmentation of ore images necessitates the utilization of low-level features, including contours, textures, colors, and shapes. Our proposed spatial movement method effectively captures local correlations by prioritizing the extraction of these low-level features, thereby yielding more comprehensive edge information.

In Fig. 3, it shows the architecture of our StoneMLP model. For example, ResMLP [8] has shown that self-attention is not the key factor of transformers for achieving excellent performance. With further research, many methods prove that self-attention modules can be replaced by MLP, Convolution, or other layers, indicating that the success of the transformer may come from the entire architecture design. Therefore, we refer to Transformer [26] for the design of the overall structural framework to expect good results. It takes RGB images $I\in\mathbb{R}^{3\times H\times W}$ as input and then slices them into patches of $4\times 4$ size. All tokens obtained at this stage are $48\times\frac{H}{4}\times\frac{W}{4}$ . StoneMLP Block contains four fully connected layers, each corresponding to four channel projections for communicating specific location information. Channel projections, vertical movement, and horizontal movement are used to extract features in the Pixel Shift operation. The temporal and computational expenses associated with the axial movement are exceedingly minimal. The computational complexity is $\mathrm{\Omega(StoneMLP)}=4HWC^{2}$ .

As illustrated in Fig. 4, we will explain the horizontal movement. The following channel projection will use the features that were extracted for the dashed box. The vertical movement operation is very similar to the horizontal movement. Combining the horizontal and vertical movements, the features in different positions are realigned on a channel. During the next channel projection operation, information from both directions is integrated to obtain the result after local communication. When the input $x$ and the displacement size are given, the output $Y_{S}$ is obtained.

Y_{S}=X^{h}\ W_{c}^{h}+X^{v}\ W_{c}^{v},

(1)

where $W_{c}^{h}$ and $W_{c}^{v}$ denote the channel projection’s learnable weights in the vertical and horizontal directions, $X^{h}=Concat(X_{1}^{h},....,X_{s}^{h})$ and $X^{v}=Concat(X_{1}^{v},....,X_{s}^{v})$ represent the axial displacement.

3.3 SparseFPN

We propose a new SparseFPN for improving inference speed while maintaining accuracy. Traditional FPN structures generate multi-scale integrated features, due to the multi-level feature fusion, small targets are easily missed or misidentified. Therefore, we append two SparseMLP modules after traditional FPN and use a $1\times 1$ convolution to output features. MLP can deal with more complex nonlinear relationships and learn more abstract feature representations.

The structure of SparseMLP is demonstrated in Fig. 5(a), which is a module based on MLP with a parallel structure composed of three parts: W channel mapping, H channel mapping, and identity mapping. In the horizontal mixing path, features are reshaped and mixed information for each row. In the vertical mixing path, similar operations are applied. The three parallel branches are connected by channels, and feature fusion is achieved by weighted summation and $1\times 1$ convolution, producing an output tensor $X^{out}=FC(concat(X_{H}\ ,\ X_{W}\ ,\ X))$ with the same dimensions as the input tensor. SparseMLP modules avoid the common overfitting problem affecting MLP-based models’ performance. Each token only interacts directly with tokens in the same row or column, and each row and column can share the same projection weight. As shown in Fig. 5(b), a layer of SparseMLP will form a cross-shaped receptive field, and after two SparseMLPs, a global receptive field can be formed. That is to say, if this module is repeated twice, each token can accumulate information throughout the two-dimensional space, improving accuracy while reducing computational complexity $\mathrm{\Omega(SparseMLP)}=HWC(H+W)+3HWC^{2}$ .

3.4 Edge Guidance Loss

To address the uncertainty resulting from the blurriness of edges in ore images, we use PointRend [9] as the baseline and perform a recomputation of boundary hard pixels. We adopt the point selection and point-wise representation strategies to prioritize uncertain edge points for optimization of image segmentation for object edges. Additionally, we propose the Edge Guidance Loss function to improve the model performance. The loss can be divided into the following four parts:

L_{EG}={\underbrace{L_{cls}^{b}\mathrm{+}\ \alpha L_{ploc}^{b}}_{L_{det}}}% \mathrm{+}{\underbrace{\beta L_{coarse}^{m}\mathrm{+}\ L_{pmat}^{m}}_{L_{mask}% }}.

(2)

where $\alpha$ and $\beta$ represent the weight of $L_{ploc}^{b}$ and $L_{coarse}^{m}$ , which is generally set to 0.5 and 1.0 respectively in the experiments. To improve the success rate of edge detection and segmentation, in addition to the classification loss and the coarse mask loss, we also introduce an Edge point Guidance match Loss and improve the target box offset loss to improve the positioning accuracy.

In our method, $L_{cls}^{b}$ and $L_{coarse}^{m}$ are defined as a standard cross entropy loss function with softmax activation. We treat the foreground points as positive samples and the other points as negative samples. The new regression localization loss $L_{ploc}^{b}$ is responsible for predicting the foreground score of the prediction point:

L_{ploc}^{b}(R,R^{\prime})=\frac{1}{n}\sum\limits_{k=1}^{n}\parallel(x_{i},y_{% i})-(x_{i}^{\prime},y_{i}^{\prime})\parallel_{2}.

(3)

It uses the point-to-point distance prediction, input predicted point, and real border box. Moreover, it sorts the coordinates of test points in the x or y direction and obtains new coordinates. The distance between each point and four borders is calculated, and the boundary loss of each border point is obtained, which is finally obtained by weighted addition.

The edge guidance match loss $L_{pmat}^{m}$ is the average of the traditional cross-entropy loss $L_{pcls}$ function and the edge point guidance match loss $L_{pg}$ :

L_{pmat}^{m}=\frac{1}{2}(L_{pcls}+L_{pg}).

(4)

In Fig. 6, the $L_{pg}$ first calculates the distance between the prediction point $Pred_{i}^{in}$ and the real border point, finds the nearest target point $gt_{X}^{ipt}$ to the prediction point according to the distance, and interpolates the target points to a specified number of points using an interpolation function to obtain the offset difference $X_{i}^{*}$ . It also increases the classification accuracy of the target and makes the target smoother.

X_{i}^{*}=arg\min_{X}\parallel{Pred}_{i}^{in}-{gt}_{X}^{ipt}\parallel_{2}.

(5)

Then, the predicted points ${Pred}_{i}^{out}$ and the matched nearest interpolation label points ${gt}_{X_{i}^{*}}^{ipt}$ are dynamically matched using the loss function smoothL1 to supervise the quality of boundary prediction. The edge point guidance match loss $L_{pg}$ can consider the smoothness and robustness of large error and small error, which is not affected by outliers.

L_{pg}=\frac{1}{N}\sum_{k=1}^{n}\parallel{Pred}_{i}^{out}-{gt}_{X_{i}^{*}}^{% ipt}\parallel_{1}.

(6)

4 Experiments

In this section, we introduce the hardware environment for ore processing tasks and the datasets, evaluation metrics, and implementation details. Next, we conducted ablation research to evaluate the effectiveness of the design decisions. Eventually, we compare OreNeXt with the state-of-the-art methods for instance segmentation on the ore dataset.

4.1 Hardware Environment

The actual scene of the ore processing tasks is depicted in Fig. 7, revealing significant adhesion of the ore and blurred edges. The detection system is responsible for identifying the size indexes of ores on the conveyor belt, providing insights into the effectiveness of the upstream processes, and facilitating the identification of over-specification ores. Ore detection on the production line is performed through sampling inspection. The traditional manual screening involves the random selection of a set of ores as a sample to assess their compliance with required standards. The online detection method based on deep learning considers the complete ores present on the surface of the conveyor belt, as captured by an industrial camera, as a representative sample, so we only expect to segment independent and complete ore individuals to obtain their particle size information. Therefore, precise segmentation of the surface ores is crucial, particularly in accurately identifying the edges of overlapping ores. It can be seen from Fig. 7 that ore processing tasks are typically conducted outdoors, where hardware resources are limited, making the use of high-performance computing equipment impractical. Consequently, the model size becomes a critical factor in ore processing scenarios with constrained computing resources. We verify on the ORE dataset that compared with the current segmentation methods, our method achieves optimal accuracy while ensuring minimal memory usage.

4.2 Experiments Setup

4.2.1 Datasets

In the experiment, we apply ImageNet [27] to pre-train the backbone model and train instance segmentation models using our created rock data set. For the ORE image dataset, we use 4,060 images for training and 1,060 images for validation. We supplement the data set in the following ways: first, we collect different scales of rock images using our experimental platform. Then, we change the positioning of rocks at different scales, such as sparse, thick, and so on. In addition, We break up large images into smaller ones that can be utilized to train networks. We segment images using a sliding window to further expand the datasets.

4.2.2 Evaluation Metrics

To confirm the efficiency of the suggested method, seven evaluation indicators are used such as $AP^{box}$ , $AP_{50}^{box}$ , $AP^{mask}$ , $AP_{50}^{mask}$ , inference time (FPS), inference memory consumption, and model size. The first four indicators are used to evaluate the accuracy. AP stands for Average Precision, and the number after the AP represents the IoU(Intersection over Union) threshold, which refers to the degree of overlap between the predicted box and the ground truth box. In general, a higher value of $AP_{50}$ indicates a better performance of the algorithm. The latter three metrics evaluate whether a model can be deployed on low-cost hardware. The computational cost of the model is quantified by its inference time, and we aim to maximize this value. The utilization of memory during training and the model’s size demonstrate its reliance on the hardware system, and we strive to minimize these factors while maintaining accuracy.

4.2.3 Implementation Details

OreNeXt is trained via back-propagation and the SGD (stochastic gradient descent) optimizer. For the ORE image dataset, all models shown in Table 1 are trained for 12k iterations on an NVIDIA RTX2080Ti GPU. We set the initial learning rate to 0.001 cut it by 1/10 at 8k iterations and use a batch size of 8 for all ablation studies. The input images are changed during training to have short sides between 160 and 320 pixels and long sides of 320 pixels. The experiments on the MS COCO dataset, the pre-train of the StoneMLP, and SAM-related experiments are all trained on the NVIDIA RTX3090Ti GPU. To ensure fairness and reproducibility in the experiment, a fixed random seed is used, and the hyperparameters of all models are set according to the best values provided in their respective papers. The parameters that achieve the highest accuracy on the validation set are preserved for prediction.

4.3 Comparison with State-of-the-Art Methods

4.3.1 Classic Segmentation

To further demonstrate the supremacy of OreNeXt, we conduct a comparative analysis with state-of-the-art methods on the ORE dataset. Specifically, we compare the segmentation performance of a typical two-stage segmentation framework utilizing various CNN, Transformer, and MLP backbone networks. The results can be found in Table 1, which compares the segmentation accuracy, inference time, model size, and GPU memory usage.

Table 1: Comparison of accuracy with the state-of-the-art methods on ore image datasets. T-based: Transformer-based.

Method		Backbone	$AP^{box}\uparrow$	$AP^{box}_{50}\uparrow$	$AP^{mask}\uparrow$	$AP^{mask}_{50}\uparrow$	FPS $\uparrow$	Model Size (MB) $\downarrow$	Inf.Memory (MB) $\downarrow$
CNN-based	Mask RCNN[18]	ResNet101	43.0	51.7	12.1	22.1	13	480	1876
	PointRend[9]	ResNet101	37.4	54.0	21.0	43.8	17	489	3361
	YOLACT[16]	ResNet101	40.3	50.6	7.1	11.9	24	410	7758
	Mask RCNN[18]	ResNet50	42.8	51.7	12.4	22.6	17	334	1769
	PointRend[9]	ResNet50	36.3	54.0	25.8	47.7	21	344	2677
	Cascade Mask RCNN[21]	ResNet50	41.6	51.7	36.4	48.3	16	587	1947
	CondInst[28]	ResNet50	43.1	52.2	39.0	48.7	23	259	1713
	MS RCNN[29]	ResNet50	42.7	51.7	14.7	14.7	16	428	1749
	Blend Mask[17]	ResNet50	44.1	58.1	38.9	48.7	23	274	1641
	BoxInst[30]	ResNet50	44.3	56.9	34.8	48.7	23	261	1681
	SparseInst[31]	ResNet50	-	-	24.8	37.9	21	380	1711
	RTMDet[32]	CSPNeXt	38.7	49.7	35.1	43.4	22	109	2013
T-based	Mask RCNN[18]	Swin-T	38.3	52.0	27.7	46.5	19	542	1995
	PointRend[9]	Swin-T	30.6	55.4	32.8	44.7	16	371	2631
	Cascade Mask RCNN[21]	Swin-T	42.0	50.7	38.5	48.5	14	920	2063
	Mask2Former[33]	Swin-T	55.7	71.3	35.3	44.3	20	572	6098
	Mask RCNN[18]	PVTv2	39.6	51.4	27.3	44.8	24	263	1943
	PointRend[9]	PVTv2	39.8	51.8	16.1	43.0	19	185	2501
	Cascade Mask RCNN[21]	PVTv2	42.7	50.9	39.0	48.7	17	641	1907
MLP-based	Mask RCNN[18]	ASMLP	35.9	49.2	21.1	38.5	21	542	1811
	PointRend[9]	ASMLP	36.6	52.6	27.0	41.1	20	371	2389
	Cascade Mask RCNN[21]	ASMLP	42.4	50.7	38.7	48.5	19	613	2043
	Mask RCNN[18]	CycleMLP-B1	36.9	51.7	29.1	45.2	26	104	1673
	PointRend[9]	CycleMLP-B1	25.0	50.2	20.0	39.8	26	109	1843
	OreNeXt(Ours)	StoneMLP	38.2	60.4	33.8	48.9	27	73	1931

Firstly, the networks based on CNN have higher accuracy but are slower and have larger model sizes than ours. Since the transformer structure has the largest model capacity, the transformer-based networks have a larger model size, while the MLP-based networks have a smaller model size but worse accuracy. Secondly, Our proposed OreNext network exceeds the best CNN and Transformer algorithms both by 0.2 on $AP_{50}^{mask}$ metrics, while also achieving high accuracy on $AP_{50}^{box}$ , due to our novel approach of using a Sparse FPN that balances local and global information with two MLP structures, as well as an Edge Guidance Loss based on a point processing strategy to improve accuracy. Recent methods like Mask2Former [33] based on SwinTransformer [23] have achieved impressive results on the $AP^{box}$ metric, surpassing the average performance, while its corresponding model size is $6\times$ larger than our OreNeXt. A small model size is crucial in ore processing tasks with limited computational resources. Therefore, models that prioritize box accuracy at the expense of model size are not suitable for field ore tasks. Furthermore, OreNeXt also outperforms other lightweight frameworks significantly in inference time, model size, and GPU memory usage, thanks to our lightweight backbone StoneMLP, which is even smaller than other lightweight MLP-based backbone networks such as CycleMLP. OreNeXt strikes a balance between accuracy and computational cost and is more suitable for ore segmentation tasks. Finally, the performance of Swin-T and ASMLP used as the backbone network of PointRend is similar to that of OreNext in terms of accuracy and model size, which supports the idea that the token mixer module of the transformer model is not the key to its excellent performance but the entire transformer architecture. Therefore, replacing the most time-consuming attention of the transformer with the StoneMLP network can also achieve excellent performance.

4.3.2 Segment Anything Model

Table 2: The comparison of OreNeXt with SAM-based instance segmentation methods,

\star

means only validation

Method	${AP^{box}_{50}}$	${AP^{mask}_{50}}$	FPS	Model Size (MB)
SAM[34]	42.2	44.7	3	609
RSPrompter[35]	42.7	41.6	2	631
$FastSAM^{\star}$ [36]	38.0	37.2	7	150
OreNeXt(Ours)	60.4	48.9	27	73

By finding foundational models that exhibit high performance in the fields of NLP and CV, the foundation Segment Anything Model (SAM) [34] proposed by Meta AI exhibits remarkable generalization capabilities. SAM is becoming a foundation step for many visual tasks, like image editing and remote sensing image segmentation. In the ore processing task, we compare the existing instance segmentation methods based on the SAM foundation model with our method, as shown in Table 2. As a category-agnostic instance segmentation method, SAM does not show its superiority in single-class ore image segmentation. Its segmentation accuracy is average, but the model size exceeds most traditional segmentation methods. In single-task industrial application scenarios, SAM not only fails to play its advantages of strong generalization ability, but its huge computation costs prevent it from wider applications in industry scenarios. At present, the traditional instance segmentation method is still the best choice after balancing accuracy and computational complexity in industrial applications. In the future, the goal is to achieve high-performance compression of large models and develop comprehensive industrial foundation modes capable of addressing multiple scenarios and tasks.

The visualization of various results from the instance segmentation method in ore images is depicted in Fig. 8. It can be observed that our method is significantly better than other methods in mask quality as well as edge sharpness, and the missed detection rate is also at a low level. Compared to the foundation large model, our method can detect as many complete ore images as possible. It can obtain the granularity information required for industrial production and meet the needs of sampling inspection.

4.4 Ablation Study

4.4.1 StoneMLP

To demonstrate the effectiveness of our proposed backbone network, we verified the accuracy of this backbone network by comparing various backbone changes. In Table 3, StoneMLP outperforms the other backbone networks in three indicators of $AP^{box}$ , $AP^{mask}$ , $AP_{50}^{mask}$ . The index $AP_{50}^{box}$ of evaluating detection bounding box is 60.4, which is slightly lower than the highest value of ResNet101 in 61.9, while the model size of 73M is 6.5 $\times$ smaller than ResNet101. The small model size is also vital in ore processing tasks with limited computational resources. Since StoneMLP is designed for feature extraction of ore low-level edge information, it greatly outperforms the lightweight backbone CycleMLP with 99M, which is also based on MLP in terms of accuracy across the board. In order to verify the general effectiveness of StoneMLP, we conduct experiments on the MS COCO [6] dataset, and the results in Table 4 show that the accuracy is significantly improved compared with the baseline.

Table 3: The segmentation results of different backbones in OreNeXt

Backbone	${AP^{box}}$	${AP^{box}_{50}}$	${AP^{mask}}$	${AP^{mask}_{50}}$	Model Size (MB)
ResNet50[37]	36.8	60.8	31.1	47.4	329
ResNet101[37]	34.0	61.9	29.9	48.3	474
RegNet[38]	36.1	57.1	29.2	47.4	255
HRNet[39]	36.7	61.8	25.7	25.1	229
PVTv2[40]	36.1	61.3	22.5	47.7	178
Swin-T[23]	30.3	61.2	11.8	25.8	220
ASMLP[24]	36.4	58.9	30.5	46.8	214
CycleMLP-B1[25]	26.6	59.3	22.4	43.8	99
StoneMLP(ours)	38.2	60.4	33.8	48.9	73

Table 4: Comparison crucial components on MS COCO datasets

	${AP^{box}}$	${AP^{box}_{50}}$	${AP^{mask}}$	${AP^{mask}_{50}}$
Baseline	21.6	37.4	20.3	34.5
EGLoss	$19.9_{~{}{\color[rgb]{0,0,1}\tiny-1.7}}$	$39.3_{~{}{\color[rgb]{1,0,0}\tiny+1.9}}$	$21.1_{~{}{\color[rgb]{1,0,0}\tiny+0.8}}$	$36.8_{~{}{\color[rgb]{1,0,0}\tiny+2.3}}$
StoneMLP	$25.1_{~{}{\color[rgb]{1,0,0}\tiny+3.5}}$	$41.2_{~{}{\color[rgb]{1,0,0}\tiny+3.8}}$	$23.6_{~{}{\color[rgb]{1,0,0}\tiny+3.3}}$	$38.8_{~{}{\color[rgb]{1,0,0}\tiny+4.3}}$

Table 5: Comparison of using different MLP in SparseFPN

Method	${AP^{box}_{50}}$	${AP^{mask}_{50}}$	FPS	Model Size (MB)
MLP[23]	51.6	39.5	25	84
CycleMLP[25]	50.5	37.7	23	83
SparseMLP(ours)	53.0	48.5	27	77

4.4.2 SparseFPN

To improve detection accuracy while maintaining light weight, we designed an MLP module SparseMLP to enhance the receptive field by mixing information and fusing features through channel mapping. The module can be implemented using a simple MLP structure. In Table 5, we compare the effectiveness of different MLP structures. Experimental results show that connecting two SparseMLP after FPN can improve the receptive field of images, reduce the parameter count to avoid overfitting, and promote multi-stage processing in the pyramid architecture. The SparseFPN can strengthen the small object information that is easy to misjudge while maintaining sufficient semantic information.

4.4.3 Loss Function

To verify the robustness of EGLoss to the network, we compared the baseline loss with the newly proposed edge guidance loss and found that the $AP_{50}^{box}$ and $AP_{50}^{mask}$ values were significantly improved while the model size remained constant. Additionally, we compared the accuracy stability before and after replacing the loss and observed that the $AP_{50}^{mask}$ value became relatively stable after replacing the loss. In Table 4, experiments on the public dataset MS COCO[6] show that EGLoss achieves good results on $AP_{50}^{mask}$ , which is an indicator for evaluating mask quality.

Table 6: Effects of the components of the proposed method. SM: StoneMLP, SF: SparseFPN, EL:EGLoss

Baseline	SM	SF	EL	${AP^{box}_{50}}$	${AP^{mask}_{50}}$	Model Size (MB)
✓				54	47.7	344
✓	✓			51.7	47.8	85
✓	✓	✓		53.9	48.4	77
✓	✓	✓	✓	60.4	48.9	73

4.4.4 Module Breakdown Analysis

We conducted ablation studies, as presented in Table 6, to investigate the specific contributions of each module in OreNeXt. StoneMLP incorporates shifting operations to capture local information, resulting in a slight performance improvement while significantly reducing the number of parameters. The model size decreases from 344M to 95M. Next, the SparseMLP blocks are introduced into FPN, helping extract small features and improve accuracy. Finally, the Loss function is replaced with the EGLoss, leading to a significant improvement in accuracy from 53.9 to 60.4 in the $AP^{box}_{50}$ and from 48.4 to 48.9 in the $AP^{mask}_{50}$ .

4.4.5 Hyperparameter

To explore the influence of hyperparameter settings on network performance, we conducted ablation experiments on the weight of the loss function and the number of channels in the neck. We have reduced the model size in SpraseFPN by changing the convolution size and reducing the number of channels. The results are depicted in Table 7, and the model size gradually lowers as the number of channels reduces from 256 to 32. The highest value of accuracy is achieved when the number of channels is 64, which are 54.7 in $AP_{50}^{box}$ and 47.6 in $AP_{50}^{mask}$ respectively. The channels of the feature map contain the feature information, and the number of channels should be reasonably designed according to the feature complexity of the detected object. Reducing the number of channels can improve accuracy due to the ORE dataset having fewer shots and features than the public dataset. When the number of channels is reduced from 64 to 32, the accuracy does not increase, which proves that the feature information will also be lost when the number of channels is too small. A channel number of 64 increases the model size almost negligible compared to a channel number of 32. These results indicate that a channel of 64 and $1\times 1$ convolution is the optimal choice for ore segmentation.

Parameter $\beta$ in (2) reflects the significance of $L_{coarse}^{m}$ in training. We perform ablation experiments on the MS COCO dataset and ORE dataset, respectively, and the results are reflected in Table 8. It can be seen that a smaller $\beta$ value can achieve the highest detection accuracy due to the single category of the ore dataset, while for the MS COCO dataset with rich target categories, a larger weight is needed to make the model not ignore the sample features of some categories.

Table 7: Comparisons of different channels in SparseFPN

Channels	Convs.	${AP^{box}_{50}}$	${AP^{mask}_{50}}$	FPS	Model Size (MB)
32	1x1	52.2	46.1	25	77
	3x3	53.6	42.6	24	79
64	1x1	54.7	47.6	26	80
	3x3	52.6	46.7	24	88
96	1x1	53.2	45.6	25	90
	3x3	52.9	44.3	24	98
128	1x1	52.2	46.4	26	98
	3x3	54.6	46.5	23	109
256	1x1	52.0	44.9	23	130
	3x3	52.7	44.3	22	161

Table 8: Validation of

L_{coarse}^{m}

in the different datasets. NC means cannot converge

	ORE			MS COCO
$\beta$	0.5	1	1.5	1.5	2.5	3
${AP^{mask}_{50}}$	45.5	48.9	46.4	30.8	36.8	NC

5 Conclusion

This study presents OreNeXt, a lightweight and efficient instance segmentation network designed for ore image segmentation. The network specifically addresses the challenging issue of edge blurring that results from ore overlap. OreNeXt is an MLP-based architecture that leverages two essential components to enhance low-level edge features. Firstly, it incorporates a StoneMLP backbone network structure that facilitates spatial information interaction through shift operations. Secondly, it incorporates a SparseFPN that effectively balances global and local information to improve accuracy. Additionally, a point-guided loss function is employed to enhance the clarity of edge segmentation. When assessed using the ore dataset, our model demonstrated superior performance in ore segmentation compared to other existing methods while also exhibiting faster inference time, decreased computational complexity, and a reduction in the number of parameters required. For future work, we will focus on further improving edge accuracy and increasing the model’s speed. We will also explore edge segmentation algorithms that can be extended to other industrial areas.

References

[1] Y. Zhang, Y. Zhou, H. Pan, B. Wu, and G. Sun, “Visual fault detection of multiscale key components in freight trains,” IEEE Trans. Ind. Informat., vol. 19, no. 8, pp. 9082–9090, 2023.
[2] S. Xia, L. Chu, L. Pei, W. Yu, and R. C. Qiu, “A boundary consistency-aware multitask learning framework for joint activity segmentation and recognition with wearable sensors,” IEEE Trans. Ind. Informat., vol. 19, no. 3, pp. 2984–2996, 2023.
[3] Y. Zhang, M. Liu, Y. Yang, Y. Guo, and H. Zhang, “A unified light framework for real-time fault detection of freight train images,” IEEE Trans. Ind. Informat., vol. 17, no. 11, pp. 7423–7432, 2021.
[4] J. Chu, Z. Guo, and L. Leng, “Object detection based on multi-layer convolution feature fusion and online hard example mining,” IEEE Access, vol. 6, pp. 19959–19967, 2018.
[5] W. Lin, J. Chu, L. Leng, J. Miao, and L. Wang, “Feature disentanglement in one-stage object detection,” Pattern Recognition, vol. 145, p. 109878, 2024.
[6] T.-Y. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan, P. Dollár, and C. L. Zitnick, “Microsoft coco: Common objects in context,” in Eur. Conf. Comput. Vis., pp. 740–755, 2014.
[7] I. O. Tolstikhin, N. Houlsby, A. Kolesnikov, L. Beyer, X. Zhai, T. Unterthiner, J. Yung, A. Steiner, D. Keysers, J. Uszkoreit, M. Lucic, and A. Dosovitskiy, “Mlp-mixer: An all-mlp architecture for vision,” in Neural Information Processing Systems, vol. 34, pp. 24261–24272, 2021.
[8] H. Touvron, P. Bojanowski, M. Caron, M. Cord, A. El-Nouby, E. Grave, G. Izacard, A. Joulin, G. Synnaeve, J. Verbeek, and H. Jegou, “Resmlp: Feedforward networks for image classification with data-efficient training,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 45, no. 4, pp. 5314–5321, 2023.
[9] A. Kirillov, Y. Wu, K. He, and R. Girshick, “Pointrend: Image segmentation as rendering,” in IEEE/CVF Conf. Comput. Vis. Pattern Recognit., pp. 9796–9805, 2020.
[10] G. Sun, D. Huang, L. Cheng, J. Jia, C. Xiong, and Y. Zhang, “Efficient and lightweight framework for real-time ore image segmentation based on deep learning,” Minerals, vol. 12, no. 5, pp. 1–18, 2022.
[11] D. P. Mukherjee, Y. Potapovich, I. Levner, and H. Zhang, “Ore image segmentation by learning image and shape features,” Pattern Recognition Letters, vol. 30, no. 6, pp. 615–622, 2009.
[12] Y. Zhang, L. Cheng, Y. Peng, C. Xu, Y. Fu, B. Wu, and G. Sun, “Faster orefsdet: A lightweight and effective few-shot object detector for ore images,” Pattern Recognition, vol. 141, p. 109664, 2023.
[13] H. Li, C. Pan, Z. Chen, A. Wulamu, and A. Yang, “Ore image segmentation method based on u-net and watershed,” Computers, Materials and Continua, vol. 65, no. 1, pp. 563–578, 2020.
[14] J. Liu, Z. Jiang, W. Gui, and Z. Chen, “A novel particle size detection system based on rgb-laser fusion segmentation with feature dual-recalibration for blast furnace materials,” IEEE Transactions on Industrial Electronics, vol. 70, no. 10, pp. 10690–10699, 2023.
[15] Y. Liu, Z. Zhang, X. Liu, W. Lei, and X. Xia, “Deep learning based mineral image classification combined with visual attention mechanism,” IEEE Access, vol. 9, pp. 98091–98109, 2021.
[16] D. Bolya, C. Zhou, F. Xiao, and Y. J. Lee, “YOLACT: real-time instance segmentation,” in IEEE/CVF Int. Conf. Comput. Vis., pp. 9156–9165, 2019.
[17] H. Chen, K. Sun, Z. Tian, C. Shen, Y. Huang, and Y. Yan, “Blendmask: Top-down meets bottom-up for instance segmentation,” in IEEE/CVF Conf. Comput. Vis. Pattern Recognit., pp. 8570–8578, 2020.
[18] K. He, G. Gkioxari, P. DollÃ¡r, and R. Girshick, “Mask r-cnn,” in IEEE/CVF Int. Conf. Comput. Vis., pp. 2980–2988, 2017.
[19] S. Ren, K. He, R. Girshick, and J. Sun, “Faster r-cnn: Towards real-time object detection with region proposal networks,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 39, no. 6, pp. 1137–1149, 2017.
[20] Y. Zhang, J. Chu, L. Leng, and J. Miao, “Mask-refined r-cnn: A network for refining object details in instance segmentation,” Sensors, vol. 20, no. 4, 2020.
[21] Z. Cai and N. Vasconcelos, “Cascade R-CNN: high quality object detection and instance segmentation,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 43, no. 5, pp. 1483–1498, 2021.
[22] A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weissenborn, X. Zhai, T. Unterthiner, M. Dehghani, M. Minderer, G. Heigold, S. Gelly, J. Uszkoreit, and N. Houlsby, “An image is worth 16x16 words: Transformers for image recognition at scale,” in International Conference on Learning Representations, pp. 1–21, 2021.
[23] Z. Liu, Y. Lin, Y. Cao, H. Hu, Y. Wei, Z. Zhang, S. Lin, and B. Guo, “Swin transformer: Hierarchical vision transformer using shifted windows,” in IEEE/CVF Int. Conf. Comput. Vis., pp. 9992–10002, 2021.
[24] D. Lian, Z. Yu, X. Sun, and S. Gao, “AS-MLP: An axial shifted MLP architecture for vision,” in International Conference on Learning Representations, pp. 1–19, 2022.
[25] S. Chen, E. Xie, C. Ge, R. Chen, D. Liang, and P. Luo, “Cyclemlp: A mlp-like architecture for dense visual predictions,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 45, no. 12, pp. 14284–14300, 2023.
[26] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, u. Kaiser, and I. Polosukhin, “Attention is all you need,” in Neural Information Processing Systems, pp. 6000–6010, 2017.
[27] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei, “Imagenet: A large-scale hierarchical image database,” in IEEE/CVF Conf. Comput. Vis. Pattern Recognit., pp. 248–255, 2009.
[28] Z. Tian, C. Shen, and H. Chen, “Conditional convolutions for instance segmentation,” in Eur. Conf. Comput. Vis., pp. 282–298, 2020.
[29] Z. Huang, L. Huang, Y. Gong, C. Huang, and X. Wang, “Mask scoring R-CNN,” in IEEE/CVF Conf. Comput. Vis. Pattern Recognit., pp. 6402 – 6411, 2019.
[30] Z. Tian, C. Shen, X. Wang, and H. Chen, “Boxinst: High-performance instance segmentation with box annotations,” in IEEE/CVF Conf. Comput. Vis. Pattern Recognit., pp. 5439–5448, 2021.
[31] T. Cheng, X. Wang, S. Chen, W. Zhang, Q. Zhang, C. Huang, Z. Zhang, and W. Liu, “Sparse instance activation for real-time instance segmentation,” in IEEE/CVF Conf. Comput. Vis. Pattern Recognit., pp. 4423–4432, 2022.
[32] C. Lyu, W. Zhang, H. Huang, Y. Zhou, Y. Wang, Y. Liu, S. Zhang, and K. Chen, “Rtmdet: An empirical study of designing real-time object detectors,” arXiv:2212.0778, 2022.
[33] B. Cheng, I. Misra, A. G. Schwing, A. Kirillov, and R. Girdhar, “Masked-attention mask transformer for universal image segmentation,” in IEEE/CVF Conf. Comput. Vis. Pattern Recognit., pp. 1280–1289, 2022.
[34] A. Kirillov, E. Mintun, N. Ravi, H. Mao, C. Rolland, L. Gustafson, T. Xiao, S. Whitehead, A. C. Berg, W.-Y. Lo, P. Dollar, and R. Girshick, “Segment anything,” in IEEE/CVF Int. Conf. Comput. Vis., pp. 4015–4026, October 2023.
[35] K. Chen, C. Liu, H. Chen, H. Zhang, W. Li, Z. Zou, and Z. X. Shi, “Rsprompter: Learning to prompt for remote sensing instance segmentation based on visual foundation model,” ArXiv:2306.12156, 2023.
[36] X. Zhao, W. Ding, Y. An, Y. Du, T. Yu, M. Li, M. Tang, and J. Wang, “Fast segment anything,” arXiv:2306.12156, 2023.
[37] K. He, X. Zhang, S. Ren, and J. Sun, “Deep Residual Learning for Image Recognition,” in IEEE/CVF Conf. Comput. Vis. Pattern Recognit., pp. 770–778, 2016.
[38] I. Radosavovic, R. P. Kosaraju, R. Girshick, K. He, and P. DollÃ¡r, “Designing Network Design Spaces,” in IEEE/CVF Conf. Comput. Vis. Pattern Recognit., pp. 10425–10433, 2020.
[39] K. Sun, B. Xiao, D. Liu, and J. Wang, “Deep high-resolution representation learning for human pose estimation,” in IEEE/CVF Conf. Comput. Vis. Pattern Recognit., pp. 5686–5696, 2019.
[40] W. Wang, E. Xie, X. Li, D.-P. Fan, K. Song, D. Liang, T. Lu, P. Luo, and L. Shao, “Pyramid vision transformer: A versatile backbone for dense prediction without convolutions,” in IEEE/CVF Int. Conf. Comput. Vis., pp. 548–558, 2021.