Surface Defect Detection of Industrial Parts Based

Received 21 November 2022, accepted 4 December 2022, date of publication 12 December 2022,
date of current version 20 December 2022.

Digital Object Identifier 10.1109/ACCESS.2022.3228687
Surface Defect Detection of Industrial

Parts Based on YOLOv5
HAI FENG LE 1, LU JIA ZHANG1 , AND YAN XIA LIU 2
1 BeijingKey Laboratory of Information Service Engineering, Beijing Union University, Beijing 100101, China
2 College of Urban Rail Transit and Logistics, Beijing Union University, Beijing 100101, China
Corresponding author: Yan Xia Liu (yanxia.liu@163.com)

This work was supported in part by the Science and Technology Program of Beijing Municipal Education Commission under Grant
KM201911417007, and in part by the key Program of Beijing Union University for Educational Reform under Grant JY2021Z002.
ABSTRACT Industrial product quality inspection, a crucial procedure in industrial production, is crucial in
assuring product yield. Product safety and quality inspections on industrial assembly lines are predominantly
manual, and there is currently a dearth of safe and dependable inspection techniques. An improved surface
defect detection approach based on YOLOv5 is proposed for the problem of surface flaws in industrial
components in order to improve the quality detection effect of industrial production parts. To improve the
effect of dense object detection, the image features are extracted by the convolutional network and enhanced
by coordinate attention. BiFPN is utilized to fuse multi-scale features in order to lower the rate of missed
detection and false detection for small target samples. The detectors from the Transformer structure are added
to the complex problem of fine-grained detection to improve the predictability of challenging occurrences.
According to the experimental findings, on the dataset for industrial parts defects, the proposed network
increases the recall of the original algorithm in abnormal classes by 5.3%, reaching 91.6%. Its inference
speed can approach 95FPS, indicating an improved real-time detection performance.
INDEX TERMS Defect detection, YOLOv5, transformer, deep learning, fine-grained detection.
I. INTRODUCTION recognition accuracy surpassing human eyes and increasingly

The industrial product safety and quality inspection of indus- faster recognition speed. Diverse algorithms and advanced
trial assembly lines are mainly based on manual review. computing equipment provide stable conditions for accu-
Industrial production still relies on the naked eye to detect and rate industrial defect detection [2]. Robots and mechanical
analyze defective products on assembly lines. Considering intelligence technology play an important role in industrial
the safety of human inspectors and the low efficiency of manufacturing, and defect detection methods used for related
manual inspection of products on assembly lines, the most parts also have research significance [3], [4], [5]. Since some
widely used method is to improve the recall rate through industrial defects can be regarded as the abnormal appear-
manual sampling after production [1]. This manual detection ance of industrial products, it is suitable to adopt image
method is inefficient and has potential safety risks, which methods for detecting abnormality [6], [7]. In particular,
limits the update and development of the industrial chain in image anomaly detection mainly focuses on whether the input
the long run. Hence, it is imminent to realize online detection image is an anomaly instance, usually at the image level.
of product defects on industrial assembly lines. However, industrial defect detection is more concerned with
With the development of deep learning in recent years, detection tasks at the local level. At the local image and
computer vision has become more and more powerful pixel level, the difference between anomalies and standard
and functional in addressing tasks such as image classi- patterns is more subtle, which leads to significantly increased
fication and object detection, characterized with increased difficulty in actual detection. Therefore, employing image-
level anomaly detection methods to meet the requirements of
The associate editor coordinating the review of this manuscript and industrial defect detection is challenging. Yue et al. [8] pro-
approving it for publication was Oguzhan Urhan . pose to use a deep learning method for local defect detection
This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/
130784 VOLUME 10, 2022
H. F. Le et al.: Surface Defect Detection of Industrial Parts Based on YOLOv5
of industrial products. Yang et al. [9] locate various defect the proposed method. The implementation of the proposed
positions through the object detection method and distinguish method and comparison with previous methods is presents in
defect categories using an improved classification network. Section 4. Section 5 summarizes the conclusions of the work
Zhao et al. [10] extract defect information by virtue of the in this study and suggests the future search direction.
instance segmentation method. The prediction results are
output through the subsequent network, and the training data II. RELATED WORK
is enriched with weakly supervised learning. Still, real-time Object detection includes two parts, classification and loca-
detection cannot be guaranteed due to slow recognition speed. tion, and its application fields are broad, including face
The industrial field requires real-time detection performance, detection, pedestrian detection, vehicle detection, etc [13].
and even small embedded devices can satisfy the real-time Traditional object detection algorithms adopt sliding win-
detection requirements. This application scenario requires a dows to detect objects without any pertinence, which is inef-
lightweight, high detection frame rate model. ficient and inaccurate. The manually selected features are
To solve the problems of low detection accuracy and inabil- less robust to irregular objects with different shapes [14].
ity to real-time detection in traditional methods [11], [12], With the advancement of deep learning technology, image
a mechanical product defect detection system is proposed in feature extraction by the convolutional neural network has
this paper for industrial assembly lines based on the object become a common approach [15], [16], [17]. Meanwhile,
detection method. Parts with defects such as deform and object detection, as one of the hot spots in the field of machine
contamination are marked when their appearance is detected vision research, has stimulated the appearance of numerous
to be defective, which provides convenience for subsequent excellent algorithms in object detection [18], [19], [20]. The
early warning and rejection of defective products. Different emergence of abundant networks has played a critical role in
from other object detection methods in defect detection of promoting the development of deep learning. For example,
industrial parts, this paper detects different abnormal states of ResNet [21] proposed the concept of residual blocks, which
the same category at the instance level, which belongs to fine- significantly intensified the depth of networks. Also, new
grained detection, characterized by the distinction between feature extraction methods are provided in terms of image
different abnormal categories. Due to smaller difference of detection with the help of the attention mechanism [22].
the samples in different states, it is more difficult to identify Methods, such as the assemblable attention module proposed
correct samples. Using a lightweight model with high real- by SENet [23], bring accuracy improvement to the convo-
time performance, the proposed method can provide a com- lutional network; DETR [24] uses the classical convolution
puter vision-based solution for current industrial production structure to encode the image features after extraction and
through embedded transplant deployment, so as to enhance completes both classification and positioning through the
the quality of products under industrial assembly lines. Based transformer structure. For detection, an innovative Hungarian
on ensuring real-time detection performance, given the dif- loss function is used to match the decoded target class in gen-
ficulty of defect classification with the existing YOLOv5s eral detection networks, rather than the initial anchor design.
method for small samples with low recall rates and slight sam- Similar to natural language processing, VIT [25] encodes
ple differences, the model is improved in this paper to make the segmented and serialized images to input them into the
it more suitable for tiny target and difficult sample detection transformer, and directly obtains coordinate positions and
to obtain satisfactory results. The main contributions are: category of targets through encoding and decoding. Swin-
1) The feature extraction module can be given coordi- transformer [26] is improved based on VIT, which can solve
nate attention to significantly improve the detection the problem of the enormous computational cost of the VIT
performance of the model with minimal computational method through hierarchical feature mapping and window
overhead. attention transformation.
2) By using a bidirectional multi-scale fusion module, So far, two main branches of object detection methods are
it is possible to optimize the model hierarchy, fuse mentioned on the basis of deep learning: the two-stage object
additional layers of features without increasing extra detection model based on the region generation network and
calculation. It can also enhance the feature fusion abil- the one-stage object detection model that directly performs
ity of the network, and raise the recall rate for small position regression [27]. YOLOv5 is an efficient and stable
target samples. one-stage object detection method with greatly enhanced
3) Aiming at the issue of missing detection of fine-grained speed and accuracy, and can quickly adapt to new tasks after
samples in the dataset, a detector with a Transformer transfer learning. The input of the YOLOv5 is an RGB image
structure is proposed to enhance the feature extraction with a size of 640*640. Its overall network design is divided
capability of the model and effectively increase the into a backbone network based on the CSPNet [28] neural
recognition accuracy of difficult and difficult target network, a multi-scale feature fusion module based on the
samples. FPN [29]+PAN [30] structure and the detector for output
The remainder of this paper is organized as follows. classification and bounding box regression.
In Section 2, we introduce related works on object detec- The backbone of YOLOv5 includes Focus, BottleneckCSP
tion in recent years. Section 3 presents the details of and SPP. The first two components mainly undertake image
VOLUME 10, 2022 130785

FIGURE 1. Structure of proposed network. In the inference stage, the input is a RGB image from camera, the output prediction is the primary
picture with marker box. The CSP_CA represents the CSP module with Coordinate Attention.
fusion and feature map extraction, separately. Stacking con-

volutions and the CSP unit structure of CSPNet are exactly
utilized to extract image features. This network is more
lightweight than the DarkNet structure used by YOLOv3,
enhancing the learning ability of the CNN while maintain-
ing accuracy. The hierarchical features obtained through the
backbone network are fused in the Neck part of the network.
The FPN+PAN structure adopted by the model is not only a
simple combination of multi-level features but also realizes
path enhancement from low to high through upsampling,
downsampling and residual connection. The method allows
the model to fuse more levels of features to obtain a larger
receptive field, and performs target prediction by combining
detection heads of different resolutions, thereby achieving a
good model prediction effect on targets of different scales.
III. METHODOLOGY
The proposed network structure is shown in Figure 1. Some
deep-level features are extracted by adding the CSP unit
with the CA module. The BiFPN is utilized to integrate the
features, simplify a portion of the network structure, and pay FIGURE 2. Structure of Coordinate Attention. It carries out average
pooling in horizontal and vertical directions, then carries out
close attention to obtain features at different levels. To locate transformation to encode spatial information, and finally fuses spatial
and classify targets, the fused features are transmitted to information by weighting on the channel.
the corresponding detectors in accordance with different
resolutions.
multiple CSP residual modules for feature extraction, which
A. COORDINATE ATTENTION can continuously accumulate redundant information during
The images of industrial parts and information of included network iteration and reduce the detection accuracy. In view
mechanical parts are usually accompanied by complex back- of the confusion of targets during dense data detection, this
ground environment. The YOLOv5 network uses stacking paper optimizes the overall feature extraction ability of the
130786 VOLUME 10, 2022

model, by embedding position information into the attention on the feature pyramid. The combination of upsampling
module after adding Coordinate Attention (CA) [31] into the and downsampling for multi-scale feature fusion can obtain
CSP structure. deeper semantic information. However, the shallow features
Attention mechanisms in computer vision, which aim of the neck will be diluted, hindering the full combination
to mimic the human visual system, can efficiently capture of image features between deep layers and shallow layers.
salient regions in complex scenes, making progress in multi- Considering many instances of small size in the defect
ple vision tasks. Through the attention mechanism, the input detection dataset of industrial parts, and the difficulty in
image features can be dynamically weighted. The SENet distinguishing features at the deep level, shallow features with
improves the recognition performance of the convolutional in-depth features are combined by the BiFPN [33]. The atten-
network by the feature extraction capability of the attention computation enhances shallow feature information flow,
tion optimization model at the feature channel and spatial making the model more biased towards small target samples
information level. But attention modules in methods such in terms of assigning weights rather than direct summation,
as SENet and CBAM [32] only consider internal channel as in PANet.
information, ignoring the importance of location information.
It is undeniable that the spatial structure of objects in vision is
of great significance. Based on CBAM, coordinate attention
is simplified, as shown in Figure 2. Give an input X, a pooling
window of size (H, 1) or (1, W) is set along horizontal and ver-
tical coordinates. By using the two parallel one-dimensional
feature codes obtained from each channel, spatial coordinate
information is integrated efficiently to acquire coordinate
attention through the subsequent convolution structure to map
the input features, so as to ensure that the network feature
extraction ability is enhanced with less computational over-
head, while obtaining more receptive field information.
FIGURE 4. Comparison of structural differences between PANet and

BiFPN. The left is the PANet in YOLOv5, the right is the BiFPN in our
method.
As is depicted in Figure 4(b), the BiFPN is simplified

based on PANet, using a weighting attention strategy, and
adding additional residual connections to the same-level fea-
ture layer, which can fuse more layer-side features without
increasing the amount of calculation, improve the feature
FIGURE 3. Comparison of the results in different attention methods.
It inserts the SE blocks, CBAM blocks and CA blocks into the same fusion ability of the network, and effectively enhance the
position in the YOLOv5 model. classification accuracy of difficult samples.
To demonstrate the advantages of the proposed method C. TRANSFORMER DETECTOR

over other attention methods, experiments are conducted on Aiming at highly similar appearance of some defect cat-
industrial parts datasets under different improved structures. egories of industrial parts samples, inspired by [34],
As shown in Figure 3, with a sufficient number of iterative a fine-grained object detector is designed in this paper by
training and computations, our network can achieve a 0.7% combing the advantages of the Transformer structure. Differ-
performance improvement in recall on the test set and a 0.5% ent from general fine-grained detection methods, the Trans-
improvement in average precision, which is significantly bet- former detector is utilized to enhance the model’s ability to
ter than models with other attention mechanisms. classify fine-grained categories, enabling end-to-end detec-
tion, and direct outputs of final detection results. As shown
B. FEATURE FUSION WITH BiFPN in Figure 5, the Transformer structure consists of two parts:
In one-stage object detection, the backbone network can the multi-head attention layer, and the feedforward neural
extract more complex texture features with the increased network layers, which are connected by the residual struc-
number of layers, and the neck should fully integrate the feature. The Transformer encoder block increases the ability to
tures extracted from the backbone network. For the problem capture different local image information, and can explore
that the top-down FPN is limited by a single information the feature representation potential through the self-attention
flow, the PANet structure is employed in YOLOv5 presented mechanism to quickly distinguish similar samples. Combined
in Figure 4(a). Bottom-up path aggregation is added based with other prediction heads, this structure can alleviate the
VOLUME 10, 2022 130787

FIGURE 5. The Transformer Module. The top part is the standard convolutional predict head, the bottom part is the Transformer predict head
consists of Multi-Head Attention, MLP and other modules. L represents the Linear layer.
TABLE 1. Results and params of the proposed model.
adverse effects of drastic object scale changes. Despite costs

of computation and memory caused by the additional detec-
tor, the performance of object detection has been greatly
improved.
The improved network structure does not replace all CSP
units with the improved module of fusion coordinate attention
since the scale distribution of the dataset in this experiment is
mostly small target samples, and pre-training weights cannot
be effectively used due to the change in the structure. Adding
attention to the beginning of the backbone network increases
the difficulty of model training, which may result in unstable
final detection performance. Therefore, CSP units are only
replaced in some areas to avoid unnecessary computational
overhead and ensure the robustness of the model. The pre-
dicted recall rate and parameters of the model are summarized
in Table 1.
FIGURE 6. The data samples of the Industrial Part Defect Dataset. The
normal target is the normal status of the industrial part, the others
IV. EXPERIMENT AND ANALYSIS represent 4 abnormal status.
A. DATASET
Given large variety of industrial parts and different types of
transportation products on assembly lines, there is currently videos and images of the same kind of industrial motors on
no unified public dataset. Aiming at the detection of micro- assembly lines in different environments, and then organized
motor defects in industrial production, this paper collected and performed data augmentation to produce a motor defect
130788 VOLUME 10, 2022

detection dataset for assembly line operation scenarios. The

dataset contains 1400 images labeled and exported using the
EasyData labeling platform of Baidu Smart Cloud. As intro-
duced in Figure 6, labeling categories include normal, dirty,
structural distortion, main body deformation, and incom-
plete. A total of 8613 labeling boxes are obtained, includ-
ing 5837 normal labels, 820 dirty labels, 778 twist labels,
531 deformative labels, and 647 incomplete labels. Due to
safety and industrial production requirements, most of the FIGURE 7. IoU calculation diagram. The Ground truth box is the true label
images are taken from a distance above the assembly line, of the target like the S1, the prediction box is the output prediction from
coupled with the target industrial motors in small size in the model like the S3, the overlapping area is the S2, and the IoU is
calculated by the S1, S2 and S3.
experiment, thereby generating tiny targets collected in the
dataset and many dense distributions of targets.
By shuffling the dataset order, 80% of the data is selected The PR (Precision-Recall) curve can reflect the perfor-
randomly as the training set and 20% as the validation set. mance of an algorithm, by setting a calculation threshold
In addition, different strategies are applied for data augmen- θ as the threshold for determining whether the prediction
tation in both training and inference stages to reinforce the result is a positive or negative sample. IoU is often used
model training accuracy. The Mosaic method is randomly in object detection to determine the prediction result. For
used during model training, including affine transformation, example, first set IoU ≥ 0.5 as the same target, then it is
random rotation, translation, scaling, cropping, flipping, and determined that the prediction that meets the condition is true,
other data augmentation methods. For inference verification, otherwise, it is false. Then calculate the Precision and Recall
only scaling and normalization are utilized. of the data set, then increase the threshold θ in a decreasing
manner, and record the Precision and Recall corresponding
B. EVALUATION METRICS
to the respective thresholds, each threshold θ corresponds to
The evaluation indicators in this paper contain Recall, Pre- a (Precision, Recall) point, and connecting these points is
cision, and Mean Average Precision (mAP), commonly used PR curve.
in object detection. For abnormal sample detection, the recall Average Precision (AP): An evaluation index reconciles
rate of abnormal samples is an important indicator for method precision and recall’s contradictory variables in object detec-
evaluation. The recall rate refers to the proportion of all tion. The recall is the horizontal axis, the precision is the
targets predicted by the model that is correctly predicted. As a vertical axis, and the PR curve encloses the area of the irreg-
widely used evaluation index in the field of defect detection, ular graph. Since the integral calculation is relatively tricky,
the recall is related to TP and FP . approximate interpolation calculation is adopted.
The evaluation indicators in this paper contain Recall,
Precision, and Mean Average Precision (mAP), commonly n−1
1X
used in object detection. For abnormal sample detection, the AP = Pinterp (r) (4)
n
recall rate of abnormal samples is an important indicator i=1
for method evaluation. The recall rate refers to the propor- where Pinterp (r) is the larger value of the accuracy in the
tion of all targets predicted by the model that is correctly r position and the r next position. Mean Average Precision
predicted. (mAP): to improve the comprehensiveness of the calculation
TP accuracy, 100 points were sampled on the PR curve for calcu-
Recall = (1)
TP + FN lation. And the threshold of IoU is adjusted from a fixed value
The prediction accuracy represents the proportion of all of 0.5 to the value of AP calculated every 0.5 in the interval
targets predicted by the model that is correctly expected. of 0.5 - 0.95, and the average of all results is taken as the final
Tp result.
Precision = (2) m
TP + FP 1X
mAP = APi (5)
The mAP is currently the most popular evaluation metric m
i=1
in object detection, and its calculated value involves several
related concepts. The intersection and the union ratio, IoU, C. ABLATION STUDY
measures the degree of overlap between the two regions and To verify the generalization of the model, the performance
is the ratio of the overlapping area of the two regions to the of the proposed model is compared with other models like
total area. As is shown in Figure 7, the IoU of two rectangular RetinaNet, EfficientDet-D0, and YOLOv5s in the defect
boxes is the ratio of the intersection area to the combined detection dataset of industrial parts. Pre-trained weights on
area. the ImageNet dataset are used for all models to complete
S2 transfer learning. SGD is adopted for optimization during the
IoU = (3)
S1 + S2 + S3 training process. The number of training images per batch is
VOLUME 10, 2022 130789

TABLE 2. Industrial parts defect detection experiment comparison.
TABLE 3. Detection results of different proposed structures on an industrial part defect dataset. Recall(abnormal) represents the recall rate except
normal target.
slightly different according to different networks. Based on

experience, the initial learning rate is set to batch size * 0.001,
the learning rate decay is performed once every 60 iterations,
a total of 200 epochs, and the decay coefficient is 0.1 to
converge the model parameters further. The input image size
of the selected model is 640*640 RGB image. The model is
evaluated using the validation set after each round of training,
and the validation curve is shown in Figure 8.
FIGURE 9. PR curve of the proposed model in industrial part defect

dataset.
the A100 graphics card reaches 95FPS, which still meets the
needs of real-time detection. The experiments demonstrated
that our model remains competitive on the dataset.
To verify the optimization effect of each proposed module
in the network, ablation experiments are carried out according
to the proposed method. The experimental results are summa-
rized in Table 3. The Recallabnormal is the primary evaluation
FIGURE 8. Validation curve of epochs. It compares the validation mAP
indicator of the abnormal detection of industrial parts. After
from RetinaNet, YOLOv5, EfficientDet-D0 and our method in the training adding the coordinate attention module, the average precision
stage. of the model is increased by 0.5%. After using the bidi-
rectional multi-scale fusion module for feature integration,
As can be seen from Figure 8, the proposed model achieves the abnormal recall rate of the detection model is increased
higher detection accuracy than YOLOv5s, and the mAP by 2.3%. It is concluded that the prediction accuracy of the
reaches 0.756, which is 2.2 percentages points higher than method for small target samples is significantly improved.
the original network. It can be seen from Table 2 that the pro- The addition of the Transformer detector guarantees great
posed method achieves almost the highest detection accuracy enhancement of the recall rate and precision. Figure 9 shows
among the same type of methods, reaching 93.6%, and the the precision-recall curve of the detection performance of
detection speed is also at a high level. The inference speed on proposed model. The detection speed decreases due to the
130790 VOLUME 10, 2022

FIGURE 10. Feature dimensionality reduction visualization in t-SNE. The feature is from the feature maps of the penultimate layer of network,
and reduced to 2 dimensions by PCA. Different colored dots represent different categories.
increased number of parameters and computation brought by samples, reduces the overlap between categories, and realizes
its structure. The results show the proposed detection model more balanced overall spacing of features, proving that the
is superior to the primary YOLOv5. proposed method can improve the representation ability of
the feature space for effective object detection.
D. ANALYSIS Specifically, to compare the proposed network results more
Figure 10 presents a comparison of the prototype distribution intuitively, some pictures in the test dataset and real pic-
of classification features learned by both original and pro- tures were selected for testing. For more obvious comparison
posed models, indicating that the model with the bidirectional results, the two networks’ confidence thresholds were set
multi-scale fusion module still faces a small amount of sam- to 0.45. The non-maximum suppression IoU threshold is
ple confusion after network fine-tuning, but its classification set to 0.3.
interval is more apparent. The model added to the Trans- Figure 11 describes the detection results of the YOLOv5s
former detector clearly distinguishes the vast majority of model and proposed model respectively on the left and
VOLUME 10, 2022 130791

FIGURE 11. Prediction results with marker box. The left side of the predict picture is from YOLOv5, and the right side is from proposed model.
The part (a) shows the comparing results in small target detection situation. The part (b) shows the comparing detection results in dense
situation.
right sides. In Figure 11(a), owing to long distance from distances of 100 cm and 120 cm. The illuminance comparison
the detection target to the shooting acquisition device, the group was set up with two distinct illuminance environments,
detected targets tend to be tiny overall, and the confidence namely the low-light group with an illuminance of approxi-
is lower with some false detections. The fact is that small mately 100 lx and a high-light group with an illuminance of
target objects can be detected more accurately on the right about 220 lx. A dirty-camera control group was established
side. In Figure 11(b), dense targets make some of the predic- to shoot with the experiment using the same model camera
tion frames on the left inaccurate and undetected, while the that had been in use in the factory for about 14 months.
detection results on the right are improved. While maintaining the same environment, all batches were
Ablation experiments are carried out to verify the shot in five shots with fine-tuning of the shooting angle, and
efficiency of the proposed module in the actual produc- total of 400 samples were collected. By using data enhance-
tion environment. We set up control groups in the factory ment methods including horizontal flip, vertical flip, and
based on different environments. Each control group contains random cropping, the dataset was enlarged. With a total of
20 batches of samples from the abnormal category, with four 1600 samples, they were then summarized and sorted into an
different abnormal samples in each batch. On a assembly line, industrial parts environmental comparison dataset.
samples from the same batch were photographed in different The prediction results of the proposed model on this dataset
environments. The camera is situated between 65 and 85 cm are shown in Table 4. It can be seen that the prediction recall
away from the object in the normal control group. There is rate of the model in different environments has been affected
around 170 lx of indoor illumination, and the camera is brand- to different degrees. Among them, the Twist samples are more
new. The samples were placed in two groups of experimen- significantly affected when the camera is far away and dirty,
tal settings that were separated from the control groups by with a maximum drop is about 5 percentage points. In low
130792 VOLUME 10, 2022

TABLE 4. Prediction results in different environments. still has a slight shortage of detecting the defects of parts with
occlusion under a fixed shooting angle. Future research will
further adjust the structure and determine how to improve
the recognition accuracy through multi-angle collaborative
detection to achieve better detection performance.
A. ABBREVIATIONS
AP Averaged AP at IoUs from 0.5 to 0.95
with an interval of 0.05
AP50 AP at IoU threshold 0.5
light conditions, dirty class samples are more affected, and AP75 AP at IoU threshold 0.75
the recall rate decreases by 6.7 percentage points. The recall BiFPN Bi-directional feature pyramid network
rate of the rest of categories is slightly influenced by the CA Coordinate Attention
environment. Additionally, it was discovered throughout the CBAM Convolutional block attention model
experiment that the samples from the Twist category and end-to-end The input is the original data, and the
the Incomplete category were marginally impacted by the output is the final result
shooting angle. According to the comparative experiments FLOPs Floating-point operations per second
mentioned above, it can be found that the proposed method FPN Feature pyramid network
will slightly reduce the recall rate of abnormal samples when IoU Intersection over union
the illumination and camera height of the real production lx Lux, the unit of illumination.
environment change slightly, but it can still meet the detection Recallabnormal recall rate of abnormal samples
requirements. SSD Single Shot multibox Detector
The proposed model is suitable for use with embedded YOLO You Only Look Once
devices. After compiling to onnx, we migrate the model to
NVIDIA Jetson NX and build a detection system based on it. REFERENCES
Due to the limited computing power of the device, the detec- [1] H. Wang, J. Wang, G. Zhang, X. Ouyang, and F. Luo, ‘‘Improved FPN’s
tion speed on Jetson NX after porting is about 35FPS. When mask R-CNN for industrial surface defect detection,’’ Manuf. Automat.,
the detection system uses monitors to output the detection vol. 42, no. 12, pp. 35–40 and 97, Dec. 2020.
[2] Y. Chen, Y. Ding, F. Zhao, E. Zhang, Z. Wu, and L. Shao, ‘‘Surface defect
videos, the detection speed of the model decreases because detection methods for industrial products: A review,’’ Appl. Sci., vol. 11,
the videos take up part of the computation, and it declines no. 16, p. 7657, Aug. 2021.
to 31FPS in our experimental environment. The above data [3] H.-Y. Lee and T.-E. Lee, ‘‘Scheduling single-armed cluster tools with
reentrant wafer flows,’’ IEEE Trans. Semicond. Manuf., vol. 19, no. 2,
are measured under the condition that the detection accuracy pp. 226–240, May 2006.
is unaffected and the model is transplanted without quan- [4] D. V. Slavov and V. D. Hristov, ‘‘3D machine vision system for defect
tification. The model can be quantized in order to reduce inspection and robot guidance,’’ in Proc. 57th Int. Sci. Conf. Inf., Commun.
Energy Syst. Technol. (ICEST), Jun. 2022, pp. 1–5.
the number of model parameters and calculations and speed
[5] M. Foumani, M. Y. Ibrahim, and I. Gunawan, ‘‘Scheduling dual gripper
up the inference for embedded devices to achieve real-time robotic cells with a hub machine,’’ in Proc. IEEE Int. Symp. Ind. Electron.,
detection. The quantization operation will result in some May 2013, pp. 1–6.
recall loss that is related to the model compression rate. [6] T. Czimmermann, G. Ciuti, M. Milazzo, M. Chiurazzi, S. Roccella,
C. M. Oddo, and P. Dario, ‘‘Visual-based defect detection and classi-
fication approaches for industrial applications—A SURVEY,’’ Sensors,
V. CONCLUSION vol. 20, no. 5, p. 1459, Mar. 2020.
In this study, we propose an end-to-end lightweight defect [7] H. Chang, J. Gou, and X. Li, ‘‘Application of faster R-CNN in image defect
detection of industrial CT,’’ J. Image Graph., vol. 23, no. 7, pp. 1061–1071,
detection model for industrial parts based on improved 2018.
YOLOv5. The detector can achieve excellent detection accu- [8] X. Yue, Q. Wang, L. He, Y. Li, and D. Tang, ‘‘Research on tiny target
racy and real-time detection on edge computing device. Our detection technology of fabric defects based on improved Yolo,’’ Appl. Sci.,
vol. 12, no. 13, p. 6823, Jul. 2022.
contributions mainly concentrate on three aspects: applying [9] Z. Li, X. Tian, X. Liu, Y. Liu, and X. Shi, ‘‘A two-stage industrial
the coordinate attention to module for feature extraction to defect detection framework based on improved-YOLOv5 and optimized-
improve the detection performance of the model, optimizing inception-ResNetV2 models,’’ Appl. Sci., vol. 12, no. 2, p. 834, Jan. 2022.
[10] J. Božic, D. Tabernik, and D. Skocaj, ‘‘Mixed supervision for surface-
the model hierarchy through the BiFPN to reduce the false defect detection: From weakly to fully supervised learning,’’ Comput. Ind.,
detection rate and missed detection of small target sam- vol. 129, Aug. 2021, Art. no. 103459.
ples, and adding the Transformer detector to increase the [11] Q. Luo, X. Fang, L. Liu, C. Yang, and Y. Sun, ‘‘Automated visual defect
detection for flat steel surface: A survey,’’ IEEE Trans. Instrum. Meas.,
recognition accuracy of difficult samples. The experimental
vol. 69, no. 3, pp. 626–644, Mar. 2020.
results demonstrate that the algorithm proposed in this article [12] J. Yang, S. Li, Z. Wang, and G. Yang, ‘‘Real-time tiny part defect detec-
improves the performance of the defect detection algorithm tion system in manufacturing using deep learning,’’ IEEE Access, vol. 7,
based on industrial parts under the premise of real-time detec- pp. 89278–89291, 2019.
[13] Z.-Q. Zhao, P. Zheng, S.-T. Xu, and X. Wu, ‘‘Object detection with deep
tion and can help improve the yield in industrial production, learning: A review,’’ IEEE Trans. Pattern Anal. Mach. Intell., vol. 30,
transportation, and other scenarios. Currently, the algorithm no. 11, pp. 3212–3232, Nov. 2019.
VOLUME 10, 2022 130793

[14] X. Wang, M. Yang, S. Zhu, and Y. Lin, ‘‘Regionlets for generic object [32] S. Woo, J. Park, J.-Y. Lee, and I. S. Kweon, ‘‘CBAM: Convolutional
detection,’’ in Proc. IEEE Int. Conf. Comput. Vis., Dec. 2013, pp. 17–24. block attention module,’’ in Proc. Eur. Conf. Comput. Vis. (ECCV), 2018,
[15] A. Krizhevsky, I. Sutskever, and G. E. Hinton, ‘‘ImageNet classification pp. 3–19.
with deep convolutional neural networks,’’ Commun. ACM, vol. 60, no. 2, [33] M. Tan, R. Pang, and Q. V. Le, ‘‘EfficientDet: Scalable and efficient
pp. 84–90, Jun. 2012. object detection,’’ in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit.
[16] K. Simonyan and A. Zisserman, ‘‘Very deep convolutional networks for (CVPR), Jun. 2020, pp. 10781–10790.
large-scale image recognition,’’ 2014, arXiv:1409.1556. [34] X. Zhu, S. Lyu, X. Wang, and Q. Zhao, ‘‘TPH-YOLOv5: Improved
[17] C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan, YOLOv5 based on transformer prediction head for object detection on
V. Vanhoucke, and A. Rabinovich, ‘‘Going deeper with convolutions,’’ drone-captured scenarios,’’ in Proc. IEEE/CVF Int. Conf. Comput. Vis.
in Proc. IEEE Conf. Comput. Vis. Pattern Recognit. (CVPR), Jun. 2015, Workshops (ICCVW), Oct. 2021, pp. 2778–2788.
pp. 1–9.
[18] J. Redmon and A. Farhadi, ‘‘YOLOv3: An incremental improvement,’’
2018, arXiv:1804.02767.
[19] A. Bochkovskiy, C.-Y. Wang, and H.-Y. M. Liao, ‘‘YOLOv4: Optimal
speed and accuracy of object detection,’’ 2020, arXiv:2004.10934.
[20] W. Liu, D. Anguelov, D. Erhan, C. Szegedy, S. Reed, C.-Y. Fu, and HAI FENG LE received the bachelor’s degree from
A. C. Berg, ‘‘SSD: Single shot multibox detector,’’ in Proc. Eur. Conf. the School of College of Urban Rail Transit and
Comput. Vis. Cham, Switzerland: Springer, 2016, pp. 21–37. Logistics, Beijing Union University, in 2019. He is
[21] K. He, X. Zhang, S. Ren, and J. Sun, ‘‘Deep residual learning for image currently pursuing the graduate degree with the
recognition,’’ in Proc. IEEE Conf. Comput. Vis. Pattern Recognit. (CVPR), School of Robotics, Beijing Union University. His
Jun. 2016, pp. 770–778. interests include deep learning and applications
[22] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, and computer graphics.
L. Kaiser, and I. Polosukhin, ‘‘Attention is all you need,’’ in Proc. Adv.
Neural Inf. Process. Syst., vol. 30, 2017, pp. 1–11.
[23] J. Hu, L. Shen, and G. Sun, ‘‘Squeeze-and-excitation networks,’’ in
Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit., Jun. 2018,
pp. 7132–7141.
[24] N. Carion, F. Massa, G. Synnaeve, N. Usunier, A. Kirillov, and
S. Zagoruyko, ‘‘End-to-end object detection with transformers,’’ in Proc.
Eur. Conf. Comput. Vis. Cham, Switzerland: Springer, 2020, pp. 213–229. LU JIA ZHANG was born in Beijing, China,
[25] A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weissenborn, X. Zhai, in 1996. She received the bachelor’s degree in
T. Unterthiner, M. Dehghani, M. Minderer, G. Heigold, S. Gelly, computer science and technology from Beijing
J. Uszkoreit, and N. Houlsby, ‘‘An image is worth 16×16 words: Trans- Union University’s Smart City College, in 2019.
formers for image recognition at scale,’’ 2020, arXiv:2010.11929. She is currently pursuing the graduate degree with
[26] Z. Liu, Y. Lin, Y. Cao, H. Hu, Y. Wei, Z. Zhang, S. Lin, and B. Guo,
the School of Robotics, Beijing Union University.
‘‘Swin Transformer: Hierarchical vision transformer using shifted win-
Her research interests include image recognition
dows,’’ in Proc. IEEE/CVF Int. Conf. Comput. Vis. (ICCV), Oct. 2021,
pp. 10012–10022. and deep learning and applications.
[27] H T. Lu and Q. C. Zhang, ‘‘Applications of deep convolutional neural
network in computer vision,’’ J. Data Acquisition Process., vol. 31, no. 1,
pp. 1–17, 2016.
[28] C.-Y. Wang, H.-Y. M. Liao, Y.-H. Wu, P.-Y. Chen, J.-W. Hsieh, and
I.-H. Yeh, ‘‘CSPNet: A new backbone that can enhance learning capability
of CNN,’’ in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit. Work-
shops (CVPRW), Jun. 2020, pp. 390–391. YAN XIA LIU received the Ph.D. degree from the
[29] T.-Y. Lin, P. Dollar, R. Girshick, K. He, B. Hariharan, and S. Belongie, School of Automation and Electrical Engineering,
‘‘Feature pyramid networks for object detection,’’ in Proc. IEEE Conf. University of Science and Technology, Beijing,
Comput. Vis. Pattern Recognit. (CVPR), Jul. 2017, pp. 2117–2125. in 2013. She is a Professor with the College of
[30] S. Liu, L. Qi, H. Qin, J. Shi, and J. Jia, ‘‘Path aggregation network Urban Rail Transit and Logistics, Beijing Union
for instance segmentation,’’ in Proc. IEEE Conf. Comput. Vis. Pattern University. Her current research interests include
Recognit., Jun. 2018, pp. 8759–8768. pattern recognition, computer vision, deep learn-
[31] Q. Hou, D. Zhou, and J. Feng, ‘‘Coordinate attention for efficient mobile ing, and intelligent instruments.
network design,’’ in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit.
(CVPR), Jun. 2021, pp. 13713–13722.
130794 VOLUME 10, 2022

Surface Defect Detection of Industrial Parts Based

Uploaded by

Copyright:

Available Formats

Surface Defect Detection of Industrial Parts Based

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Surface Defect Detection of Industrial Parts Based

Uploaded by

Copyright:

Available Formats

Received 21 November 2022, accepted 4 December 2022, date of publication 12 December 2022,

date of current version 20 December 2022.

Surface Defect Detection of Industrial

Corresponding author: Yan Xia Liu (yanxia.liu@163.com)

I. INTRODUCTION recognition accuracy surpassing human eyes and increasingly

VOLUME 10, 2022 130785

fusion and feature map extraction, separately. Stacking con-

130786 VOLUME 10, 2022

FIGURE 4. Comparison of structural differences between PANet and

As is depicted in Figure 4(b), the BiFPN is simplified

To demonstrate the advantages of the proposed method C. TRANSFORMER DETECTOR

VOLUME 10, 2022 130787

TABLE 1. Results and params of the proposed model.

adverse effects of drastic object scale changes. Despite costs

130788 VOLUME 10, 2022

detection dataset for assembly line operation scenarios. The

VOLUME 10, 2022 130789

TABLE 2. Industrial parts defect detection experiment comparison.

slightly different according to different networks. Based on

FIGURE 9. PR curve of the proposed model in industrial part defect

130790 VOLUME 10, 2022

VOLUME 10, 2022 130791

130792 VOLUME 10, 2022

VOLUME 10, 2022 130793

130794 VOLUME 10, 2022

You might also like