YOLOv5 Based Object Detection in Reel Package X-Ray Images of Semiconductor Component
YOLOv5 Based Object Detection in Reel Package X-Ray Images of Semiconductor Component
YOLOv5 Based Object Detection in Reel Package X-Ray Images of Semiconductor Component
d
Computer
we
Procedia Computer Science 00 (2023) 1–22 Science
ie
YOLOv5 Based Object Detection in Reel Package X-ray Images of
Semiconductor Component
ev
Jinwoo Park1,2 , Jaeheoung Lee3 , Jongyeol Lim4 , Yoonsung Jeon5 , Junik hwang6 , Jongpil Jeong1,∗
1 Department of Smart Factory Convergence, Sungkyunkwan University, 2066 Seobu-ro, Jangan-gu,Suwon 16419, Gyeonggi-do, Korea;
lifee40@gmail.com
2 UCT AI Research Lab
3 Department of Smart Factory Convergence
r
4 Department of Mechanical Engineering , Sungkyunkwan University
5 Department of Advanced Materials Science & Engineering , Sungkyunkwan University
6 Department of Chemical Engineering , Sungkyunkwan University
er
pe
Abstract
With the development of artificial intelligence (AI) technology, companies are rationalizing the facilities required at the production
site to suit smart factory plants, and applying AI to the production and inspection processes. In terms of manufacturing production,
AI-computer vision can replace existing rule-based systems and can add competitiveness to industrial sites. In order to respond
to the innovative business management paradigm in the manufacturing industry, the advancement of smart factory construction in
the domestic manufacturing industry is in progress. Accordingly, the introduction of manufacturing execution systems (MES) that
apply AI to industrial sites is becoming important. Small object detection using AI deep learning is a useful technology in the
ot
production field and can be applied to various fields in the future. YOLOv5 is a fast, high-performance, one-stage object detection
program. It uses pytorch, which is lighter than existing models and can be accessed easily by users. As a method for recognizing
small images of objects, an improved model was proposed by adding one layer from the existing three layers. We propose an
improved UCT model with superior accuracy and speed to the existing YOLOv5 model. An X-ray image of the semiconductor
tn
part is taken , and excellent performance is obtained after training using the UCT model. For the advancement of MES, it can be
applied to the inspection process in the surface mount technology industry with many small devices, in-line with production. The
mAP performance of UCT MODEL showed 0.622, much improved compared to 0.349 of YOLOv5 model. The accuracy of the
UCT MODEL was 0.865, much improved compared to the result of the YOLOv5 model (0.552), so that objects in Reel units could
be detected so that they could be applied in the field during object inference.
rin
1. Introduction
ep
Parts manufacturing companies did not have much difficulty in material management because the number of items
and production quantities were small and production was done manually. However, with the development of the in-
dustrial society, the number of items and quantity produced increased, and the complexity of material management
∗ Correspondingauthor
Pr
1 email address :
jpjeong@skku.edu; Tel.:+82-31-299-4267,Department of Smart Factory Convergence, Sungkyunkwan University, 2066 Seobu-
ro, Jangan-gu,Suwon 16419, Gyeonggi-do, Korea
1
This preprint research paper has not been peer reviewed. Electronic copy available at: https://ssrn.com/abstract=4339978
/ Procedia Computer Science 00 (2023) 1–22 2
d
increased as small quantity production of various types increased, resulting in difficulties in making material purchase
decisions. With the development of the 4th industrial revolution technology, smart factories have been introduced
we
in the manufacturing process and MES is being promoted. Smart factories include factory automation, and various
machinery automatically performs processes. With the introduction of smart factories, field workers are demanding
accuracy in material management, which involves importing and dispensing of required materials and parts. Small
object recognition is a field of research in which a computer analyzes and infers visual information obtained from
human activities in modern industrial settings. This technique has been applied to a wide range of industries, with
applications in image recognition, face recognition, IoT, autonomous driving, production and manufacturing, and the
ie
defense industry. As the demand for such small object recognition increases, object recognition research is becoming
active in the field of object recognition, ranging from PASCAL, ImageNet [1], and recently MS COCO [2]. Techno-
logical advances are in progress. In the past, the approach used in object recognition research was to find an object by
ev
designing and detecting its features, using algorithms such as scale invariant feature transform (SIFT) and speeded-up
robust features (SURF) [3]. On the exterior, semiconductors are rectanglular in shape, with rounded or angular cor-
ners, and the thickness can be characterized using contours. In a deformable part—based model [4] , a feature map
is constructed by dividing an object into small parts, and by connecting each part appropriately, the structure is trans-
formed into a map such as support vector Mmachine (SVM). [5] Object recognition was developed by linking it as
r
a learning method. However, convolutional neural networks (CNNs) have been studied and showed overwhelmingly
positive results in ImageNet 2012, so interest in deep learning has increased. Using a CNN, LeCun demonstrated the
CNN’s performance in recognizing handwritten digits [6]. By using CNNs, it was possible to extract local informa-
er
tion from an image, which was not possible with conventional neural networks. After that, research was conducted by
constructing a network more deeply through CNN hidden layers, ZFNet [7] VGG [8], ResNet [9], GoogleNet [10],
and DenseNet [11]. CNNs have been used successfully to find objects in images, but finding an accurate method for
the location of an object in an image has been a challenge. Research on a method for extracting the location of an
object has emerged. Region-based convolutional neural networks (R-CNNs) have been considered as a deep learning
pe
regression method. However, R-CNNs are slow, so Fast R-CNN was developed to compensate for the slow detection
speed. However, it is still difficult to find candidate regions of objects using the system due to the time cost and
system quality. In Faster R-CNN, selective search was removed, ROI was calculated through RPN, and the learning
effect was increased through the GPU. Afterwards, RPN creates region proposals, whereas convolutional layers are
created to develop region-based fully convolutional network (R-FCN) models. Although the speed of deep learning
has greatly improved from these developments, it was intended to overcome the demand for robots and autonomous
ot
driving that require real-time processing speed. A method of construction was proposed. Recently, a study that shows
fast mobile detection speed, such as the single shot multibox detector (SSD), is also in progress. In recent years, as 3D
scanners have spread and 3D model data have been converted into databases, deep learning methods using 3D images
are actively progressing. This study intends to use the existing rule-based X-ray image identification method instead
tn
of the existing rule-based X-ray image identification method through the method of improving YOLOv5, which has
excellent performance recently, and with the hope that the improved method will be applied to industrial sites to help
inventory management and quantity identification. The contributions of this study include:
• Proposal of a specialized model that contributes to computer vision as an efficient object recognition method by
improving various methods for small object recognition.
rin
• By recognizing and quickly digitizing images of semiconductor parts, it is linked to the MES system to add
value as information that operates the inventory management system.
• By expanding horizontal application to production sites such as PCB defects and LED DISPLAY defects in
various fields, AI technology can be added to the entire production process to achieve advancement of smart
ep
factories.
• Bill of material (BOM) information is calculated so that it can be input or output quickly, and the MES system
accurately composes the process to increase the productivity of the company and reduce the lead time of pro-
duction activities, thereby reducing the lead time of the company’s smart factory network. It helps to improve
the productivity of the company through the smooth supply of necessary materials.
Pr
This preprint research paper has not been peer reviewed. Electronic copy available at: https://ssrn.com/abstract=4339978
/ Procedia Computer Science 00 (2023) 1–22 3
d
Section 2 of this paper describes the semiconductor reel package, X-ray images, object detection, and YOLOv5
before explaining the proposed system. Section 3 describes the proposed improved model of YOLOv5 and describes
we
the composition and role of the process architecture. Section 4 presents the implementation, training using the X-ray
image data collected from the actual factory, calculation of results, and evaluation of the model in conjunction with
comparison to other models. Finally, Section 5 summarizes the proposed architecture, implementation, and test results
in this paper, and describes future research directions.
ie
2. Related research
ev
easy carry, and easy transfer. Semiconductor components are packed in a tape-and-reel configuration, which protects
die units during shipment, storage and handling. During automated component placement by a surface mounted
technology (SMT) machine, placement machines can pick and place thousands of components per hour with a very
high degree of accuracy [12]. As shown in Figure 1, the packing configuration consists of a carrier tape (cavity tape)
for storing semiconductor chips, a cover tape for sealing and protection, and a reel for carrying the sealed material.
r
Before supply, the semiconductor chip goes through various inspection processes, and after completion of the
inspections, the semiconductor chip is packaged in a carrier such as a plastic tray or tape-and-reel, and then put into
the PCB mounting process. The packaged semiconductor chips are inserted into a carrier tape at regular intervals, and
er
after the inspected chip is placed in a carrier tape, cover tape is placed on the cavity tape and is pressurized with a
certain temperature and pressure using a heating blade unit for sealing. At this time, if the temperature of the heating
block unit is too high, the cover tape may not withstand the temperature and may be damaged, and if it is too low,
bonding may not be performed properly. [13]
pe
Carrier tape is made from a multi-layer polystyrene (PS) sheet, which is extruded and laminated. The carrier tape
is used for carrying the device, and cover tape is used to cover the carrier tape. Carrier tapes are covered with a
cover tape to protect the chips from external impact and supplied in a roll form. The pocket design of the carrier tape
provides the maximum protection for electronic devices and transportation hazards while it accommodates efficient
PCB assembly requirements. [14] Cover tape consists of three film layers. The top layer is a polyester base film, the
middle layer is an olefin film and the bottom layer is a heat-activated adhesive coating layer [14] . Most applications
use a heat- and pressure-sensitive adhesive to ensure a consistent seal to the carrier tape. Lastly, reels that contain
ot
the sealed carrier tape are constructed from PS [15]. During the tape-and-reel packing process, the taper machine
seals the carrier tape and cover tape together tightly to keep the sealed device in the reel. In this study, objects will be
detected in reel package X-ray images using YOLOv5.
tn
the first Nobel Prize in Physics in 1901 and the penetrating power of X-rays was used for radiography. Currently,
diffraction, fluorescence, and total reflection are widely used to reveal the properties and structures of materials. Elec-
tromagnetic waves with wavelengths between 10-11 and 10-9 m have strong penetrating power for materials, and in
particular, transmittance varies according to the density of the material. This principle is used for medical equipment
and industrial non-destructive testing equipment that take pictures of the inside of a living body using this principle.
ep
In addition, semiconductor reels can be made into images and counted. We want to improve the coefficient using run-
ning. Because it uses strong penetrating power to observe the inside, there is no need for destructive pre-processing
techniques such as cutting, and the inspection process is very simple, so it is very easy to use X-ray images. [16]
Pr
This preprint research paper has not been peer reviewed. Electronic copy available at: https://ssrn.com/abstract=4339978
/ Procedia Computer Science 00 (2023) 1–22 4
d
ie we
r ev
er
Figure 1. Semiconductor Reel
pe
ot
tn
rin
This preprint research paper has not been peer reviewed. Electronic copy available at: https://ssrn.com/abstract=4339978
/ Procedia Computer Science 00 (2023) 1–22 5
d
Table 1. X-ray Specifications
we
Parameter Specification
X-ray source 55 kV, 110 W, 400 µm focal spot size
Image Detection system 17 Inch FPXD /140 µm pixel size
Inspection Area Max. 380mm Reel / Min 180mm Reel, 0201 Chip
Utility Power : 220VAC
Dimension,weight (W) 900mm (D) 1,579mm (H)1,828mm , 660kg
ie
ETC Barcode scanner, label printer
ev
Compontent Function
X-ray tube Occurrence of X-RAY
Table Movement of the test object within the radiation device
Detector Receives transmitted X-rays and converts them into visible rays
CCD Camera Convert visible light from detector into digital data
r
Controller Image image and inspector overall system control
Shied cabinet Radiation shield
separated, and each CNN must be trained in thousands of candidate regions. To compensate for this, Fast R-CNN
[18] implements one CNN for each input image. The feature map generated through the trained CNN [19] is pooled
to get a feature. In addition, the training step is simplified by simultaneously training by summing the loss of the
classifier and the loss of the area box regression. As a classifier, Softmax was used instead of the existing SVM, but
tn
Fast R-CNN proved that the performance was better when Softmax was applied. Through these improvements, Fast
R-CNN showed a higher mAP than R-CNN and significantly reduced the training time.
In Faster R-CNN, the algorithm for generating candidate regions is performed outside the CNN. However, this
structure is inefficient in terms of speed, and the algorithm cannot be trained. Faster R-CNN [20] is a region proposal
network that is a separate CNN that does not use a selective search algorithm to generate candidate regions and creates
candidate regions in the last layer of CNN that extracts feature maps. A region proposal network (RPN) [21] was
rin
applied. An RPN is a network that receives a feature map, which is the output of the CNN, from Fast R-CNN as an
input, estimates the location of an object, and outputs a candidate region. The feature map extracted from the CNN is
cut into the candidate region estimated from the RPN to find the object. In this way, the CNN process of extracting
feature maps and the process of generating candidate regions were configured as a series of networks, and the training
time was reduced by about 10 times compared to that of Fast R-CNN under the same conditions.
ep
R-FCN [22] finds the location of an object accurately and efficiently by using a score map that includes location
information. Classification results are obtained for each specific location, and the results are combined to ultimately
determine the accurate location. If a specific location contains the object in question, the response of the score map
increases, whereas if it does not, the response decreases. In R-FCN, training is not required for score map location
information. SSD [23] recognizes objects using feature maps of various sizes without separately training RPN to
Pr
generate candidate regions. The feature map obtained from the CNN model decreases in size as the convolution layer
progresses. SSD uses all the feature maps extracted in this process in the inference process to recognize objects.
5
This preprint research paper has not been peer reviewed. Electronic copy available at: https://ssrn.com/abstract=4339978
/ Procedia Computer Science 00 (2023) 1–22 6
d
Small objects can be detected by obtaining a feature map with a large size extracted from a shallow depth, and large
objects can be detected by obtaining a feature map with a small size extracted from a deep depth. SSDs have improved
we
training speed over faster R-CNNs by eliminating RPNs.
AlexNet [24] is a CNN structure with eight layers, and uses a larger and deeper CNN than before. To prevent
the over-fitting problem caused by this, two graphics processing units (GPUs) were used for fast calculation. AlexNet
consists of two CNNs consisting of five convolutional layers and three fully-connected layers in parallel, and each
CNN is learned using two GPUs. As an activation function, the learning speed was improved by about six times by
using rectified linear units (ReLU) instead of the previously used hyperbolic tangent or logistic regression. To prevent
ie
overfitting, a data processing method was used that arbitrarily cropped the input image or adjusted pixel brightness,
and dropout [45] was applied to the fully connected layer.
Since AlexNet, CNN models with deep structures to improve the accuracy of image classification have appeared,
ev
including the introduction of ZFNet. In order to visualize one layer inside CNN, since one layer consists of three
processes (convolution, activation, and integration), this is reversed for each layer and mapped to the size of the input
image. Among the five convolutional layers of ZFNet, simple features such as lines and shapes are extracted in the
shallow layer, and features close to the shape of the object are extracted in the deep layer. Using the information
obtained by visualizing the convolutional layer, the initial layer of AlexNet was modified to construct ZFNet [25],
r
which showed high classification accuracy of AlexNet.
VGG [8] was proposed through a study on the change in performance according to the layer depth of CNN, and
in the structure of the model, five integrations were used to equalize all conditions except for the layer depth, and all
er
filters. The same settings were applied to each model, such as setting the size to three. A total of five models were
used, ranging in depth from 11 to 19. The VGG model emphasized that if the size of the filter is set to three and used
repeatedly, it can act as a larger size filter, and many models developed later use the filter size of 3.
ResNet [26] for image object recognition have improved their performance by implementing a deeper network
structure. However, as the depth increases, the problem of lower accuracy occurs. To solve this problem, a method
pe
called residual learning was applied. In residual learning, a specific layer is trained to react sensitively to small changes
by learning the difference between input and output rather than simply learning output. Learning the difference
between input and output is implemented only by addition, so additional parameters are not required, so computational
efficiency can be maintained. ResNet is a model that applies the concept of residual learning to a VGG composed of
34 layers. In the case of the same VGG model without residual learning, the accuracy decreased when the number of
layers was increased from 18 to 34. Accuracy increased with increasing layers.
ot
2.4. YOLO v5
YOLOv5 has the structure shown in [fig:fig4]Figure 4. YOLO object detection is a method of marking the location
of an object for verification within an image and automatically classifying its type. The object detection algorithm is
tn
a two-stage detection algorithm that performs region proposal and object classification respectively, and a one-stage
detection algorithm that performs both at the same time. It can be classified as a detection algorithm. If we describe
the research methods in order of increasing computation speed, we started with the two-stage object detection method,
and the most representative algorithms of detection type are R-CNN, FAST R-CNN, and FASTER R-CNN. The two-
stage detection algorithm proposes a region of interest where an object to be detected can be located, extracts object
features, and performs learning for marking and classifying the object’s bounding box. Stage 1 corresponds to “you
rin
only look once” (YOLO) and SSD, and simultaneously performs boundary box proposal and classification to save
time and reduce computation and inference costs. In the case of a two-stage detector, calculation and inference
are performed in two stages at the start of object detection, which should improve the calculation speed should be
improved. However, although the speed has been improved due to the development of one-stage object detection,
the relatively low performance is a disadvantage. A modified feature extraction backbone was used by applying the
ep
recognition range to detect small objects using the YOLOv2 algorithm, which is an early modified model of YOLO.
Convolutional layers that can include more domain information and regular convolutional layers show improved
performance compared to previous models. One-stage research based on the YOLO model has object detection
performance equivalent to two-stage detection, but accurate detection of some objects, especially small objects, is
lacking.
Pr
We considered the characteristics of real-time detection of target objects through X-ray scans and inferred that
slow two-stage detectors are unfit for search and detection of semiconductor reel images. Model development and
6
This preprint research paper has not been peer reviewed. Electronic copy available at: https://ssrn.com/abstract=4339978
/ Procedia Computer Science 00 (2023) 1–22 7
d
research on YOLO, the fastest network among one-stage detectors, were then performed. We modified YOLOv5, the
latest YOLO model, and developed the UCT model, a new model with enhanced performance. YOLO has relatively
we
poor performance in detecting small objects due to its high detection rate. Research is continuously conducted in the
field of aviation to detect small objects that are difficult to identify with the naked eye or detect cars and people in
ground videos taken by drones. A generative adversarial network has been used for data augmentation to develop a
network that generates low-resolution video, and object detection and model learning have been performed through
low-resolution satellite-captured video to improve detection of small objects. However, in spite of these efforts,
previous studies on the detection of small objects have not maintained performance or required improvement in data
ie
classification including objects of various sizes.
In particular, previous studies on X-ray object detection have not yielded uniform performance because of the
difficulties in analyzing large data sets and ensuring proper data labeling. Therefore, in this study, focusing on this,
ev
grouping through classification by size of X-ray images secures product feature information through labeling and adds
one layer of CSP and SPP to obtain more specific image characteristics than existing models. By connecting with the
head, we want to increase the recognition rate for small objects by increasing the number of anchor boxes from the
existing three to one, and study the method for object recognition from images using small object features and X-rays
more accurately.
r
The YOLO algorithm divides the image involved in detection into S×S grids, each of which has a different de-
tection task. The entire network structure consists of 2 fully connected layers and 24 convolutional layers. A tensor
of S × S × (B × 5 + C) is output after the fully connected layer, where B represents the number of predicted targets
er
in each grid and C represents the number of categories. The final detection result can be obtained by regressing the
detection box position and determining the category probability of the tensor data. The YOLO algorithm can detect
targets quickly, but it cannot detect small targets or the detection effect is poor. The specific reason is that without
detailed grid division, there tend to be multiple targets on the same grid. Therefore, in order to compensate for these
shortcomings, CSP was adopted in YOLOv5 Figure 5.CSP, which has an excellent detection method with increased
pe
speeds up.The YOLOv5 algorithm sends each batch of training data through the data loader and enhances the training
data. The data loader can perform three types of data enhancement: scaling, color space adjustment, and mosaic
enhancement. In addition, the anchor mechanism of Faster R-CNN is utilized to enhance the ability of the YOLOv5
algorithm for small target detection in images through a multi-scale mechanism in the image detection process. It also
provides the YOLOv5 algorithm with high adaptability to images of different sizes. The system can quickly detect
and recognize small targets in remote sensing images with high accuracy.
ot
According to the theory of the YOLO algorithm, the SSD increases the number of feature maps of different sizes by
removing the entire connected layer of the network. Multiscale target detection is performed on augmented features in
remote sensing images. Meanwhile, the anchor mechanism is supplemented for feature detection of the target remote
sensing image. The SSD algorithm is faster than Fast R-CNN in image feature detection and has higher accuracy than
tn
the YOLO algorithm. However, the SSD algorithm has certain limitations. When using multi-layer feature maps to
detect small objects, correlations between images are often neglected. Therefore, DSSD (deconvolution SSD) is set
by optimizing the SSD. In the deconvolution operation, the DSSD algorithm realizes the concatenation of multi-layer
feature maps using the cross-layer concatenation method to make the multi-layer feature maps more expressive, and
finally, the SSD algorithm achieves optimal detection accuracy of small and medium targets in remote sensing images.
The YOLO algorithm [27] is a deep learning algorithm for object detection. In particular, the YOLO algorithm is
rin
suitable for real-time performance because it has high frames per second. Among them, the most recently proposed
architecture of YOLOv5 Fig. 7 has excellent performance. The model backbone, which serves to extract important
features from the input image, combines BottleNeck and CSPNet [28] and improves learning ability while using the
C3 model with SiLU [29] as the activation function.
The neck of the model, which mixes the functions formed in the backbone PA-Net [30], uses the feature pyramid
ep
scheme. The model head, which performs object detection, is the existing YOLOv3 [31] Follow the structure of the
model. AP (Average Precision) using the COCO dataset When calculating FPS, the results for each model are shown
in Table.3.
Object detection is the process of marking the position of an object for verification within an image and auto-
matically classifying its type [32]. Object detection algorithms can be classified into two types: two-stage detector
Pr
algorithms that performs region proposal and object classification separately, and one-stage detector algorithms that
perform these two processes simultaneously. [33] The former can be converted to the latter with greatly improved
7
This preprint research paper has not been peer reviewed. Electronic copy available at: https://ssrn.com/abstract=4339978
/ Procedia Computer Science 00 (2023) 1–22 8
d
ie we
r ev
Figure 4. YOLOv5 MODEL Architecture
Backbone : Extracts features from multiple resolutions
Neck : Performs multi-resolution feature aggregation
Head : Generates final predictions based on object resolution
er
pe
ot
This preprint research paper has not been peer reviewed. Electronic copy available at: https://ssrn.com/abstract=4339978
/ Procedia Computer Science 00 (2023) 1–22 9
d
Table 3. The performance comparison of different models of YOLOv5
we
Model pixels mAP(0.5) mAP(0.5:0.95) Params(M) FLOPs(G)
YOLOv5s 640x640 55.4 36.7 7.3 17.0
YOLOv5m 640x640 63.3 44.5 21.4 51.3
YOLOv5l 640x640 66.9 48.2 47.0 115.4
YOLOv5x 640x640 66.8 50.4 87.7 218.8
ie
speed and lower computational cost. The most representative algorithms of the two-stage detector types, listed in
order of increasing speed, are region-based convolutional neural networks (R-CNNs), FAST R-CNNs, and FASTER
R-CNNs. Specifically, the two-stage detector algorithm proposes a region of interest where an object to be detected
ev
can be located, extracts object features, and performs learning for marking and classifying the object’s bounding box.
One-stage detectors can be classified into YOLO and SSD algorithms. One-stage detectors simultaneously perform
the proposal and classification of the bounding box, saving time and reducing computational and inference costs. In
the case of a two-stage detector, the calculation speed must be increased because the calculation and inference are
performed in two stages at the start of object detection. However, although the speed has improved with the devel-
r
opment of one-stage object detection, the relatively low performance is a disadvantage. Therefore, to improve the
performance, the researchers extended the receptive field to detect small objects using the YOLOv2 algorithm, an
early modified model of YOLO. Improved performance compared to the previous model was achieved by applying
er
a modified feature extraction model (backbone) that connects a convolutional layer that can contain more domain
information and a regular layer. According to a case study of a one-stage detector based on the YOLO model, object
detection performance was equivalent to that of a two-stage detector. [34] However, the one-stage detector’s ability
to accurately detect some objects, especially small ones, is still limited. [35] Model development and research for
pe
YOLO, the fastest network among one- stage detectors, is being carried out, and the latest YOLO model, YOLOv5,
has been modified and improved to a new model with improved performance, the UCT model.
YOLO has various versions, and the higher the version, the more it is known as a model that overcomes the
disadvantages of the previous version. However, in this paper, we propose an improved model based on YOLOv5
ot
[36]. A disadvantage of CNN-based object detection algorithms is that they do not have good performance in detecting
small objects. This shortcoming is corrected in the network to improve the performance of small object detection.
YOLO uses an object detection bounding box called an anchor box [37] to detect the size of an object. In order
to respond to various parts such as size change, the number of anchor boxes is adjusted and the size of the anchor
tn
boxes is also modified according to the collected data to propose a more improved model. First, various colors of
parts and slight rotation changes Collect and process images to have robust classification characteristics for Second,
in order to improve the classification performance, the boundary area of the object is determined in consideration of
the distinguishing characteristics between parts. Third, by selecting YOLOv5, which is a model that simultaneously
detects and classifies objects and is dependent on size change, an anchor box suitable for the shape of the part is
rin
created and the network is modified and improved to improve the classification performance of the system so that
even small parts can be detected.
semiconductor reel image. The existing number of X-ray machines was calculated in a rule-based method to identify
the number of parts and show the result, but when an error with the actual product occurred or a new semiconductor
chip was stored in the in-house warehouse, the image was extracted. When checking the actual quantity, a smaller
number than the actual quantity was shown as a result. To improve this, the original X-ray image (3072×3072) can
be used with an added a network.
Pr
Recently, various artificial intelligence technologies for object detection are being developed in the field of com-
puter vision and medical imaging. Representatively, deep learning-based CNN algorithm [38] detects a suspected
9
This preprint research paper has not been peer reviewed. Electronic copy available at: https://ssrn.com/abstract=4339978
/ Procedia Computer Science 00 (2023) 1–22 10
d
semiconductor part in an image and displays a bounding box centered on the part. This can provide information for
diagnosis. However, existing deep learning-based CNN algorithms have limitations in detecting diseases in real time
we
due to their slow processing time, and have the disadvantage of decreased accuracy of disease detection. On the other
hand, the recently developed YOLOv5s (you only look once version 5small) model improved the speed and accuracy
of object detection by using a bottleneck CSP layer and a skip connection function.
In this study, we use the YOLOv5s model, which can solve the disadvantages of the existing deep learning-
based CNN algorithm, to perform semiconductor reel detection using X-ray images and evaluate its performance.
The hyperparameters required for deep learning-based model learning [39] are the learning rate [40], the optimizer
ie
function [41], the activation function [42], loss function [43], number of times of learning (epoch) [44], etc., which
determine the performance of the trained model. Therefore, the performance of the learning model can be optimized
by changing the hyperparameters, and appropriate hyperparameters must be applied according to the characteristics
ev
of the object or part to be detected. In this study, training was conducted while modifying hyperparameters using the
YOLOv5s model. The model was trained by varying the activation function, optimization function, loss function, and
number of times of learning, and precision and mAP were measured to evaluate the performance of the learned model.
r
The structure of the YOLOv5s model consists of backbone, neck, and head modules as shown in Figure 4. The
backbone module extracts the features of the input image and includes four convolution layers with a 3 × 3 kernel size,
three bottleneck cross stage partial (CSP) layers [45] and one spatial pyramid pooling (SPP) [46]. The bottleneck
er
CSP layer returns a part of the input value as an output value by using residual connection when the convolution
function is applied to the input value, and the SPP layer extracts various features of the input value using kernels
of different sizes. The neck module restores the input value received from the backbone module to its original size,
and extracts the features of the input value that could not be extracted from the backbone module. In order to obtain
pe
improved results compared to the basic model, a UCT model is proposed to increase the detection of small objects.
The advantage of YOLOv5 is that it uses CSP dense net [47], and since the accuracy of conventional CNNs is greatly
reduced after lightening, the CNN’s learning ability is strengthened to provide sufficient accuracy while lightening.
Through this, unnecessary calculation of Densenet was reduced, the calculation bottleneck was halved, and the cost
of memory was effectively reduced to increase the calculation speed.
The neck module used in this study has six bottleneck CSP layers, six convolution layers with 1 × 1 and 3 × 3
kernel sizes, and doubles the size of the input values, as shown in Figure 8. It consists of three up-sampling layers
ot
to perform and six concatenate layers for residual concatenation. In the output module, the bounding box for the
X-ray semiconductor image is displayed using the data obtained from the last four bottleneck CSP layers of the neck
module. X-ray reel image detection differs between large and small objects depending on the size and shape of the
semiconductor package. While the detection performance of a semiconductor chip with a large image is suitable to a
tn
certain extent, the detection rate for a small image such as a small signal chip resistor or capacitor has relatively low
accuracy. The proposed UCT model increases the accuracy probability of object detection by adding layers.
The backbone network of the UCT model finds spatial information and location of objects, and semantic infor-
mation is found in the neck part. Based on the characteristics of YOLOv5, network No. 9 CBL network No. 10 CSP
is added in the backbone, and Network No. 14 CBL and network are added in the Neck. Upsample Network No. 15
rin
Add Concat No. 16, Network No. 32 CSP, Network No. 33 Concat and Network No. 34 CSP. The reason for adding
one anchor from the existing three anchors is that a feature map is created through convolution, and one anchor is
added to include as much semantic information as possible from the information obtained by the last upsampling.
The anchor information is the added backbone. The same scale neck is connected to help the network and find more
characteristics and spatial information of small objects and add one from three anchors to a total of four anchors to
obtain meaningful information.
ep
In the head, the size of the original bounding box size decreased by half. The size of network No. 25 anchor is
(2x2, 2x4, 5x4), Network No. 28 (3x4, 4x8, 9x6) Network No. 31 ( 5x7, 8x15, 17x12), Network No. 34 (10x13,
16x30, 33x23) is set, and each image is Network No. 25 (80*80*256), Network No. 28 (40*40*512), Network No.
31 (20*20*1024) , output to network number 34 (10*10*2048).
As a result, in the composition of the modules that make up the existing network CBL,SPP,SPPF, add CSP to
Pr
the Backbone, Upsample and Concat to the Neck, and an anchor to the Head. By adding more and operating one
10
This preprint research paper has not been peer reviewed. Electronic copy available at: https://ssrn.com/abstract=4339978
/ Procedia Computer Science 00 (2023) 1–22 11
d
ie we
r ev
Figure 7. Performance of YOLOv5
er
pe
ot
tn
rin
ep
11
This preprint research paper has not been peer reviewed. Electronic copy available at: https://ssrn.com/abstract=4339978
/ Procedia Computer Science 00 (2023) 1–22 12
d
additional network, the accuracy of Small Device Detection is increased and the performance of mAP is further
improved by 0.3m AP. Regression learning in the head is further extended by using smaller anchor boxes, such as
we
small package semiconductors, resistors, and capacitors, to improve the performance of detecting small objects using
small anchor boxes. The modified structure of YOLOv5 is shown in Figure 9
In the backbone, the model depth multiple becomes a deeper model as the higher the Depth Multiple value, the
more BottleneckCSP modules (layers) are repeated. The width multiple is a layer channel multiple. As the width
multiple value increases, the number of Conv filters in the layer increases. The depth and width are set to 0.3 and 0.5,
respectively, by default. During training, the epoch is set to 3000 times as the default, and when mAP is the same
ie
value more than 100 times, training is automatically completed. The weight values [48] are obtained by training
X-ray images on YOLOv5s. In the COCO128 training data [49], the weight was updated by setting the CLASS to
CHIP. The image size was set to 640, BATCH SIZE = 16, and Leaky ReLU was used as an activation function to add
ev
stability by partially reflecting negative numbers. In the detect.py module, the internal default value limited to 1000
images per image is set to 3000 to sufficiently recognize objects in the image.
In Backbone, the model depth multiple becomes a deeper model as the higher the Depth Multiple value, the more
BottleneckCSP modules (layers) are repeated. Width multiple is a layer channel multiple. As the Width Multiple
value increases, the number of Conv filters in the layer increases. Depth is set to 0.3 and width to 0.5 by default.
r
During training, the epoch is set to 3000 times as the default, and when mAP is the same value more than 100 times,
training is automatically completed. The Weights value [48] obtained the best.pt value by training X-RAY images on
YOLOv5S. In the COCO128 training data [49], the weight was updated by setting the CLASS to CHIP. The image
size was set to 640, BATCH SIZE = 16, and Leaky ReLU was used as an activation function to add stability by
er
partially reflecting negative numbers. In the detect.py module, the internal default value limited to 1000 images per
image is set to 3000 to sufficiently recognize objects in the image.
3.2.1. Labeling
pe
As shown in Figure 11, adjacent pixel values are grouped and numbered to identify each object in the binarized
image. The black areas in an image can be selected and labeled individually. When the area of the area was carefully
labeled according to the part, the labeling gave good results to the entire part. In order to apply the X-ray counter,
each image was checked and corrected to ensure accurate labeling. When an image is input, a meaningful value can
be obtained only when the labeling is accurate in that the image area must be recognized and the image processed.
ot
Runtime 24hours
The sizes of all semiconductor components were classified and divided into classes from CHIP 1 to CHIP 10,
as shown in Table 4. It is divided into three categories again and systematized for detection based on chip size.
Capacitors and resistors, which are the smallest images, are grouped into the small class, medium-sized transistors
ep
and diodes are grouped into the middle class, and IC and harnesses are grouped into the large class.
The X-ray image is secured as 3072×3072 as in Figure 10, and data sets for each size are used. The type of data
or the characteristics of each dataset are analyzed to build a part library. The types of parts used in this paper include
capacitors, diodes, resistors, transistors, etc., and the types of parts are subdivided according to the size of the parts in
each type. Figure 10 provides X-ray images of each type of part, and the shape or size of the part varies depending on
the type of part. Features such as shape, size, and color are characteristics that can distinguish the type of each part
Pr
when using a deep learning algorithm, so the number of leads for each type or the shape of width × length × height
12
This preprint research paper has not been peer reviewed. Electronic copy available at: https://ssrn.com/abstract=4339978
/ Procedia Computer Science 00 (2023) 1–22 13
d
ie we
r ev
er
pe
ot
tn
rin
13
This preprint research paper has not been peer reviewed. Electronic copy available at: https://ssrn.com/abstract=4339978
/ Procedia Computer Science 00 (2023) 1–22 14
d
ie we
r ev
er
pe
ot
Figure 11. By class Chip x-ray Images(Class1 to Class10 from the left)
Pr
14
This preprint research paper has not been peer reviewed. Electronic copy available at: https://ssrn.com/abstract=4339978
/ Procedia Computer Science 00 (2023) 1–22 15
d
Table 5. Performance of UCT Model.
we
Model Precision Recall mAP(0.5) mAP(0.5:0.95)
Yolov5s 0.552 0.323 0.349 0.112
Yolov5m 0.685 0.342 0.367 0.125
Yolov5l 0.669 0.489 0.387 0.164
Yolov5x 0.673 0.504 0.395 0.167
UCT 0.865 0.475 0.622 0.309
ie
appears in one dimension, and the color of parts appears in black and white. The ROI area is determined by analysis
based on the characteristics of the image. The X-ray image is a black and white image, and the character or color of
ev
the part does not appear, and it is recognized by characterizing the size of the shape.
r
The confusion matrix has the following four evaluation methods. A True Positive (TP) means that a true is
classified as true. True Negative (TN) means that true is classified as false. False Positive (FP) means false is classified
as true. Finally, False Negative (FN) means that false appears as false.
er
Precision is often used together with recall. It is an indicator that can indicate how accurate the predicted result is.
In other words, it is possible to know how accurate the detection result is because it is possible to know the ratio of
correct answers among the detected items.
Recall refers to how many correct answers were given out of ground truth (GT), i.e., the ratio of properly detected
pe
objects among objects to be detected. The average precision is calculated by increasing the recall by 0.1 units from
0 to 1 (a total of 11 values), and the precision inevitably decreases. The precision value is calculated for each unit
and averaged. That is, the average of precision values according to 11 recall values is called AP. The AP value can be
calculated for each class, and the value obtained by calculating the AP for the total number of classes and averaging
it is mAP.
3.4. Train
ot
As a result of examining the performance of the UCT model, as shown in, the average overall performance
improved to 0.622 for mAP [26], 0.39 for mAP (0.5 : 0.95), and 0.865 for accuracy. This is a model improved by more
than 0.3 relative to the previous model, YOLOv5s. For small objects, the performance improved as the size of the input
image increased. That is, it can be seen that the resolution of the image acts as a valid image value when the input value
tn
is converted to 640 [50]. A tiling method was used in the pretreatment. Tiling divides an image into tiles according
to real-time constraints. The accuracy improvement obtained by the tiling approach is four times greater than that
of conventional methods. Experiments show that emulating inference tiling in the training phase is also beneficial.
Providing training data with a similar image resolution distribution leads to a better representation of the network-
learning image space. Using tiles as an additional data augmentation method in the training phase also significantly
rin
improves small object detection performance by 20%. Training networks with high-resolution images with larger
feature maps increases the computational and memory requirements. The proposed tiling approach linearly increases
the computation time while keeping the memory requirements fixed due to sequential tile processing. Therefore,
efficiency can be improved by supplying tilings in batches to keep the computation and memory appropriate. Tiling
can also be used as a parameter in network design depending on the target platform. [51]
ep
Among the objects, there are about 1200 cases of less than three pixels, so to increase the detection of small
images, the value extracted using network No. 32 CBL Network No. 33 CONCAT [52] Network No. 34 CSP in
HEAD 5,4) Anchor box was added, and a (3,4 4,8 9,6) anchor box was added as a value extracted using network
No. 29 CBL, network No. 30 Concat, and network No. 31 CSP in HEAD, and HEAD An anchor box (5,7 8,15
17,12) was added with values extracted using network No. 26 CBL, network No. 27 CONCAT, and network No.
28 CSP, and the anchor extracted with network No. 25 CSP value was (10 ,13 16,30, 33,23) were added. Medium
Pr
and large semiconductor image objects are maintained without reduction or, in some cases, increase. The addition
15
This preprint research paper has not been peer reviewed. Electronic copy available at: https://ssrn.com/abstract=4339978
/ Procedia Computer Science 00 (2023) 1–22 16
d
we
ie
r ev
er
pe
Figure 12. Confusion matrix
ot
tn
16
This preprint research paper has not been peer reviewed. Electronic copy available at: https://ssrn.com/abstract=4339978
/ Procedia Computer Science 00 (2023) 1–22 17
d
ie we
r ev
er
pe
ot
tn
17
This preprint research paper has not been peer reviewed. Electronic copy available at: https://ssrn.com/abstract=4339978
/ Procedia Computer Science 00 (2023) 1–22 18
d
Table 6. Performace Comparision of YOLOv5s vs UCT MODEL by group.
we
Model Precision Recall mAP(0.5) mAP(0.5:0.95)
YOLOv5s 0.552 0.323 0.349 0.112
CHIP1,3,4 0.541 0.315 0.342 0.121
CHIP2,5,6,7 0.569 0.358 0.348 0.134
CHIP 9,10 0.572 0.368 0.375 0.147
UCT 0.865 0.475 0.622 0.309
ie
CHIP1,3,4 0.847 0.504 0.415 0.167
CHIP2,5,6,7 0.869 0.513 0.487 0.174
CHIP9,10 0.876 0.525 0.637 0.315
ev
of 34 detection layer networks dedicated to small object detection greatly improved the overall performance. The
advantage of YOLOv5 is that the amount of calculation is reduced through convolution and the execution speed is
greatly improved by fast calculation. In Backbone, Bottleneckcsp This structure basically creates four convolution
layers. Convolution is performed in conv1 and conv4, and convolution + batchnorm is performed in conv2 and conv3
r
[53]. The deeper and wider the neural network, the stronger its effect. However, the expansion of the architecture
(structure) increases the amount of computation of the neural network, and during training, it is difficult to train
tasks such as object detection due to the large amount of data. The purpose of CSPNet is to reduce the amount of
er
computation while creating more and more gradient combinations. It can be realized by dividing it into two parts in
the base layer and combining them in the last cross-stage layer. Through this, if CSPNet is used to enhance learning
ability, weight reduction is achieved while maintaining accuracy. ResNet, ResNeXt, and DenseNet reduced operations
by 10 to 20 percent. By removing the computation bottleneck, the computational load of each layer can be distributed
pe
equally to upgrade the computation utilization of the layer without using the computation bottleneck. Using CSPNet,
memory cost can be effectively reduced, and cross-channel pooling compresses the feature pyramid work. In this
study, semiconductor chip detection was performed using the semiconductor Reel X-ray image and the UCT model,
which is an improved YOLOv5s model. In addition, one new head was added in the Csp and Cbl structures. The
learned UCT model can display the bounding box for the predicted part recognized as a semiconductor or chip in the
X-ray image, and the accuracy and recall rate, mAP0.5. The measured value, mAP0.5:0.9, was marked so that the
performance could be compared. The maximum mAP of 0.5 and precision of the UCT model trained in this study
ot
were 0.622 and 0.865, respectively, as shown in , which are considered to be excellent results compared to other
models. Through this process, CBL and SPP were added in one backbone, and 1024 channels were concatenated
in network No. 16 to increase the context of the data. Again, small images were detected through concatenation in
network No. 33, and the performance was 0.273 mAP higher than the basic YOLOv5s, demonstrating the model’s
tn
In the case of the UCT model, which is an improvement of the YOLOv5s model used in this study, the accuracy
of the best chip detection increased with increasing image size from 512 to 640 to 3072 Figure 18. Excellent chip
counting results such as those shown in Figures 16, 17, 18 were obtained. On the other hand, when the increase
in the number of learning reaches the optimum, the learning time is stopped if there is no change in 100 epochs
due to the nature of YOLO. Therefore, the size of the input image when training the model for accuracy detection
is important and should be judged as the largest factor before proceeding with learning. Through this study, it was
ep
confirmed that the accuracy and learning efficiency can be improved by applying the optimal hyperparameters when
learning the UCT model. If the effects of learning variables such as the amount of learning data, learning rate, number
and structure of layers in each module are also analyzed, as well as the hyperparameters used in this study, there is
potential for further improvement in the performance of the model for part counting by finding features in the image
within the semiconductor reel. In addition, it will be possible to demonstrate the performance of the improved model
Pr
in YOLOv5s through comparison of object detection time with existing deep-learning-based CNN algorithms.
18
This preprint research paper has not been peer reviewed. Electronic copy available at: https://ssrn.com/abstract=4339978
/ Procedia Computer Science 00 (2023) 1–22 19
d
ie we
ev
Figure 16. CHIP CLASS1,3,4 Detect chip counting results
r
er
pe
ot
19
This preprint research paper has not been peer reviewed. Electronic copy available at: https://ssrn.com/abstract=4339978
/ Procedia Computer Science 00 (2023) 1–22 20
d
4. Conclusion
we
In this paper, the characteristic class of parts is defined for classifying reel parts. Also, the YOLOv5 model, which
has relatively higher accuracy than other models of YOLO, is improved so that parts of different sizes can be classified
into different types of parts. By adding 1 layer of backbone, 1 layer of neck, and 1 layer of head to the existing model,
the detection processing value for Small Device Image in Backbone, Neck, and Head was extended by more than 25%,
and also the number and size of anchor boxes in this paper By adding 9 bounding boxes, we proposed a network that
can recognize even smaller parts by enabling smaller data image processing per reel. There are a total of 10 classes
ie
of parts used in this paper, but parts with the same size but requiring height information are excluded, and classes
with similar shapes and sizes are grouped into the same class. Parts with a small number of data were excluded for
reliability of learning, and the classification results of three classes were shown. all. The experiment was conducted
with 3,524 pieces of learning data and 468 sheets of validation data. While the existing number of anchor boxes is
ev
27, in this paper, the number of anchor boxes is increased to 36 considering the number of parts of different sizes and
the characteristics of diodes and transistors. The size of the anchor box was also changed based on the actual size of
the parts, and through experiments, it was confirmed that the case of modifying the anchor box had high accuracy for
large parts and found the ROI more accurately. In addition, when the anchor box is modified, there is a disadvantage
of a relatively large number of false detections for small sizes, but the network is corrected and supplemented, and the
r
performance is confirmed through experiments. The final developed YOLO program is a model with the anchor box
and network modified, and the performance evaluation accurately finds 3,048 of the total 3,524 data and has very good
performance with an accuracy of 86.5%. X-ray images should be used using . In addition, in this paper, only four
er
types of capacitors, diodes, resistors, and transistors, which are mainly used and easy to collect, have been performed,
but as more diverse categories of data are obtained, a classifier that can classify the remaining parts will be created
and proceeded.
There were limitations when considering the lack of time during the research process, the limitations of labeling
pe
of learning data, and more classification, training time, and finiteness of the system to be invested in as various
semiconductor parts have images in various shapes. Since suitable modeling for object detection creates more demand
in the industrial field, research on modeling is still necessary. Models such as YOLOvX are continuously being
updated, so we are interested in setting the direction of research and improving modeling.
The characteristic of applying the improved model is that the best.pt value of the training result is very light
compared to other models, so it is fast and accurate. In particular, the CSP layer functions to return a part of the
input value as an output value using residual connection when applying the convolution function to the input value,
ot
and the SPP layer uses kernels of different sizes to develop various characteristics of the input value. The function of
extracting was used to ensure accuracy. The Neck module has the function of restoring the input value received from
the Backbone module to its original size, and has the function of extracting the features of the input value that could
not be extracted from the Backbone module.
tn
There were limitations when considering the lack of time during the research process, the limitations of labeling
of learning data, and more classification, training time, and finiteness of the system to be invested in as various
semiconductor parts have images in various shapes. Since suitable modeling for object detection creates more demand
in the industrial field, research on modeling is continuously needed. Models such as YOLOVX are continuously being
updated, so we are interested in setting the direction of research and improving modeling.
rin
References
[1] A. Krizhevsky, I. Sutskever, G. E. Hinton, Imagenet classification with deep convolutional neural networks, Communications of the ACM
60 (6) (2017) 84–90.
ep
[2] T.-Y. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan, P. Dollár, C. L. Zitnick, Microsoft coco: Common objects in context, in:
European conference on computer vision, Springer, 2014, pp. 740–755.
[3] H. Bay, T. Tuytelaars, L. V. Gool, Surf: Speeded up robust features, in: European conference on computer vision, Springer, 2006, pp.
404–417.
[4] P. Felzenszwalb, D. McAllester, D. Ramanan, A discriminatively trained, multiscale, deformable part model, in: 2008 IEEE conference on
computer vision and pattern recognition, Ieee, 2008, pp. 1–8.
[5] C. Schuldt, I. Laptev, B. Caputo, Recognizing human actions: a local svm approach, in: Proceedings of the 17th International Conference on
Pr
Pattern Recognition, 2004. ICPR 2004., Vol. 3, IEEE, 2004, pp. 32–36.
20
This preprint research paper has not been peer reviewed. Electronic copy available at: https://ssrn.com/abstract=4339978
/ Procedia Computer Science 00 (2023) 1–22 21
d
[6] Y. LeCun, L. Bottou, Y. Bengio, P. Haffner, Gradient-based learning applied to document recognition, Proceedings of the IEEE 86 (11) (1998)
2278–2324.
we
[7] S. S. Kaddoun, Y. Aberni, L. Boubchir, M. Raddadi, B. Daachi, Convolutional neural algorithm for palm vein recognition using zfnet
architecture, in: 2021 4th International Conference on Bio-Engineering for Smart Technologies (BioSMART), IEEE, 2021, pp. 1–4.
[8] A. Sengupta, Y. Ye, R. Wang, C. Liu, K. Roy, Going deeper in spiking neural networks: Vgg and residual architectures, Frontiers in neuro-
science 13 (2019) 95.
[9] Z. Wu, C. Shen, A. Van Den Hengel, Wider or deeper: Revisiting the resnet model for visual recognition, Pattern Recognition 90 (2019)
119–133.
[10] P. Ballester, R. M. Araujo, On the performance of googlenet and alexnet applied to sketches, in: Thirtieth AAAI conference on artificial
ie
intelligence, 2016.
[11] F. Iandola, M. Moskewicz, S. Karayev, R. Girshick, T. Darrell, K. Keutzer, Densenet: Implementing efficient convnet descriptor pyramids,
arXiv preprint arXiv:1404.1869.
[12] L. Khine, J. C. Alimagno, Die sticking quality issue of tape-and-reel packaging for wlcsp (2019) 676–678.
[13] J. K. Kim, Analysis of reel tape packing process conditions using doe, Journal of the Semiconductor Display Technology 19.2 (2020)
ev
pp.105–109.
[14] B. O. C. Troxtell, Semiconductor packing methodology, Journal of the Semiconductor Display Technology SZZA021C (2005) pp.10–19.
[15] T. L. R. S. Qiao, L. Q. Tao, Z. L. Liu, Tape reel single side peel force test verification, 17th International Conference on Electronic Packaging
Technology (ICEPT) SZZA021C (2016) pp.1483–1486.
[16] R. Novelline, The power of tiling for small object detection 5th edition (ISBN 0-674-83339-2.). doi:HarvardUniversityPress.
[17] J. R. Uijlings, K. E. Van De Sande, T. Gevers, A. W. Smeulders, Selective search for object recognition, International journal of computer
vision 104 (2) (2013) 154–171.
r
[18] R. Girshick, Fast r-cnn, in: Proceedings of the IEEE international conference on computer vision, 2015, pp. 1440–1448.
[19] B. Graham, Fractional max-pooling, arXiv preprint arXiv:1412.6071.
[20] R. Girshick, Fast r-cnn, in: Proceedings of the IEEE international conference on computer vision, 2015, pp. 1440–1448.
er
[21] Q. Fan, W. Zhuo, C.-K. Tang, Y.-W. Tai, Few-shot object detection with attention-rpn and multi-relation detector, in: Proceedings of the
IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2020, pp. 4013–4022.
[22] S. Park, S. Han, Robust-tracking control for robot manipulator with deadzone and friction using backstepping and rfnn controller, IET control
theory & applications 5 (12) (2011) 1397–1417.
[23] W. Liu, D. Anguelov, D. Erhan, C. Szegedy, S. Reed, C.-Y. Fu, A. C. Berg, Ssd: Single shot multibox detector (2016) 21–37.
[24] M. Z. Alom, T. M. Taha, C. Yakopcic, S. Westberg, P. Sidike, M. S. Nasrin, B. C. Van Esesn, A. A. S. Awwal, V. K. Asari, The history began
pe
from alexnet: A comprehensive survey on deep learning approaches, arXiv preprint arXiv:1803.01164.
[25] L. Fu, Y. Feng, Y. Majeed, X. Zhang, J. Zhang, M. Karkee, Q. Zhang, Kiwifruit detection in field images using faster r-cnn with zfnet,
IFAC-PapersOnLine 51 (17) (2018) 45–50.
[26] S. Targ, D. Almeida, K. Lyman, Resnet in resnet: Generalizing residual architectures, arXiv preprint arXiv:1603.08029.
[27] P. Jiang, D. Ergu, F. Liu, Y. Cai, B. Ma, A review of yolo algorithm developments, Procedia Computer Science 199 (2022) 1066–1073.
[28] C.-Y. Wang, H.-Y. M. Liao, Y.-H. Wu, P.-Y. Chen, J.-W. Hsieh, I.-H. Yeh, Cspnet: A new backbone that can enhance learning capability of
cnn, in: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition workshops, 2020, pp. 390–391.
[29] S. Sharma, S. Sharma, A. Athaiya, Activation functions in neural networks, towards data science 6 (12) (2017) 310–316.
ot
[30] K. Wang, S. Dong, N. Liu, J. Yang, T. Li, Q. Hu, Pa-net: Learning local features using by pose attention for short-term person re-identification,
Information Sciences 565 (2021) 196–209.
[31] J. Redmon, A. Farhadi, Yolov3: An incremental improvement, arXiv preprint arXiv:1804.02767.
[32] Z.-Q. Zhao, P. Zheng, S.-t. Xu, X. Wu, Object detection with deep learning: A review, IEEE transactions on neural networks and learning
systems 30 (11) (2019) 3212–3232.
tn
[33] C. G. Roehrborn, J. D. McConnell, Analysis of factors contributing to success or failure of 1-stage urethroplasty for urethral stricture disease,
The Journal of urology 151 (4) (1994) 869–874.
[34] J.-S. Kang, S.-E. Shim, S.-M. Jo, K. Chung, Yolo based light source object detection for traffic image big data processing, Journal of
Convergence for Information Technology 10 (8) (2020) 40–46.
[35] B. Li, Y. Liu, X. Wang, Gradient harmonized single-stage detector, in: Proceedings of the AAAI conference on artificial intelligence, Vol. 33,
2019, pp. 8577–8584.
[36] W. Wu, H. Liu, L. Li, Y. Long, X. Wang, Z. Wang, J. Li, Y. Chang, Application of local fully convolutional neural network combined with
rin
yolo v5 algorithm in small target detection of remote sensing image, PloS one 16 (10) (2021) e0259283.
[37] Y. Zhong, J. Wang, J. Peng, L. Zhang, Anchor box optimization for object detection, in: Proceedings of the IEEE/CVF Winter Conference
on Applications of Computer Vision, 2020, pp. 1286–1294.
[38] L. O. Chua, T. Roska, The cnn paradigm, IEEE Transactions on Circuits and Systems I: Fundamental Theory and Applications 40 (3) (1993)
147–156.
[39] J. Fu, H. Luo, J. Feng, K. H. Low, T.-S. Chua, Drmad: distilling reverse-mode automatic differentiation for optimizing hyperparameters of
ep
[43] F. Schorfheide, Loss function-based evaluation of dsge models, Journal of Applied Econometrics 15 (6) (2000) 645–670.
[44] M. Sozzi, S. Cantalamessa, A. Cogato, A. Kayad, F. Marinello, Automatic bunch detection in white grape varieties using yolov3, yolov4, and
21
This preprint research paper has not been peer reviewed. Electronic copy available at: https://ssrn.com/abstract=4339978
/ Procedia Computer Science 00 (2023) 1–22 22
d
yolov5 deep learning algorithms, Agronomy 12 (2) (2022) 319.
[45] Z. Zhang, X. Lu, G. Cao, Y. Yang, L. Jiao, F. Liu, Vit-yolo: Transformer-based yolo for object detection, in: Proceedings of the IEEE/CVF
we
International Conference on Computer Vision, 2021, pp. 2799–2808.
[46] Z. Huang, J. Wang, X. Fu, T. Yu, Y. Guo, R. Wang, Dc-spp-yolo: Dense connection and spatial pyramid pooling based yolo for object
detection, Information Sciences 522 (2020) 241–258.
[47] C. Wang, H. M. Liao, I. Yeh, Y. Wu, P. Chen, J. Hsieh, Cspnet: A new backbone that can enhance learning capability of CNN, CoRR
abs/1911.11929. arXiv:1911.11929.
URL http://arxiv.org/abs/1911.11929
[48] C. Liu, Y. Wu, J. Liu, J. Han, Mti-yolo: a light-weight and real-time deep neural network for insulator detection in complex aerial images,
ie
Energies 14 (5) (2021) 1426.
[49] W. Chen, Z. Liqiang, Y. Tianpeng, J. Tao, J. Yijing, L. Zhihao, Research on the state detection of the secondary panel of the switchgear based
on the yolov5 network model, in: Journal of Physics: Conference Series, Vol. 1994, IOP Publishing, 2021, p. 012030.
[50] M. Li, Z. Zhang, L. Lei, X. Wang, X. Guo, Agricultural greenhouses detection in high-resolution satellite images based on convolutional
neural networks: Comparison of faster r-cnn, yolo v3 and ssd, Sensors 20 (17) (2020) 4938.
ev
[51] F. Ünel, B. O. Özkalayci, C. Çiğla, The power of tiling for small object detection (2019) 582–591doi:10.1109/CVPRW.2019.00084.
[52] R. Guan, K. L. Man, H. Zhao, R. Zhang, S. Yao, J. Smith, E. G. Lim, Y. Yue, Man and cat: mix attention to nn and concatenate attention to
yolo, The Journal of Supercomputing (2022) 1–29.
[53] J. Frankle, D. J. Schwab, A. S. Morcos, Training batchnorm and only batchnorm: On the expressive power of random features in cnns, arXiv
preprint arXiv:2003.00152.
r
er
pe
ot
tn
rin
ep
Pr
22
This preprint research paper has not been peer reviewed. Electronic copy available at: https://ssrn.com/abstract=4339978