In order to further select the most accurate algorithm, the actual detection effect of the current mainstream target detection algorithm model was utilized in real images of small targets with complex backgrounds, as shown in
Figure 2. In
Figure 2a, we show the actual detection effect of the SSD model, where the confidence of person is 0.69, and the confidence of hat is 0.70.
Figure 2b shows the actual detection effect of the Fast R-CNN model
Figure 1, where the confidence of person is 0.65, and the confidence of hat is 0.68.
Figure 2c is the actual detection effect diagram of the Faster R-CNN model, where the confidence of person is 0.71, and the confidence of hat equals 0.72.
Figure 2d demonstrates the actual detection effect diagram of the YOLOv5s model, where the confidence of person is 0.73, and the confidence of hat is 0.73. Compared with other mainstream target detection algorithms, the experimental results illustrate that the YOLOv5s target detection algorithm has a higher detection accuracy than the SSD model, Fast R-CNN model, and Faster R-CNN model, while maintaining a light weight.
In order to create an object detection algorithm that can effectively detect small objects, especially helmets, this paper proposes an improved YOLOv5s network model based on the YOLOv5s model. This method can effectively enhance the ability to extract helmet target features.
A real-time target identification system called YOLOv5 offers four network models with varying degrees of depth: YOLOv5s, YOLOv5m, YOLOv5l, and YOLOv5x. The lightweight YOLOv5s network structure, shown in
Figure 3, is made up of four parts: input (Input), backbone network (Backbone), neck network (Neck), and detecting head (Prediction). This research focuses on enhancing this structure. The focus module, the CBL convolutional layer, and the CSP1_X module are the components of the YOLOv5s backbone network. A 640 × 640 × 3 picture is fed into the focus structure, followed by slice processing, and a convolution operation yields a 320 × 320 × 64 feature map. The CBL convolutional layer and the CSP1_X module are then used to create a rich feature map with semantic information. The neck network implements two upsampling operations using CSP2_X and FPN+PAN models to combine shallow and high-level semantic features, realizing the fusion of multi-scale receptive fields and enhancing the feature fusion ability. For the prediction, we used the regression + classification method, dividing the input image into three different sized grids: 80 × 80, 40 × 40, and 20 × 20, thereby identifying large, medium, and small targets. Furthermore, YOLOv5 applies adaptive picture scaling, adaptive anchor frame computation, and mosaic data improvement to the input. The backbone network receives the focus and CSP structures, whereas the neck network receives the FPN+PAN structure [
27]. The target detection frame in the output terminal employs GIoU_Loss as its loss function. We also suggest the NMS non-maximum suppression approach. The YOLOv5s algorithm not only increases detection accuracy when compared to the conventional two-stage detection approach, but also significantly reduces training time.
The optimization of the YOLOv5s algorithm for helmet detection can be divided into several aspects:
2.1. Improvement of Adaptive Anchor Frame Mechanism in YOLOv5 Based on K-means++ Algorithm
A core problem in computer vision is object recognition, which entails locating and recognizing items inside an image using bounding boxes. To increase the object recognition models’ accuracy, selecting an appropriate prior bounding box during training can be beneficial. The YOLOv5 model incorporates the concept of an anchor box into target recognition. An initial bounding box with a defined size and aspect ratio is known as an anchor box. The anchor box’s proximity to the ground-truth bounding box is taken into account by the model when adjusting the predicted bounding box during training. K-means and genetic algorithms were employed to update the anchor boxes in the initial YOLOv5 model [
28], with the Euclidean distance acting as the metric function. However, when working with samples of different sizes, utilizing Euclidean distance may result in clustering problems. To solve this problem, we suggest a hybrid method that groups anchor boxes using the K-means++ algorithm and the intersection-over-union (IOU) distance metric. This results in previous bounding boxes with a higher
IOU value, improving object recognition accuracy.
The dimensions of the two boxes are represented by (
w1,
h1) and (
w2,
h2), respectively, as illustrated in
Figure 4. The region highlighted in red denotes the intersection of the two boxes, with dimensions (
w,
h), and is defined as:
The combined area of the two frames is represented by the “blue + red + gray” region. This combined area can be calculated using a formula, which is rewritten as:
The
IOU can be obtained based on Equations (1) and (2):
The overlap between two frames is measured by the
IOU, a statistic that has a scale from 0 to 1. There is no gap between the two frames when the value is 0, while a value of 1 indicates that the two frames are identical. When the
IOU value is higher, it indicates that the two previous frames fit better. To ensure that the measurement value and similarity have a negative correlation, when the measurement value is low, the similarity is high, and the value of the
IOU is subtracted from 1. This gives rise to Equation (4), which calculates the similarity metric between two frames:
In this paper, we utilized the K-means-based YOLOv5 algorithm in combination with the Euclidean distance measure method to derive 9 prior boxes. These prior boxes corresponded to feature maps of varying scales and had a matching degree of 0.8553. The prior boxes on feature maps of different scales are presented in
Table 1.
The linear growth in the computing complexity and quick convergence time of the K-means clustering method are two of its many benefits. The beginning clustering center must be predetermined for this approach, and various initial clustering centers may provide different clustering outcomes. To address this problem, we leveraged the K-means++ technique to calculate the anchor boxes in our object identification model. The first cluster center is chosen at random by the K-means++ algorithm, ensuring that the mutual distance between the initial cluster centers is as great as is feasible. When the initial n cluster centers (0 n K) have been chosen, the n + 1-th cluster center is chosen by giving sites further from the n cluster centers a greater likelihood. This approach helps to ensure that the anchor boxes are optimized for better accuracy and robustness in object recognition while mitigating the potential effects of initial clustering center selection.
To create previous boxes of feature maps with various scales for this investigation, we combined the IOU measurement method with the K-means++ algorithm. Our approach yielded a matching degree of 0.8689 for the previous frames, which was higher than that achieved by clustering with the K-means algorithm. The resulting prior box distribution is presented in
Table 2.
2.2. Improvement of Network Structure
The attention mechanism seeks to identify relevant information and disregard irrelevant information, thereby enhancing the efficiency of neural networks. By obtaining detailed information and suppressing unnecessary data, it becomes possible to improve the network’s performance [
29,
30]. In order to do this, we suggest a fusion approach that combines the cross-stage partial (CSP) module built into the convolutional block attention module (CBAM) attention mechanism with the global attentional map (GAM) mechanism. Our method attempts to improve the model’s overall performance by strengthening its feature extraction capabilities.
- (1)
CBAM attention mechanism
A compact and adaptable module for strengthening neural networks is the convolutional block attention module (CBAM) [
31]. In this study, the last layer of the cross-stage partial (CSP) modules in the backbone and neck of YOLOv5s includes the CBAM module. This integration enhances the model’s ability to extract features while also lowering computational complexity.
The channel attention module and the spatial attention module are two sub-modules that make up the CBAM module. They are used in succession. From the deep network, we first achieve intermediate feature maps. The CBAM modules are then used at each convolutional block to adaptively improve these maps. The attention map is then successively inferred along the channel and space dimensions. To accomplish adaptive feature refinement, the output attention map is multiplied by the input feature map. In
Figure 5, we can observe the detailed CBAM attention module of the proposed method.
The intermediate feature map F∈R
C×H×W is the input for the CBAM module. A 1D channel attention map (M
c∈R
C×1×1) and a 2D spatial attention map (M
s∈R
1×H×W) are then obtained through sequential inference performed by the module. The mathematical representations of the attention process are given as:
where the symbol ⊗ denotes an element-level multiplication in the attention process. The spatial dimension is communicated together with the channel attention levels. F″ represents the refined output.
Notably, the feature map is compressed along the spatial dimension by the channel attention mechanism to produce a one-dimensional vector. The corresponding calculation for the channel attention is expressed as:
The channel attention sub-module uses the shared network’s maximum and average pooling outputs to generate an attention map, as shown in
Figure 4. Two distinct spatial context descriptors, referred to as F
cavg and F
cmax, are produced simultaneously by aggregating the spatial information of the feature maps using average pooling and max pooling. The average and maximum pooled characteristics are represented, respectively, by these two descriptors. The channel attention map M
c∈R
C×1×1 is created by feeding the two feature maps into a common network of multi-layer perceptrons (MLPs). Next, R
C/r×1×1 is chosen as the activation value size, where r is the reduction ratio and is a sigma function. The weights W
0∈R
C/r×C and W
1∈R
C×C/r of the MLP are shared with the ReLU activation function that comes after W
0.
The spatial attention mechanism compresses the channel by employing average pooling and maximum pooling in the channel dimension, which is formulated as:
To aggregate the feature data of one feature map, two pooling operations—maximum pooling and average pooling—are conducted on the channel dimension, resulting in a dual-channel feature map. Specifically, the number of maximum pooling extractions is H × W, and the number of average pooling extractions is also H × W. Consequently, two 2D feature maps are obtained; the average pooling and maximum pooling characteristics throughout the whole channel are represented by the symbols Fsavg∈R1×H×W and Fsmax∈R1×H×W, respectively. To create a 2D spatial attention map, these two maps are combined and convolved using typical convolutional layers. The convolution operation, denoted as f7×7, employs a filter size of 7 × 7.
The feature map is compressed along the spatial dimension by the channel attention mechanism to produce a one-dimensional vector. The corresponding calculation for the channel attention is expressed in Equation (7).
- (2)
Global Attention Mechanism
The goal of the global attention module (GAM) is to improve neural network performance by reducing the loss of useful information and boosting the representation of global interactions. A convolutional spatial attention sub-module with multi-layer perceptions and a three-dimensional channel attention sub-module are introduced to accomplish this. As shown in
Figure 5, the GAM uses the channel attention mechanism and the spatial attention mechanism juxtaposition technique, similar to the CBAM approach [
32]. An intermediate state F
2 and an output F
3 are defined as follows, given an input feature map F
1∈R
C×H×W:
The symbols for the channel attention map and the spatial attention map are Mc and Ms, respectively, with element-level multiplication ⊗.
To preserve 3D information, the channel attention submodule utilizes 3D permutations. After that, it uses a two-layer multilayer perceptron (MLP) to improve the spatial and cross-dimensional relationships. The MLP is built using a compression ratio of
r, and
Figure 6 shows the channel attention submodule.
To fuse the spatial information, the spatial attention sub-module uses two convolutional layers and maintains the same compression ratio r as the channel attention sub-module.
By reducing feature loss and magnifying the representation of global interactions, the GAM attention mechanism improves the performance of the neural network. Here, we introduce a convolutional spatial attention submodule with multi-layer perceptron and a three-dimensional channel attention submodule. By embedding the CBAM module into the last layer of the CSP of the backbone and neck, the feature map undergoes adaptive refinement for each convolutional block of the deep network through the CBAM module. This process reduces the model’s computational complexity and establishes high-dimensional spatial features’ correlations, thereby facilitating the extraction of relevant features. The network structure incorporates the GAM and CBAM attention mechanisms, as illustrated in
Table 3.
Table 3 shows the number of input source layers in the “from” column and the number of parameters in the “params” column. The “arguments” column lists information on the number of input and output channels, convolution kernel size, step size, and other relevant specifics. The “module” column lists the name of the module.
As shown in
Figure 7, we used Grad-CAM to display the model’s heat map characteristics. The visualization demonstrates the requirement for the original YOLOv5s model feature extraction to be more coherent and suitable for small targets. However, after incorporating the combined attention mechanism, the model focuses on extracting critical information, reduces attention to irrelevant details, and remarkably enhances the feature extraction of small objects.
2.3. Bounding Box Loss Function
Object detection accuracy and effectiveness are heavily reliant on the loss function employed. Traditional object detection loss functions are based on aggregating bounding box regression metrics. However, the distance between the expected target box and the predicted box, the overlapping area, and aspect ratio are a few of the characteristics that greatly affect aggregation accuracy. Some examples include the fact that YOLOv5’s GIoU, CIoU, etc. do not take into consideration the direction discrepancy between the desired target box and the forecast box, resulting in a slower convergence speed and poorer model performance [
33]. On the other hand, the SCYLLA-IoU LOSS (SIoU) [
34] considers the vector’ angle between the regressed boxes and the orientation discrepancy between the anticipated box and the required item box, resulting in increased detection precision.
Conventional object detection loss algorithms are considerably improved by the SIoU loss function, as it not only considers the angle and distance between the regressed boxes, but also addresses the orientation mismatch between the predicted and desired object boxes. This improves training effectiveness and ultimately enhances target box regression’s stability, resulting in a more accurate model. The angle cost, distance cost, shape cost, and IoU cost make up the SIoU loss function.
- (1)
Angle cost
An extra term, LF, in the SioU loss function integrates an adaptive angle adjustment function and greatly lowers the number of variables linked to distance. As seen in
Figure 8, the model first lines up the predicted box with either the X or Y axis (whichever is closest), and then it optimizes the distance along the pertinent axis.
When
α ≤ Π/4, minimize
α, and when
α > Π/4, minimize
β. The definition of LF is obtained, which can be constructed as:
where
- (2)
Distance cost
Based on the redefined angle cost, SIoU defines the distance cost:
where
Equations (12) to (16) show that the effect of distance cost on the output decreases noticeably as the value of α approaches 0. Conversely, as α approaches Π/4, the impact of the distance cost on the output becomes more significant.
- (3)
Shape cost
A definition of the shape cost function is:
where
The value of θ can have a variety of effects on the shape cost depending on the shape of each dataset. To ascertain the relative significance of the form cost, a certain value of θ is determined. A genetic algorithm is used during training to determine the ideal value of θ for each dataset.
- (4)
IoU cost
The IoU cost is described as:
The
Lbox regression loss function is formulated as:
The total loss function is constructed as:
To calculate the loss function, we used a genetic algorithm to determine the values of Wbox, Wcls, and θ. Lcls represents the focal loss, while Wbox and Wcls are the weights for the prediction box and classification loss, respectively. Moreover, we chose a small subset from the training set and computed these values iteratively until the number of iterations was either below a threshold or the maximum number was achieved, at which time the iterations were terminated.
2.4. Knowledge Distillation
Knowledge distillation is a technique utilized to extract the knowledge of a large teacher model and condense it into a small student model. It can be understood as a large teacher neural network teaching his knowledge to a small student network [
35,
36,
37].
The process is transferred from the teacher network to the student network. The teacher network is generally bloated; therefore, the teacher network provides knowledge to the student network. The student network is a relatively small network and can thus obtain a lightweight network model. Knowledge distillation adopts the teacher–student mode. In this mode, the teacher is the output party of “knowledge”, and the student is the receiver of “knowledge” [
38].
The teacher has a strong learning ability and can transfer the learned knowledge to the student model with a lower learning ability, so as to improve the generalization ability of the student model. The complicated and cumbersome but easy-to-use teacher model has no upper limit; it is purely a tutor, and in reality, a simple and flexible student model is deployed. The knowledge distillation process is shown in
Figure 9 below.
First, distill a deeper teacher network with a better extraction ability to obtain a logit, and distill it at
T temperature. Then, use the classification prediction probability distribution in the Softmax layer to obtain soft targets. At the same temperature
T, the logits in the student network are distilled, and then the category prediction probability distribution in Softmax is used to obtain the loss function
Lsoft. Its expression is:
where
Cj is the true label value of the
j-th class.
Finally, Lhard and Lsoft are weighted and summed to obtain the final loss function L. This loss function can prevent the wrong information from the teacher network from being transmitted to the student network by comparing it with the real label. In this study, the improved YOLOv5s model was used as the teacher network, and the YOLOv5s model with the large target detection layer removed by structural pruning was used as the student model for knowledge distillation to obtain the final model and reduce the amount of calculation and parameters of the improved network model.