1. Introduction
Since the beginning of the 21st century, the growth of the automobile industry has gradually changed people’s daily travel patterns. Despite the convenience brought by automobile technology, humans also face increasingly serious traffic safety issues. Studies have shown that subjective behaviors such as inattention, unresponsiveness, and impoliteness of drivers toward pedestrians can easily cause unnecessary casualties in traffic accidents, thereby posing a huge threat to human life and property [
1,
2,
3,
4,
5]. With the substantial improvement of modern control technology and automotive technology, smart cars can assist or even completely replace drivers to perform the main driving operation, thereby providing a solution to traffic safety problems [
6,
7]. Pedestrian detection is an important aspect of the development of intelligent vehicles, which directly affects the driver’s road condition judgment. Smart cars obtain the actual road information around the vehicle in real time through the vehicle-mounted camera, and then uses the pedestrian detection technology to effectively detect pedestrian objects that appear in front of the vehicle, so that timely feedback and warning can be provided to the driver, and the driver take the correct driving operation to avoid the pedestrians. This is helpful to ensure the road safety of people and greatly reduce the traffic accident rate [
8,
9,
10,
11]. Therefore, this subject deserves further in-depth study.
Pedestrian detection refers to the automatic detection of the presence of walking people from a collected detection image or video sequence, and accurate positioning of the pedestrian area. However, as pedestrians are non-rigid objects, complex backgrounds, different postures, changing light, and varying degrees of occlusion in actual road scenarios pose challenges to the accurate detection of pedestrians [
12,
13,
14,
15]. With the rapid development of computer science and artificial intelligence technology, pedestrian detection, as an important branch of computer vision, has attracted considerable research attention. Pedestrian detection research methods are generally divided into two categories, namely, traditional and deep learning-based detection methods. Traditional pedestrian detection methods are mostly implemented step by step based on statistical learning. First, effective feature extraction is conducted in the candidate region of the detection image, and then input to the classifier for discrimination, and finally output the results combined with the detection model [
16,
17,
18]. Dollar et al. [
19] proposed a research method for multi-scale pedestrian detection using fast feature pyramids based on aggregated channel features (ACF). This method first calculated the features of detection image by channel, and then obtained the final feature vector by integral histogram, which had a good detection effect on most visible light images. Gaikwad et al. [
20] proposed a pedestrian detection method based on edge features, which effectively reduced the computational complexity of the feature classifier. However, the detection effect was poor when the edge features of pedestrians in the detected image were not obvious or clear. Liu et al. [
21] effectively combined the linear kernel function with the two heterogeneous features of oriented gradient histogram and local binary pattern. The multi-view-pose part ensemble detector enhanced the expression ability of pedestrian features, exhibiting robust properties. Baek et al. [
22] used kernel support vector machine (SVM) as a feature classifier for pedestrian detection, and trained and optimized it by genetic algorithm to obtain higher detection accuracy. In a word, the traditional pedestrian detection method has a good detection effect under a simple background. However, when the actual road scenario becomes complex, the detection image is blurry, or the pedestrian is in motion, the detection accuracy of this method decreases and is easily affected by environmental factors [
23,
24,
25].
In recent years, deep learning models represented by convolutional neural network (CNN) have been successfully applied to the field of computer vision. As deep learning has a significant advantage of self-learning pedestrian characteristics, pedestrian detection methods based on deep learning have developed rapidly. Chen et al. [
26] extracted the gradient features of pedestrians in a detection image based on deep CNN, input them to the SVM classifier for detection, and then achieved highly satisfactory detection results. Li et al. [
27] adopted a scale-adaptive Fast RCNN framework that can effectively integrate large and small subnets, and had good adaptability to pedestrian detection at different scales. Ouyang et al. [
28] proposed a joint deep learning framework for pedestrian detection, focusing on deformation and occlusion processing, and realized automatic interaction between related components, thereby showing competitive advantages in detection accuracy. Hou et al. [
29] proposed a multispectral pedestrian detection algorithm that combines a single-shot detector framework with multispectral pixel-level image fusion methods, and the detection performance was further improved. Chu et al. [
30] proposed a Syncretic-NMS algorithm for instance segmentation in object detection. Based on the traditional NMS algorithm, the bounding box was merged with its strongly related neighboring boxes, and the experimental results showed that Syncretic-NMS algorithm can effectively improve the accuracy of instance segmentation and adapt to different application scenarios. Recently, pedestrian detection methods based on deep learning are mainly divided into two-stage detection and one-stage detection methods. RCNN, SPP, Faster RCNN, and Mask RCNN are typical two-stage detection networks that have high detection accuracy. However, due to the high complexity of algorithms, long calculation time, and poor real-time performance, these networks cannot be effectively applied to the actual road scenarios [
31,
32,
33]. OverFeat, SSD, and YOLO are typical one-stage detection networks. Furthermore, although these methods have high detection speed, they sacrifice a certain degree of detection accuracy and cannot effectively solve the problem of large network model parameters [
34,
35,
36]. In general, many research methods have achieved positive research results on pedestrian detection technology, but different algorithms have advantages and disadvantages, and the detection performance is uneven. At present, there is no deep learning-based pedestrian detection method that can exhibit accuracy and real-time performance together when applied to complex road scenarios. Different algorithms still have different degrees of limitations, which is not conducive to the further development of the technological level of intelligent vehicle driving assistance. Therefore, improving the pedestrian detection algorithm in view of the above problems is necessary.
In this study, a pedestrian detection algorithm for intelligent vehicles in complex scenarios is proposed. First, the basic principle of YOLOv3 is elaborated and analyzed to determine its limitations in pedestrian detection. Then, on the basis of the original YOLOv3 network model, many improvements are made, including modifying grid cell size, adopting improved k-means clustering algorithm, improving multi-scale bounding box prediction based on receptive field, and using Soft-NMS algorithm. Finally, based on INRIA person and PASCAL VOC 2012 datasets, pedestrian detection experiments are conducted to test the performance of the algorithm in various complex scenarios. By comparing the detection performance with other algorithms, the performance of the proposed algorithm is evaluated.
The rest of this paper is organized as follows: In
Section 2, based on the basic principle of YOLOv3, the limitations of its application to pedestrian detection are determined. In
Section 3, the original YOLOv3 network model is improved. In
Section 4, pedestrian detection experiments are conducted based on relevant datasets, the detection effect is observed, and the performances of the algorithms are compared.
Section 5 summarizes the conclusions and provides directions for future work.
3. Improved YOLOv3 Network Model
3.1. Improved Grid Cell Size
In the original YOLOv3 network model, the detection image is evenly divided into grid cells with size of 7 × 7. The grid cell where the center of the pedestrian object is located is responsible for predicting the pedestrian object. Accurately increasing the division density of grid cells can help improve the detection accuracy of the network model and reduce the probability of missed detection of pedestrian objects. However, if the division density of grid cells is too high, it will play a counterproductive detection effect. Therefore, in order to determine an appropriate grid cell size, a set of control variable comparison experiments are conducted in this paper, based on INRIA person dataset for repeated training and testing. Under other conditions consistent, observe the pedestrian detection performance of YOLOv3 under different grid cell sizes, and the experimental results are shown in
Table 2.
It can be seen from the above table that when the grid cell size is the original 7 × 7, the average processing time of each frame is the shortest, while the corresponding mean Average Precision (mAP) value is the lowest. When the grid cell size is 14 × 14, the corresponding mAP value is the highest, while the average processing time of each frame is the longest. Nevertheless, when the grid cell size is 10 × 10, the mAP value obtained is only slightly lower than the highest value, and the detection time is not extended too much, and it still has a faster running speed. Therefore, this study chooses 10 × 10, which is a relatively compromised experimental result, as the division size of the detection image, so that the pedestrian detection algorithm can still achieve detection efficiency while improving the detection accuracy.
Figure 2 shows an example picture of improved grid cell size.
3.2. Improved k-Means Clustering Algorithm
The original YOLOv3 network model uses k-means clustering algorithm to perform the prior box unsupervised learning. This algorithm uses Euclidean distance as the evaluation index of object similarity in the clustering process, which is an iterative algorithm for automatic clustering. However, different sizes of the real box of the collected dataset may exist, and the error of the larger real box is larger than that of the smaller one in the iterative updating. Therefore, using the Euclidean distance as the evaluation index of object similarity in the unsupervised learning process of prior box is inaccurate [
39]. As the ultimate goal of the prior box, unsupervised learning is to make the size of the detection box as close as possible to the size of the real box, this study selects
IOU as the evaluation index to describe the distance between the real box and cluster center.
IOU is a commonly used metric, referring to the area ratio of the obtained detection and real boxes [
40], which can be expressed as
where
Gt (ground truth) represents the real box of the object;
Dr (detection result) represents the detection box of the object;
represents the intersection of the real box and detection box; and
represents the union of the real box and detection box.
The distance between the real box and cluster center can be expressed as follows:
where
box represents the real box and
centroid represents the cluster center.
Equation (3) shows that the larger the IOU value between the real box and cluster center, the smaller is the distance between them. The k-means clustering algorithm based on IOU value can effectively reduce the error caused by the size of the real box, which is conducive to obtaining a more accurate cluster center value.
Before using the k-means clustering algorithm to cluster the pedestrian dataset, the invalid annotation data in the dataset needs to be cleared. In this study, the width and height of the real box are taken as an important reference in the data filtering process. If points or lines exist in the dataset, then the corresponding width or height of the real box is 0, and the data are considered invalid. As the pedestrian objects are mostly thin and tall, the data are also considered invalid if the aspect ratio of the real box is greater than 3.
The basic steps of the improved k-means clustering algorithm are as follows:
- (1)
The invalid annotation data in the training dataset are eliminated.
- (1a)
Coordinate data are written from the data file corresponding to the training dataset of the array.
- (1b)
Read the array data in sequence. The projection coordinate of the vertex at the lower left corner of annotation box on the axis is defined as . The projection coordinate on the axis is defined as . The projection coordinate of the vertex at the upper right corner of the annotation box on the axis is defined as , and the projection coordinate on the axis as .
- (1c)
The difference between and is calculated and recorded as . The difference between and is recorded and calculated as . If or , then the annotation data corresponding to and is invalid; otherwise, it is valid.
- (1d)
The quotient of and is calculated and recorded as Q. If , then the annotation data corresponding to and is invalid; otherwise, it is valid.
- (1e)
All valid annotation data in the training dataset are obtained.
- (2)
Effective annotation data are clustered.
- (2a)
The k clusters are artificially selected and k initial clustering centers are randomly selected.
- (2b)
The IOU values of all valid annotation data and clustering centers are calculated.
- (2c)
The data points with larger IOU value are automatically divided into the cluster where the cluster center is located.
- (2d)
The center of all data points in each cluster is selected as the new clustering center.
- (2e)
Steps (2b)–(2d) are repeated until the cluster center no longer moves.
- (3)
The final clustering result is used as the prior box obtained by unsupervised learning of the YOLOv3 network model.
The improved k-means clustering algorithm can almost completely eliminate the effect of invalid annotation data on the clustering center, greatly improving the matching degree between the prior box and pedestrian object. This condition is not only beneficial to reduce the complexity of the training network and shorten the network training time but also helps to improve the detection accuracy of the YOLOv3 network model.
3.3. Improved Multi-Scale Bounding Box Prediction Based on Receptive Field
The original YOLOv3 network model uses the feature extraction network with deep convolution. However, during the network training process, as the number of network layers gradually deepens, the relevant information on small-scale pedestrian objects is increasingly lost [
41]. Therefore, expanding the receptive field of deep convolutional layers as much as possible is necessary to improve the feature recognition level of the network model for pedestrian objects of different scales. Receptive field refers to the size of the area where the pixels on the feature map output by each layer in the CNN are mapped on the original input image. As the convolutional kernels of 1 × 1 and 3 × 3 are widely used in the down-sampling process of YOLOv3, the receptive field increases gradually as the network depth increases. It is a relative concept, and the calculation formula can be expressed as follows:
where
RF represents the size of receptive field;
s is the convolution step size;
k is the size of convolution kernel; and
i,
i-1 are the number of convolutional layers.
In the original multi-scale bounding box prediction, the last layer feature maps of size 13 × 13, 26 × 26, and 52 × 52 are fused with the same-size feature map obtained by up-sampling, and the detection results of this size feature layer are obtained. According to Equation (4), the receptive field size of the last feature layer of relevant size in the original YOLOv3 is reported in
Table 3.
To fully utilize the large amount of semantic information on high-level features and detailed information of low-level features, this study improves the multi-scale bounding box prediction. On the basis of the original three-scale detection module, the feature map with size of 52 × 52 is up-sampled to obtain the feature map with size of 104 × 104. The new feature map is obtained by fusing with the feature map of the same size in the shallow network, and the fourth detection result is obtained after convolution operation and entry into the detection layer. As the number of convolutional layers with 104 × 104 size in the shallow network is small, based on the original Darknet-53 network, six convolutional layers with 104 × 104 size are added to the shallow network to achieve an improved detection effect.
Table 4 presents the receptive field size of the last feature layer of relevant size in the improved YOLOv3.
A comparison of
Table 3 and
Table 4 show intuitively that in the improved YOLOv3 network model, the receptive field size of the last feature layer of four sizes have been significantly increased. In particular, for the feature layer with 104 × 104 size, the corresponding receptive field size has increased from 29 × 29 to 77 × 77. Improving the multi-scale bounding box prediction based on receptive field is beneficial to enhance the network’s ability to pay attention to global information, effectively identify the scale features of pedestrian objects, improve the detection accuracy of the network model for small-scale pedestrian objects, and greatly reduce the occurrence of missing detection.
3.4. Soft-NMS Algorithm
The objective of the traditional NMS algorithm is to search for local maxima, suppress non-maximum elements, and complete the main operation based on the obtained confidence of the detection boxes and overlap between the detection boxes. Two main problems exist in the traditional NMS algorithm. First, when the two detection boxes are relatively close, the effective detection box with a slightly lower confidence is deleted only because of its large overlapping area. Second, the overlap threshold needs to be an artificial setting; if the setting is extremely large, then it will cause false detection, and if the setting is extremely small, then it will cause missed detection. Therefore, the NMS algorithm completely relies on the confidence of the detection box and simply deletes other detection boxes that are larger than the overlap threshold, which cannot achieve the ideal pedestrian detection effect.
Considering the shortcomings of the traditional NMS algorithm, this study uses the Soft-NMS algorithm as the detection box selection scheme. This algorithm does not directly delete all detection boxes whose IOU is larger than the threshold, but reduces their confidence. The larger the IOU between the detection box to be processed and the detection box with the current maximum confidence, the faster is the decrease in the confidence of the detection box to be processed. According to the actual situation, the Soft-NMS algorithm selects one of two penalty functions, linear and Gaussian, to attenuate the confidence of the detection box.
The linear penalty function is defined as follows:
where
is the confidence of the detection box to be processed,
M represents the detection box with the current maximum confidence,
represents the detection box to be processed, and
is the overlap threshold.
The Gaussian penalty function is defined as follows:
The Soft-NMS algorithm is a more general non-maximum suppression algorithm that does not require retraining of the network model, is easy to implement, and can be effectively applied to improved pedestrian detection algorithm. Compared with the traditional NMS algorithm, the Soft-NMS algorithm improves the accuracy of the pedestrian object positioning, has good adaptability in the object intensive scenario, which is helpful to further improve the detection performance of the network model.
5. Conclusions
In this study, a pedestrian detection algorithm for intelligent vehicle in complex scenarios is proposed. First, the basic principle of YOLOv3 is elaborated and analyzed to determine its limitations in pedestrian detection. Then, on the basis of the original YOLOv3 network model, many improvements are made, including modifying grid cell size, adopting improved k-means clustering algorithm, improving multi-scale bounding box prediction based on receptive field, and using Soft-NMS algorithm. Finally, based on INRIA person and PASCAL VOC 2012 datasets, pedestrian detection experiments are conducted to test the performance of the algorithm in various complex scenarios. The experimental results show that the mAP value reaches 90.42%, and the average processing time of each frame is 9.6 ms. Compared with other detection algorithms, the proposed algorithm exhibits accuracy and real-time performance together, good robustness and anti-interference ability in complex scenarios, strong generalization ability, high network stability, and detection accuracy and detection speed have been markedly improved.
From the perspective of pedestrian detection accuracy and operating efficiency, the proposed algorithm has large advantages, which meet the accuracy and real-time target requirements of pedestrian detection for smart cars in the actual road scenarios. These advantages are also important in protecting the road safety of pedestrians and ensuring the steady development of the technological level of intelligent vehicle driving assistance. In the future, the pedestrian detection algorithm under severe working conditions and algorithm hardware transplantation can be conducted in-depth research, so as to improve the overall performance and practical application value of the algorithm.