6.1. Experimental Process and Dataset
To validate the performance of our improved Faster R-CNN model, this section will analyze the model from the perspectives of a vehicle dataset, experimental evaluation indicators, and improvement effects. Firstly, if the IoU of the object bounding box
G generated by Faster R-CNN and the ground truth
Ggt marked in the test set satisfy Formula (3), the object is judged to be the target, otherwise, it is judged not to be the target.
where
Tr is the fixed threshold, and
Tr ≥ 0.5 in most experiments. For the ground truth
Ggt in an image of size
m ×
n, the fixed threshold
Tr should satisfy Formula (4).
The recall rates of different models are expressed by
rr and the precision rates are expressed by
pr, which can be obtained by Equations (5) and (6). Where
TP stands for true positive and
FP stands for false positive, they represent the number of bounding boxes that are recognized to be correct and wrong, respectively, and
Ma is the number of ground truths in the images for the vehicle type
a.
Then, the AP of each type is calculated by taking 11 positions on the range of recall rates [0, 1] at intervals of 0.1, and converting the precision rates of the vehicle type into a piecewise function of the corresponding recall rates. The AP of the vehicle types is obtained by calculating the area contained in the function curve. Finally, the mAP of the whole test set is the mean value of the vehicle type AP.
In this paper, the MIT DriveSeg Dataset [
28], the Caltech Pedestrian Detection Benchmark [
29], and 1000 vehicle images on the internet were used as the dataset, mainly including a total of three types of vehicle samples (including cars, SUVs, and vans). The MIT DriveSeg Dataset and the Caltech Pedestrian Detection Benchmark are often used as the basic dataset for some vehicle recognition methods and have good performance in recognition results. The MIT DriveSeg dataset is a large-scale traffic scene segmentation dataset that annotates road objects at the pixel level across 5000 video frames. The images in the dataset contain information about various vehicles on city streets. The Caltech dataset consists of approximately 10 h of video taken from vehicles traveling through regular traffic in an urban environment, and the images in the dataset contain a variety of vehicles traveling on urban roads. We used a total of 12,000 images as the training set and a total of 1916 images of three types of vehicles as the test set. The process of model training and testing is shown in
Figure 6. Firstly, ground truths are manually added to various vehicles in the dataset, and the marked images are divided into the training set and test set. The training set is used to train the classification model and the region proposal generation model, and the test set is used to test the recognition effect. During the experiment, region proposals of vehicle types are extracted from the original image, and then, the type and position information of different vehicles in the whole image are obtained.
6.2. Experimental Results and Analysis for the Recognition Accuracy
In the Faster R-CNN model, region proposals need to be generated first, that is, regions where the object may appear in the original image are found, and then, local features are extracted. Therefore, to improve the accuracy of object recognition, an effective solution is to improve the quality of the generated region proposals. Thus, to include all vehicle types when the number of generating region proposals is as small as possible, the recall rates can be improved without increasing the number of generated region proposals. In this paper, the output features of multiple convolutional layers were combined to obtain multi-layer features, and then, the multi-layer features were used as the input of the Faster R-CNN model to improve the generation quality of region proposals. We utilized two experimental groups to test whether the multi-layer feature combination method can improve the quality of generating region proposals. The images in the dataset were used as the original samples to generate region proposals. The two experimental groups calculated the recall rates of generating region proposals under different IoU values. The difference is that the first experimental group generated 2000 region proposals for each training image, while the second experimental group generated only 300 region proposals.
In this experiment, Selective Search, the traditional VGG16 network with the RPN (denoted as VGG16), and an improved VGG16 network with the RPN and multi-layer features (denoted as VGG16 + M
f) are used to generate and analyze region proposals of images. In the case of generating 2000 region proposals, the experimental results are shown in
Figure 7. When IoU is 0.5, the three models have relatively good recall rates, all of which are above 0.95.
Table 1 records the recall rates when 2000 region proposals are generated with IoU values of 0.5, 0.7, and 0.9, respectively. As can be seen from
Table 1, in the case of the same IoU, the improved model based on multi-layer feature combination has a certain degree of improvement in recall rates compared with the traditional VGG16 network.
Then, the number of generated region proposals are reduced from 2000 to 300, and other parameters remain unchanged for the second experimental group. The experimental results are shown in
Figure 8. It can be seen from the figure that the Selective Search greatly reduces the recall rates due to the reduction in the number of generated region proposals, which also greatly affects the recognition performance. However, the other two models based on the VGG16 network are less affected by the number of generated region proposals.
Table 2 records the recall rates when 300 region proposals are generated with IoU values of 0.5, 0.7, and 0.9, respectively. As can be seen from the table, compared with the traditional VGG16 network, the improved model based on the multi-layer feature combination proposed in this paper has a certain degree of improvement in the recall rates, and the improvement amplitude is greater than that in the case of 2000 region proposals.
To further verify the efficiency of multi-layer feature combination in the Faster R-CNN model, we adopted the ResNet50 network to design the following three comparative experimental groups. The first group is trained and tested using the original ResNet50 network (denoted as R
0). The second and third groups add the multi-layer feature combination in the ResNet50 network. In the second group, the features of the third convolutional layer output are combined with those of the fourth convolutional layer output after maximum pooling (denoted as P
1). The third group adopts the convolutional layer whose convolution kernel size is 3 × 3 and step size is 2 for pooling operations (denoted as P
2). The final results are shown in
Table 3. It can be seen from the table that the different multi-layer feature combination methods also have an impact on the recognition AP. The mAP of the second experimental group is increased by 0.29% compared with the first experimental group, and the mAP of the third experimental group is increased by 0.70% compared with the first experimental group.
Then, based on the original Faster R-CNN (denoted as O
1) model, the three models with the backbone networks VGG16, ResNet50, and ResNet101 are trained and tested by adding the multi-layer feature combination used in model P
2 (denoted as M
1). The results are shown in
Table 4. In the VGG16 network, the mAP of vehicle-type recognition is increased by 0.57% after adopting the multi-layer feature combination. In the ResNet50 network, the mAP of vehicle-type recognition is increased by 0.71% after adopting the multi-layer feature combination. In the ResNet101 network, the mAP of vehicle-type recognition is increased by 0.61% after adopting the multi-layer feature combination. The improvement in mAP proved that the multi-layer feature combination proposed in this paper can improve the vehicle-type recognition precision of the Faster R-CNN model.
Faster R-CNN with the ResNet50 network is used as the basic model for this experimental comparison. The first experimental group adopts the Faster R-CNN model with the ResNet50 network (denoted as O
2) to train and test the dataset. The second experimental group adds the contextual features based on the first experimental group (denoted as N
1), and the third experimental group adds the bounding box optimization based on the second experimental group (denoted as N
2). The AP of vehicle-type recognition in the three experimental groups is shown in
Table 5. It can be concluded that mAP is increased by 0.31% after adding the contextual features, and the mAP is increased by 0.50% after adding the contextual features and the bounding box optimization.
Finally, based on the multi-layer feature combination (M
1), we added the contextual features and bounding box optimization to the Faster R-CNN model for experiments (denoted as M
2); the results of the vehicle-type recognition tests are shown in
Table 6. Compared with the results in
Table 4, the mAP of the VGG16 network is increased by 0.73% after adding multi-layer feature combination, contextual features, and object bounding box optimization (compare M
2 + VGG16 with O
1 + VGG16), and the mAP is increased by 0.16% after contextual feature combination and bounding box optimization (compare M
2 + VGG16 with M
1 + VGG16). The mAP of the ResNet50 network is increased by 0.98% (compare M
2 + ResNet50 with O
1 + ResNet50), and the mAP is increased by 0.27% (compare M
2 + ResNet50 with M
1 + ResNet50). The mAP of the ResNet101 network is increased by 1.72% (compare M
2 + ResNet101 with O
1 + ResNet101), and the mAP is increased by 1.12% (compare M
2 + ResNet101 with M
1+ResNet101). Some of the vehicle-type recognition results of the improved Faster R-CNN model are shown in
Figure 9. It can be seen from the figure that the improved model proposed in this paper based on Faster R-CNN can better identify the vehicle types in images of actual complex traffic.