5.2.1. Experimental Setup
Since most of the traditional VMMR datasets contain a limited number of largely populated vehicle model classes, in a first experiment, we propose to generate a set of small-scale VMMR training datasets, denoted
S-
where each dataset collects the most populated Top-51 vehicle model classes from one of the following dataset: Compcar [
1], VMMRDB [
4], and the new DVMM dataset; i.e.,
. The neural network models are trained using the most representative classes from each dataset. The Compcar and VMMRDB dataset are rearranged to a make-and-model level classification.
In a second experiment, we propose to evaluate the performance of the proposed 2B–2S framework when the neural network models are trained using large-scale VMMR datasets (containing hundreds of model classes), such as the traditional VMMRDB-495 dataset and the new DVMM dataset.
The neural network models are trained by randomly splitting the datasets using a
–
ratio for Training–Testing. For a fair comparison with the new DVMM dataset, the vehicle models containing less than 100 image samples are also removed from the VMMRDB-495 Test set.
Table 2 shows the specifications of all the VMMR datasets used in this paper to carry out experiments. One can note that Compcar-51 [
1], VMMRDB-51 [
4], and DVMM-51 are small- and medium-scale datasets, while VMMRDB-495 [
4] and new DVMM are large-scale datasets.
The proposed 2B–2S framework is designed so that the make recogniser branch always employs a DenseNet201 backbone [
17], while the model recogniser branch employs one of the AlexNet, ResNet50, or DenseNet201 backbones. The ResNet50 and DenseNet201 backbone are initialized using weights trained on the ImageNet dataset [
33] for image classification, while AlexNet [
15] is trained from scratch. The training image samples are resized to the
resolution and are augmented using flipping and
rotation operations. The AlexNet backbone, the ResNet50 backbone, and the Decision Module are trained using a batch size of 32 samples. Due to the GPU memory limitation, the DenseNet201 backbone is trained using a batch size of 16 samples.
The proposed method is implemented in Tensorflow and optimized by Adam optimizer [
34]. In
Table 3, we list the detailed training schema of the model trained on DVMM. At Stage 1, all the learning rates (lr) are multiplied by the corresponding weight factor after epoch 20; the model learns with learning rates decayed after epoch 5 at Stage 2. The experiments are performed on a machine equipped with an NVIDIA GTX 2080 Ti GPU.
In this work, the performance of the following methods is compared: (i) the proposed 2B–2S framework; and (ii) the corresponding one-branch model recogniser, 1B. The VMMR results of each method are reported at the following classification level: (i) make; (ii) model. The accuracy, -score, G-score, and BDCM metrics are used to evaluate performance of the VMMR solutions.
5.2.2. Numerical Results and Analysis
In the proposed 2B–2S framework, the make recogniser branch always employs a DenseNet201 [
17] backbone to assist the model recogniser branch as DenseNet201 [
17] achieves the best make recognition results on the challenging datasets, as shown in
Table 4.
Table 5 shows the vehicle make recognition results for the 1B method when employing a DenseNet201 [
17] backbone trained on: (i) three small- and medium-scale datasets, Compcar-51 [
1], VMMRDB-51 [
4], and new DVMM-51; and (ii) large-scale datasets, that is, VMMRDB-495 [
4] and our new DVMM. One notes that the network models are trained for the simpler vehicle make classification task compared to the vehicle model classification task, as seen from the number of vehicle classes given in
Table 2. In
Figure 5, we plot the accuracy of the make recogniser, the model recogniser and the corresponding 2B–2S during training process.
Table 6,
Table 7 and
Table 8 show the VMMR results on Compcar-51 [
1], VMMRDB-51 [
4], and new DVMM-51, respectively.
Table 9 and
Table 10 show the VMMR result on VMMRDB-495 [
4] and new DVMM, respectively. Here, we study the performance of the model recogniser branch designed based on one of the following backbones: AlexNet, ResNet50, and DenseNet201. One can note that the vehicle model recogniser provides both the vehicle’s make and model, hence, for both 1B and 2B–2S solutions, the vehicle make classifications are also extracted from the vehicle model classification.
Table 11 indicates the model complexity of 1B and 2B–2S by using different model recogniser backbones. In all tables, the best results at the vehicle model classification level are marked using bold font, while the best results at the vehicle make classification level are marked using underlines.
Similarly,
Figure 6,
Figure 7 and
Figure 8 show the BDCM visual results on the Compcar-51 [
1], VMMRDB-51 [
4], and new DVMM-51 dataset, respectively.
Figure 9 and
Figure 10 show the BDCM visual results on the VMMRDB-495 [
4] and new DVMM datasets, respectively.
In
Table 6, one can note that the proposed 2B–2S framework outperforms the corresponding 1B approach on the traditional Compcar-51 [
1] dataset, for both vehicle make classification and vehicle model classification, and when employing any type of backbone for model recognition. Moreover, when the AlexNet backbone is employed, the make recognition accuracy based on the model recogniser is improved from
to
the
-score is also improved from
to
and the
G-score shows that the proposed 2B–2S framework reduces the classification confusion at both make and model recognising level, i.e.,
Similar results are achieved also when the ResNet50 [
16] or DenseNet201 [
17] backbone are employed. Overall, the best VMMR results on the dataset,
accuracy and
-score, are provided when both the make and model recogniser branches employ a DenseNet201 [
17] backbone.
Figure 6 presents the visualization of the BDCM matrix, where one can note that the number of cases where 2B–2S provides a better result is much higher than the number of cases where 1B provides a better result, i.e., the number of blue dots is always larger than the number of red dots. This demonstrates the efficiency of the proposed 2B–2S framework, which introduces less inter-make ambiguity on the traditional Compcar-51 dataset. A similar conclusion can be drawn based on the VMMR results on the traditional VMMRDB-51 datset and the new DVMM-51 dataset, which, however, cover the same, but limited, number of vehicle models as Compcar-51.
In
Table 7 and
Table 8, one can note that the same improvement trend is obtained on the traditional VMMRDB-51 dataset and the new DVMM-51 dataset, irrespective of the type of backbone employed for model recognition. The results show that: (a) the make recognition accuracy based on the model recogniser is improved; (b) the
-score of the proposed 2B–2S framework is larger than the
-score of 1B; (c) the
G-score shows that the inter-make ambiguity is highly decreased for the proposed 2B–2S framework.
Table 2 shows that both DVMM-51 and VMMRDB-51 datasets contain
more samples than Compcar-51. Since, for each vehicle model class, more samples are available for training, the neural networks provide better results on these datasets than over the Compcar-51 dataset.
Figure 7 and
Figure 8 present the visualization of the BDCM matrix, where one can note that the proposed 2B–2S framework continues to provide a better overall performance compared with 1B.
Table 9 shows the VMMR results on the traditional large-scale VMMRDB-495 [
4] dataset. One can note that, based on the AlexNet [
15] backbone, the accuracy of the model recogniser is improved from
to
and
-score is improved from
to
Similarly, the make recognition accuracy based on the model recogniser is improved with more the
. The ResNet50 [
16] and DenseNet201 [
17] backbones show the same improvement trend. We have similar observations for
Table 10 reporting results on the new DVMM dataset.
Compared with the three small-scale S–51 datasets, both VMMRDB-495 and DVMM are much larger (around larger than Compcar-51, around larger than VMMRDB-51 and DVMM-51), and contains more than more vehicle models.
One can note that VMMRDB-495 and DVMM are much more complex and more diverse than the
S–51 datasets. Therefore, when working with such datasets, the VMMR problem becomes more complex as the network models are tested on a more diverse dataset. Nevertheless, one can note that the
G-score results over such datasets (reported in
Table 9 and
Table 10) demonstrate that the proposed 2B–2S framework outperforms the corresponding 1B model recogniser and reduces the ambiguity at both make and model level classification.
In
Figure 9 and
Figure 10, the proposed 2B–2S framework continues to provide an overall better inter-make ambiguity reduction compared to 1B as the number of blue dots is always larger than the number of red dots. In
Figure 11, we visualize the vehicle model confusion of 1B and the corresponding results of 2B–2S which are corrected by fusing the accurate make information from the make recogniser branch. Therefore, our proposed 2B–2S method is able to improve the performance of the VMMR task by significantly reducing the classification confusion.
One can observe that the proposed two-branch architecture is more complex compared to the single branch, as shown in
Table 11 while with the increase of the dataset diversity, the difference in performance between 1B and 2B–2S is decreased; the latter observation inspires us future research on 2B–2S, as detailed in
Section 6.