A comparative analysis of three algorithms is conducted using the matched datasets. The first two approaches use object detection techniques to determine the scene class. They utilize a probabilistic classifier to provide information on both the object-bounding box class and the scene class for each image. In contrast, the third approach is an end-to-end scene classifier that solely focuses on scene classification and does not rely on a bounding-box dataset. All datasets used in this study have consistent and seamless scene class data for the same image frames. While the evaluation of detection algorithms is confined to dataset A and B, the classification algorithm evaluation includes all three datasets. Overall, this study aimed to compare the effectiveness of the three algorithms in scene classification.
5.3.1. Detection Algorithm Evaluation for Special Traffic Object Extraction
We evaluated two datasets of bounding boxes using the YOLOv5 object detection algorithm: one with separate bounding boxes (dataset A) and one with merged bounding boxes (dataset B). Although these datasets serve as an intermediary step, they are crucial in solving the final classification problem.
When detecting objects using b-box detection, we follow the normal object detection algorithms in self-driving applications [
22,
23]. To confirm a true positive detection, we require a significant intersection-over-union (IoU) with the ground truth, with a threshold value of 0.5. We set a relatively small IoU threshold because an overly precise bounding box is not necessary for scene classification. Our approach to bounding box detection aims to optimize network parameters for the highest precision while maintaining sub-optimal recall. This is because a high number of false positives could negatively impact scene classification by providing incorrect prior information. Therefore, it is generally better to miss some uncertain but valid b-box cue than to incorrectly detect it, in order to minimize misclassification.
As illustrated in
Figure 6 and
Table 4, the object detection algorithm demonstrates superior performance when applied to the separate bounding box dataset (dataset A) due to several factors. Firstly, this dataset enables clear instance justification and ensures that the bounding boxes are well-aligned with the objects of interest. However, it is worth noting that the merged bounding boxes, particularly those encompassing both vehicles and pedestrians, exhibit notably poor IOU values. Consequently, a lower IOU threshold of 0.3 is necessary for effective classification purposes.
Moreover, merging b-box for the associated objects into a single bounding box dataset presents a challenge in achieving an equal distribution of data. This can negatively impact the detection performance, especially when it comes to accurately identifying instances of the “Issued Vehicle + Pedestrian” class. Furthermore, the merged bounding box datasets primarily focus on capturing larger-scale target scenes, such as congested vehicles and issued vehicles accompanied by pedestrians.
5.3.2. Evaluation on Special Traffic Issue Classification
We conducted a comparative analysis of three distinct methods through the classification accuracy: separate bounding box inference (referred to as Algorithm A), merged bounding box inference (referred to as Algorithm B), and end-to-end classification-based inference (referred to as Algorithm C). To facilitate a clear differentiation between these algorithms, we adopted the nomenclature of datasets A, B, and C, respectively. The accuracy of each module is presented in
Table 5.
The end-to-end module shows visibly different characteristics other than the object detection-based modules (dataset A and B). The end-to-end-based scene classification method exhibits the highest average accuracy of 87.1%. Using a simple ResNet-34 backbone network and classifying head, this problem can be solved with greater accuracy with simplicity implementation. However, its computation cost is lower than any other object-detection-based algorithm. When we utilize ResNet-18 as a backbone, it could result in a 14% higher computational cost, but lose 7% of accuracy in the classification. The end-to-end approach (dataset C) is well-suited for C-ITS systems that solely require scene class type information, and when the control tower can operate effectively with a processing time of one or two frames per second (fps). In situations where precise object recognition and positioning are not crucial, but delivering accurate alerts to the traffic control and management team holds paramount importance, the end-to-end approach is recommended as the optimal choice.
In this section, no visual figure is provided for dataset C because it mainly involves image classification using numerical labels ranging from C1 to C4. “C” stands for “Class”, which corresponds to classes shown in
Table 2. It is important to note that the algorithm’s performance on dataset C, as shown in
Table 6, has already demonstrated a significantly high level of effectiveness.
On the other hand, if more specific information is required, such as the precise location of the target objects, object-detection-based algorithms are preferable as they can both localize the target and provide the corresponding class label. Consequently, if the algorithm needs to operate in real time at a video level exceeding 10 fps, the YOLOv5-based object detection algorithm is a significantly superior option. Dataset A and B both encompass the same number of classes and employ the same inference model with probabilistic methods, resulting in comparable testing times between the two datasets. Other than accuracy and processing time on average, we can see in
Table 6 a comprehensive breakdown of road condition issue cases, specifically focusing on the classification results of special traffic issues. The table offers detailed information regarding the confusion matrix of all dataset cases, including the class label numbers ranging from class one to class four. This table provides valuable insights into various road condition scenarios, going beyond mere accuracy and processing time considerations.
This study shows that the separate bounding box inference method performs well in traffic-congested scenarios and is less prone to misclassifying normal and suspect vehicles. Emergency vehicles are one of the good cues and help classify issued vehicles better. This is demonstrated empirically in
Figure 7. On the other hand, the merged bounding box approach provides additional information about the overall traffic situation, including the presence of congested vehicles and drivers.
Moreover, the separate bounding box inference method (dataset A) performs optimally in situations involving emergency vehicles, work zones, and debris. This approach relies more heavily on the second-stage classifier which utilizes the output of the bounding box detection results rather than the merged bounding box approach, as is intended.Moreover, the separate bounding box approach has greater potential to compensate for malfunctions in object detection. For example, even though the object detection of PODs using PODs object detection performs relatively low at 0.48 with small FODs due to their small size, which is a weakness of YOLO-based methods, the scattered nature of PODs compensates for one another, and dataset A demonstrates the best performance for this class.
An intriguing observation can be made from
Figure 8, which indicates satisfactory performance in the context of total roll-over conditions using the merged bounding box-based method (dataset B). The superior adaptability of the merged bounding box-based method can be attributed to its wider range of observation within the accident scene, which allows for a more detailed response to unforeseen scenarios. In contrast, the separate bounding box method (dataset A) is unable to accurately estimate the traffic event in cases wherein the target or supporting agents are not detected.
An interesting finding can be observed from
Figure 9, where the right scene is correctly classified despite the false detection of an accident event by the merged-object detection method (dataset B). Although the method’s understanding of the objects involved is incorrect, it still manages to classify the scene correctly due to the false detection of other accident-related objects. However, this classification is not entirely reliable as it relies on fortunate circumstances rather than accurate detection. Therefore, evaluating the merged object-based algorithm may not be a feasible option, considering the possibility of coincidental false detections leading to accurate classifications.