Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                

Complementary Learning for Real-World Model Failure Detection

Daniel Bogdoll1,2∗, Finn Sartoris1,2∗, Vincent Geppert1,2∗, Svetlana Pavlitska1,2, and J. Marius Zöllner1,2 Authors contributed equally1FZI Research Center for Information Technology, Germany bogdoll@fzi.de2KIT Karlsruhe Institute of Technology, Germany
Abstract

In real-world autonomous driving, deep learning models can experience performance degradation due to distributional shifts between the training data and the driving conditions encountered. As is typical in machine learning, it is difficult to acquire a large and potentially representative labeled test set to validate models in preparation for deployment in the wild. In this work, we introduce complementary learning, where we use learned characteristics from different training paradigms to detect model errors. We demonstrate our approach by learning semantic and predictive motion labels in point clouds in a supervised and self-supervised manner and detect and classify model discrepancies subsequently. We perform a large-scale qualitative analysis and present LidarCODA, the first dataset with labeled anomalies in lidar point clouds, for an extensive quantitative analysis.

I Introduction

In autonomous driving, analyzing anomalies in the environment surrounding the vehicle is a well-established research field [1, 2, 3, 4, 5, 6, 7]. However, many issues are not induced by rare events in the environment, but by model failures, which can also occur in seemingly normal situations. For such cases, Heidecker et al. introduced method-layer corner cases [3], and other data-driven perspectives followed [8, 9, 10]. Model failures in autonomous driving are rarely detected during evaluation, as labeled validation and test splits are typically very small and not representative. However, large-scale unlabeled fleet data recordings are often available.

In order to detect failure modes, there are many active research areas. Active learning [11] is concerned with continuously enriching the training data by querying samples from a set of unlabeled data points. Discrepancies between different sensor systems can also be used to query samples [12]. In error estimation, many approaches try to utilize unlabeled test sets for the evaluation of models [13]. Label refinement compares given labels, e.g. by an auto-labeling process, with new proposals [14]. All of these methods have in common that they utilize or compare two or more different results on the same task. However, we are unaware of approaches that leverage different training paradigms in order to utilize different data characteristics. Our main contributions are:

  • Introduction and demonstration of complementary learning, leveraging complementary training paradigms to detect model failures

  • LidarCODA, the first real-world anomaly dataset for autonomous driving with labeled lidar data

Refer to caption
Figure 1: Model Failure Detection. The top left point cloud shows a supervised and the top right a self-supervised motion segmentation. The supervised model falsely classifies the pedestrian in the front left as static. Our approach exposes this model failure, as highlighted in red in the bottom image.

II Related Work

Refer to caption
Figure 2: Method Overview. Given point clouds, we derive semantic motion labels in a supervised (sv) fashion. Complementary, we perform ground segmentation and derive predictive motion labels in a self-supervised (ssv) fashion. Subsequently, we perform point-wise discrepancy detection and classify potential model failures with an oracle.

The concept of comparing the outputs of two or more neural networks was already introduced in 1994 by Cohn et al., where they queried samples for active learning based on the disagreement between neural networks [15]. Since then, the variability in model predictions has been widely used to detect anomalies or errors. Ensemble diversity is especially well studied, as it was shown to lead to better performance [16], robustness [17], uncertainty quantification [18], and detection of outliers or distribution shifts [19, 20, 21]. While no uniform metric for ensemble diversity exists, measures like disagreement of models, double fault measure, or output correlation are widely used [22]. Ensemble diversity can be implicitly enhanced via random initialization [18], noise injection or dropout, or explicitly via bagging, boosting, or stacking. Compared to ensembles, mixtures of experts [23] include a learnable gate component for dynamic input routing. They enforce higher model specialization and thus more component diversity, leading to better detection of anomalous or out-of-distribution data [24].

The approaches mentioned above involve a combination of several neural networks with similar or identical architectures. Active learning is another research field interested in the detection of model failures. Here, uncertainty derived from ensembles is resource-intensive and thus only rarely used as part of a querying strategy [25, 26, 27]. Similarly, disagreements in a query-by-committee setting can be used to select samples [28]. In autonomous driving, also contradicting detections from sensors can be used as triggers, e.g., when radar and camera detections do not match [12]. Discrepancies between teacher and student models, typically known from knowledge distillation, can also be utilized [29]. As test sets are often small and not representative, directly estimating the accuracy of a model with only unlabeled data is of high interest [13, 30, 31, 32]. Here, we often see simple classification tasks or approaches that estimate an overall error that cannot be applied to individual samples. In some cases, generated pseudo-labels are utilized for further training steps [33, 34].

Disagreements can also be used for detecting erroneous labels. Ground truth labels in large vision datasets are often error-prone when auto-labeling processes based on large models are employed [35]. Detecting label errors with disagreements can be done by predicting a novel or refined label, and uncertainties can be generated by predicting multiple such labels [36, 14, 37]. This way, also noisy labels introduced by human errors can be detected [38].

Robustness during deployment is often achieved with sensor fusion, which, quite differently, purposefully aims to complement the weaknesses of one sensor with the strengths of another. Thus, disagreements are both typical and wanted, with the aim of resolving them [39]. However, also data from a single sensor can be split into multiple streams to increase robustness, e.g., by performing appearance- and geometric-based object detection [40]. In performance monitoring [41, 42], but also in outlier or anomaly detection [43], typically, a primary model performing a regular task is accompanied by a learned or model-based module that provides some sort of uncertainty for the results of the regular task.

Research Gap. Many of the analyzed works utilizing disagreements deal with toy problems and only analyze classification tasks, which are not sufficient to truly understand the shortcomings of a model that is designed for the complex task of autonomous driving. Many works analyze model outputs of the same architecture, leveraging differences during training. However, this way, the same data characteristics are being used during training. In autonomous driving, we see the largest potential for disagreement-based approaches in designing triggers for active learning [12] and increasing robustness during deployment [40], but these industry demonstrations are not accompanied by scientific works. Finally, to the best of our knowledge, no work exists that utilizes different training paradigms in order to detect model failures through disagreements.

III Method

We leverage complementary training paradigms on the same task in order to detect model failures and classify challenging scenarios with an oracle. The ability to detect model failures is based on the intuition that different training paradigms leverage different data characteristics from the same training data set. In this work, we demonstrate this approach with the motion segmentation of point clouds in the context of autonomous driving. As shown in Figure 2, we first derive motion labels from point clouds in a supervised and self-supervised fashion. Here, the first paradigm leverages human knowledge, given only context from static scenes. On the other hand, the second paradigm leverages temporal information inherent in the data. Typically, these paradigms are combined either in a pre-training concept [44] or with a combined loss during learning [45]. In our demonstrated use case, we examine motion labels in point clouds and focus on model disagreements. Based on a point-wise comparison, we detect discrepancies and cluster them for better interpretation. Finally, an oracle examines and classifies the model failures to better understand challenging situations. As one use case, our method, deployed within a query strategy in an active learning loop, can find exciting samples to improve a model under test. All code is on Github.

III-A Semantic Motion Labels

To derive semantic motion labels in point clouds, we first leverage a supervised semantic segmentation model [46] to determine whether a point belongs to a static or dynamic class. However, some classes do not provide clear information about the motion state of the points. For example, points assigned to the class bicyclist at a traffic light may be static in the case of a red light and dynamic in the case of a green light. We refer to such classes as potentially dynamic classes. By also performing supervised motion segmentation [47], we are able to further subdivide the existing classes. For example, the class bicyclist can be broken down into static bicyclist and dynamic bicyclist. Semantic classes that are static by definition remain unchanged. We refer to the semantic labels that are further subdivided based on their motion as semantic motion labels. An example of the resulting label fusion of semantic and motion labels to form semantic motion labels is shown in Figure 3.

Refer to caption
Figure 3: Supervised Semantic Motion Labels. The left semantic segmentation allows no distinction between the parked car at the bottom left and the moving car at the top right. The middle image shows a motion segmentation, where the parked car was classified as static, and the moving car as dynamic. Finally, the right image shows the fused semantic motion labels. Adapted from [48].

III-B Predictive Motion Labels

In order to predict motion labels for a given point cloud, we first filter out the ground [49] to focus on objects in the scene, a common pre-processing step of scene flow models [50, 51, 52, 53, 54]. Based on self-supervised flow prediction [53] of the remaining points, we aim to derive motion labels, indicating whether a point is static or dynamic. The model takes consecutive point clouds as input and predicts the future motion for each lidar point in the form of a 3D displacement vector. The scene flow model does not distinguish between the point’s own motion and the observer’s ego-motion and represents the overall motion of a point between two consecutive frames. In order to derive relative displacements, we need to correct for the ego-motion. This can be done by leveraging or learning odometry information [55]. After predicting the future point cloud X^t+1=Xt+ftsubscript^𝑋𝑡1subscript𝑋𝑡subscript𝑓𝑡\hat{X}_{t+1}=X_{t}+f_{t}over^ start_ARG italic_X end_ARG start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT = italic_X start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT + italic_f start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT, we apply the learned rigid body transformation Tt+1tsubscript𝑇𝑡1𝑡T_{t+1\rightarrow t}italic_T start_POSTSUBSCRIPT italic_t + 1 → italic_t end_POSTSUBSCRIPT of an odometry model, transforming the predicted point cloud back into the coordinate system of Xtsubscript𝑋𝑡X_{t}italic_X start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT. This gives the future point cloud X~t+1subscript~𝑋𝑡1\tilde{X}_{t+1}over~ start_ARG italic_X end_ARG start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT, which contains only the predicted relative motion without the ego-motion, as described by the following formula:

X~t+1=(Tt+1t[(Xt+ft)T𝟏])Tsubscript~𝑋𝑡1superscriptsubscript𝑇𝑡1𝑡matrixsuperscriptsubscript𝑋𝑡subscript𝑓𝑡𝑇1𝑇\tilde{X}_{t+1}=\left(T_{t+1\rightarrow t}\cdot\begin{bmatrix}(X_{t}+f_{t})^{T% }\\ \mathbf{1}\end{bmatrix}\right)^{T}over~ start_ARG italic_X end_ARG start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT = ( italic_T start_POSTSUBSCRIPT italic_t + 1 → italic_t end_POSTSUBSCRIPT ⋅ [ start_ARG start_ROW start_CELL ( italic_X start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT + italic_f start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT end_CELL end_ROW start_ROW start_CELL bold_1 end_CELL end_ROW end_ARG ] ) start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT (1)

As a result, static objects line up closely with the original data of Xtsubscript𝑋𝑡X_{t}italic_X start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT, and only dynamic objects show a predicted displacement, as shown in Figure 4 a).

Two-Stage Clustering. An analysis of the velocity values of the flow predictions showed that separating static from dynamic classes is infeasible in a point-wise fashion, as a strong overlap exists. What we found, however, is a significant difference when considering instance-wise normalized standard deviations, as shown in Figure 5. While we performed this analysis with ground-truth labels, in unlabeled data, the necessity arises to form instance clusters.

To obtain such instances, we utilize DBSCAN [56] to spatially cluster the point clouds, as shown in Figure 4 b). A cluster was classified as potentially dynamic if the normalized standard deviation of the cluster’s velocity was less than 0.12. This threshold was identified through a grid search, leading to the highest IoU for dynamic points using ground truth from SemanticKITTI [57]. In order to further reduce the number of false positives, we cluster potentially dynamic points based on their flow vectors in the second stage. This way, points that have a similar flow are clustered, causing fewer static clusters to be incorrectly classified as dynamic. Spatially separated points can now be grouped together based on their flow. An example of this step can be seen in Figure 4 c). Here, the points on the left and right edges belonging to static objects now form a cluster, which changes the distribution of flow vectors within this cluster, causing it to be classified as static. Finally, the newly found clusters were classified as dynamic if the speed of the cluster was above 4 km/h, based on typical velocity profiles of pedestrians, as visible in Figure 4 d).

Refer to caption
Figure 4: Self-Supervised Predictive Motion Labels. The first image shows the original point cloud in green and the point cloud transformed by the scene flow model and compensated by the ego-motion using the odometry estimation in red. The second and third images show the result of the spatial and flow-based clustering, respectively. The fourth image shows the final predictive motion labels, with dynamic points in red and static points in green. Reprinted from [48].

III-C Discrepancy Detection and Failure Classification

After obtaining motion labels from both the supervised and the self-supervised stream, we are now interested in detecting contradictions between the labels, as shown in Figure 2. Given a semantic and a predictive motion label for each lidar point, there exist four different categories, where we assign a color for visual inspection to each. First, we color points both models deem static in green\bullet and those both models deem dynamic in blue\bullet. For contradictory points, we color those red\bullet where the supervised stream predicts a static point and the self-supervised a dynamic one, and all others yellow\bullet in the opposite case. To help a subsequent human oracle better understand a scene, we provide a visual inspection tool where we map the lidar points onto the corresponding RGB image for an improved scene understanding, as shown in Figure 6. Finally, we cluster instances with contradicting labels, so a human oracle can classify complete scenes as well as single instances.

III-D Implementation Details

For all models shown in Figure 2, we utilized available architectures from the literature. We trained the supervised semantic segmentation model SalsaNext [46] and the supervised motion segmentation model of Chen et al. [47] on the KITTI-360 dataset [58], as it is a large dataset that contains semantic labels, motion labels, and odometry data. However, KITTI-360 only provides ground truth for accumulated point clouds and not for raw scans. To obtain ground truth labels for the raw scans, the labels for the raw point clouds were recovered with a nearest neighbor search. Here, we improved the work of Sanchez [59] in order to also recover labels of dynamic traffic participants. As a result, instead of one billion labels of the accumulated point clouds, about 6.9 billion labels of the raw scans were used for training. The training was performed on an NVIDIA RTX A6000. Hyperparameters were taken from the original papers [46, 47]. More details can be found in [48].

For the remaining three models, we used available pre-trained model weights. For ground segmentation, we deployed GndNet [49]. We chose FlowStep3D [53] as the self-supervised scene flow model because, at time of implementation, it had the lowest outlier rate among other self-supervised scene flow models on the KITTI scene flow dataset [60, 61]. The pre-trained model was trained on synthetic data of the FlyingThings3D [62] dataset with 8,192 points per point cloud. To minimize the domain shift mentioned by Baur et al. [63] between the training dataset with synthetic data and 8,192 points and the inference dataset with raw scans and about 120,000 points from a Velodyne HDL-64E scanner, the following preprocessing steps were necessary. First, only the lidar points from the field of view of the forward-looking camera were considered. Then, we removed points farther than 35 m and excluded points classified as ground. For the self-supervised odometry model, we deployed DeLORA [55].

Refer to caption
Figure 5: Self-Supervised Label Generation. The left graph shows that the magnitude of point-wise flow vectors is not sufficient to distinguish between dynamic and static points. The boxplot on the right shows that the normalized standard deviation per instance is significantly lower for dynamic instances than for static instances. Ground truth from SemanticKITTI [57] was used to classify the points into dynamic and static. Reprinted from [48].
Refer to caption
Figure 6: Supervised Model Failures. Based on our visual inspection tool, a human oracle can analyze contradicting model outputs and classify failure cases. In these exemplary images, we show model failures of the supervised stream, which can be detected due to contradicting outputs of the self-supervised model. Adapted from [48].

IV Evaluation

In the first section of the evaluation, we incorporated a human oracle for Model Failure Classification. In the second part of the evaluation, we analyze the sensitivity of the approach towards anomalies in the environment.

IV-A Model Failure Classification

We conduct an extensive analysis of our approach by examining the outputs of the models. Instead of analyzing only scenes that included detected discrepancies, we manually examined over 20,000 frames to better understand the model behaviors and scenes they react to.

Evaluation Dataset. To minimize perceptual failures due to a domain shift, it is important to choose an evaluation dataset similar to the training dataset. Both, different weather conditions and differences between datasets from different countries can lead to domain shifts [3]. However, classical evaluation splits are too small to truly understand the performance of our approach. Thus, we chose KITTI Odometry [58] for evaluation. KITTI-360, the dataset we trained on, and KITTI Odometry [64] are closely related datasets, as both were captured in Karlsruhe, Germany with a Velodyne HDL-64E lidar. Since we used the KITTI-360 dataset for training, and DeLORA was trained on sequences 00-08 of the KITTI odometry dataset, we used the remaining sequences 09-21 of the KITTI odometry dataset for the evaluation. We used sequences 09 and 10 to quantitatively examine motion segmentation performance for each part separately and used SemanticKITTI [57] labels as ground truth. To investigate model failures, sequences 11-21 of the KITTI Odometry datasets were used for qualitative analysis. As shown in Fig. 7, we observe that the majority of points are predicted as static by both models and around 5 % of the points show model contradictions. Due to the pre-processed point cloud of the scene flow model, the self-supervised stream predicts a label for significantly fewer points than the supervised stream, which predicts a label for all points in a lidar scan. In both analyses, only the lidar points per frame for which both streams predicted a label were considered.

Oracle. Sequences 11-21 of the KITTI odometry dataset were used for the qualitative analysis, comprising a total of 20,350 frames. Based on the visual inspection tool introduced in Section III-C, we present representative examples of detected model failures. In most cases, the models were correctly consistent, especially due to the high number of static points. In these areas, no improvement is necessary. Next, we observe model failures of the supervised stream, which is the model under test in most cases. Here, the two streams contradict each other, and we were able to detect the failure by the correct detection of the self-supervised stream. We show representative examples in Figure 6. Here, scene 1 shows a turning car and two moving bicyclists, where one bicyclist is wrongly labeled as static by the supervised stream. Scene 2 contains two walking pedestrians that are wrongly classified as static by the supervised stream. Scene 3 shows a parking car misclassified as dynamic by the supervised stream. Scenes 4 and 5 show a car moving slowly and a car moving backward, respectively. These rare cases also lead to model failures. These cases demonstrate effectively that our approach enables the detection of sometimes regular, but sometimes also rare and challenging scenarios that lead to model failures, which could not have been detected in a small test split. This way, new samples can be collected for labeling to further improve the model under test. We found various weak points in each stream, characterized by repeated occurrence. Specifically, the supervised model under test has weaknesses in distinguishing between dynamic and static objects in specific situations, e.g., at red lights or when a car is parked directly in front of the ego vehicle. An example of such situations is given in scenes 6 and 7 of Figure 6.

Refer to caption
Figure 7: Discrepancy Detection. Given mostly normal data in the KITTI Odometry dataset, the two streams agree on 95 % of the data (green, blue). In light of anomalies, as provided by the CODA datasets, many more discrepancies (red, yellow) are detected.

Next, we show scenarios where the self-supervised stream showed model failures, detected by correct predictions of the supervised stream. Figure 8 shows representative scenes of this category. Scene 1 contains two distant pedestrians walking, wrongly classified as static by the self-supervised stream. In scene 2, a parked car is misclassified as dynamic by the self-supervised stream. Scenes 3, 4, and 5 show walking pedestrians or moving cars incorrectly classified as dynamic. These cases demonstrate that our approach enables the detection of challenging temporal scenarios. Often, models under test are fine-tuned variants of models trained in a self-supervised fashion. This way, additional training data can be collected for improved pre-training. Here, no labels are required. The self-supervised stream classifies an above-average number of objects as dynamic when the ego-vehicle turns or goes over a speed bump. An example is shown in Figure 8, where in scene 6, the vehicle turns, and in scene 8, it drives over a speed bump. Another weak point is fast oncoming vehicles on highways, which are often classified as static, as can be seen in scene 7. Finally, a common weakness is small clusters on the right or left edge that are incorrectly classified as dynamic, as in scene 9, where points of a window are classified as dynamic.

Finally, also cases occur where the models are incorrectly consistent. For this category, the two streams agree, but the label is incorrect in both cases. We show examples in Figure 10. Here, the left scene shows two walking pedestrians that are incorrectly classified as static and the right scene shows a parked car classified as dynamic.

Refer to caption
Figure 8: Self-Supervised Model Failures. Based on our visual inspection tool, a human oracle can analyze contradicting model outputs and classify failure cases. In these exemplary images, we show model failures of the self-supervised model, which can be detected due to contradicting outputs of the supervised stream. Adapted from [48].

IV-B Sensitivity towards Anomalies

Generally, we are interested in scenarios where models fail. It is well known that perception models often struggle most with anomalies in the environment. Thus, we also examine the sensitivity of our approach towards object-level anomalies in the environment around the ego vehicle, as these are most often examined [65, 66].

Evaluation Data. While our goal in the first part of the evaluation was to minimize the domain gap between training and evaluation data, there are no labeled anomalies in the original KITTI dataset [64]. Thus, we utilized data from CODA [67], the only real-world dataset that provides lidar data and includes anomalies [68]. The CODA dataset provides anomaly labels for objects based on three existing data sets: KITTI [64], ONCE [69], and nuScenes [70].

Refer to caption
Refer to caption
Refer to caption
Figure 9: LidarCODA. Annotated lidar scenes in the three data splits ONCE, KITTI, and nuScenes, from left to right. Anomalies are shown in red. Reprinted from [71].

For the CODA-KITTI split, the authors manually reviewed all misc labels available in the ground truth and relabeled some as anomalies according to a labeling policy. Since we trained our models on KITTI-360, this enables us to quantitatively examine our approach with only a small domain gap. For CODA-nuScenes, the authors similarly adopted available annotations in a manual process. Finally, for CODA-ONCE, they deployed an automated anomaly detection approach, making this subset the most relevant. CODA includes 1,500 scenes with a total of 5,937 anomaly instances. Among those, 4,746 belong to the superclass traffic_facility, followed by 929 vehicle and 197 obstruction instances. With 396, most vehicle instances can be found in CODA-KITTI. In Figure 7, we show the outputs of our approach on the CODA subsets, also in comparison to the outputs of the KITTI dataset. We can observe many more detected discrepancies, which is in line with the much higher number of anomalies, but the subsets reveal strongly varying behavior patterns.

For further qualitative analysis, however, labeled anomalies in 3D are required. The original CODA dataset provides anomaly labels only in the form of 2D bounding boxes in images. Therefore, we present LidarCODA, a dataset based on the CODA dataset [67] for evaluation. Based on a frustum-based filter, subsequent clustering, and manual inspection, we transferred the original 2D labels from image space into refined, point-wise 3D labels that go beyond the coarse characteristic of the provided bounding boxes. More details can be found in [71]. LidarCODA is the first real-world anomaly dataset with annotated lidar data, as shown in Figure 9. Here, also the different lidar systems utilized become clearly visible. Due to the sparse point cloud of nuScenes, many small or distant labeled anomalies in the image space are only covered by a few or no lidar points.

Dataset #Frames mIoU \uparrow mP \uparrow mR \uparrow mF1 \uparrow
LidarCODA 1,412 8.9 13.2 26.2 17.5
LidarCODA-ONCE 1,034 8.9 14.0 27.1 18.4
LidarCODA-KITTI 307 10.9 13.3 29.0 18.3
LidarCODA-nuScenes 71 0.4 0.7 1.0 0.8
TABLE I: LidarCODA Subsets. Evaluation of our approach on LidarCODA and its three subsets. Adapted from [71].

Anomalies. With LidarCODA, we provide ground truth for object-level anomalies [1] in lidar data. Since our approach is not specific to object-level anomalies but shows its strength in underrepresented or generally challenging or atypical scenarios, which can also include scene- or scenario-level anomalies, we are primarily interested in the sensitivity of our approach towards these specific anomalies. This is different from works in anomaly detection, which aim to detect as many anomalies as possible. As our approach assigns distinct labels per point, we follow the evaluation protocols of semantic segmentation tasks rather than anomaly detection tasks, as here an anomaly score per point is required. First, we need to better understand the suitability of the subsets of LidarCODA due to introduced domain shifts, either due to new environments or due to new sensor setups. Table I shows the evaluation results on the individual subsets. Here, we evaluated all points of the lidar point cloud, even if our approach did not label individual points, e.g., because they were filtered during pre-processing. Such cases were counted as false negatives if an anomaly was missed. This way, we fairly incorporate the limits of our approach and do not create any requirements on the analyzed point clouds. The results clearly show that our approach struggles with the nuScenes subset, which is primarily due to the large domain shift w.r.t to the sensor setup. The subsets ONCE and KITTI, however, show more promising results, as here, our approach is capable of detecting anomalies. This allows for further investigation. As nuScenes is by far the smallest subset, its impact on the evaluation is rather limited.

Next, we investigate the sensitivity towards the superclasses provided by CODA. Here, we only evaluated points that have labels assigned by both our approach and the ground truth. As shown in Table II, our approach shows different levels of sensitivity given different types of anomalies. While animals were not detected at all, our approach is more sensitive to cyclists and objects of the misc class. The misc class consists of objects that are “unrecognizable or difficult to categorize” [67]. The results are difficult to interpret, however, as CODA defines an anomaly as an object that “blocks or is about to block a potential path of the selfdriving vehicle” [67] and/or “does not belong to any of the common classes of autonomous driving benchmarks” [67]. This risk-aware definition is not always in line with the methodology of our approach, where objects that block the path in front of the ego vehicle are not necessarily hard to segment or predict. Regardless, we can conclude that our methodology shows a heightened sensitivity towards cyclists who are sometimes difficult to recognize or predict and anomalies that are generally difficult to categorize.

V Conclusion

Superclass #Instances mIoU \uparrow mP \uparrow mR \uparrow mF1 \uparrow
Pedestrian 16 33.9 44.1 37.3 40.4
Cyclist 22 41.6 58.3 49.5 53.5
Vehicle 736 33.0 48.3 41.0 44.4
Animal 5 0.0 0.0 0.0 0.0
Traffic facility 3,360 28.6 39.9 33.9 36.7
Obstruction 125 20.5 34.0 22.9 27.4
Misc 15 36.7 60.7 37.9 46.7
TABLE II: CODA Superclasses. Evaluation of our approach on seven superclasses. Adapted from [71].

In this work, we have presented complementary learning to detect real-world model failures. We leverage complementary training paradigms to detect contradicting outputs on the same task. This way, we can detect the failure modes of a model in the real-world, far beyond the limited scope of a test split. We demonstrate our approach with a motion segmentation of point clouds for autonomous driving, where we leverage a supervised stream for semantic motion labels and a self-supervised stream for predictive motion labels. By inspecting over 20,000 frames, we showed that our approach exposes model failures in repeating scenarios, which allows for the classification of model failures. In addition, we provide LidarCODA, the first anomaly detection benchmark with labeled point clouds to perform a quantitative analysis. We could show that our approach shows an increased sensitivity of often hard-to-detect and hard-to-predict bicycles as well as hard-to-categorize objects. In the future, we will be interested in the effects of including challenging data points into re-training our models, and potentially fine-tuning our self-supervised stream with the supervised one.

Limitations. When both streams are wrong, model failures go undetected, as shown in Figure 10. This behavior is known and unavoidable [21, 33]. Thus, our approach should not be understood as a stand-alone solution but can be seen as one of many triggers to detect challenging scenarios. In addition, our self-supervised stream depends on a clustering strategy, which can lead to faulty clusters and misclassifications.

Acknowledgment

This work results partly from the jbDATA project supported by the German Federal Ministry for Economic Affairs and Climate Action (BMWK), grant number 19A23003H.

Refer to caption
Figure 10: Simultaneous Model Failures. Examples where both streams produce model failures. Both cases are misclassified and are, therefore, consistent. Reprinted from [48].

References

  • [1] J. Breitenstein, J. A. Termohlen, D. Lipinski, and T. Fingscheidt, “Systematization of Corner Cases for Visual Perception in Automated Driving,” in Intelligent Vehicles Symposium (IV), 2020.
  • [2] J. Breitenstein, J.-A. Termöhlen, D. Lipinski, and T. Fingscheidt, “Corner cases for visual perception in automated driving: Some guidance on detection approaches,” arXiv preprint:2102.05897, 2021.
  • [3] F. Heidecker, J. Breitenstein, R. Kevin, L. Jonas, C. Stiller, T. Fingscheidt, and B. Sick, “An Application-Driven Conceptualization of Corner Cases for Perception in Highly Automated Driving,” in Intelligent Vehicles Symposium (IV), 2021.
  • [4] D. Bogdoll, J. Breitenstein, F. Heidecker, M. Bieshaar, B. Sick, T. Fingscheidt, and J. M. Zöllner, “Description of Corner Cases in Automated Driving: Goals and Challenges,” in International Conference on Computer Vision (ICCV) Workshops, 2021.
  • [5] D. Bogdoll, M. Nitsche, and J. M. Zollner, “Anomaly Detection in Autonomous Driving: A Survey,” in Conference on Computer Vision and Pattern Recognition (CVPR) Workshops, 2022.
  • [6] L. Vater, M. Sonntag, J. Hiller, P. Schaudt, and L. Eckstein, “A Systematic Approach Towards the Definition of the Terms Edge Case and Corner Case for Automated Driving,” in International Conference on Electrical, Computer, Communications and Mechatronics Engineering (ICECCME), 2023.
  • [7] H. X. Liu and S. Feng, “Curse of rarity for autonomous vehicles,” Nature Communications, vol. 15, no. 1, 2024.
  • [8] J. Pfeil, J. Wieland, T. Michalke, and A. Theissler, “On Why the System Makes the Corner Case: AI-based Holistic Anomaly Detection for Autonomous Driving,” in Intelligent Vehicles Symposium (IV), 2022.
  • [9] J. Zhou and J. Beyerer, “Corner cases in data-driven automated driving: Definitions, properties and solutions,” in Intelligent Vehicles Symposium (IV), 2023.
  • [10] F. Heidecker, M. Bieshaar, and B. Sick, “Corner Case Definition in Machine Learning Processes for the Perception of Highly Automated Driving,” AI Perspectives & Advances, vol. 6, 2024.
  • [11] P. Kumar and A. Gupta, “Active Learning Query Strategies for Classification, Regression, and Clustering: A Survey,” Journal of Computer Science and Technology, vol. 35, no. 4, 2020.
  • [12] A. Karpathy, “Tesla keynote: Cvpr 2021 workshop on autonomous driving,” https://www.youtube.com/watch?v=g6bOwQdCJrc, 2021, accessed: 2024-06-13.
  • [13] W. Deng and L. Zheng, “Are Labels Always Necessary for Classifier Accuracy Evaluation?” in Conference on Computer Vision and Pattern Recognition (CVPR), 2021.
  • [14] M. Schubert, T. Riedlinger, K. Kahl, D. Kröll, S. Schoenen, S. Šegvić, and M. Rottmann, “Identifying Label Errors in Object Detection Datasets by Loss Inspection,” in Winter Conference on Applications of Computer Vision (WACV), 2024.
  • [15] D. Cohn, L. Atlas, and R. Ladner, “Improving generalization with active learning,” Machine Learning, vol. 15, no. 2, 1994.
  • [16] T. G. Dietterich, “Ensemble methods in machine learning,” in International workshop on multiple classifier systems.   Springer, 2000.
  • [17] T. Pang, K. Xu, C. Du, N. Chen, and J. Zhu, “Improving adversarial robustness via promoting ensemble diversity,” in International Conference on Machine Learning (ICML), 2019.
  • [18] B. Lakshminarayanan, A. Pritzel, and C. Blundell, “Simple and scalable predictive uncertainty estimation using deep ensembles,” in Conference on Neural Information Processing Systems (NeurIPS), 2017.
  • [19] H. A. Mehrtens, C. González, and A. Mukhopadhyay, “Improving robustness and calibration in ensembles with diversity regularization,” in German Conference on Pattern Recognition (GCPR), 2022.
  • [20] M. Pagliardini, M. Jaggi, F. Fleuret, and S. P. Karimireddy, “Agree to disagree: Diversity through disagreement for better transferability,” in International Conference on Learning Representations (ICLR), 2023.
  • [21] L. Wang, J. Wang, Y. Zheng, S. Jain, C.-C. M. Yeh, Z. Zhuang, J. Ebrahimi, and W. Zhang, “Learning from Disagreement for Event Detection,” in International Conference on Big Data, 2022.
  • [22] L. I. Kuncheva and C. J. Whitaker, “Measures of diversity in classifier ensembles and their relationship with the ensemble accuracy,” Machine learning, vol. 51, 2003.
  • [23] R. A. Jacobs, M. I. Jordan, S. J. Nowlan, and G. E. Hinton, “Adaptive mixtures of local experts,” Neural computation, vol. 3, no. 1, 1991.
  • [24] S. Pavlitskaya, C. Hubschneider, and M. Weber, “Evaluating mixture-of-experts architectures for network aggregation,” in Deep Neural Networks and Data for Automated Driving: Robustness, Uncertainty Quantification, and Insights Towards Safety, 2022.
  • [25] C. Schröder, A. Niekler, and M. Potthast, “Revisiting Uncertainty-based Query Strategies for Active Learning with Transformers,” in Findings of the Association for Computational Linguistics, 2022.
  • [26] A. Roitberg, Z. Al-Halah, and R. Stiefelhagen, “Informed democracy: Voting-based novelty detection for action recognition,” in British Machine Vision Conference (BMVC), 2018.
  • [27] S. W. Yahaya, A. Lotfi, and M. Mahmud, “A consensus novelty detection ensemble approach for anomaly detection in activities of daily living,” Applied Soft Computing, vol. 83, 2019.
  • [28] H. Hino and S. Eguchi, “Active Learning by Query by Committee with Robust Divergences,” Information Geometry, vol. 6, 2023.
  • [29] G. Chen, W. Choi, X. Yu, T. Han, and M. Chandraker, “Learning Efficient Object Detection Models with Knowledge Distillation,” in Conference on Neural Information Processing Systems (NeurIPS), vol. 30, 2017.
  • [30] R. Peng, H. Zou, H. Wang, Y. Zeng, Z. Huang, and J. Zhao, “Energy-based Automated Model Evaluation,” 2024.
  • [31] C. Baek, Y. Jiang, A. Raghunathan, and Z. Kolter, “Agreement-on-the-Line: Predicting the Performance of Neural Networks under Distribution Shift,” in Conference on Neural Information Processing Systems (NeurIPS), 2022.
  • [32] J. Chen, F. Liu, B. Avci, X. Wu, Y. Liang, and S. Jha, “Detecting Errors and Estimating Accuracy on Unlabeled Data with Self-training Ensembles,” in Conference on Neural Information Processing Systems (NeurIPS), 2021.
  • [33] J. Wang, L. Wang, Y. Zheng, C.-C. M. Yeh, S. Jain, and W. Zhang, “Learning-From-Disagreement: A Model Comparison and Visual Analytics Framework,” Transactions on Visualization and Computer Graphics, vol. 29, no. 9, 2023.
  • [34] Y. Yu, Z. Yang, A. Wei, Y. Ma, and J. Steinhardt, “Predicting Out-of-Distribution Error with the Projection Norm,” in International Conference on Machine Learning (ICML), 2022.
  • [35] K. Chachuła, J. Łyskawa, B. Olber, P. Frątczak, A. Popowicz, and K. Radlak, “Combating noisy labels in object detection datasets,” arXiv:2211.13993, 2023.
  • [36] Z. Hu, K. Gao, X. Zhang, J. Wang, H. Wang, and J. Han, “Probability Differential-Based Class Label Noise Purification for Object Detection in Aerial Images,” Geoscience and Remote Sensing Letters, vol. 19, 2022.
  • [37] A. Bär, J. Uhrig, J. P. Umesh, M. Cordts, and T. Fingscheidt, “A Novel Benchmark for Refinement of Noisy Localization Labels in Autolabeled Datasets for Object Detection,” in Conference on Computer Vision and Pattern Recognition (CVPR) Workshops, 2023.
  • [38] C. Northcutt, L. Jiang, and I. Chuang, “Confident Learning: Estimating Uncertainty in Dataset Labels,” Journal of Artificial Intelligence Research, vol. 70, 2021.
  • [39] K. Kowol, M. Rottmann, S. Bracke, and H. Gottschalk, “YOdar: Uncertainty-based Sensor Fusion for Vehicle Detection with Camera and Radar Sensors,” in International Conference on Agents and Artificial Intelligence (ICAART), 2021.
  • [40] A. Shashua, “Intel newsroom: CES 2021: Under the Hood,” https://youtube.com/watch?v=B7YNj66GxRA, 2021, accessed: 2024-06-13.
  • [41] W. Shao, B. Li, W. Yu, J. Xu, and H. Wang, “When Is It Likely to Fail? Performance Monitor for Black-Box Trajectory Prediction Model,” Transactions on Automation Science and Engineering, 2024.
  • [42] C. Buerkle, F. Oboril, J. Burr, and K.-U. Scholl, “Safe perception - a hierarchical monitor approach,” in International Conference on Intelligent Transportation Systems (ITSC), 2022.
  • [43] A. Delić, M. Grcić, and S. Šegvić, “Outlier detection by ensembling uncertainty with negative objectness,” arXiv:2402.15374, 2024.
  • [44] T. Chen, S. Liu, S. Chang, Y. Cheng, L. Amini, and Z. Wang, “Adversarial Robustness: From Self-Supervised Pre-Training to Fine-Tuning,” in Conference on Computer Vision and Pattern Recognition (CVPR), 2020.
  • [45] Y. Cheng, W. Wang, L. Jiang, and W. Macherey, “Self-supervised and Supervised Joint Training for Resource-rich Machine Translation,” in International Conference on Machine Learning (ICML), 2021.
  • [46] T. Cortinhal, G. Tzelepis, and E. E. Aksoy, “Salsanext: Fast, uncertainty-aware semantic segmentation of lidar point clouds for autonomous driving,” in Advances in Visual Computing, 2020.
  • [47] X. Chen, S. Li, B. Mersch, L. Wiesmann, J. Gall, J. Behley, and C. Stachniss, “Moving Object Segmentation in 3D LiDAR Data: A Learning-based Approach Exploiting Sequential Data,” Robotics and Automation Letters, 2021.
  • [48] F. Sartoris, “Anomaly Detection in Lidar Data by Combining Supervised and Self-Supervised Methods,” Bachelor’s Thesis, Karlsruhe Institute of Technology (KIT), 2022.
  • [49] A. Paigwar, O. Erkent, D. Sierra-Gonzalez, and C. Laugier, “GndNet: Fast Ground Plane Estimation and Point Cloud Segmentation for Autonomous Vehicles,” in International Conference on Intelligent Robots and Systems (IROS), 2020.
  • [50] W. Wu, Z. Y. Wang, Z. Li, W. Liu, and L. Fuxin, “Pointpwc-net: Cost volume on point clouds for (self-) supervised scene flow estimation,” in European Conference on Computer Vision (ECCV), 2020.
  • [51] H. Mittal, B. Okorn, and D. Held, “Just go with the flow: Self-supervised scene flow estimation,” in Conference on Computer Vision and Pattern Recognition (CVPR), 2020.
  • [52] I. Tishchenko, S. Lombardi, M. R. Oswald, and M. Pollefeys, “Self-supervised learning of non-rigid residual flow and ego-motion,” in International Conference on 3D Vision (3DV), 2020.
  • [53] Y. Kittenplon, Y. C. Eldar, and D. Raviv, “Flowstep3d: Model unrolling for self-supervised scene flow estimation,” in Conference on Computer Vision and Pattern Recognition (CVPR), 2021.
  • [54] S. Baur, D. Emmerichs, F. Moosmann, P. Pinggera, B. Ommer, and A. Geiger, “SLIM: Self-Supervised LiDAR Scene Flow and Motion Segmentation,” in International Conference on Computer Vision (ICCV), 2021.
  • [55] J. Nubert, S. Khattak, and M. Hutter, “Self-supervised learning of lidar odometry for robotic applications,” in International Conference on Robotics and Automation (ICRA), 2021.
  • [56] M. Ester, H.-P. Kriegel, J. Sander, and X. Xu, “A Density-Based Algorithm for Discovering Clusters in Large Spatial Databases with Noise,” in International Conference on Knowledge Discovery and Data Mining, 1996.
  • [57] J. Behley, M. Garbade, A. Milioto, J. Quenzel, S. Behnke, C. Stachniss, and J. Gall, “Semantickitti: A dataset for semantic scene understanding of lidar sequences,” in International Conference on Computer Vision (ICCV), 2019.
  • [58] Y. Liao, J. Xie, and A. Geiger, “KITTI-360: A novel dataset and benchmarks for urban scene understanding in 2d and 3d,” Pattern Analysis and Machine Intelligence (PAMI), 2022.
  • [59] J. Sanchez, “recoverkitti360label,” https://github.com/JulesSanchez/recoverKITTI360label/pull/3, 2022, accessed: 2024-06-26.
  • [60] M. Menze, C. Heipke, and A. Geiger, “Joint 3d estimation of vehicles and scene flow,” ISPRS Annals of the Photogrammetry, Remote Sensing and Spatial Information Sciences, 2015.
  • [61] M. Menze and A. Geiger, “Object scene flow for autonomous vehicles,” in Conference on Computer Vision and Pattern Recognition (CVPR), 2015.
  • [62] N. Mayer, E. Ilg, P. Hausser, P. Fischer, D. Cremers, A. Dosovitskiy, and T. Brox, “A Large Dataset to Train Convolutional Networks for Disparity, Optical Flow, and Scene Flow Estimation,” in Conference on Computer Vision and Pattern Recognition (CVPR), 2016.
  • [63] S. A. Baur, F. Moosmann, S. Wirges, and C. B. Rist, “Real-time 3D LiDAR Flow for Autonomous Vehicles,” in Intelligent Vehicles Symposium (IV), 2019.
  • [64] A. Geiger, P. Lenz, and R. Urtasun, “Are we ready for autonomous driving? The KITTI vision benchmark suite,” in Conference on Computer Vision and Pattern Recognition (CVPR), 2012.
  • [65] H. Blum, P.-E. Sarlin, J. Nieto, R. Siegwart, and C. Cadena, “Fishyscapes: A Benchmark for Safe Semantic Segmentation in Autonomous Driving,” in International Conference on Computer Vision (ICCV) Workshops, 2019.
  • [66] R. Chan, K. Lis, S. Uhlemeyer, H. Blum, S. Honari, R. Siegwart, M. Salzmann, P. Fua, and M. Rottmann, “SegmentMeIfYouCan: A Benchmark for Anomaly Segmentation,” in Conference on Neural Information Processing Systems (NeurIPS), 2021.
  • [67] K. Li, K. Chen, H. Wang, L. Hong, C. Ye, J. Han, Y. Chen, W. Zhang, C. Xu, D.-Y. Yeung, X. Liang, Z. Li, and H. Xu, “CODA: A Real-World Road Corner Case Dataset for Object Detection in Autonomous Driving,” in European Conference on Computer Vision (ECCV), 2022.
  • [68] D. Bogdoll, S. Uhlemeyer, K. Kowol, and J. M. Zöllner, “Perception datasets for anomaly detection in autonomous driving: A survey,” in Intelligent Vehicles Symposium (IV), 2023.
  • [69] J. Mao, M. Niu, C. Jiang, H. Liang, J. Chen, X. Liang, Y. Li, C. Ye, W. Zhang, Z. Li, J. Yu, H. Xu, and C. Xu, “One million scenes for autonomous driving: Once dataset,” in Conference on Neural Information Processing Systems (NeurIPS), 2021.
  • [70] H. Caesar, V. Bankiti, A. H. Lang, S. Vora, V. E. Liong, Q. Xu, A. Krishnan, Y. Pan, G. Baldan, and O. Beijbom, “nuscenes: A multimodal dataset for autonomous driving,” in Conference on Computer Vision and Pattern Recognition (CVPR), 2020.
  • [71] V. Geppert, “Anomaly Detection with Model Contradictions for Autonomous Driving,” Bachelor Thesis, Karlsruhe Institute of Technology (KIT), 2023.