Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
License: arXiv.org perpetual non-exclusive license
arXiv:2110.11334v3 [cs.CV] 23 Jan 2024

11institutetext: Jingkang Yang 22institutetext: S-Lab, Nanyang Technological University, Singapore
22email: jingkang001@ntu.edu.sg
33institutetext: Kaiyang Zhou 44institutetext: S-Lab, Nanyang Technological University, Singapore
44email: kaiyang.zhou@ntu.edu.sg
55institutetext: Yixuan Li 66institutetext: Department of Computer Sciences, University of Wisconsin-Madison, Madison, WI, United States
66email: sharonli@cs.wisc.edu
77institutetext: Ziwei Liu 88institutetext: S-Lab, Nanyang Technological University, Singapore
88email: ziwei.liu@ntu.edu.sg

Generalized Out-of-Distribution Detection: A Survey

Jingkang Yang    Kaiyang Zhou    Yixuan Li    Ziwei Liu
(Received: date / Accepted: date)
Abstract

Out-of-distribution (OOD) detection is critical to ensuring the reliability and safety of machine learning systems. For instance, in autonomous driving, we would like the driving system to issue an alert and hand over the control to humans when it detects unusual scenes or objects that it has never seen during training time and cannot make a safe decision. The term, OOD detection, first emerged in 2017 and since then has received increasing attention from the research community, leading to a plethora of methods developed, ranging from classification-based to density-based to distance-based ones. Meanwhile, several other problems, including anomaly detection (AD), novelty detection (ND), open set recognition (OSR), and outlier detection (OD), are closely related to OOD detection in terms of motivation and methodology. Despite common goals, these topics develop in isolation, and their subtle differences in definition and problem setting often confuse readers and practitioners. In this survey, we first present a unified framework called generalized OOD detection, which encompasses the five aforementioned problems, i.e., AD, ND, OSR, OOD detection, and OD. Under our framework, these five problems can be seen as special cases or sub-tasks, and are easier to distinguish. Despite comprehensive surveys of related fields, the summarization of OOD detection methods remains incomplete and requires further advancement. This paper specifically addresses the gap in recent technical developments in the field of OOD detection. It also provides a comprehensive discussion of representative methods from other sub-tasks and how they relate to and inspire the development of OOD detection methods. The survey concludes by identifying open challenges and potential research directions.

1 Introduction

A trustworthy visual recognition system should not only produce accurate predictions on known context, but also detect unknown examples and reject them (or hand them over to human users for safe handling) concrete16arxiv ; mlsafety21arxiv ; hendrycks2021unsolved ; hendrycks2022x . For instance, a well-trained food classifier should be able to detect non-food images such as selfies uploaded by users, and reject such input instead of blindly classifying them into existing food categories. In safety-critical applications such as autonomous driving, the driving system must issue a warning and hand over the control to drivers when it detects unusual scenes or objects it has never seen during training.

Refer to caption
Figure 1: Taxonomy of generalized OOD detection framework, illustrated by classification tasks. Four bases are used for the task taxonomy: 1) Distribution shift to detect: the task focuses on detecting covariate shift or semantic shift; 2) ID data type: the ID data contains one single class or multiple classes; 3) Whether the task requires ID classification; 4) Transductive learning task requires all observations; inductive tasks follow the train-test scheme. Note that ND is often interchangeable with AD, but ND is more concerned with semantic anomalies. OOD detection is generally interchangeable with OSR for classification tasks.
Refer to caption
Figure 2: Illustration of sub-tasks under generalized OOD detection framework with vision tasks. Tags on test images refer to model’s expected predictions. (a) In sensory anomaly detection, test images with covariate shift will be considered as OOD. No semantic shift occurs in this setting. (b) In one-class novelty detection, normal/ID images belong to one class. Test images with semantic shift will be considered as OOD. (c) In multi-class novelty detection, ID images belong to multiple classes. Test images with semantic shift will be considered as OOD. Note that (b) and (c) compose novelty detection, which is identical to the topic of semantic anomaly detection. (d) Open set recognition is identical to multi-class novelty detection in the task of detection, with the only difference that open set recognition further requires ID classification. Out-of-distribution detection solves the same problem as open-set recognition. It canonically aims to detect test samples with semantic shift without losing the ID classification accuracy. However, OOD Detection encompasses a broader spectrum of learning tasks and solution space. (e) Outlier detection does not follow a train-test scheme. All observations are provided. It fits in the generalized OOD detection framework by defining the majority distribution as ID. Outliers can have any distribution shift from the majority.

Most existing machine learning models are trained based on the closed-world assumption imagenet12nips ; surpass15iccv , where the test data is assumed to be drawn i.i.d. from the same distribution as the training data, known as in-distribution (ID). However, when models are deployed in an open-world scenario openworld06esi , test samples can be out-of-distribution (OOD) and therefore should be handled with caution. The distributional shifts can be caused by semantic shift (e.g., OOD samples are drawn from different classes) oodbaseline17iclr , or covariate shift (e.g., OOD samples from a different domain) domainshift10ml ; deepdg17iccv ; dasurvey18neurocomp .

The detection of semantic distribution shift (e.g., due to the occurrence of new classes) is the focal point of OOD detection tasks, where the label space 𝒴𝒴\mathcal{Y}caligraphic_Y can be different between ID and OOD data and hence the model should not make any prediction. In addition to OOD detection, several problems adopt the “open-world” assumption and have a similar goal of identifying OOD examples. These include outlier detection (OD) outlierhighd01sigmod ; outliersurvey04aireview ; outlier05handbook ; outlierprogress19ieee , anomaly detection (AD) anomalysurvey21ieee ; anomalyreview20adelaide ; anomalysurvey20dsong ; anomalysurvey19sydney , novelty detection (ND) ndsurveyox14sp ; ndreview10mipro ; ndsurvey03sp01 ; ndsurvey03sp02 , and open set recognition (OSR) boult19aaai ; osrsurvey20pami ; osrsurvey21arxiv . While all these problems are related to each other by sharing similar motivations, subtle differences exist among the sub-topics in terms of the specific definition. However, the lack of a comprehensive understanding of the relationship between the different sub-topics leads to confusion for both researchers and practitioners. Even worse, these sub-topics, which are supposed to be compared and learned from each other, are developing in isolation.

In this survey, we for the first time clarify the similarities and differences between these problems, and present a unified framework termed generalized OOD detection. Under this framework, the five problems (i.e., AD, ND, OSR, OOD detection, and OD) can be viewed as special cases or sub-topics. While other sub-topics have been extensively surveyed, the summarization of OOD detection methods is still inadequate and requires further exploration. This paper fills this gap by focusing specifically on recent technical developments in OOD detection, analyzing fair experimental comparisons among classical methods on common benchmarks. Our survey concludes by highlighting open challenges and outlining potential avenues for future research.

We further conduct a literature review for each sub-topic, with a special focus on the OOD detection task. To sum up, we make three contributions to the research community:

  1. 1.

    A Unified Framework: For the first time, we systematically review five closely related topics of AD, ND, OSR, OOD detection, and OD, and present a unified framework of generalized OOD detection. Under this framework, the similarities and differences of the five sub-topics can be systematically compared and analyzed. We hope our unification helps the community better understand these problems and correctly position their research in the literature.

  2. 2.

    A Comprehensive Survey for OOD Detection: Noticing the existence of comprehensive surveys on AD, ND, OSR, and OD methodologies in recent years anomalysurvey21ieee ; anomalyreview20adelaide ; anomalysurvey20dsong ; anomalysurvey19sydney ; osrsurvey20pami , this survey provides a comprehensive overview of OOD detection methods and thus complements existing surveys. By connecting with methodologies of other sub-topics that are also briefly reviewed, as well as sharing the insights from a fair comparison on a standard benchmark, we hope to provide readers with a more holistic understanding of the developments for each problem and their interconnections, especially for OOD detection.

  3. 3.

    Future Research Directions: We draw readers’ attention to some problems or limitations that remain in the current generalized OOD detection field. We conclude this survey with discussions on open challenges and opportunities for future research.

2 Generalized OOD Detection

Framework Overview  In this section, we introduce a unified framework termed generalized OOD detection, which encapsulates five related sub-topics: anomaly detection (AD), novelty detection (ND), open set recognition (OSR), out-of-distribution (OOD) detection, and outlier detection (OD). These sub-topics can be similar in the sense that they all define a certain in-distribution, with the common goal of detecting out-of-distribution samples under the open-world assumption. However, subtle differences exist among the sub-topics in terms of the specific definition and properties of ID and OOD data—which are often overlooked by the research community. To this end, we provide a clear introduction and description of each sub-topic in respective subsections (from Section 2.1 to 2.5). Each subsection details the motivation, background, formal definition, as well as relative position within the unified framework. Applications and benchmarks are also introduced, with concrete examples that facilitate understanding. Fig. 4 illustrates the settings for each sub-topic. In the end, we conclude this section by introducing the neighborhood topics to clarify the scope of the generalized OOD detection framework. (Section 2.6).

Preliminary: Distribution Shift In our framework, we recognize the complexity and interconnectedness of distribution shifts, which are central to understanding various OOD scenarios. Distribution shifts can be broadly categorized into covariate shift and semantic (label) shift, but it’s important to clarify their interdependence. Firstly, let’s define the input space as 𝒳𝒳\mathcal{X}caligraphic_X (sensory observations) and the label space as 𝒴𝒴\mathcal{Y}caligraphic_Y (semantic categories). The data distribution is represented by the joint distribution P(X,Y)𝑃𝑋𝑌P(X,Y)italic_P ( italic_X , italic_Y ) over the space 𝒳×𝒴𝒳𝒴\mathcal{X}\times\mathcal{Y}caligraphic_X × caligraphic_Y. Distribution shift can occur in either the marginal distribution P(X)𝑃𝑋P(X)italic_P ( italic_X ), or both P(Y)𝑃𝑌P(Y)italic_P ( italic_Y ) and P(X)𝑃𝑋P(X)italic_P ( italic_X ). Note that shift in P(Y)𝑃𝑌P(Y)italic_P ( italic_Y ) naturally triggers shift in P(X)𝑃𝑋P(X)italic_P ( italic_X ).

Covariate Shift:  This occurs when there is a change in the marginal distribution P(X)𝑃𝑋P(X)italic_P ( italic_X ), affecting the input space, while the label space 𝒴𝒴\mathcal{Y}caligraphic_Y remains constant. Examples of covariate distribution shift on P(X)𝑃𝑋P(X)italic_P ( italic_X ) include adversarial examples adversarial15iclr ; adversarial18iclr , domain shift quinonero2009dataset , and style changes gatys2016image .

Semantic Shift:  This involves changes in both P(Y)𝑃𝑌P(Y)italic_P ( italic_Y ) and indirectly P(X)𝑃𝑋P(X)italic_P ( italic_X ). A shift in the label space P(Y)𝑃𝑌P(Y)italic_P ( italic_Y ) implies the introduction of new categories or the alteration of existing ones. This change naturally affects the input space P(X)𝑃𝑋P(X)italic_P ( italic_X ) since the nature of the data being observed or collected is now different.

Remark:  Given the interdependence between P(X)𝑃𝑋P(X)italic_P ( italic_X ) and P(Y)𝑃𝑌P(Y)italic_P ( italic_Y ), it’s crucial to distinguish the intentions behind different types of distribution shifts. We define Covariate Shift as scenarios where changes are intended in the input space (P(X)𝑃𝑋P(X)italic_P ( italic_X )) without any deliberate alteration to the label space (P(Y)𝑃𝑌P(Y)italic_P ( italic_Y )). On the other hand, Semantic Shift specifically aims to modify the semantic content, directly impacting the label space (P(Y)𝑃𝑌P(Y)italic_P ( italic_Y )) and, consequently, the input space (P(X)𝑃𝑋P(X)italic_P ( italic_X )).

Importantly, we note that covariate shift is more commonly used to evaluate model generalization and robustness performance, where the label space 𝒴𝒴\mathcal{Y}caligraphic_Y remains the same during test time. On the other hand, the detection of semantic distribution shift (e.g., due to the occurrence of new classes) is the focal point of many detection tasks considered in this framework, where the label space 𝒴𝒴\mathcal{Y}caligraphic_Y can be different between ID and OOD data and hence the model should not make any prediction.

With the concept of distribution shift in mind, readers can get a general idea of the differences and connections among sub-topics/tasks in Fig. 1. Notice that different sub-tasks can be easily identified with the following four dichotomies: 1) covariate/semantic shift dichotomy; 2) single/multiple class dichotomy; 3) ID classification needed/non-needed dichotomy; 4) inductive/transductive dichotomy. Next, we proceed with elaborating on each sub-topic.

2.1 Anomaly Detection

Background  The notion of “anomaly” stands in contrast with the “normal” defined in advance. The concept of “normal” should be clear and reflect the real task. For example, to create a “not-hotdog detector”, the concept of the normal should be clearly defined as the hotdog class, i.e., a food category, so that objects that violate this definition are identified as anomalies, which include steaks, rice, and non-food objects like cats and dogs. Ideally, “hotdog” would be regarded as a homogeneous concept, regardless of the sub-classes of French or American hotdog.

Current anomaly detection settings often restrict the environment of interest to some specific scenarios. For example, the “not-hotdog detector” only focuses on realistic images, assuming the nonexistence of images from other domains such as sketches. Another realistic example is industrial defect detection, which is based on only one set of assembly lines for a specific product. In other words, the “open-world” assumption is usually not completely “open”. Nevertheless, “not-hotdog” or “defects” can form a large unknown space that breaks the “closed-world” assumption.

In summary, the key to anomaly detection is to define normal clearly (usually without sub-classes) and detect all possible anomalous samples under some specific scenarios.

Definition  Anomaly detection (AD) chandola2009anomaly aims to detect any anomalous samples that deviate from the predefined normality during testing. The deviation can happen due to either covariate shift or semantic shift, which leads to two sub-tasks: sensory AD and semantic AD, respectively anomalysurvey21ieee .

Sensory AD detects test samples with covariate shift, under the assumption that normalities come from the same covariate distribution. No semantic shift takes place in sensory AD settings. On the other hand, semantic AD detects test samples with label shift, assuming that normalities come from the same semantic distribution (category), i.e., normalities should belong to only one class.

Formally, in sensory AD, normalities are from in-distribution P(X)𝑃𝑋P(X)italic_P ( italic_X ) while anomalies encountered at test time are from out-of-distribution P(X)superscript𝑃𝑋P^{\prime}(X)italic_P start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ( italic_X ), where P(X)P(X)𝑃𝑋superscript𝑃𝑋P(X)\neq P^{\prime}(X)italic_P ( italic_X ) ≠ italic_P start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ( italic_X ) — only covariate shift occurs. The goal in sensory AD is to detect samples from P(X)superscript𝑃𝑋P^{\prime}(X)italic_P start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ( italic_X ). No semantic shift occurs in this setting, i.e., P(Y)=P(Y)𝑃𝑌superscript𝑃𝑌P(Y)=P^{\prime}(Y)italic_P ( italic_Y ) = italic_P start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ( italic_Y ). Conversely, for semantic AD, only semantic shift occurs (i.e., P(Y)P(Y)𝑃𝑌superscript𝑃𝑌P(Y)\neq P^{\prime}(Y)italic_P ( italic_Y ) ≠ italic_P start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ( italic_Y )) and the goal is to detect samples that belong to novel classes.

Remark: Sensory/Semantic Dichotomy  Our sensory/semantic dichotomy for the AD sub-task definition comes from the low-level sensory anomalies and high-level semantic anomalies that are introduced in  semanticanomaly20aaai and highlighted in the recent AD survey anomalysurvey21ieee , to reflect the rise of deep learning. Note that although most sensory and semantic AD methods are shown to be mutually inclusive due to the common shift on P(X)𝑃𝑋P(X)italic_P ( italic_X ), some approaches are specialized in one of the sub-tasks (ref. Section 4.2). Recent research communities are also trending on subdividing types of anomalies to develop targeted methods, so that practitioners can select the optimal solution for their own practical problem semanticanomaly20aaai ; zhang2021understanding .

Position in Framework  Under the generalized OOD detection framework, the definition of “normality” seamlessly connects to the notion of “in-distribution”, and “anomaly” corresponds to “out-of-distribution”. Importantly, AD treats ID samples as a whole, which means that regardless of the number of classes (or statistical modalities) in ID data, AD does not require differentiation in the ID samples. This feature is an important distinction between AD and other sub-topics such as OSR and OOD detection.

Application and Benchmark  Sensory AD only focuses on objects with the same or similar semantics, and identifies the observational differences on their surface. Samples with sensory differences are recognized as sensory anomalies. Example applications include adversarial defense akhtar2018threat , forgery recognition of biometrics and artworks spoofsmartphone16 ; facespoof15 ; spoofscheme08 ; forgeryart09 , image forensics deepfakedataset19 ; deeperforensics20 ; forensicsurvey20 , industrial inspection mvtec19cvpr ; rlad20eccv ; healthmonitor18health , etc. The most popular academic AD benchmark is MVTec-AD mvtec19cvpr for industrial inspection.

In contrast to sensory AD, semantic AD only focuses on the semantic shift. An example of real-world applications is crime surveillance idrees2018enhancing ; surveillance02ijcnn . Active image crawlers for a specific category also need semantic AD methods to ensure the purity of the collected images optimol10ijcv . An example of the academic benchmarks is to recursively use one class from MNIST as ID during training, and ask the model to distinguish it from the rest of the 9 classes during testing.

Evaluation  In the AD benchmarks, test samples are annotated to be either normal or abnormal. The deployed anomaly detector will produce a confidence score for a test sample, indicating how confident the model considers the sample as normality. Samples below the predefined confidence threshold are considered abnormal. By viewing the anomalies as positive and true normalities as negative111Align with MSP msp17iclr . Check this issue in OpenOOD, different thresholds will produce a series of true positive rates (TPR) and false-positive rates (FPR)—from which we can calculate the area under the receiver operating characteristic curve (AUROC) auroc06pr . Similarly, the precision and recall values can be used to compute metrics of F-scores and the area under the precision-recall curve (AUPR) evaluation20jmlt . Note that there can be two variants of AUPR values: one treating “normal” as the positive class, and the other treating “abnormal” as the positive class. For AUROC and AUPR, a higher value indicates better detection performance.

Remark: Alternative Taxonomy on Anomalies  Some previous literature considers anomalies types to be three-fold: point anomalies, conditional or contextural anomalies, and group or collective anomalies anomalyreview20adelaide ; anomalysurvey19sydney ; anomalysurvey21ieee . In this survey, we mainly focus on point anomalies detection for its popularity in practical applications and its adequacy to elucidate the similarities and differences between sub-tasks. Details of the other two kinds of anomalies, i.e., contextural anomalies that often occur in time-series tasks, and collective anomalies that are common in the data mining field, are not covered in this survey. We recommend readers to the recent AD survey papers anomalysurvey21ieee for an in-depth discussion on them.

Remark: Taxonomy based on Supervision  We use sensory/semantic dichotomy to subdivide AD at the task level. From the perspective of methodologies, some literature categorizes AD techniques into unsupervised and (semi-) supervised settings. Note that these two taxonomies are orthogonal as they focus on tasks and methods respectively.

2.2 Novelty Detection

Background  The word “novel” generally refers to the unknown, new, and something interesting. While novelty detection (ND) is often interchangeable with AD in the community, strictly speaking, their subtle difference is worth noticing. In terms of motivation, novelty detection usually does not perceive “novel” test samples as erroneous, fraudulent, or malicious as AD does, but cherishes them as learning resources for potential future use with a positive learning attitude anomalyreview20adelaide ; anomalysurvey21ieee . In fact, novelty detection is also known as “novel class detection” ndsurvey03sp01 ; ndsurvey03sp02 , indicating that it is primarily focusing on detecting semantic shift.

Definition  Novelty detection aims to detect any test samples that do not fall into any training category. The detected novel samples are usually prepared for future constructive procedures, such as more specialized analysis, or incremental learning of the model itself. Based on the number of training classes, ND contains two different settings: 1) one-class novelty detection (one-class ND): only one class exists in the training set; 2) multi-class novelty detection (multi-class ND): multiple classes exist in the training set. It is worth noting that despite having many ID classes, the goal of multi-class ND is only to distinguish novel samples from ID. Both one-class and multi-class ND are formulated as binary classification problems.

Position in Framework  Under the generalized OOD detection framework, ND deals with the setting where OOD samples have semantic shift, without the need for classification in the ID set even if possible. Therefore, ND shares the same problem definition with semantic AD.

Application and Benchmark  Real-world ND application includes video surveillance idrees2018enhancing ; surveillance02ijcnn , planetary exploration mars19aaai and incremental learning incremental15aims ; curiositydriven17icml . For one-class ND, an example academic benchmark can be identical to that of semantic AD, which considers one class from MNIST as ID and the rest as the novel. The corresponding MNIST benchmark for multi-class ND may use the first 6 classes during training, and test on the remaining 4 classes as OOD.

Evaluation  The evaluation of ND is identical to AD, which is based on AUROC, AUPR, or F-scores (see details in Section 2.1).

Remark: One-Class/Multi-Class Dichotomy  Although the ND models do not require the ID classification even with multi-class annotations, the method on multi-class ND can be different from one-class ND, as multi-class ND can make use of the multi-class classifier while one-class ND cannot. Also note that semantic AD can be further split into one-class semantic AD and multi-class semantic AD that matches ND, as semantic AD is equivalent to ND.

Remark: Nuance between AD and ND  Apart from the special interest in semantics, some literature ocgan19cvpr ; xia2015learning also point out that ND is supposed to be fully unsupervised (no novel data in training), while AD might have some abnormal training samples. It’s important to note that neither AD nor ND necessitates the classification of ID data. This is a key distinction between OSR and OOD detection, which we will discuss in subsequent sections.

2.3 Open Set Recognition

Background  Machine learning models trained in the closed-world setting can incorrectly classify test samples from unknown classes as one of the known categories with high confidence towardosr13pami . Some literature refers to this notorious overconfident behavior of the model as “arrogance”, or “agnostophobia” agnostophobia18nips . Open set recognition (OSR) is proposed to address this problem, with their own terminology of “known known classes” to represent the categories that exist at training, and “unknown unknown classes” for test categories that do not fall into any training category. Some other terms, such as open category detection pac18icml and open set learning bound21icml , are simply different expressions for OSR.

Definition  Open set recognition requires the multi-class classifier to simultaneously: 1) accurately classify test samples from “known known classes”, and 2) detect test samples from “unknown unknown classes”.

Position in Framework  OSR well aligns with our generalized OOD detection framework, where “known known classes” and “unknown unknown classes” correspond to ID and OOD respectively. Formally, OSR deals with the case where OOD samples during testing have semantic shift, i.e., P(Y)P(Y)𝑃𝑌superscript𝑃𝑌P(Y)\neq P^{\prime}(Y)italic_P ( italic_Y ) ≠ italic_P start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ( italic_Y ). The goal of OSR is largely shared with that of multi-class ND—the only difference is that OSR additionally requires accurate classification of ID samples from P(Y)𝑃𝑌P(Y)italic_P ( italic_Y ).

Application and Benchmark  OSR supports the robust deployment of real-world image classifiers in general, which can reject unknown samples in the open world sorio2010open ; openworld19www . An example academic benchmark on MNIST can be identical to multi-class ND, which considers the first 6 classes as ID and the remaining 4 classes as OOD. In addition, OSR further requires a good classifier on the 6 ID classes.

Evaluation  Similar to AD and ND, the metrics for OSR include F-scores, AUROC, and AUPR. Beyond them, the classification performance is also evaluated by standard ID accuracy. While the above metrics evaluate the novelty detection and ID classification capabilities independently, some works raise some evaluation criteria for joint evaluation, such as CCR@FPRx𝑥xitalic_x agnostophobia18nips , which calculates the class-wise recall when a certain FPR equal to x𝑥xitalic_x (e.g., 101superscript10110^{-1}10 start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT) is achieved.

2.4 Out-of-Distribution Detection

Background  With the observation that deep learning models are often inappropriate but in fact overconfident in classifying samples from different semantic distributions in the image classification task and text categorization msp17iclr , the field of out-of-distribution detection emerges, requiring the model to reject inputs that are semantically different from the training distribution and therefore should not be predicted by the model.

Definition  Out-of-distribution detection, or OOD detection, aims to detect test samples drawn from a distribution that is different from the training distribution, with the definition of distribution to be well-defined according to the application in the target. For most machine learning tasks, the distribution should refer to “label distribution”, which means that OOD samples should not have overlapping labels w.r.t. training data. Formally, in the OOD detection, the test samples come from a distribution whose semantics are shifted from ID, i.e., P(Y)P(Y)𝑃𝑌superscript𝑃𝑌P(Y)\neq P^{\prime}(Y)italic_P ( italic_Y ) ≠ italic_P start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ( italic_Y ). Note that the training set usually contains multiple classes, and OOD detection should NOT harm the ID classification capability.

Position in Framework  Out-of-distribution detection can be canonical to OSR in common machine learning tasks like multi-class classification—keeping the classification performance on test samples from ID class space 𝒴𝒴\mathcal{Y}caligraphic_Y, and reject OOD test samples with semantics outside the support of 𝒴𝒴\mathcal{Y}caligraphic_Y. Also, the multi-class setting and the requirement of ID classification distinguish the task from AD and ND.

Application and Benchmark  The application of OOD detection usually falls into safety-critical situations, such as autonomous driving huang2020survey ; geiger2012we . An example academic benchmark is to use CIFAR-10 as ID during training and to distinguish CIFAR images from other datasets such as SVHN, etc. Researchers should pay attention that OOD datasets should NOT have label overlapping with ID datasets when building the benchmark.

Evaluation  Apart from F-scores, AUROC, and AUPR, another commonly-used metric is FPR@TPRx𝑥xitalic_x, which measures the FPR when the TPR is x𝑥xitalic_x (e.g., 0.95). Some works also use an alternative metric, TNR@TPRx𝑥xitalic_x, which is equivalent to 1-FPR@TPRx𝑥xitalic_x. OOD detection also concerns the performance of ID classification.

Remark: OSR vs OOD Detection  The difference between OSR and OOD detection tasks is three-fold.

1) Different benchmark setup: OSR benchmarks usually split one multi-class classification dataset into ID and OOD parts according to classes, while OOD detection takes one dataset as ID and finds several other datasets as OOD with the guarantee of non-overlapping categories between ID/OOD datasets. However, despite the different benchmark traditions of the two sub-tasks, they are in fact tackling the same problem of semantic shift detection.

2) No additional data in OSR: Due to the requirement of theoretical open-risk bound guarantee, OSR discourages the usage of additional data during training by design boult19aaai . This restriction precludes methods that are more focused on effective performance improvements (e.g., outlier exposures oe18nips ; Zhang_2023_WACV ) but may violate OSR constraints.

3) Broadness of OOD detection: Compare to OSR, OOD detection encompasses a broader spectrum of learning tasks (e.g., multi-label classification hendrycks2019scaling ), wider solution space (to be discussed in Section 3).

Remark: Mainstream OOD Detection Focuses on Semantics  While most works in the current community interpret the keyword “out-of-distribution” as “out-of-label/semantic-distribution”, some OOD detection works also consider detecting covariate shifts godin20cvpr , which claim that covariate shift usually leads to a significant drop in model performance and therefore needs to be identified and rejected. However, although detecting covariate shift is reasonable on some specific tasks (usually due to high-risk or privacy reasons) that are to be discussed in the following paragraph, research on this topic remains a controversial task w.r.t OOD generalization tasks (c.f. Section 2.6 and Section 6.2). Detecting semantic shift has been the mainstream of OOD detection tasks.

Remark: To Generalize, or To Detect?  We provide another definition from the perspective of generalization: Out-of-distribution detection, or OOD detection, aims to detect test samples to which the model cannot or does not want to generalize pleiss2019neural . In most of the machine learning tasks, such as image classification, the models are expected to generalize their prediction capability to samples with covariate shift, and they are only unable to generalize when semantic shift occurs. However, for applications where models are by-design nontransferable to other domain, such as many deep reinforcement learning tasks like game AI vinyals2017starcraft ; sedlmeier2019uncertainty , the key term “distribution” should refer to “data/input distribution”, so that the model should refuse to decide the environment that is not the same as the training environment, i.e., P(X)P(X)𝑃𝑋superscript𝑃𝑋P(X)\neq P^{\prime}(X)italic_P ( italic_X ) ≠ italic_P start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ( italic_X ). Similar applications are those high-risk tasks such as medical image classification zimmerer2022mood or in privacy-sensitive scenario tariq2020review , where the models are expected to be very conservative and only make predictions for samples exactly from the training distribution, rejecting any samples that deviate from it. Recent studies averly2023unified also highlight a model-specific view: a robust model should generalize to examples with covariate shift; a weak model should reject them. Ultimately, an OOD detection task is considered valid when it successfully balances the aspects of “detection” and “generalization”, taking into account factors such as meaningfulness and the inherent challenges presented by the task. Nonetheless, detecting semantic shift remains the primary focus of OOD detection tasks and is central to this survey.

2.5 Outlier Detection

Background  According to Wikipedia outlierwiki , an outlier is a data point that differs significantly from other observations. Recall that the problem settings in AD, ND, OSR, and OOD detect unseen test samples that are different from the training data distribution. In contrast, outlier detection directly processes all observations and aims to select outliers from the contaminated dataset outlier05handbook ; outliersurvey04aireview ; outlierhighd01sigmod . Since outlier detection does not follow the train-test procedure but has access to all observations, approaches to this problem are usually transductive rather than inductive transductive16iaai .

Definition  Outlier detection aims to detect samples that are markedly different from the others in the given observation set, due to either covariate or semantic shift.

Position in Framework  Different from all previous sub-tasks, whose in-distribution is defined during training, the “in-distribution” for outlier detection refers to the majority of the observations. Outliers may exist due to semantic shift on P(Y)𝑃𝑌P(Y)italic_P ( italic_Y ), or covariate shift on P(X)𝑃𝑋P(X)italic_P ( italic_X ).

Application and Benchmark  While mostly applied in data mining tasks ben2005outlier ; basu2007automatic ; download19asonam , outlier detection is also used in real-world computer vision applications such as video surveillance surveillance15spletter and dataset cleaning dataclean04cce ; dataclean04kdnet ; dataclean05plos . For the application of dataset cleaning, outlier detection is usually used as a pre-processing step for the main tasks such as learning from open-set noisy labels iterativeosnl18cvpr , webly supervised learning webly15iccv , and open-set semi-supervised learning openworldssl21arxiv . To construct an outlier detection benchmark on MNIST, one class should be chosen so that all samples that belong to this class are considered as inliers. A small fraction of samples from other classes are introduced as outliers to be detected.

Evaluation  Apart from F-scores, AUROC, and AUPR, the evaluation of outlier detectors can be also evaluated by the performance of the main task it supports. For example, if an outlier detector is used to purify a dataset with noisy labels, the performance of a classifier that is trained on the cleaned dataset can indicate the quality of the outlier detector.

Remark: On Inclusion of Outlier Detection  Interestingly, the outlier detection task can be considered as an outlier in the generalized OOD detection framework, since outlier detectors are operated on the scenario when all observations are given, rather than following the training-test scheme. Also, publications exactly on this topic are rarely seen in the recent deep learning venues. However, we still include outlier detection in our framework, because intuitively speaking, outliers also belong to one type of out-of-distribution, and introducing it can help familiarize readers more with various terms (e.g., OD, AD, ND, OOD) that have confused the community for a long while.

2.6 Related Topics

Apart from the five sub-topics that are described in our generalized OOD detection framework (shown in Figure 1), we further briefly discuss five related topics below, which help clarify the scope of this survey.

Learning with Rejection (LWR)  LWR bartlett2008classification can date back to early works on abstention chow1970optimum ; fumera2002support , which considered simple model families such as SVMs cortes1995support . The phenomenon of neural networks’ overconfidence in OOD data is first revealed by fooldnn15cvpr . Despite methodologies differences, subsequent works developed on OOD detection and OSR share the underlying spirit of classification with the rejection option.

Domain Adaptation/Generalization  Domain Adaptation (DA) dasurvey18neurocomp and Domain Generalization (DG) dgsurvey21arxiv also follow “open-world” assumption. Different from generalized OOD detection settings, DA/DG expects the existence of covariate shift during testing without any semantic shift and requires classifiers to make accurate predictions into the same set of classes liu2020open . Noticing that OOD detection commonly concerns detecting the semantic shift, which is complementary to DA/DG. In the case when both covariate and semantic shift take place, the model should be able to detect semantic shift while being robust to covariate shift. More discussion on relations between DA/DG and OOD detection is in Section 6.2. The difference between DA and DG is that while the former requires extra but few training samples from the target domain, the latter does not.

Novelty Discovery  Novelty discovery dtc19iccv ; zhao2021novel ; jia2021joint ; vaze2022generalized ; joseph2022novel requires all observations to be given in advance as outlier detection does. The observations are provided in a semi-supervised manner, and the goal is to explore and discover the new categories and classes in the unlabeled set. Different from outlier detection where outliers are sparse, the unlabeled set in novelty discovery setting can mostly consist of, and even be overwhelmed by unknown classes.

Zero-shot Learning  Zero-shot learning zslsurvey19tist has a similar goal of novelty discovery but follows the training-testing scheme. The test set is under the “open-world” assumption with unknown classes, which expects classifiers trained only on the known classes to perform classification on unknown testing samples with the help of extra information such as label relationships.

Open-world Recognition  Open-world recognition toopenworld15cvpr aims to build a lifelong learning machine that can actively detect novel images liu2019large , label them as new classes, and perform continuous learning. It can be viewed as a combination of novelty detection (or open-set recognition) and incremental learning. More specifically, open-world recognition extends the concept of OSR by adding the ability to incrementally learn new classes over time. In open-world scenarios, the system not only identifies unknown instances but also can update its model to include these new classes as part of the known set. This approach is more dynamic and suited for real-world applications where the environment is not static, and new categories can emerge after the initial training phase parmar2023open .

Conformal PredictionConformal prediction (CP) stands as a robust statistical framework in machine learning, primarily designed to provide confidence measures for predictions shafer2008tutorial ; angelopoulos2021gentle . Distinctively, it yields prediction intervals with specified confidence levels, transcending the limitations of mere point estimates. In scenarios of OOD detection, the conformal prediction framework becomes particularly insightful: wider prediction intervals or lower confidence levels generated by conformal prediction methods can serve as indicators of such OOD data. Although research at the intersection of CP and OOD detection is still emerging kaur2022idecode ; kaur2022codit ; cai2021inductive , the potential of applying the conformal prediction framework in this domain is significant and warrants further exploration.

2.7 Organization of Remaining Sections

In this paper, we focus on the methodologies of OOD detection in Section 3, providing a comprehensive overview of the different approaches that have been proposed in the literature. We also briefly introduce methodologies for other sub-tasks including AD, ND, OSR, and OD in Section 4, to provide readers with a broader understanding of OOD-related problems and inspire the development of more effective methods. For each sub-task, we categorize and introduce the methodologies into four groups: 1) classification-based methods: methods that largely rely on classifiers; 2) density-based methods: detecting OOD by modeling data density; 3) distance-based methods: using distance metrics (usually in the feature space) to identify OODs; and 4) reconstruction-based methods: methods featured by reconstruction techniques. To offer readers further insights from an empirical perspective, we conduct a thorough analysis that provides a fair comparison between representative OOD detection methods and methods from other sub-tasks. Additionally, we highlight some of the remaining problems and limitations that exist in the current generalized OOD detection field. We conclude this survey with a discussion on the open challenges and opportunities for future research. It is worth noting that a concurrent survey salehi2021unified provides a detailed explanation of OOD-related methods, which greatly complements our work.

Refer to caption
Figure 3: Timeline for representative OOD detection methodologies. Different colors indicate different categories of methodologies. Each method has its corresponding reference (inconspicuous white) in the lower right corner. Methods with high citations and open-source code are prioritized for inclusion in this figure.




Table 1: Paper list for out-of-distribution detection.
Sections

References

§§\lx@sectionsign\leavevmode\nobreak\ §3.1 Classification §§\lx@sectionsign\leavevmode\nobreak\ §3.1.1 Output-based Methods a: Training-free

msp17iclr ; odin18iclr ; mahananobis18nips ; energyood20nips ; gram20icml ; wang2021canmulti ; she23iclr ; sun2021tone ; dong2022neural ; sun2022dice ; sun2022knn ; mood21cvpr ; gram19nipsw ; she23iclr ; djurisic2022extremely ; park2023nearest ; park2023understanding ; jiang2023detecting ; liu2023gen

b: Training-based

confbranch2018arxiv ; wang2021energy ; eloc18eccv ; good20nips ; aloe20arxiv ; whyrelu19cvpr ; blur20iclr ; outliermining21ecml ; ceda19cvpr ; mixup19nips ; cutmix19cvpr ; cutout17arxiv ; augmix19arxiv ; hendrycks2021pixmix ; csi20nips ; ccu20arxiv ; bibas2021single ; wang2022watermarking ; mood21cvpr ; dongneural ; godin20cvpr ; wei2022mitigating ; hierarchical18cvpr ; mos21cvpr ; hierarchyood ; ksemantic18nips ; nearood21arxiv ; gan2021language

§§\lx@sectionsign\leavevmode\nobreak\ §3.1.1 Outlier Exposure a: Real Outliers

oe18nips ; agnostophobia18nips ; mcd19iccv ; sina2020aaai ; outliermining21ecml ; abstention21arxiv ; oecc21neurocomputing ; outliermining21ecml ; ming2022posterior ; backgroundsampling20cvpr ; Zhang_2023_WACV ; pseudolabel20aaai ; yang2021scood ; lu2023uncertainty ; lessbias19bmvc ; katzsamuels2022training ; wang2023learning

b: Data Generation

confcal18iclr ; oodsg19nipsw ; confgan18nipsw ; maml20nips ; du2022vos ; npos2023iclr ; wang2023out ; zheng2023out ; du2022unknown

§§\lx@sectionsign\leavevmode\nobreak\ §3.1.3: Gradient-based Methods

odin18iclr ; huang2021importance ; igoe2022useful

§§\lx@sectionsign\leavevmode\nobreak\ §3.1.4: Bayesian Models

mcdropout16icml ; deepensemble17nips ; practicalbnn19nips ; dpn18nips ; dpn19nips ; dpn20nips ; kim2021locally

§§\lx@sectionsign\leavevmode\nobreak\ §3.1.5: OOD for Foundation Models

hendrycks2019using ; nearood21arxiv ; pretransformer20arxiv ; ming2023does ; miyai2023can ; miyai2023locoop ; lu2023likelihood ; esmaeilpour2022zero ; ming2022delving ; wang2023clipn

§§\lx@sectionsign\leavevmode\nobreak\ §3.2: Density-based Methods

dagmm18iclr ; autogres19cvpr ; gpnd18nips ; adgan18ecml ; advrec18cvpr ; mahananobis18nips ; flowreview20pami ; residualflow20cvpr ; glow18nips ; pixelcnn16icml ; jiang2021revisiting ; dgmknow19nips ; waic18arxiv ; whyflow20nips ; likelihoodratio19nips ; ratiogm20iclr ; vae20nips ; haoqi2022vim

§§\lx@sectionsign\leavevmode\nobreak\ §3.3: Distance-based Methods

mahananobis18nips ; rmd21arxiv ; cosinesim20accv ; svae20eccv ; onedim21cvpr ; duq20icml ; fss20arxiv ; sun2022knn ; ming2022cider ; kim2023neural

§§\lx@sectionsign\leavevmode\nobreak\ §3.4: Reconstruction-based Methods

denouden2018improving ; zhou2022rethinking ; yang2022mask ; jiang2022read ; li2023mood

§§\lx@sectionsign\leavevmode\nobreak\ §3.5: Theoretical Analysis

zhang2021understanding ; morteza2022provable ; towardosr13pami ; psvm14eccv ; rudd2017extreme ; pac18icml ; bound21icml ; fang2022out

3 OOD Detection: Methodology

In this section, we introduce the methodology for OOD detection. Initially, we explore classification-based models in Section 3.1. These models primarily utilize the model’s output, such as softmax scores, to identify OOD instances. We further examine outlier exposure-based methods that leverage external data sources and other types of methods. The later section is followed by density-based methods in Section 3.2. Distance-based methods will be introduced in Sections 3.3. A brief discussion will be included at the end.

3.1 Classification-based Methods

Research on OOD detection originated from a simple baseline, that is, using the maximum softmax probability as the indicator score of ID-ness msp17iclr . Early OOD detection methods focus on deriving improved OOD scores based on the output of neural networks.

3.1.1 Output-based Methods

a. Post-hoc Detection  Post-hoc methods have the advantage of being easy to use without modifying the training procedure and objective. The property can be important for the adoption of OOD detection methods in real-world production environments, where the overhead cost of retraining can be prohibitive. Early work ODIN odin18iclr is a post-hoc method that uses temperature scaling and input perturbation to amplify the ID/OOD separability. Key to the method, a sufficiently large temperature has a strong smoothing effect that transforms the softmax score back to the logit space—which effectively distinguishes ID vs. OOD. Note that this is different from confidence calibration, where a much milder T𝑇Titalic_T is employed. While calibration focuses on representing the true correctness likelihood of ID data only, the ODIN score is designed to maximize the gap between ID and OOD data and may no longer be meaningful from a predictive confidence standpoint. Built on the insights, recent work energyood20nips ; mood21cvpr proposed using an energy score for OOD detection, which enjoys theoretical interpretation from a likelihood perspective morteza2022provable . Test samples with lower energy are considered ID and vice versa. JointEnergy score wang2021canmulti is then proposed to perform OOD detection for multi-label classification networks. The most recent work SHE she23iclr uses stored patterns that represent classes to measure the discrepancy of unseen data for OOD detection, which is hyperparameter-free and computationally efficient compared to classic energy methods. Techniques such as layer-wise Mahalanobis distance mahananobis18nips and Gram Matrix gram20icml are implemented for better-hidden feature quality to perform density estimation.

Recently, one fundamental cause of the overconfidence issue on OOD data has been revealed that using mismatched BatchNorm statistics—that are estimated on ID data yet blindly applied to the OOD data in testing—can trigger abnormally high unit activations and model output accordingly sun2021tone . Therefore, ReAct sun2021tone proposes truncating the high activations, which establishes strong post-hoc detection performance and further boosts the performance of existing scoring functions. Similarly, NMD dong2022neural uses the activation means from BatchNorm layers for ID/OOD discrepancy. While ReAct considers activation space, sun2022dice proposes a weight sparsification-based OOD detection framework termed DICE. DICE ranks weights based on a measure of contribution and selectively uses the most salient weights to derive the output for OOD detection. By pruning away noisy signals, DICE provably reduces the output variance for OOD data, resulting in a sharper output distribution and stronger separability from ID data. In a similar vein, ASH djurisic2022extremely also targets the activation space but adopts a different strategy. It removes a significant portion (e.g., 90%) of an input’s feature representations from a late layer based on a top-K criterion, followed by adjusting the remaining activations (e.g., 10%) either by scaling or assigning constant values, yielding surprisingly effective results.

b. Training-based Methods  With the training phase, confidence can be developed via designing a confidence-estimating branch confbranch2018arxiv or class wang2021energy , ensembling with leaving-out strategy eloc18eccv , adversarial training good20nips ; aloe20arxiv ; whyrelu19cvpr ; blur20iclr ; outliermining21ecml , stronger data augmentation ceda19cvpr ; mixup19nips ; cutmix19cvpr ; cutout17arxiv ; augmix19arxiv ; hendrycks2021pixmix , pretext training csi20nips , better uncertainty modeling ccu20arxiv ; bibas2021single , input-level manipulation odin18iclr ; wang2022watermarking , and utilizing feature or statistics from the intermediate-layer features mood21cvpr ; dongneural . Especially, to enhance the sensitivity to covariate shift, some methods focus on the hidden representations in the middle layers of neural networks. Generalized ODIN, or G-ODIN godin20cvpr extended ODIN odin18iclr by using a specialized training objective termed DeConf-C and choosing hyperparameters such as perturbation magnitude on ID data. Note that we do not categorize G-ODIN as post-hoc method as it requires model retraining. Recent work wei2022mitigating shows that the overconfidence issue can be mitigated through Logit Normalization (LogitNorm), a simple fix to the common cross-entropy loss by enforcing a constant vector norm on the logits in training. Trained with LogitNorm, neural networks produce highly distinguishable confidence scores between in- and out-of-distribution data.

Some works redesign the label space to achieve good OOD detection performance. While commonly used to encode categorical information for classification, the one-hot encoding ignores the inherent relationship among labels. For example, it is unreasonable to have a uniform distance between dog and cat vs. dog and car. To this end, several works attempt to use information in the label space for OOD detection. Some works arrange the large semantic space into a hierarchical taxonomy of known classes hierarchical18cvpr ; mos21cvpr ; hierarchyood . Under the redesigned label architecture, top-down classification strategy hierarchical18cvpr ; hierarchyood and group softmax training mos21cvpr are demonstrated effective. Another set of works uses word embeddings to automatically construct the label space. In ksemantic18nips , the sparse one-hot labels are replaced with several dense word embeddings from different NLP models, forming multiple regression heads for robust training. When testing, the label, which has the minimal distance to all the embedding vectors from different heads, will be considered as the prediction. If the minimal distance crosses above the threshold, the sample would be classified as “novel”. Recent works further take the image features from language-image pre-training models radford2021learning to better detect novel classes, where the image encoding space also contains rich information from the language space nearood21arxiv ; gan2021language .

3.1.2 Methods with Outlier Exposure

a. Real Outliers  Another branch of OOD detection methods makes use of a set of collected OOD samples, or “outlier”, during training to help models learn ID/OOD discrepancy. Starting from the concurrent baselines that encourage a flat/high-entropic prediction on given OOD samples oe18nips ; agnostophobia18nips and suppressing OOD feature magnitudes agnostophobia18nips , a follow-up work, MCD mcd19iccv uses a network with two branches, between which entropy discrepancy is enlarged for OOD training data. Another straightforward approach with outlier exposure spares an extra abstention (or rejection class) and considers all the given OOD samples in this class sina2020aaai ; outliermining21ecml ; abstention21arxiv . A later work OECC oecc21neurocomputing noticed that an extra regularization for confidence calibration introduces additional improvement for OE. To effectively utilize the given, usually massive, OOD samples, some work use outlier mining outliermining21ecml ; ming2022posterior and adversarial resampling backgroundsampling20cvpr approaches to obtain a compact yet representative set. In cases where the meaningful “near”-OOD images are not available, MixOE Zhang_2023_WACV proposes to interpolate between ID and “far”-OOD images to obtain informative outliers for better regularization. Other works consider a more practical scenario where given OOD samples contain ID samples, therefore using pseudo-labeling pseudolabel20aaai or ID filtering methods yang2021scood with optimal transport scheme lu2023uncertainty to reduce the interference of ID data. In general, OOD detection with outlier exposure can reach a much better performance. However, research shows that the performance can be largely affected by the correlations between given and real OOD samples lessbias19bmvc . To address the issue, recent work katzsamuels2022training proposes a novel framework that enables effectively exploiting unlabeled in-the-wild data for OOD detection. Unlabeled wild data is frequently available since it is produced essentially for free whenever deploying an existing classifier in a real-world system. This setting can be viewed as training OOD detectors in their natural habitats, which provide a much better match to the true test time distribution than data collected offline.

b. Outlier Data Generation  The outlier exposure approaches impose a strong assumption on the availability of OOD training data, which can be infeasible in practice. When no OOD sample is available, some methods attempt to synthesize OOD samples to enable ID/OOD separability. Existing works leverage GANs to generate OOD training samples and force the model predictions to be uniform confcal18iclr , generate boundary samples in the low-density region oodsg19nipsw , or similarly, high-confidence OOD samples confgan18nipsw , or using meta-learning the update sample generation maml20nips . However, synthesizing images in the high-dimensional pixel space can be difficult to optimize. Recent work VOS du2022vos proposed synthesizing virtual outliers from the low-likelihood region in the feature space, which is more tractable given lower dimensionality. While VOS du2022vos is a parametric approach that models the feature space as a class-conditional Gaussian distribution, NPOS npos2023iclr also generates outlier ID data but in a non-parametric approach. Noticing the generated OOD data could be incorrect or irrelevant, DOE wang2023out synthesizes hard OOD data that leads to worst judgments to train the OOD detector with a min-max learning scheme, and ATOL zheng2023out uses auxiliary task to relieve the mistaken OOD generation. In object detection, du2022unknown proposes synthesizing unknown objects from videos in the wild using spatial-temporal unknown distillation.

3.1.3 Gradient-based Methods

Existing OOD detection approaches primarily rely on the output (Section 3.1) or feature space for deriving OOD scores, while overlooking information from the gradient space. ODIN odin18iclr first explored using gradient information for OOD detection. In particular, ODIN proposed using input pre-processing by adding small perturbations obtained from the input gradients. The goal of ODIN perturbations is to increase the softmax score of any given input by reinforcing the model’s belief in the predicted label. Ultimately the perturbations have been found to create a greater gap between the softmax scores of ID and OOD inputs, thus making them more separable and improving the performance of OOD detection. While ODIN only uses gradients implicitly through input perturbation, recent work proposed GradNorm huang2021importance which explicitly derives a scoring function from the gradient space. GradNorm employs the vector norm of gradients, backpropagated from the KL divergence between the softmax output and a uniform probability distribution. A recent research igoe2022useful demonstrates that while gradient-based methods are effective, their success does not necessarily depend on gradients, but rather on the magnitude of learned feature embeddings and predicted output distribution.

3.1.4 Bayesian Models

A Bayesian model is a statistical model that implements Bayes’ rule to infer all uncertainty within the model jaynes1986bayesian . The most representative method is the Bayesian neural network bnn12book , which draws samples from the posterior distribution of the model via MCMC mcmc06gamerman , Laplace methods laplace92cit ; inbwnbnn20icmlw and variational inference meanfield89nn , forming the epistemic uncertainty of the model prediction. However, their obvious shortcomings of inaccurate predictions howgoodbnn20icml and high computational costs objectbnn08ba prevent them from wide adoption in practice. Recent works attempt several less principled approximations including MC-dropout mcdropout16icml and deep ensembles deepensemble00iwmcs ; deepensemble17nips ; maddox2019simple for faster and better estimates of uncertainty. These methods are less competitive for OOD uncertainty estimation. Further exploration takes natural-gradient variational inference and enables practical and affordable modern deep learning training while preserving the benefits of Bayesian principles practicalbnn19nips . Dirichlet Prior Network (DPN) is also used for OOD detection with an uncertainty modeling of three different sources of uncertainty: model uncertainty, data uncertainty, and distributional uncertainty, and form a line of works dpn18nips ; dpn19nips ; dpn20nips . Recently, the Bayesian hypothesis test has been used to formulate OOD detection, with upweighting method and Hessian approximation for scalability kim2021locally .

3.1.5 OOD Detection for Foundation Models

Foundation models bommasani2021opportunities , notably large-scale vision-language models radford2021learning , have demonstrated exceptional performance in a variety of downstream tasks. Their success is largely attributed to extensive pre-training on large-scale datasets. Several works hendrycks2019using ; nearood21arxiv ; pretransformer20arxiv reveal that well-pretrained models can significantly enhance OOD detection, particularly in challenging scenarios. However, adapting (tuning) these models for downstream tasks with specific semantic (label) space in the training data remains a challenge, as simple approaches such as linear probing, prompt tuning zhou2022coop ; zhou2022cocoop ; jia2022visual , and adaptor-style fine-tuning methods gao2023clip do not have good results on OOD detection. To advance the problem, a thorough investigation ming2023does examines how fine-tuned vision-language models are performed. Additionally, recent research miyai2023can highlights the impact of large-scale pretraining data and provides a systematic study on pretraining strategies on OOD detection performance. On a technical front, LoCoOp miyai2023locoop introduces OOD regularization to a subset of CLIP’s local features identified as OOD, enhancing prompt learning for better ID and OOD differentiation, and LSA lu2023likelihood uses a bidirectional prompt customization mechanism to enhance the image-text alignment.

The strong zero-shot learning capabilities of models like CLIP radford2021learning also open avenues for zero-shot OOD detection. This new setting aims to categorize known class samples and detect samples that do not belong to any of the known classes, where known classes are represented solely through textual descriptions or class names, eliminating the need for explicit training on these classes. Addressing this, ZOC esmaeilpour2022zero trains a decoder based on CLIP’s visual encoder to create candidate labels for OOD detection. While ZOC is computationally intensive and data-demanding, MCM ming2022delving opts for softmax scaling to align visual features with textual concepts for OOD detection. A recent advancement, CLIPN wang2023clipn , innovatively integrates a “no” logic in OOD detection. Utilizing new prompts and a text encoder, along with novel opposite loss functions, CLIPN effectively tackles the challenge of identifying hard-to-distinguish OOD samples. This development marks a significant stride in enhancing the precision of OOD detection in complex scenarios.

3.2 Density-based Methods

Density-based methods in OOD detection explicitly model the in-distribution with some probabilistic models, and flag test data in low-density regions as OOD. Although OOD detection can be different from AD in that multiple classes exist in the in-distribution, density estimation methods used for AD in Section 4.2 can be directly adapted to OOD detection by unifying the ID data as a whole dagmm18iclr ; autogres19cvpr ; gpnd18nips ; adgan18ecml ; advrec18cvpr . When the ID contains multiple classes, class-conditional Gaussian distribution can explicitly model the in-distribution so that the OOD samples can be identified based on their likelihoods mahananobis18nips . Flow-based methods flowreview20pami ; residualflow20cvpr ; glow18nips ; pixelcnn16icml ; jiang2021revisiting can also be used for probabilistic modeling. While directly estimating the likelihood seems like a natural approach, some works dgmknow19nips ; waic18arxiv ; whyflow20nips find that probabilistic models sometimes assign a higher likelihood for the OOD sample. Several works attempt to solve the problems using likelihood ratio likelihoodratio19nips .  ratiogm20iclr finds that the likelihood exhibits a strong bias towards the input complexity and proposes a likelihood ratio-based method to compensate for the influence of input complexity. Recent methods turn to new scores such as likelihood regret vae20nips or an ensemble of multiple density models waic18arxiv . To directly model the density of semantic space, SEM score is used with a simple combination of density estimation in the low-level and high-level space yang2022fsood . Overall, generative models can be prohibitively challenging to train and optimize, and the performance can often lag behind the classification-based approaches (Section 3.1).

3.3 Distance-based Methods

The basic idea of distance-based methods is that the testing OOD samples should be relatively far away from the centroids or prototypes of in-distribution classes.  mahananobis18nips uses the minimum Mahalanobis distance to all class centroids for detection. A subsequent work splits the images into foreground and background and then calculates the Mahalanobis distance ratio between the two spaces rmd21arxiv . In contrast to the parametric approach, recent work sun2022knn shows strong promise of non-parametric nearest-neighbor distance for OOD detection. Unlike Mahalanobis, the non-parametric approach does not impose any distributional assumption about the underlying feature space, hence providing stronger simplicity, flexibility, and generality.

For distance functions, some works use cosine similarity between test sample features and class features to determine OOD samples cosinesim20accv ; svae20eccv . The one-dimensional subspace spanned by the first singular vector of the training features is shown to be more suitable for cosine similarity-based detection onedim21cvpr . Moreover, other works leverage distances with radial basis function kernel duq20icml , Euclidean distance fss20arxiv , and geodesic distance gomes2022igeood between the input’s embedding and the class centroids. Apart from calculating the distance between samples and class centroids, the feature norm in the orthogonal complement space of the principal space is shown effective on OOD detection haoqi2022vim . Recent work CIDER ming2022cider explores the usability of the embeddings in the hyperspherical space, where inter-class dispersion and inner-class compactness can be encouraged.

3.4 Reconstruction-based Methods

The core idea of reconstruction-based methods is that the encoder-decoder framework trained on the ID data usually yields different outcomes for ID and OOD samples. The difference in model performance can be utilized as an indicator for detecting anomalies. For example, reconstruction models that are only trained by ID data cannot well recover the OOD data denouden2018improving , and therefore the OOD can be identified. While reconstruction-based models with pixel-level comparison seem not a popular solution in OOD detection for its expensive training cost, reconstructing with hidden features is shown as a promising alternative zhou2022rethinking . Rather than reconstructing the entire image, recent work MoodCat yang2022mask masks a random portion of the input image and identifies OOD samples using the quality of the classification-based reconstruction results. READ jiang2022read combines inconsistencies from a classifier and an autoencoder by transforming the reconstruction error of raw pixels to the latent space of the classifier. MOOD li2023mood shows that masked image modeling for pretraining is beneficial to OOD detection tasks compared to contrastive training and classic classifier training.

3.5 Theoretical Analysis

Early theoretical research on OOD detection zhang2021understanding delves into the limitations of Deep Generative Models (DGMs) in OOD contexts. This work uncovers a critical flaw where DGMs frequently assign greater probabilities to OOD data compared to training data, attributing this issue primarily to model misestimation rather than the typical set hypothesis. This hypothesis posits that relevant out-distributions might be located in high-likelihood areas of the data distribution. The study concludes that any generalized OOD task must restrict the set of distributions that are considered out-of-distribution, as without any restrictions, the task is impossible. Later work morteza2022provable advances the field by developing a comprehensive analytical framework aimed at enhancing theoretical understanding and practical performance of OOD detection methods in neural networks. Their innovative approach culminates in a novel OOD detection method that surpasses existing techniques in both theoretical robustness and empirical performance.

Another series of studies has been focused on Open-Set Learning (OSL). The seminal work in this domain towardosr13pami conceptualizes open-space risk for recognizing samples from unknown classes. The following research applies extreme value theory to OSL psvm14eccv ; rudd2017extreme . While probably approximately correct (PAC) theory is applied for OSR pac18icml , their method required test samples during training. Therefore, an investigation of the generalization error bound is conducted and proves the existence of a low-error OSL algorithm under certain assumptions bound21icml . Still, under the PAC theory, a later study establishes necessary and sufficient conditions for the learnability of OOD detection in various scenarios fang2022out , including cases with overlapping and non-overlapping ID and OOD data. Their work also offers theoretical support for existing OOD detection algorithms and suggests that OOD detection is possible under certain practical conditions.

Despite these theoretical advancements, the field eagerly anticipates further research addressing aspects such as generalization in OOD detection, the explainability of these models, the integration of deep learning theory specific to OOD detection, and the exploration of foundation model theories pertinent to this area.

3.6 Discussion

The field of OOD detection has enjoyed rapid development since its emergence, with a large space of solutions. In the multi-class setting, the problem can be canonical to OSR (Section 4.1)—accurately classify test samples from ID within the class space 𝒴𝒴\mathcal{Y}caligraphic_Y, and reject test samples with semantics outside the support of 𝒴𝒴\mathcal{Y}caligraphic_Y. The difference often lies in the evaluation protocol. OSR splits a dataset into two halves: one set as ID and another set as OOD. In contrast, OOD allows a more general and flexible evaluation by considering test samples from different datasets or domains. Moreover, OOD detection encompasses a broader spectrum of learning tasks (e.g., multi-label classification wang2021canmulti , object detection du2022vos ; du2022unknown ) and solution space. Apart from the methodology development, theoretical understanding has also received attention in the community morteza2022provable , providing provable guarantees and empirical analysis to understand how OOD detection performance changes with respect to data distributions.

4 Methodologies from Other Sub-tasks

In this section, we briefly introduce methodologies for sub-tasks under the generalized OOD detection framework, including AD, ND, OSR, and OD, in hope that the methods from other sub-tasks can inspire more ideas for OOD detection community.

4.1 Open Set Recognition

The concept of OSR was first introduced in towardosr13pami , which showed the validity of 1-class SVM and binary SVM for solving the OSR problem. In particular, towardosr13pami proposes the 1-vs-Set SVM to manage the open-set risk by solving a two-plane optimization problem instead of the classic half-space of a binary linear classifier. This paper highlighted that the open-set space should also be bounded, in addition to bounding the ID risk.

Classification-based Methods  Early works focused on logits redistribution using the compact abating probability (CAP) cap14pami and extreme value theory (EVT) evt90appmath ; evt12book ; psvm14eccv . In particular, classic probabilistic models lack the consideration of open-set space. CAP explicitly models the probability of class membership abating from ID points to OOD points, and EVT focuses on modeling the tail distribution with extreme high/low values. In the context of deep learning, OpenMax openmax16cvpr first implements EVT for neural networks. OpenMax replaces the softmax layer with an OpenMax layer, which calibrates the logits with a per-class EVT probabilistic model such as Weibull distribution.

To bypass open-set risk construction, some works attained good results without EVT. For example, some work uses a membership loss to encourage high activations for known classes, and uses large-scale external datasets to learn globally negative filters that can reduce the activations of novel images dtl19cvpr . Apart from explicitly forcing discrepancy between known/unknown classes, other methods extract stronger features through an auxiliary task of transformation classification gdfr20cvpr , or mutual information maximization between the input image and its latent features m2iosr21arxiv , etc.

Image generation techniques have been utilized to synthesize unknown samples from known classes, which helps distinguish between known vs. unknown samples gopenmax17bmvc ; osrci18eccv ; proser21cvpr ; kong2021opengan . While these methods are promising on simple images such as handwritten characters, they do not scale to complex natural image datasets due to the difficulty in generating high-quality images in high-dimensional space. Another solution is to successively choose random categories in the training set and treat them as unknown, which helps the classifier to shrink the boundaries and gain the ability to identify unknown classes collectdecision20tkde ; onevsrest20arxiv . Moreover, intraclass19eusipco splits the training data into typical and atypical subsets, which also helps learn compact classification boundaries.

Distance-based Methods  Distance-based methods for OSR require the prototypes to be class-conditional, which allows maintaining the ID classification performance. Category-based clustering and prototyping are performed based on the visual features extracted from the classifiers. OOD samples can be detected by computing the distance w.r.t. clusters metric18bmvc ; podn2020scireport . Some methods also leveraged contrastive learning to learn more compact clusters for known classes peeler20cvpr ; rpl20eccv , which enlarge the distance between ID and OOD. CROSR crosr19cvpr enhances the features by concatenating visual embeddings from both the classifier and reconstruction model for distance computation in the extended feature space. Besides using features from classifiers, GMVAE gmvae20aaai extracts features using a reconstruction VAE, and models the embeddings of the training set as a Gaussian mixture with multiple centroids for the following distance-based operations. Classifiers using nearest neighbors are also adapted for OSR problem nndr17ml . By storing the training samples, the nearest neighbor distance ratio is used for identifying unknown samples in testing.

Reconstruction-based Methods  With similar motivations as Section 3.4, reconstruction-based methods expect different reconstruction behavior for ID vs. OOD samples. The difference can be captured in the latent feature space or the pixel space of reconstructed images.

By sparsely encoding images from the known classes, open-set samples can be identified based on their dense representation. Techniques such as sparsity concentration index srosr16pami and kernel null space methods knda13cvpr ; iknda17cvpr are used for sparse encoding.

By fixing the visual encoder obtained from standard multi-class training to maintain ID classification performance, C2AE trains a decoder conditioned on label vectors and estimates the reconstructed images using EVT to distinguish unknown classes c2ae19cvpr . Subsequent works use conditional Gaussian distributions by forcing different latent features to approximate class-wise Gaussian models, which enables classifying known samples as well as rejecting unknown samples cgdl20cvpr . Other methods generate counterfactual images, which help the model focus more on semantics counterfactual21cvpr . Adversarial defense is also considered in osad20eccv to enhance model robustness.

Discussion  Although there is not an independent section for density-based methods, these methods can play an important role and are fused as a critical step in some classification-based methods such as OpenMax openmax16cvpr . The density estimation on visual embeddings can effectively detect unknown classes without influencing the classification performance. A hybrid model also uses a flow-based density estimator to detect unknown samples openhybrid20eccv .

As introduced in Section 2.4, the general goal of OSR and OOD detection is aligned, that is to detect semantic shift from the training data. Therefore, we encourage methods from these two field should learn more from each other. For example, apart from novel methods, OSR research also shows that a good classifier vaze2021open in the close-set is critical to OSR performance, which should also applicable to OOD detection tasks.

4.2 Anomaly Detection & Novelty Detection

This section reviews methodologies for sensory and semantic AD and one-class ND. Notice that multi-classes ND is covered in the previous. Given homogeneous in-distribution data, approaches include density-based, reconstruction-based, distance-based, and hybrid methods. We also discuss theoretical works.

Density-based Methods  Density-based methods model normal data (ID) distributions, assuming anomalous test data has low likelihood while normal data has higher likelihood. Techniques include classic density estimation, density estimation with deep generative models, energy-based models, and frequency-based methods.

Parametric density estimation assumes pre-defined distributions anomalyparametric98pami . Methods involve multivariate Gaussian distribution mahalanobis00cils ; mahalanobis18jesp , mixed Gaussian distribution gmm84siam ; anomalygmm00icml , and Poisson distribution poisson16isi . Non-parametric density estimation handles more complex scenarios nonparametric91jasa with histograms histogram73cstm ; histogram12wireless ; histogram09traffic ; hbos12ki and kernel density estimation (KDE) kde62math ; kdeanomaly98ime ; anomalykde18tkde .

Neural networks generate high-quality features to enhance classic density estimation. Techniques include autoencoder (AE) ae91aiche and variational autoencoder (VAE) vae13arxiv -based models, generative adversarial networks (GANs) gan14nips , flow-based models flows15icml ; flowreview20pami , and representation enhancement strategies.

EBMs use scalar energy scores to express probability density energy11icml and provide a solution for AD energyad16icml . Training EBMs can be computationally expensive, but score matching scorematch05jmlr and stochastic gradient Langevin dynamics sgld11icml enable efficient training.

Frequency domain analysis for AD includes methods like CNN kernel smoothing highfreq20cvpr , spectrum-oriented data augmentation amplitude21iccv , and phase spectrum targeting phase21cvpr . These mainly focus on sensory AD.

Reconstruction-based Methods  These AD methods leverage model performance differences on normal and abnormal data in feature space or by reconstruction error.

Sparse reconstruction assumes normal samples can be accurately reconstructed using a limited set of basis functions, while anomalies have larger reconstruction costs and a dense representation admm15jsps ; adsparsecrowd17mta ; adsparsevideo13tcsvt . Techniques include L1subscript𝐿1L_{1}italic_L start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT norm-based kernel PCA kpca13pr and low-rank embedded networks lren21aaai .

Reconstruction-error methods assume a model trained on normal data will produce better reconstructions for normal test samples than anomalies. Deep models include AEs recerrae18wts , VAEs recerrvae15ie , GANs ganad18iclr , and U-Net framepred18cvpr .

AE/VAE-based models combine reconstruction-error with AE/VAE models recerrae18wts ; recerrvae15ie and use strategies like reconstructing by memorized normality menae19iccv ; memad20cvpr , adapting model architectures rsr20iclr , and partial/conditional reconstruction scadn21aaai ; gpnd18nips ; conad19icml . In semi-supervised AD, CoRA cora19aaai trains two AEs on inliers and outliers, using reconstruction errors for anomaly detection. Reconstruction-error methods using GANs leverage the discriminator to calculate reconstruction error for anomaly detection ganad18iclr . Variants like denoising GANs advrec18cvpr , class-conditional GANs ocgan19cvpr , and ensembling ganesb21aaai further improve performance. Gradient-based methods observe different patterns on training gradient between normalities and anomalies in a reconstruction task, using gradient-based representation to characterize anomalies gradcon20eccv .

Distance-based Methods  These methods detect anomalies by calculating the distance between samples and prototypes wettschereck1994study , requiring training data in memory. Methods include K-nearest Neighbors tian2014anomaly and prototype-based methods munz2007traffic ; syarif2012unsupervised .

Classification-based Methods  AD and one-class ND are often formulated as unsupervised learning problems, but there are some supervised and semi-supervised methods as well. One-class classification (OCC) directly learns a decision boundary that corresponds to a desired density level set of the normal data distribution occ02tax . DeepSVDD deepsvdd18icml introduced classic OCC to the deep learning community. PU learning pusurvey08isip ; pusurvey20ml ; pusurvey19iisa ; deepsad20iclr is proposed for the semi-supervised AD setting where unlabeled data is available in addition to the normal data. Self-supervised learning methods use pretext tasks such as contrastive learning csi20nips , image transformation prediction goad20iclr ; transform18nips , and future frame prediction ssmtl21cvpr , where anomalies are more likely to make mistakes on the designed task.

One-class classification learns a decision boundary that corresponds to a desired density level set of the normal data distribution, which DeepSVDD deepsvdd18icml introduced to the deep learning community. PU learning pusurvey08isip ; pusurvey20ml ; pusurvey19iisa ; deepsad20iclr is a popular method for the semi-supervised AD setting. Self-supervised learning methods use pretext tasks such as contrastive learning csi20nips , image transformation prediction goad20iclr ; transform18nips , and future frame prediction ssmtl21cvpr , where anomalies are more likely to make mistakes on the designed task.

Discussion: Sensory vs Semantic AD  Sensory and semantic AD approaches assume the normal data as homogeneous, despite the presence of multiple categories within it. While semantic AD methods are mainly applicable to sensory AD problems, the latter can benefit from techniques that focus on lower-level features (e.g., flow-based and hidden feature-based), local representations, and frequency-based methods. Although current OOD detection tasks mostly focus on semantic shift, the method for Sensory AD might be especially helpful for far OOD detection, like ImageNet vs Texture dataset.

Discussion: Theoretical Analysis  In addition to algorithmic development, theoretical analysis of AD and one-class ND has also been provided in some works. For instance, pac18icml constructs a clean set of ID and a mixed set of ID/OOD with identical sample sizes, achieving a PAC-style finite sample guarantee for detecting a certain portion of anomalies with the minimum number of false alarms. All these works could be beneficial to the theoretical works of OOD detection.

4.3 Outlier Detection

Outlier detection (OD) observes all samples to identify significant deviations from the majority distribution. Though mostly studied in data mining, deep learning-based OD methods are used for data cleaning in open-set noisy data iterativeosnl18cvpr ; webly15iccv and open-set semi-supervised learning openworldssl21arxiv .

Density-based Methods  OD methods include Gaussian distribution std05bmj ; devmedian13jesp , Mahalanobis distance mahalanobis00cils , Gaussian mixtures gmm09siam , and Local outlier factor (LOF) lof00sigmod . RANSAC fischler1981random estimates parameters for a mathematical model. Classic density methods and NN-based density methods can also be applied.

Distance-based Methods  Outliers can be detected by neighbor counting distanceod13nips ; distanceod10vldb , DBSCAN clustering dbscan96kdd , and graph-based methods knngraph04icpr ; neibourgraph04jiis ; largegraph10icml ; graphad15dmkd ; graphad03sigkdd ; gbc07ictai ; gbc12iccse ; ngc21iccv ; webly20acmmm .

Classification-based Methods  AD methods like Isolation Forest isoforest08icdm and OC-SVM occ02tax ; deepsvdd18icml can be applied to OD. Deep learning models can identify outliers li2017learning . Techniques for robustness and feature generalizability include ensembling nguyen2019self , co-training han2018co , and distillation li2017learning ; webly20eccv .

Discussion  OD techniques are valuable for open-set semi-supervised learning, learning with open-set noisy labels, and novelty discovery. All these solutions can be applied especially when OOD samples are exposed during the training stage yang2021scood .

Refer to caption
Figure 4: The illustration of CIFAR-10 benchmark that is used in Section 5. The CIFAR-100 benchmark simply swaps the position of CIFAR-10 and CIFAR-100 in the figure.
Refer to caption
Figure 5: Comparison between different methodologies under generalized OOD detection framework on the CIFAR-10/100 benchmarks. Results are from OpenOOD yang2022openood . Different colors denote the method categories. Each method reports near-OOD (left-bar) and far-OOD (right-bar) AUROC scores, as introduced in Section 5.1. Method names in black originated for OOD detection, while in red are AD methods, blue for OSR methods, and pink for models from model uncertainty works.

5 Benchmarks and Experiments

In this section, we report the fair comparison of methodologies that from different categories on the CIFAR cifar-10 benchmark. The report originated from OpenOOD benchmarks yang2022openood . We selected several popular AD methods, OOD detection methods (post-hot, training-required, and extra-data-required), and model robustness methods.

5.1 Benchmarks and Metrics

The common practice for building OOD detection benchmarks is to consider an entire dataset as in-distribution (ID), and then collect several datasets that are disconnected from any ID categories as OOD datasets. In this part, we show the results from two popular OOD benchmarks with ID datasets of CIFAR-10 cifar-10 , CIFAR-100 cifar-100 from OpenOOD (c.f. Figure 4), with each benchmark designing near-OOD and far-OOD datasets to facilitate detailed analysis of the OOD detectors. Near-OOD datasets only have semantic shift compared with ID datasets, while far-OOD further contains obvious covariate (domain) shift.

CIFAR-10  CIFAR-10 cifar-10 is a 10-class dataset for general object classification, which contains 50k training images and 10k test images. As for the OOD dataset, we construct near-OOD with CIFAR-100 cifar-100 and TinyImageNet imagenet12nips . Notice that 1,207 images are removed from TinyImageNet since they actually belong to CIFAR-10 classes yang2021scood . Far-OOD is built by MNIST LeCun2005TheMD , SVHN svhn , Texture texture , and Places365 zhou2017places with 1,305 images are removed due to semantic overlaps.

CIFAR-100  Another OOD detection benchmark uses CIFAR-100 cifar-100 as an in-distribution, which contains 50k training images and 10k test images with 100 classes. For OOD dataset, near-OOD includes CIFAR-10 cifar-10 and TinyImageNet tinyimages08pami . Similar to the CIFAR-10 benchmark, 2,502 images are removed from TinyImageNet due to the overlapping semantics with CIFAR-100 classes yang2021scood . Far-OOD consists of MNIST LeCun2005TheMD , SVHN svhn , Texture texture , and Places365 zhou2017places with 1,305 images removed.

Metrics  We only report the AUROC scores, which measure the area under the Receiver Operating Characteristic (ROC) curve.

5.2 Experimental Setup

To ensure a fair comparison across methods that originate from different fields and have different implementations, unified settings with common hyperparameters and architecture choices are implemented. ResNet-18 he2016deep is used as the backbone network. If the implemented method requires training, the widely accepted setting with SGD optimizer, a learning rate of 0.1, momentum of 0.9, and weight decay of 0.0005 for 100 epochs, is used. For further details, please refer to OpenOOD yang2022openood ; zhang2023openood .

5.3 Experimental Results and Findings

Data Augmentation Methods are the Most Effective  We split Figure 5 into several sections based on the method type. Generally, the most effective methods are those that use model uncertainty works with data augmentation techniques. This group mainly includes simple and effective methods such as preprocessing methods like PixMix hendrycks2021pixmix and CutMix cutmix19cvpr . PixMix achieves 93.1% on Near-OOD in CIFAR-10, the best performance among all the methods in this benchmark. These methods also perform well in most of the other benchmarks. Similarly, other simple and effective methods to enhance model uncertainty estimation such as Ensemble deepensemble00iwmcs and Mixup mixup19nips also demonstrate excellent performance.

Extra Data Seems Not Necessary?  Comparing UDG yang2021scood (the best from the extra-data part) with KNN sun2022knn (the best from the extra data-free part), we found that UDG’s advantage is only in CIFAR-10 near-OOD, which is not satisfactory since a large quantity of real outlier data is required. In this benchmark, we use the entire TinyImageNet training set as the extra data, the choice of training outliers could greatly affect the performance of OOD detectors, so further exploration is needed.

Post-Hoc Methods Outperform Training in General  Surprisingly, methods that require training do not necessarily perform better. In general, inference-only methods outperform trained methods. Nevertheless, the trained models can be generally used in conjunction with post-hoc methods, which could potentially further increase their performance.

Post-Hoc Methods are Making Progress  In general, recent post-hoc methods have had better performance than previous methods since 2021, indicating that the direction of inference-only methods is promising and making progress. Recent methods show improvements in performance on more realistic datasets than previous methods, which focused on toy datasets. For example, the classic MDS performs well on MNIST but poorly on CIFAR-10 and CIFAR-100, while the recent KNN maintains good performance on MNIST, CIFAR-10, CIFAR-100, and also shows outstanding performance on ImageNet yang2022openood .

Some AD Methods are Good at Far-OOD  Although anomaly detection (AD) methods were originally designed to detect pixel-level appearance differences on the MVTec-AD dataset, they have shown potency in far-OOD detection, such as with DRAEM and CutPaste. Both methods achieved high performance on far-OOD detection, especially when using CIFAR-100 as the in-distribution dataset.

Explore OpenOOD for More Experimental FindingsAccompanying our survey, we lead the development of OpenOOD yang2022openood , an open-source codebase that provides a unified framework and benchmarking platform for conducting fair comparisons of various model architectures and OOD detection methods. OpenOOD is continuously updated and includes two comprehensive experimental reports yang2022openood ; zhang2023openood that delve into extensive analysis and discovery222We also provide a leaderboard to track SOTA methods.. We encourage readers to explore OpenOOD’s resources for a deeper understanding of key aspects such as selecting model architectures, utilizing pre-trained models, practical applications, and detailed implementation insights.

5.4 Exclusion of Covariate-Shift Detection

While OpenOOD does not include settings for pure covariate shift, this was a deliberate choice. The primary focus is on semantic shifts, which are fundamental to OOD detection. By not separately analyzing covariate shifts, we aim to avoid potential misinterpretations and prevent the overemphasis on covariate shift detection. Experiments in yang2022fsood highlight a key finding: most current OOD detectors are more sensitive to covariate shifts than semantic shifts and lead to the concept of “full-spectrum OOD detection”, advocating for models that effectively generalize to handle covariate shifts while simultaneously detecting samples with semantic shifts. More experimental evaluations can be found in OpenOOD v1.5 zhang2023openood .

6 Challenges and Future Directions

In this section, we discuss the challenges and future directions of generalized OOD detection.

6.1 Challenges

a. Proper Evaluation and Benchmarking  We hope this survey can clarify the distinctions and connections of various sub-tasks, and help future works properly identify the target problem and benchmarks within the framework. The mainstream OOD detection works primarily focus on detecting semantic shifts. Admittedly, the field of OOD detection can be very broad due to the diverse nature of distribution shifts. Such a broad OOD definition also leads to some challenges and concerns semanticanomaly20aaai ; gan2021language , which advocate a clear specification of OOD type in consideration (e.g., semantic OOD, adversarial OOD, etc.) so that proposed solutions can be more specialized. Besides, the motivation of detecting a certain distribution shift also requires clarification. While rejecting classifying samples with semantic shift is apparent, detecting sensory OOD should be specified to some meaningful scenarios to contextualize the necessity and practical relevance of the task.

We also urge the community to carefully construct the benchmarks and evaluations. It is noticed that early work msp17iclr ignored the fact that some OOD datasets may contain images with ID categories, causing inaccurate performance evaluation. Fortunately, recent OOD detection works yang2021scood have realized this flaw and pay special attention to removing ID classes from OOD samples to ensure proper evaluation.

b. Outlier-free OOD Detection  The outlier exposure approach oe18nips imposes a strong assumption of the availability of OOD training data, which can be difficult to obtain in practice. Moreover, one needs to perform careful de-duplication to ensure that the outlier training data does not contain ID data. These restrictions may lead to inflexible solutions and prevent the adoption of methods in the real world. Going forward, a major challenge for the field is to devise outlier-free learning objectives that are less dependent on auxiliary outlier dataset.

c. Tradeoff Between Classification and OOD Detection  In OSR and OOD detection, it is important to achieve the dual objectives simultaneously: one for the ID task (e.g., image classification), another for the OOD detection task. For a shared network, an inherent trade-off may exist between the two tasks. Promising solutions should strive for both. These two tasks may or may not contradict each other, depending on the methodologies. For example,  liu2019large advocated the integration of image classification and open-set recognition so that the model will possess the capability of discriminative recognition on known classes and sensitivity to novel classes at the same time.  vaze2021open also showed that the ability of detecting novel classes can be highly correlated with its accuracy on the closed-set classes.  yang2021scood demonstrated that optimizing for the cluster compactness of ID classes may facilitate both improved classification and distance-based OOD detection performance. Such solutions may be more desirable than ND, which develops a binary OOD detector separately from the classification model, and requires deploying two models.

d. Real-world Benchmarks and Evaluations  Current methods in OOD detection are predominantly evaluated on smaller datasets like CIFAR. However, it has been observed that strategies effective on CIFAR may not perform as well on larger datasets like ImageNet, which has a more extensive semantic space. This discrepancy underscores the importance of conducting OOD detection evaluations in large-scale, real-world settings. Consequently, we recommend future research to focus on benchmarks based on ImageNet for OOD detection mos21cvpr and to explore large-scale Open Set Recognition (OSR) benchmarks vaze2021open to fully test the effectiveness of these methods. Additionally, recent research bitterwolf2023or highlights the presence of erroneous samples in ImageNet OOD benchmarks and introduces the corrected NINCO dataset for more accurate evaluations. Furthermore, expanding the scope of benchmarks to encompass real-world scenarios, such as more realistic datasets koh2021wilds ; cultrera2023leveraging , and object-level OOD detection du2022vos ; du2022unknown , can provide valuable insights, especially in safety-critical applications like autonomous driving.

6.2 Future Directions

a. Methodologies across Sub-tasks  Due to the inherent connections among different sub-tasks, their solution space can be shared and inspired by each other. For example, the recent emerging density-based OOD detection research (c.f. Section 3.2) can draw insights from the density-based AD methods (c.f. Section 4.2) that have been around for a long time.

b. OOD Detection & Generalization  An open-world classifier should consider two tasks, i.e., being robust to covariate shift while being aware of the semantic shift. Existing works pursue these two goals independently. Recent work proposes a semantically coherent OOD detection framework yang2021scood that encourages detecting semantic OOD samples while being robust to negligible covariate shift. Given the vague definition of OOD, ming2022spurious proposed a formalization of OOD detection by explicitly taking into account the separation between invariant features (semantically related) and environmental features (non-semantic). The work highlighted that spurious environmental features in the training set can significantly impact OOD detection, especially when the semantic OOD data contains the spurious feature. Further, full-spectrum OOD detection yang2022fsood highlights the effects of “covariate-shifted in-distribution”, and show that most of the previous OOD detectors are unfortunately sensitive to covariate shift rather than semantic shift. This setting explicitly promotes the generalization ability of OOD detectors. Recent works on open long-tailed recognition liu2019large , open compound domain adaptation liu2020open , open-set domain adaptation panareda2017open and open-set domain generalization shu2021open consider the potential existence of open-class samples. Looking ahead, we envision great research opportunities on how OOD detection and OOD generalization can better enable each other liu2019large , in terms of both algorithmic design and comprehensive performance evaluation.

c. OOD Detection & Open-Set Noisy Labels  Existing methods of learning from open-set noisy labels focus on suppressing the negative effects of noise iterativeosnl18cvpr ; mopro20iclr . However, the open-set noisy samples can be useful for outlier exposure (c.f. Section 3.1.2ngc21iccv and potentially benefit OOD detection. With a similar idea, the setting of open-set semi-supervised learning can be promising for OOD detection. We believe the combination of OOD detection and the previous two fields can provide more insights and possibilities.

d. OOD Detection For Broader Learning Tasks  As mentioned in Section 3.6, OOD detection encompasses a broader spectrum of learning tasks, including multi-label classification wang2021canmulti , object detection du2022vos ; du2022unknown , image segmentation hendrycks2019scaling , time-series prediction kaur2022codit , and LiDAR-based 3D object detection nguyen2022out . For the classification task itself, the researchers also extended the OOD detection technique to improve the reliability of zero-shot pretrained models esmaeilpour2022zero (e.g., CLIP). Furthermore, some studies focus on applying OOD detection methods to produce reliable image captions shalev2022baseline . Recent advancements extend OOD detection to continuously adaptive or online learning environments wu2023meta . Additionally, OOD detection shows promise in addressing model reliability issues in broader applications, like mitigating hallucination problems in large language models zhou2020detecting . The integration of OOD detection methods promises to enhance the reliability and practicality of models across various fields, and insights from these fields could, in turn, further refine OOD detection techniques.

e. OOD Detection with World ModelsThe existing works utilizing foundation models, particularly multi-modal ones such as CLIP radford2021learning , have significantly enhanced OOD detection performance, as discussed in Section 3.1.5. Starting from this, recent advancements have further focused on leveraging the extensive world knowledge encapsulated in Large Language Models dai2023exploring . This approach aligns with the rapid development in multi-modal world models yang2023dawn ; liu2023visual ; li2023otter , presenting burgeoning opportunities for further innovation within the OOD detection community.

7 Conclusion

In this survey, we comprehensively review five topics: AD, ND, OSR, OOD detection, and OD, and unify them as a framework of generalized OOD detection. By articulating the motivations and definitions of each sub-task, we encourage follow-up works to accurately locate their target problems and find the most suitable benchmarks. By sorting out the methodologies for each sub-task, we hope that readers can easily grasp the mainstream methods, identify suitable baselines, and contribute future solutions in light of existing ones. By providing insights, challenges, and future directions, we hope that future works will pay more attention to the existing problems and explore more interactions across other tasks within or even outside the scope of generalized OOD detection.

Acknowledgment

This study is supported by the Ministry of Education, Singapore, under its MOE AcRF Tier 2 (MOE-T2EP20221- 0012), NTU NAP, and under the RIE2020 Industry Alignment Fund – Industry Collaboration Projects (IAF-ICP) Funding Initiative, as well as cash and in-kind contribution from the industry partner(s). YL is supported by the Office of the Vice Chancellor for Research and Graduate Education (OVCRGE) with funding from the Wisconsin Alumni Research Foundation (WARF).

Data Availability Statement

The datasets analyzed during the current study in Section 5 are available in the OpenOOD repository, https://github.com/Jingkang50/OpenOOD.

References

  • (1) D. Amodei, C. Olah, J. Steinhardt, P. Christiano, J. Schulman, and D. Mané, “Concrete problems in AI safety,” arXiv preprint arXiv:1606.06565, 2016.
  • (2) S. Mohseni, H. Wang, Z. Yu, C. Xiao, Z. Wang, and J. Yadawa, “Practical machine learning safety: A survey and primer,” arXiv preprint arXiv:2106.04823, 2021.
  • (3) D. Hendrycks, N. Carlini, J. Schulman, and J. Steinhardt, “Unsolved problems in ML safety,” arXiv preprint arXiv:2109.13916, 2021.
  • (4) D. Hendrycks and M. Mazeika, “X-risk analysis for AI research,” arXiv preprint arXiv:2206.05862, 2022.
  • (5) A. Krizhevsky, I. Sutskever, and G. E. Hinton, “Imagenet classification with deep convolutional neural networks,” in NIPS, 2012.
  • (6) K. He, X. Zhang, S. Ren, and J. Sun, “Delving deep into rectifiers: Surpassing human-level performance on imagenet classification,” in ICCV, 2015.
  • (7) N. Drummond and R. Shearer, “The open world assumption,” in eSI Workshop, 2006.
  • (8) D. Hendrycks and K. Gimpel, “A baseline for detecting misclassified and out-of-distribution examples in neural networks,” in ICLR, 2017.
  • (9) S. Ben-David, J. Blitzer, K. Crammer, A. Kulesza, F. Pereira, and J. W. Vaughan, “A theory of learning from different domains,” Machine learning, 2010.
  • (10) D. Li, Y. Yang, Y.-Z. Song, and T. M. Hospedales, “Deeper, broader and artier domain generalization,” in ICCV, 2017.
  • (11) M. Wang and W. Deng, “Deep visual domain adaptation: A survey,” Neurocomputing, 2018.
  • (12) C. C. Aggarwal and P. S. Yu, “Outlier detection for high dimensional data,” in ACM SIGMOD, 2001.
  • (13) V. Hodge and J. Austin, “A survey of outlier detection methodologies,” Artificial intelligence review, 2004.
  • (14) I. Ben-Gal, “Outlier detection,” in Data mining and knowledge discovery handbook, 2005.
  • (15) H. Wang, M. J. Bah, and M. Hammad, “Progress in outlier detection techniques: A survey,” Ieee Access, 2019.
  • (16) L. Ruff, J. R. Kauffmann, R. A. Vandermeulen, G. Montavon, W. Samek, M. Kloft, T. G. Dietterich, and K.-R. Müller, “A unifying review of deep and shallow anomaly detection,” Proceedings of the IEEE, 2021.
  • (17) G. Pang, C. Shen, L. Cao, and A. v. d. Hengel, “Deep learning for anomaly detection: A review,” arXiv preprint arXiv:2007.02500, 2020.
  • (18) S. Bulusu, B. Kailkhura, B. Li, P. K. Varshney, and D. Song, “Anomalous example detection in deep learning: A survey,” IEEE Access, 2020.
  • (19) R. Chalapathy and S. Chawla, “Deep learning for anomaly detection: A survey,” arXiv preprint arXiv:1901.03407, 2019.
  • (20) M. A. Pimentel, D. A. Clifton, L. Clifton, and L. Tarassenko, “A review of novelty detection,” Signal Processing, 2014.
  • (21) D. Miljković, “Review of novelty detection methods,” in MIPRO, 2010.
  • (22) M. Markou and S. Singh, “Novelty detection: a review—part 1: statistical approaches,” Signal processing, 2003.
  • (23) M. Markou and S. Singh, “Novelty detection: a review—part 2: neural network based approaches,” Signal processing, 2003.
  • (24) T. E. Boult, S. Cruz, A. R. Dhamija, M. Gunther, J. Henrydoss, and W. J. Scheirer, “Learning and the unknown: Surveying steps toward open world recognition,” in AAAI, 2019.
  • (25) C. Geng, S.-j. Huang, and S. Chen, “Recent advances in open set recognition: A survey,” TPAMI, 2020.
  • (26) A. Mahdavi and M. Carvalho, “A survey on open set recognition,” arXiv preprint arXiv:2109.00893, 2021.
  • (27) I. J. Goodfellow, J. Shlens, and C. Szegedy, “Explaining and harnessing adversarial examples,” in ICLR, 2015.
  • (28) A. Madry, A. Makelov, L. Schmidt, D. Tsipras, and A. Vladu, “Towards deep learning models resistant to adversarial attacks,” ICLR, 2018.
  • (29) J. Quiñonero-Candela, M. Sugiyama, N. D. Lawrence, and A. Schwaighofer, Dataset shift in machine learning. Mit Press, 2009.
  • (30) L. A. Gatys, A. S. Ecker, and M. Bethge, “Image style transfer using convolutional neural networks,” in CVPR, 2016.
  • (31) V. Chandola, A. Banerjee, and V. Kumar, “Anomaly detection: A survey,” ACM computing surveys (CSUR), vol. 41, no. 3, pp. 1–58, 2009.
  • (32) F. Ahmed and A. Courville, “Detecting semantic anomalies,” in AAAI, 2020.
  • (33) L. Zhang, M. Goldstein, and R. Ranganath, “Understanding failures in out-of-distribution detection with deep generative models,” in ICML, 2021.
  • (34) N. Akhtar and A. Mian, “Threat of adversarial attacks on deep learning in computer vision: A survey,” IEEE Access, 2018.
  • (35) K. Patel, H. Han, and A. K. Jain, “Secure face unlock: Spoof detection on smartphones,” IEEE transactions on information forensics and security, 2016.
  • (36) D. Wen, H. Han, and A. K. Jain, “Face spoof detection with image distortion analysis,” IEEE Transactions on Information Forensics and Security, 2015.
  • (37) K. A. Nixon, V. Aimale, and R. K. Rowe, “Spoof detection schemes,” in Handbook of biometrics, 2008.
  • (38) G. Polatkan, S. Jafarpour, A. Brasoveanu, S. Hughes, and I. Daubechies, “Detection of forgery in paintings using supervised learning,” in ICIP, 2009.
  • (39) B. Dolhansky, R. Howes, B. Pflaum, N. Baram, and C. C. Ferrer, “The deepfake detection challenge (dfdc) preview dataset,” arXiv preprint arXiv:1910.08854, 2019.
  • (40) L. Jiang, Z. Guo, W. Wu, Z. Liu, Z. Liu, C. C. Loy, S. Yang, Y. Xiong, W. Xia, B. Chen, P. Zhuang, S. Li, S. Chen, T. Yao, S. Ding, J. Li, F. Huang, L. Cao, R. Ji, C. Lu, and G. Tan, “DeeperForensics Challenge 2020 on real-world face forgery detection: Methods and results,” arXiv preprint arXiv:2102.09471, 2021.
  • (41) P. Yang, D. Baracchi, R. Ni, Y. Zhao, F. Argenti, and A. Piva, “A survey of deep learning-based source image forensics,” Journal of Imaging, 2020.
  • (42) P. Bergmann, M. Fauser, D. Sattlegger, and C. Steger, “Mvtec ad–a comprehensive real-world dataset for unsupervised anomaly detection,” in CVPR, 2019.
  • (43) W.-H. Chu and K. M. Kitani, “Neural batch sampling with reinforcement learning for semi-supervised anomaly detection,” in ECCV, 2020.
  • (44) D. J. Atha and M. R. Jahanshahi, “Evaluation of deep learning approaches based on convolutional neural networks for corrosion detection,” Structural Health Monitoring, 2018.
  • (45) H. Idrees, M. Shah, and R. Surette, “Enhancing camera surveillance using computer vision: a research note,” Policing: An International Journal, 2018.
  • (46) C. P. Diehl and J. B. Hampshire, “Real-time object classification and novelty detection for collaborative video surveillance,” in IJCNN, 2002.
  • (47) L.-J. Li and L. Fei-Fei, “Optimol: automatic online picture collection via incremental model learning,” IJCV, 2010.
  • (48) D. Hendrycks and K. Gimpel, “A baseline for detecting misclassified and out-of-distribution examples in neural networks,” in ICLR, 2017.
  • (49) T. Fawcett, “An introduction to roc analysis,” Pattern recognition letters, 2006.
  • (50) D. M. Powers, “Evaluation: from precision, recall and f-measure to roc, informedness, markedness and correlation,” JMLT, 2020.
  • (51) H. R. Kerner, D. F. Wellington, K. L. Wagstaff, J. F. Bell, C. Kwan, and H. B. Amor, “Novelty detection for multispectral images with application to planetary exploration,” in AAAI, 2019.
  • (52) H. Al-Behadili, A. Grumpe, and C. Wöhler, “Incremental learning and novelty detection of gestures in a multi-class system,” in AIMS, 2015.
  • (53) D. Pathak, P. Agrawal, A. A. Efros, and T. Darrell, “Curiosity-driven exploration by self-supervised prediction,” in ICML, 2017.
  • (54) P. Perera, R. Nallapati, and B. Xiang, “Ocgan: One-class novelty detection using gans with constrained latent representations,” in CVPR, 2019.
  • (55) Y. Xia, X. Cao, F. Wen, G. Hua, and J. Sun, “Learning discriminative reconstructions for unsupervised outlier removal,” in CVPR, 2015.
  • (56) W. J. Scheirer, A. de Rezende Rocha, A. Sapkota, and T. E. Boult, “Toward open set recognition,” TPAMI, 2013.
  • (57) A. R. Dhamija, M. Günther, and T. E. Boult, “Reducing network agnostophobia,” in NeurIPS, 2018.
  • (58) S. Liu, R. Garrepalli, T. Dietterich, A. Fern, and D. Hendrycks, “Open category detection with pac guarantees,” in ICML, 2018.
  • (59) Z. Fang, J. Lu, A. Liu, F. Liu, and G. Zhang, “Learning bounds for open-set learning,” in ICML, 2021.
  • (60) E. Sorio, A. Bartoli, G. Davanzo, and E. Medvet, “Open world classification of printed invoices,” in Proceedings of the 10th ACM symposium on Document engineering, 2010.
  • (61) H. Xu, B. Liu, L. Shu, and P. Yu, “Open-world learning and application to product classification,” in WWW, 2019.
  • (62) X. Huang, D. Kroening, W. Ruan, J. Sharp, Y. Sun, E. Thamo, M. Wu, and X. Yi, “A survey of safety and trustworthiness of deep neural networks: Verification, testing, adversarial attack and defence, and interpretability,” Computer Science Review, 2020.
  • (63) A. Geiger, P. Lenz, and R. Urtasun, “Are we ready for autonomous driving? the kitti vision benchmark suite,” in CVPR, 2012.
  • (64) D. Hendrycks, M. Mazeika, and T. Dietterich, “Deep anomaly detection with outlier exposure,” in ICLR, 2019.
  • (65) J. Zhang, N. Inkawhich, R. Linderman, Y. Chen, and H. Li, “Mixture outlier exposure: Towards out-of-distribution detection in fine-grained environments,” in Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), pp. 5531–5540, January 2023.
  • (66) D. Hendrycks, S. Basart, M. Mazeika, M. Mostajabi, J. Steinhardt, and D. Song, “Scaling out-of-distribution detection for real-world settings,” in ICML, 2022.
  • (67) Y.-C. Hsu, Y. Shen, H. Jin, and Z. Kira, “Generalized odin: Detecting out-of-distribution image without learning from out-of-distribution data,” in CVPR, 2020.
  • (68) G. Pleiss, A. Souza, J. Kim, B. Li, and K. Q. Weinberger, “Neural network out-of-distribution detection for regression tasks,” 2019.
  • (69) O. Vinyals, T. Ewalds, S. Bartunov, P. Georgiev, A. S. Vezhnevets, M. Yeo, A. Makhzani, H. Küttler, J. Agapiou, J. Schrittwieser, et al., “Starcraft ii: A new challenge for reinforcement learning,” arXiv preprint arXiv:1708.04782, 2017.
  • (70) A. Sedlmeier, T. Gabor, T. Phan, L. Belzner, and C. Linnhoff-Popien, “Uncertainty-based out-of-distribution detection in deep reinforcement learning,” arXiv preprint arXiv:1901.02219, 2019.
  • (71) D. Zimmerer, P. M. Full, F. Isensee, P. Jäger, T. Adler, J. Petersen, G. Köhler, T. Ross, A. Reinke, A. Kascenas, et al., “Mood 2020: A public benchmark for out-of-distribution detection and localization on medical images,” IEEE Transactions on Medical Imaging, 2022.
  • (72) M. I. Tariq, N. A. Memon, S. Ahmed, S. Tayyaba, M. T. Mushtaq, N. A. Mian, M. Imran, and M. W. Ashraf, “A review of deep learning security and privacy defensive techniques,” Mobile Information Systems, vol. 2020, 2020.
  • (73) R. Averly and W.-L. Chao, “Unified out-of-distribution detection: A model-specific perspective,” arXiv preprint arXiv:2304.06813, 2023.
  • (74) Wikipedia contributors, “Outlier from Wikipedia, the free encyclopedia,” 2021. [Online; accessed 12 August 2021].
  • (75) M. Bianchini, A. Belahcen, and F. Scarselli, “A comparative study of inductive and transductive learning with feedforward neural networks,” in Conference of the Italian Association for Artificial Intelligence, 2016.
  • (76) I. Ben-Gal, “Outlier detection,” in Data mining and knowledge discovery handbook, 2005.
  • (77) S. Basu and M. Meckesheimer, “Automatic outlier detection for time series: an application to sensor data,” Knowledge and Information Systems, 2007.
  • (78) Y. Dou, W. Li, Z. Liu, Z. Dong, J. Luo, and S. Y. Philip, “Uncovering download fraud activities in mobile app markets,” in ASONAM, 2019.
  • (79) T. Xiao, C. Zhang, and H. Zha, “Learning to detect anomalies in surveillance video,” IEEE Signal Processing Letters, 2015.
  • (80) H. Liu, S. Shah, and W. Jiang, “On-line outlier detection and data cleaning,” Computers & chemical engineering, 2004.
  • (81) A. Loureiro, L. Torgo, and C. Soares, “Outlier detection using clustering methods: a data cleaning application,” in Proceedings of KDNet Symposium on Knowledge-based Systems, 2004.
  • (82) J. Van den Broeck, S. Argeseanu Cunningham, R. Eeckels, and K. Herbst, “Data cleaning: detecting, diagnosing, and editing data abnormalities,” PLoS medicine, 2005.
  • (83) Y. Wang, W. Liu, X. Ma, J. Bailey, H. Zha, L. Song, and S.-T. Xia, “Iterative learning with open-set noisy labels,” in CVPR, 2018.
  • (84) X. Chen and A. Gupta, “Webly supervised learning of convolutional networks,” in ICCV, 2015.
  • (85) K. Cao, M. Brbic, and J. Leskovec, “Open-world semi-supervised learning,” arXiv preprint arXiv:2102.03526, 2021.
  • (86) P. L. Bartlett and M. H. Wegkamp, “Classification with a reject option using a hinge loss.,” Journal of Machine Learning Research, vol. 9, no. 8, 2008.
  • (87) C. Chow, “On optimum recognition error and reject tradeoff,” IEEE Transactions on Information Theory, 1970.
  • (88) G. Fumera and F. Roli, “Support vector machines with embedded reject option,” in International Workshop on Support Vector Machines, 2002.
  • (89) C. Cortes and V. Vapnik, “Support-vector networks,” Machine learning, 1995.
  • (90) A. Nguyen, J. Yosinski, and J. Clune, “Deep neural networks are easily fooled: High confidence predictions for unrecognizable images,” in CVPR, 2015.
  • (91) K. Zhou, Z. Liu, Y. Qiao, T. Xiang, and C. C. Loy, “Domain generalization: A survey,” arXiv preprint arXiv:2103.02503, 2021.
  • (92) Z. Liu, Z. Miao, X. Pan, X. Zhan, D. Lin, S. X. Yu, and B. Gong, “Open compound domain adaptation,” in CVPR, 2020.
  • (93) K. Han, A. Vedaldi, and A. Zisserman, “Learning to discover novel visual categories via deep transfer clustering,” in CVPR, 2019.
  • (94) B. Zhao and K. Han, “Novel visual category discovery with dual ranking statistics and mutual knowledge distillation,” NeurIPS, 2021.
  • (95) X. Jia, K. Han, Y. Zhu, and B. Green, “Joint representation learning and novel category discovery on single-and multi-modal data,” in ICCV, 2021.
  • (96) S. Vaze, K. Han, A. Vedaldi, and A. Zisserman, “Generalized category discovery,” in CVPR, 2022.
  • (97) K. Joseph, S. Paul, G. Aggarwal, S. Biswas, P. Rai, K. Han, and V. N. Balasubramanian, “Novel class discovery without forgetting,” in ECCV, 2022.
  • (98) W. Wang, V. W. Zheng, H. Yu, and C. Miao, “A survey of zero-shot learning: Settings, methods, and applications,” TIST, 2019.
  • (99) A. Bendale and T. Boult, “Towards open world recognition,” in CVPR, 2015.
  • (100) Z. Liu, Z. Miao, X. Zhan, J. Wang, B. Gong, and S. X. Yu, “Large-scale long-tailed recognition in an open world,” in CVPR, 2019.
  • (101) J. Parmar, S. Chouhan, V. Raychoudhury, and S. Rathore, “Open-world machine learning: applications, challenges, and opportunities,” ACM Computing Surveys, vol. 55, no. 10, pp. 1–37, 2023.
  • (102) G. Shafer and V. Vovk, “A tutorial on conformal prediction.,” Journal of Machine Learning Research, vol. 9, no. 3, 2008.
  • (103) A. N. Angelopoulos and S. Bates, “A gentle introduction to conformal prediction and distribution-free uncertainty quantification,” arXiv preprint arXiv:2107.07511, 2021.
  • (104) R. Kaur, S. Jha, A. Roy, S. Park, E. Dobriban, O. Sokolsky, and I. Lee, “idecode: In-distribution equivariance for conformal out-of-distribution detection,” in Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 7104–7114, 2022.
  • (105) R. Kaur, K. Sridhar, S. Park, S. Jha, A. Roy, O. Sokolsky, and I. Lee, “Codit: Conformal out-of-distribution detection in time-series data,” arXiv e-prints, 2022.
  • (106) F. Cai, A. I. Ozdagli, N. Potteiger, and X. Koutsoukos, “Inductive conformal out-of-distribution detection based on adversarial autoencoders,” in 2021 IEEE International Conference on Omni-Layer Intelligent Systems (COINS), pp. 1–6, IEEE, 2021.
  • (107) M. Salehi, H. Mirzaei, D. Hendrycks, Y. Li, M. H. Rohban, and M. Sabokrou, “A unified survey on anomaly, novelty, open-set, and out-of-distribution detection: Solutions and future challenges,” arXiv preprint arXiv:2110.14051, 2021.
  • (108) S. Liang, Y. Li, and R. Srikant, “Enhancing the reliability of out-of-distribution image detection in neural networks,” in ICLR, 2018.
  • (109) K. Lee, K. Lee, H. Lee, and J. Shin, “A simple unified framework for detecting out-of-distribution samples and adversarial attacks,” in NeurIPS, 2018.
  • (110) W. Liu, X. Wang, J. D. Owens, and Y. Li, “Energy-based out-of-distribution detection,” in NeurIPS, 2020.
  • (111) C. S. Sastry and S. Oore, “Detecting out-of-distribution examples with gram matrices,” in ICML, 2020.
  • (112) H. Wang, W. Liu, A. Bocchieri, and Y. Li, “Can multi-label classification networks know what they don’t know?,” NeurIPS, 2021.
  • (113) J. Zhang, Q. Fu, X. Chen, L. Du, Z. Li, G. Wang, S. Han, D. Zhang, et al., “Out-of-distribution detection based on in-distribution data patterns memorization with modern hopfield energy,” in ICLR, 2023.
  • (114) Y. Sun, C. Guo, and Y. Li, “React: Out-of-distribution detection with rectified activations,” in NeurIPS, 2021.
  • (115) X. Dong, J. Guo, A. Li, W.-T. Ting, C. Liu, and H. Kung, “Neural mean discrepancy for efficient out-of-distribution detection,” in CVPR, 2022.
  • (116) Y. Sun and Y. Li, “Dice: Leveraging sparsification for out-of-distribution detection,” in ECCV, 2022.
  • (117) Y. Sun, Y. Ming, X. Zhu, and Y. Li, “Out-of-distribution detection with deep nearest neighbors,” in ICML, 2022.
  • (118) Z. Lin, S. D. Roy, and Y. Li, “Mood: Multi-level out-of-distribution detection,” in CVPR, 2021.
  • (119) C. S. Sastry and S. Oore, “Detecting out-of-distribution examples with in-distribution examples and gram matrices,” in NeurIPS-W, 2019.
  • (120) A. Djurisic, N. Bozanic, A. Ashok, and R. Liu, “Extremely simple activation shaping for out-of-distribution detection,” ICLR, 2023.
  • (121) J. Park, Y. G. Jung, and A. B. J. Teoh, “Nearest neighbor guidance for out-of-distribution detection,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 1686–1695, 2023.
  • (122) J. Park, J. C. L. Chai, J. Yoon, and A. B. J. Teoh, “Understanding the feature norm for out-of-distribution detection,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 1557–1567, 2023.
  • (123) X. Jiang, F. Liu, Z. Fang, H. Chen, T. Liu, F. Zheng, and B. Han, “Detecting out-of-distribution data through in-distribution class prior,” in International Conference on Machine Learning, pp. 15067–15088, PMLR, 2023.
  • (124) X. Liu, Y. Lochman, and C. Zach, “Gen: Pushing the limits of softmax-based out-of-distribution detection,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 23946–23955, 2023.
  • (125) T. DeVries and G. W. Taylor, “Learning confidence for out-of-distribution detection in neural networks,” arXiv preprint arXiv:1802.04865, 2018.
  • (126) Y. Wang, B. Li, T. Che, K. Zhou, Z. Liu, and D. Li, “Energy-based open-world uncertainty modeling for confidence calibration,” in ICCV, 2021.
  • (127) A. Vyas, N. Jammalamadaka, X. Zhu, D. Das, B. Kaul, and T. L. Willke, “Out-of-distribution detection using an ensemble of self supervised leave-out classifiers,” in ECCV, 2018.
  • (128) J. Bitterwolf, A. Meinke, and M. Hein, “Certifiably adversarially robust detection of out-of-distribution data,” in NeurIPS, 2020.
  • (129) J. Chen, Y. Li, X. Wu, Y. Liang, and S. Jha, “Robust out-of-distribution detection for neural networks,” arXiv preprint arXiv:2003.09711, 2020.
  • (130) M. Hein, M. Andriushchenko, and J. Bitterwolf, “Why relu networks yield high-confidence predictions far away from the training data and how to mitigate the problem,” in CVPR, 2019.
  • (131) S. Choi and S.-Y. Chung, “Novelty detection via blurring,” in ICLR, 2020.
  • (132) J. Chen, Y. Li, X. Wu, Y. Liang, and S. Jha, “Atom: Robustifying out-of-distribution detection using outlier mining,” ECML&PKDD, 2021.
  • (133) M. Hein, M. Andriushchenko, and J. Bitterwolf, “Why relu networks yield high-confidence predictions far away from the training data and how to mitigate the problem,” in CVPR, 2019.
  • (134) S. Thulasidasan, G. Chennupati, J. Bilmes, T. Bhattacharya, and S. Michalak, “On mixup training: Improved calibration and predictive uncertainty for deep neural networks,” in NeurIPS, 2019.
  • (135) S. Yun, D. Han, S. J. Oh, S. Chun, J. Choe, and Y. Yoo, “Cutmix: Regularization strategy to train strong classifiers with localizable features,” in CVPR, 2019.
  • (136) T. DeVries and G. W. Taylor, “Improved regularization of convolutional neural networks with cutout,” arXiv preprint arXiv:1708.04552, 2017.
  • (137) D. Hendrycks, N. Mu, E. D. Cubuk, B. Zoph, J. Gilmer, and B. Lakshminarayanan, “Augmix: A simple data processing method to improve robustness and uncertainty,” arXiv preprint arXiv:1912.02781, 2019.
  • (138) D. Hendrycks, A. Zou, M. Mazeika, L. Tang, D. Song, and J. Steinhardt, “Pixmix: Dreamlike pictures comprehensively improve safety measures,” 2022.
  • (139) J. Tack, S. Mo, J. Jeong, and J. Shin, “Csi: Novelty detection via contrastive learning on distributionally shifted instances,” in NeurIPS, 2020.
  • (140) A. Meinke and M. Hein, “Towards neural networks that provably know when they don’t know,” arXiv preprint arXiv:1909.12180, 2019.
  • (141) K. Bibas, M. Feder, and T. Hassner, “Single layer predictive normalized maximum likelihood for out-of-distribution detection,” NeurIPS, 2021.
  • (142) Q. Wang, F. Liu, Y. Zhang, J. Zhang, C. Gong, T. Liu, and B. Han, “Watermarking for out-of-distribution detection,” in NeurIPS, 2022.
  • (143) X. Dong, J. Guo, W.-T. T. Ang Li23, C. Liu, and H. Kung, “Neural mean discrepancy for efficient out-of-distribution detection,” in CVPR, 2022.
  • (144) H. Wei, R. Xie, H. Cheng, L. Feng, B. An, and Y. Li, “Mitigating neural network overconfidence with logit normalization,” in ICML, 2022.
  • (145) K. Lee, K. Lee, K. Min, Y. Zhang, J. Shin, and H. Lee, “Hierarchical novelty detection for visual object recognition,” in CVPR, 2018.
  • (146) R. Huang and Y. Li, “Mos: Towards scaling out-of-distribution detection for large semantic space,” in CVPR, 2021.
  • (147) R. Linderman, J. Zhang, N. Inkawhich, H. Li, and Y. Chen, “Fine-grain inference on out-of-distribution data with hierarchical classification,” in Proceedings of The 2nd Conference on Lifelong Learning Agents (S. Chandar, R. Pascanu, H. Sedghi, and D. Precup, eds.), vol. 232 of Proceedings of Machine Learning Research, pp. 162–183, PMLR, 22–25 Aug 2023.
  • (148) G. Shalev, Y. Adi, and J. Keshet, “Out-of-distribution detection using multiple semantic label representations,” in NeurIPS, 2018.
  • (149) S. Fort, J. Ren, and B. Lakshminarayanan, “Exploring the limits of out-of-distribution detection,” NeurIPS, 2021.
  • (150) W. Gan, “Language guided out-of-distribution detection,” 2021.
  • (151) Q. Yu and K. Aizawa, “Unsupervised out-of-distribution detection by maximum classifier discrepancy,” in ICCV, 2019.
  • (152) S. Mohseni, M. Pitale, J. Yadawa, and Z. Wang, “Self-supervised learning for generalizable out-of-distribution detection,” in AAAI, 2020.
  • (153) S. Thulasidasan, S. Thapa, S. Dhaubhadel, G. Chennupati, T. Bhattacharya, and J. Bilmes, “An effective baseline for robustness to distributional shift,” arXiv preprint arXiv:2105.07107, 2021.
  • (154) A.-A. Papadopoulos, M. R. Rajati, N. Shaikh, and J. Wang, “Outlier exposure with confidence control for out-of-distribution detection,” Neurocomputing, 2021.
  • (155) Y. Ming, Y. Fan, and Y. Li, “Poem: Out-of-distribution detection with posterior sampling,” in ICML, 2022.
  • (156) Y. Li and N. Vasconcelos, “Background data resampling for outlier-aware classification,” in CVPR, 2020.
  • (157) S. Mohseni, M. Pitale, J. Yadawa, and Z. Wang, “Self-supervised learning for generalizable out-of-distribution detection,” in AAAI, 2020.
  • (158) J. Yang, H. Wang, L. Feng, X. Yan, H. Zheng, W. Zhang, and Z. Liu, “Semantically coherent out-of-distribution detection,” in ICCV, 2021.
  • (159) F. Lu, K. Zhu, W. Zhai, K. Zheng, and Y. Cao, “Uncertainty-aware optimal transport for semantically coherent out-of-distribution detection,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 3282–3291, 2023.
  • (160) A. Shafaei, M. Schmidt, and J. J. Little, “A less biased evaluation of out-of-distribution sample detectors,” in BMVC, 2019.
  • (161) J. Katz-Samuels, J. Nakhleh, R. Nowak, and Y. Li, “Training ood detectors in their natural habitats,” in International Conference on Machine Learning (ICML), PMLR, 2022.
  • (162) Q. Wang, Z. Fang, Y. Zhang, F. Liu, Y. Li, and B. Han, “Learning to augment distributions for out-of-distribution detection,” arXiv preprint arXiv:2311.01796, 2023.
  • (163) K. Lee, H. Lee, K. Lee, and J. Shin, “Training confidence-calibrated classifiers for detecting out-of-distribution samples,” 2018.
  • (164) S. Vernekar, A. Gaurav, V. Abdelzad, T. Denouden, R. Salay, and K. Czarnecki, “Out-of-distribution detection in classifiers via generation,” in NeurIPS-W, 2019.
  • (165) K. Sricharan and A. Srivastava, “Building robust classifiers through generation of confident out of distribution examples,” in NeurIPS-W, 2018.
  • (166) T. Jeong and H. Kim, “Ood-maml: Meta-learning for few-shot out-of-distribution detection and classification,” in NeurIPS, 2020.
  • (167) X. Du, Z. Wang, M. Cai, and Y. Li, “Vos: Learning what you don’t know by virtual outlier synthesis,” in Proceedings of the International Conference on Learning Representations, 2022.
  • (168) L. Tao, X. Du, X. Zhu, and Y. Li, “Non-parametric outlier synthesis,” in ICLR, 2023.
  • (169) Q. Wang, J. Ye, F. Liu, Q. Dai, M. Kalander, T. Liu, J. Hao, and B. Han, “Out-of-distribution detection with implicit outlier transformation,” 2023.
  • (170) H. Zheng, Q. Wang, Z. Fang, X. Xia, F. Liu, T. Liu, and B. Han, “Out-of-distribution detection learning with unreliable out-of-distribution sources,” in NeurIPS, 2023.
  • (171) X. Du, X. Wang, G. Gozum, and Y. Li, “Unknown-aware object detection: Learning what you don’t know from videos in the wild,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022.
  • (172) R. Huang, A. Geng, and Y. Li, “On the importance of gradients for detecting distributional shifts in the wild,” in NeurIPS, 2021.
  • (173) C. Igoe, Y. Chung, I. Char, and J. Schneider, “How useful are gradients for ood detection really?,” arXiv preprint arXiv:2205.10439, 2022.
  • (174) Y. Gal and Z. Ghahramani, “Dropout as a bayesian approximation: Representing model uncertainty in deep learning,” in ICML, 2016.
  • (175) B. Lakshminarayanan, A. Pritzel, and C. Blundell, “Simple and scalable predictive uncertainty estimation using deep ensembles,” NeurIPS, 2017.
  • (176) K. Osawa, S. Swaroop, A. Jain, R. Eschenhagen, R. E. Turner, R. Yokota, and M. E. Khan, “Practical deep learning with bayesian principles,” in NeurIPS, 2019.
  • (177) A. Malinin and M. Gales, “Predictive uncertainty estimation via prior networks,” in NeurIPS, 2018.
  • (178) A. Malinin and M. Gales, “Reverse kl-divergence training of prior networks: Improved uncertainty and adversarial robustness,” in NeurIPS, 2019.
  • (179) J. Nandy, W. Hsu, and M. L. Lee, “Towards maximizing the representation gap between in-domain & out-of-distribution examples,” in NeurIPS, 2020.
  • (180) K. Kim, J. Shin, and H. Kim, “Locally most powerful bayesian test for out-of-distribution detection using deep generative models,” NeurIPS, 2021.
  • (181) D. Hendrycks, K. Lee, and M. Mazeika, “Using pre-training can improve model robustness and uncertainty,” in International conference on machine learning, pp. 2712–2721, PMLR, 2019.
  • (182) D. Hendrycks, X. Liu, E. Wallace, A. Dziedzic, R. Krishnan, and D. Song, “Pretrained transformers improve out-of-distribution robustness,” arXiv preprint arXiv:2004.06100, 2020.
  • (183) Y. Ming and Y. Li, “How does fine-tuning impact out-of-distribution detection for vision-language models?,” IJCV, 2023.
  • (184) A. Miyai, Q. Yu, G. Irie, and K. Aizawa, “Can pre-trained networks detect familiar out-of-distribution data?,” arXiv preprint arXiv:2310.00847, 2023.
  • (185) A. Miyai, Q. Yu, G. Irie, and K. Aizawa, “Locoop: Few-shot out-of-distribution detection via prompt learning,” arXiv preprint arXiv:2306.01293, 2023.
  • (186) F. Lu, K. Zhu, K. Zheng, W. Zhai, and Y. Cao, “Likelihood-aware semantic alignment for full-spectrum out-of-distribution detection,” arXiv preprint arXiv:2312.01732, 2023.
  • (187) S. Esmaeilpour, B. Liu, E. Robertson, and L. Shu, “Zero-shot out-of-distribution detection based on the pretrained model clip,” in AAAI, 2022.
  • (188) Y. Ming, Z. Cai, J. Gu, Y. Sun, W. Li, and Y. Li, “Delving into out-of-distribution detection with vision-language representations,” Advances in Neural Information Processing Systems, vol. 35, pp. 35087–35102, 2022.
  • (189) H. Wang, Y. Li, H. Yao, and X. Li, “Clipn for zero-shot ood detection: Teaching clip to say no,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 1802–1812, 2023.
  • (190) B. Zong, Q. Song, M. R. Min, W. Cheng, C. Lumezanu, D. Cho, and H. Chen, “Deep autoencoding gaussian mixture model for unsupervised anomaly detection,” in ICLR, 2018.
  • (191) D. Abati, A. Porrello, S. Calderara, and R. Cucchiara, “Latent space autoregression for novelty detection,” in CVPR, 2019.
  • (192) S. Pidhorskyi, R. Almohsen, D. A. Adjeroh, and G. Doretto, “Generative probabilistic novelty detection with adversarial autoencoders,” in NeurIPS, 2018.
  • (193) L. Deecke, R. Vandermeulen, L. Ruff, S. Mandt, and M. Kloft, “Image anomaly detection with generative adversarial networks,” in ECML&KDD, 2018.
  • (194) M. Sabokrou, M. Khalooei, M. Fathy, and E. Adeli, “Adversarially learned one-class classifier for novelty detection,” in CVPR, 2018.
  • (195) I. Kobyzev, S. Prince, and M. Brubaker, “Normalizing flows: An introduction and review of current methods,” TPAMI, 2020.
  • (196) E. Zisselman and A. Tamar, “Deep residual flow for out of distribution detection,” in CVPR, 2020.
  • (197) D. P. Kingma and P. Dhariwal, “Glow: Generative flow with invertible 1x1 convolutions,” NeurIPS, 2018.
  • (198) A. Van Oord, N. Kalchbrenner, and K. Kavukcuoglu, “Pixel recurrent neural networks,” in ICML, 2016.
  • (199) D. Jiang, S. Sun, and Y. Yu, “Revisiting flow generative models for out-of-distribution detection,” in International Conference on Learning Representations, 2021.
  • (200) E. Nalisnick, A. Matsukawa, Y. W. Teh, D. Gorur, and B. Lakshminarayanan, “Do deep generative models know what they don’t know?,” in NeurIPS, 2018.
  • (201) H. Choi, E. Jang, and A. A. Alemi, “Waic, but why? generative ensembles for robust anomaly detection,” arXiv preprint arXiv:1810.01392, 2018.
  • (202) P. Kirichenko, P. Izmailov, and A. G. Wilson, “Why normalizing flows fail to detect out-of-distribution data,” in NeurIPS, 2020.
  • (203) J. Ren, P. J. Liu, E. Fertig, J. Snoek, R. Poplin, M. A. DePristo, J. V. Dillon, and B. Lakshminarayanan, “Likelihood ratios for out-of-distribution detection,” in NeurIPS, 2019.
  • (204) J. Serrà, D. Álvarez, V. Gómez, O. Slizovskaia, J. F. Núñez, and J. Luque, “Input complexity and out-of-distribution detection with likelihood-based generative models,” 2020.
  • (205) Z. Xiao, Q. Yan, and Y. Amit, “Likelihood regret: An out-of-distribution detection score for variational auto-encoder,” in NeurIPS, 2020.
  • (206) H. Wang, Z. Li, L. Feng, and W. Zhang, “Vim: Out-of-distribution with virtual-logit matching,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022.
  • (207) J. Ren, S. Fort, J. Liu, A. G. Roy, S. Padhy, and B. Lakshminarayanan, “A simple fix to mahalanobis distance for improving near-ood detection,” arXiv preprint arXiv:2106.09022, 2021.
  • (208) E. Techapanurak, M. Suganuma, and T. Okatani, “Hyperparameter-free out-of-distribution detection using cosine similarity,” in ACCV, 2020.
  • (209) X. Chen, X. Lan, F. Sun, and N. Zheng, “A boundary based out-of-distribution classifier for generalized zero-shot learning,” in ECCV, 2020.
  • (210) A. Zaeemzadeh, N. Bisagno, Z. Sambugaro, N. Conci, N. Rahnavard, and M. Shah, “Out-of-distribution detection using union of 1-dimensional subspaces,” in CVPR, 2021.
  • (211) J. Van Amersfoort, L. Smith, Y. W. Teh, and Y. Gal, “Uncertainty estimation using a single deep deterministic neural network,” in ICML, 2020.
  • (212) H. Huang, Z. Li, L. Wang, S. Chen, B. Dong, and X. Zhou, “Feature space singularity for out-of-distribution detection,” arXiv preprint arXiv:2011.14654, 2020.
  • (213) Y. Ming, Y. Sun, O. Dia, and Y. Li, “Cider: Exploiting hyperspherical embeddings for out-of-distribution detection,” in ICLR, 2023.
  • (214) J.-H. Kim, S. Yun, and H. O. Song, “Neural relation graph: A unified framework for identifying label noise and outlier data,” in Thirty-seventh Conference on Neural Information Processing Systems, 2023.
  • (215) T. Denouden, R. Salay, K. Czarnecki, V. Abdelzad, B. Phan, and S. Vernekar, “Improving reconstruction autoencoder out-of-distribution detection with mahalanobis distance,” arXiv preprint arXiv:1812.02765, 2018.
  • (216) Y. Zhou, “Rethinking reconstruction autoencoder-based out-of-distribution detection,” in CVPR, 2022.
  • (217) Y. Yang, R. Gao, and Q. Xu, “Out-of-distribution detection with semantic mismatch under masking,” in ECCV, 2022.
  • (218) W. Jiang, H. Cheng, M. Chen, S. Feng, Y. Ge, and C. Wang, “Read: Aggregating reconstruction error into out-of-distribution detection,” in AAAI, 2023.
  • (219) J. Li, P. Chen, S. Yu, Z. He, S. Liu, and J. Jia, “Rethinking out-of-distribution (ood) detection: Masked image modeling is all you need,” in CVPR, 2023.
  • (220) P. Morteza and Y. Li, “Provable guarantees for understanding out-of-distribution detection,” in AAAI, 2022.
  • (221) L. P. Jain, W. J. Scheirer, and T. E. Boult, “Multi-class open set recognition using probability of inclusion,” in ECCV, 2014.
  • (222) E. M. Rudd, L. P. Jain, W. J. Scheirer, and T. E. Boult, “The extreme value machine,” IEEE transactions on pattern analysis and machine intelligence, vol. 40, no. 3, pp. 762–768, 2017.
  • (223) Z. Fang, Y. Li, J. Lu, J. Dong, B. Han, and F. Liu, “Is out-of-distribution detection learnable?,” in NeurIPS, 2022.
  • (224) A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal, G. Sastry, A. Askell, P. Mishkin, J. Clark, et al., “Learning transferable visual models from natural language supervision,” in ICML, 2021.
  • (225) E. T. Jaynes, “Bayesian methods: General background,” 1986.
  • (226) R. M. Neal, Bayesian learning for neural networks. 2012.
  • (227) D. Gamerman and H. F. Lopes, Markov chain Monte Carlo: stochastic simulation for Bayesian inference. CRC Press, 2006.
  • (228) D. J. C. Mackay, Bayesian methods for adaptive models. PhD thesis, California Institute of Technology, 1992.
  • (229) A. Y. Foong, Y. Li, J. M. Hernández-Lobato, and R. E. Turner, “’in-between’ uncertainty in bayesian neural networks,” in ICML-W, 2020.
  • (230) C. Peterson and E. Hartman, “Explorations of the mean field theory learning algorithm,” Neural Networks, 1989.
  • (231) F. Wenzel, K. Roth, B. S. Veeling, J. Światkowski, L. Tran, S. Mandt, J. Snoek, T. Salimans, R. Jenatton, and S. Nowozin, “How good is the bayes posterior in deep neural networks really?,” in ICML, 2020.
  • (232) A. Gelman, “Objections to bayesian statistics,” Bayesian Analysis, 2008.
  • (233) T. G. Dietterich, “Ensemble methods in machine learning,” in International workshop on multiple classifier systems, 2000.
  • (234) W. J. Maddox, P. Izmailov, T. Garipov, D. P. Vetrov, and A. G. Wilson, “A simple baseline for bayesian uncertainty in deep learning,” Advances in Neural Information Processing Systems, vol. 32, pp. 13153–13164, 2019.
  • (235) R. Bommasani, D. A. Hudson, E. Adeli, R. Altman, S. Arora, S. von Arx, M. S. Bernstein, J. Bohg, A. Bosselut, E. Brunskill, et al., “On the opportunities and risks of foundation models,” arXiv preprint arXiv:2108.07258, 2021.
  • (236) K. Zhou, J. Yang, C. C. Loy, and Z. Liu, “Learning to prompt for vision-language models,” International Journal of Computer Vision (IJCV), 2022.
  • (237) K. Zhou, J. Yang, C. C. Loy, and Z. Liu, “Conditional prompt learning for vision-language models,” in IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2022.
  • (238) M. Jia, L. Tang, B.-C. Chen, C. Cardie, S. Belongie, B. Hariharan, and S.-N. Lim, “Visual prompt tuning,” in European Conference on Computer Vision, pp. 709–727, Springer, 2022.
  • (239) P. Gao, S. Geng, R. Zhang, T. Ma, R. Fang, Y. Zhang, H. Li, and Y. Qiao, “Clip-adapter: Better vision-language models with feature adapters,” International Journal of Computer Vision, pp. 1–15, 2023.
  • (240) J. Yang, K. Zhou, and Z. Liu, “Full-spectrum out-of-distribution detection,” arXiv preprint arXiv:2204.05306, 2022.
  • (241) E. D. C. Gomes, F. Alberge, P. Duhamel, and P. Piantanida, “Igeood: An information geometry approach to out-of-distribution detection,” in ICLR, 2022.
  • (242) W. J. Scheirer, L. P. Jain, and T. E. Boult, “Probability models for open set recognition,” TPAMI, 2014.
  • (243) R. L. Smith, “Extreme value theory,” Handbook of applicable mathematics, 1990.
  • (244) E. Castillo, Extreme value theory in engineering. Elsevier, 2012.
  • (245) A. Bendale and T. E. Boult, “Towards open set deep networks,” in CVPR, 2016.
  • (246) P. Perera and V. M. Patel, “Deep transfer learning for multiple class novelty detection,” in CVPR, 2019.
  • (247) P. Perera, V. I. Morariu, R. Jain, V. Manjunatha, C. Wigington, V. Ordonez, and V. M. Patel, “Generative-discriminative feature representations for open-set recognition,” in CVPR, 2020.
  • (248) X. Sun, H. Ding, C. Zhang, G. Lin, and K.-V. Ling, “M2iosr: Maximal mutual information open set recognition,” arXiv preprint arXiv:2108.02373, 2021.
  • (249) Z. Ge, S. Demyanov, Z. Chen, and R. Garnavi, “Generative openmax for multi-class open set classification,” in BMVC, 2017.
  • (250) L. Neal, M. Olson, X. Fern, W.-K. Wong, and F. Li, “Open set learning with counterfactual images,” in ECCV, 2018.
  • (251) D.-W. Zhou, H.-J. Ye, and D.-C. Zhan, “Learning placeholders for open-set recognition,” in CVPR, 2021.
  • (252) S. Kong and D. Ramanan, “Opengan: Open-set recognition via open data generation,” in ICCV, 2021.
  • (253) C. Geng and S. Chen, “Collective decision for open set recognition,” TKDE, 2020.
  • (254) J. Jang and C. O. Kim, “One-vs-rest network-based deep probability model for open set recognition,” arXiv preprint arXiv:2004.08067, 2020.
  • (255) P. Schlachter, Y. Liao, and B. Yang, “Open-set recognition using intra-class splitting,” in EUSIPCO, 2019.
  • (256) M. Masana, I. Ruiz, J. Serrat, J. van de Weijer, and A. M. Lopez, “Metric learning for novelty and anomaly detection,” in BMVC, 2018.
  • (257) Y. Shu, Y. Shi, Y. Wang, T. Huang, and Y. Tian, “p-odn: prototype-based open deep network for open set recognition,” Scientific reports, 2020.
  • (258) B. Liu, H. Kang, H. Li, G. Hua, and N. Vasconcelos, “Few-shot open-set recognition using meta-learning,” in CVPR, 2020.
  • (259) G. Chen, L. Qiao, Y. Shi, P. Peng, J. Li, T. Huang, S. Pu, and Y. Tian, “Learning open set network with discriminative reciprocal points,” in ECCV, 2020.
  • (260) R. Yoshihashi, W. Shao, R. Kawakami, S. You, M. Iida, and T. Naemura, “Classification-reconstruction learning for open-set recognition,” in CVPR, 2019.
  • (261) A. Cao, Y. Luo, and D. Klabjan, “Open-set recognition with gaussian mixture variational autoencoders,” AAAI, 2020.
  • (262) P. R. M. Júnior, R. M. De Souza, R. d. O. Werneck, B. V. Stein, D. V. Pazinato, W. R. de Almeida, O. A. Penatti, R. d. S. Torres, and A. Rocha, “Nearest neighbors distance ratio open-set classifier,” Machine Learning, 2017.
  • (263) H. Zhang and V. M. Patel, “Sparse representation-based open set recognition,” TPAMI, 2016.
  • (264) P. Bodesheim, A. Freytag, E. Rodner, M. Kemmler, and J. Denzler, “Kernel null space methods for novelty detection,” in CVPR, 2013.
  • (265) J. Liu, Z. Lian, Y. Wang, and J. Xiao, “Incremental kernel null space discriminant analysis for novelty detection,” in CVPR, 2017.
  • (266) P. Oza and V. M. Patel, “C2ae: Class conditioned auto-encoder for open-set recognition,” in CVPR, 2019.
  • (267) X. Sun, Z. Yang, C. Zhang, K.-V. Ling, and G. Peng, “Conditional gaussian distribution learning for open set recognition,” in CVPR, 2020.
  • (268) Z. Yue, T. Wang, Q. Sun, X.-S. Hua, and H. Zhang, “Counterfactual zero-shot and open-set visual recognition,” in CVPR, 2021.
  • (269) R. Shao, P. Perera, P. C. Yuen, and V. M. Patel, “Open-set adversarial defense,” in ECCV, 2020.
  • (270) H. Zhang, A. Li, J. Guo, and Y. Guo, “Hybrid models for open set recognition,” in ECCV, 2020.
  • (271) S. Vaze, K. Han, A. Vedaldi, and A. Zisserman, “Open-set recognition: A good closed-set classifier is all you need,” in ICLR, 2022.
  • (272) G. Danuser and M. Stricker, “Parametric model fitting: From inlier characterization to outlier detection,” TPAMI, 1998.
  • (273) R. De Maesschalck, D. Jouan-Rimbaud, and D. L. Massart, “The mahalanobis distance,” Chemometrics and intelligent laboratory systems, 2000.
  • (274) C. Leys, O. Klein, Y. Dominicy, and C. Ley, “Detecting multivariate outliers: Use a robust variant of the mahalanobis distance,” Journal of Experimental Social Psychology, 2018.
  • (275) R. A. Redner and H. F. Walker, “Mixture densities, maximum likelihood and the em algorithm,” SIAM review, 1984.
  • (276) E. Eskin, “Anomaly detection over noisy data using learned probability distributions,” in ICML, 2000.
  • (277) M. Turcotte, J. Moore, N. Heard, and A. McPhall, “Poisson factorization for peer-based anomaly detection,” in IEEE Conference on Intelligence and Security Informatics (ISI), 2016.
  • (278) A. J. Izenman, “Review papers: Recent developments in nonparametric density estimation,” Journal of the American Statistical Association, 1991.
  • (279) J. Van Ryzin, “A histogram method of density estimation,” Communications in Statistics-Theory and Methods, 1973.
  • (280) M. Xie, J. Hu, and B. Tian, “Histogram-based online anomaly detection in hierarchical wireless sensor networks,” in ICTSPCC, 2012.
  • (281) A. Kind, M. P. Stoecklin, and X. Dimitropoulos, “Histogram-based traffic anomaly detection,” IEEE Transactions on Network and Service Management, 2009.
  • (282) M. Goldstein and A. Dengel, “Histogram-based outlier score (hbos): A fast unsupervised anomaly detection algorithm,” KI-2012: Poster and Demo Track, 2012.
  • (283) E. Parzen, “On estimation of a probability density function and mode,” The annals of mathematical statistics, 1962.
  • (284) M. Desforges, P. Jacob, and J. Cooper, “Applications of probability density estimation to the detection of abnormal conditions in engineering,” Proceedings of the institution of mechanical engineers, 1998.
  • (285) W. Hu, J. Gao, B. Li, O. Wu, J. Du, and S. Maybank, “Anomaly detection using local kernel density estimation and context-based regression,” TKDE, 2018.
  • (286) M. A. Kramer, “Nonlinear principal component analysis using autoassociative neural networks,” AIChE journal, 1991.
  • (287) D. P. Kingma and M. Welling, “Auto-encoding variational bayes,” arXiv preprint arXiv:1312.6114, 2013.
  • (288) I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and Y. Bengio, “Generative adversarial nets,” NIPS, 2014.
  • (289) D. Rezende and S. Mohamed, “Variational inference with normalizing flows,” in ICML, 2015.
  • (290) J. Ngiam, Z. Chen, P. W. Koh, and A. Y. Ng, “Learning deep energy models,” in ICML, 2011.
  • (291) S. Zhai, Y. Cheng, W. Lu, and Z. Zhang, “Deep structured energy based models for anomaly detection,” in ICML, 2016.
  • (292) A. Hyvärinen and P. Dayan, “Estimation of non-normalized statistical models by score matching.,” 2005.
  • (293) M. Welling and Y. W. Teh, “Bayesian learning via stochastic gradient langevin dynamics,” in ICML, 2011.
  • (294) H. Wang, X. Wu, Z. Huang, and E. P. Xing, “High-frequency component helps explain the generalization of convolutional neural networks,” in CVPR, 2020.
  • (295) G. Chen, P. Peng, L. Ma, J. Li, L. Du, and Y. Tian, “Amplitude-phase recombination: Rethinking robustness of convolutional neural networks in frequency domain,” ICCV, 2021.
  • (296) H. Liu, X. Li, W. Zhou, Y. Chen, Y. He, H. Xue, W. Zhang, and N. Yu, “Spatial-phase shallow learning: rethinking face forgery detection in frequency domain,” in CVPR, 2021.
  • (297) A. Adler, M. Elad, Y. Hel-Or, and E. Rivlin, “Sparse coding with anomaly detection,” Journal of Signal Processing Systems, 2015.
  • (298) A. Li, Z. Miao, Y. Cen, and Y. Cen, “Anomaly detection using sparse reconstruction in crowded scenes,” Multimedia Tools and Applications, 2017.
  • (299) X. Mo, V. Monga, R. Bala, and Z. Fan, “Adaptive sparse representations for video anomaly detection,” IEEE Transactions on Circuits and Systems for Video Technology, 2013.
  • (300) Y. Xiao, H. Wang, W. Xu, and J. Zhou, “L1 norm based kpca for novelty detection,” Pattern Recognition, 2013.
  • (301) K. Jiang, W. Xie, J. Lei, T. Jiang, and Y. Li, “Lren: Low-rank embedded network for sample-free hyperspectral anomaly detection,” in AAAI, 2021.
  • (302) Z. Chen, C. K. Yeo, B. S. Lee, and C. T. Lau, “Autoencoder-based network anomaly detection,” in Wireless Telecommunications Symposium, 2018.
  • (303) J. An and S. Cho, “Variational autoencoder based anomaly detection using reconstruction probability,” Special Lecture on IE, 2015.
  • (304) H. Zenati, C. S. Foo, B. Lecouat, G. Manek, and V. R. Chandrasekhar, “Efficient gan-based anomaly detection,” in ICLR-W, 2018.
  • (305) W. Liu, W. Luo, D. Lian, and S. Gao, “Future frame prediction for anomaly detection–a new baseline,” in CVPR, 2018.
  • (306) D. Gong, L. Liu, V. Le, B. Saha, M. R. Mansour, S. Venkatesh, and A. v. d. Hengel, “Memorizing normality to detect anomaly: Memory-augmented deep autoencoder for unsupervised anomaly detection,” in CVPR, 2019.
  • (307) H. Park, J. Noh, and B. Ham, “Learning memory-guided normality for anomaly detection,” in CVPR, 2020.
  • (308) C.-H. Lai, D. Zou, and G. Lerman, “Robust subspace recovery layer for unsupervised anomaly detection,” ICLR, 2020.
  • (309) X. Yan, H. Zhang, X. Xu, X. Hu, and P.-A. Heng, “Learning semantic context from normal samples for unsupervised anomaly detection,” in AAAI, 2021.
  • (310) D. T. Nguyen, Z. Lou, M. Klar, and T. Brox, “Anomaly detection with multiple-hypotheses predictions,” in ICML, 2019.
  • (311) K. Tian, S. Zhou, J. Fan, and J. Guan, “Learning competitive and discriminative reconstructions for anomaly detection,” in AAAI, 2019.
  • (312) X. Han, X. Chen, and L.-P. Liu, “Gan ensemble for anomaly detection,” arXiv preprint arXiv:2012.07988, 2020.
  • (313) G. Kwon, M. Prabhushankar, D. Temel, and G. AlRegib, “Backpropagated gradient representations for anomaly detection,” in ECCV, 2020.
  • (314) D. Wettschereck, “A study of distance-based machine learning algorithms,” 1994.
  • (315) J. Tian, M. H. Azarian, and M. Pecht, “Anomaly detection using self-organizing maps-based k-nearest neighbor algorithm,” in PHM Society European Conference, 2014.
  • (316) G. Münz, S. Li, and G. Carle, “Traffic anomaly detection using k-means clustering,” in GI/ITG Workshop MMBnet, 2007.
  • (317) I. Syarif, A. Prugel-Bennett, and G. Wills, “Unsupervised clustering approach for network anomaly detection,” in International conference on networked digital technologies, 2012.
  • (318) D. M. J. Tax, “One-class classification: Concept learning in the absence of counter-examples.,” 2002.
  • (319) L. Ruff, R. Vandermeulen, N. Goernitz, L. Deecke, S. A. Siddiqui, A. Binder, E. Müller, and M. Kloft, “Deep one-class classification,” in ICML, 2018.
  • (320) B. Zhang and W. Zuo, “Learning from positive and unlabeled examples: A survey,” in International Symposiums on Information Processing, 2008.
  • (321) J. Bekker and J. Davis, “Learning from positive and unlabeled data: A survey,” Machine Learning, 2020.
  • (322) K. Jaskie and A. Spanias, “Positive and unlabeled learning algorithms and applications: A survey,” in International Conference on Information, Intelligence, Systems and Applications, 2019.
  • (323) L. Ruff, R. A. Vandermeulen, N. Görnitz, A. Binder, E. Müller, K.-R. Müller, and M. Kloft, “Deep semi-supervised anomaly detection,” ICLR, 2020.
  • (324) L. Bergman and Y. Hoshen, “Classification-based anomaly detection for general data,” in ICLR, 2020.
  • (325) I. Golan and R. El-Yaniv, “Deep anomaly detection using geometric transformations,” in NeurIPS, 2018.
  • (326) M.-I. Georgescu, A. Barbalau, R. T. Ionescu, F. S. Khan, M. Popescu, and M. Shah, “Anomaly detection in video via self-supervised and multi-task learning,” in CVPR, 2021.
  • (327) D. G. Altman and J. M. Bland, “Standard deviations and standard errors,” BMJ, 2005.
  • (328) C. Leys, C. Ley, O. Klein, P. Bernard, and L. Licata, “Detecting outliers: Do not use standard deviation around the mean, use absolute deviation around the median,” Journal of experimental social psychology, 2013.
  • (329) X. Yang, L. J. Latecki, and D. Pokrajac, “Outlier detection with globally optimal exemplar-based gmm,” in SIAM, 2009.
  • (330) M. M. Breunig, H.-P. Kriegel, R. T. Ng, and J. Sander, “Lof: identifying density-based local outliers,” in SIGMOD, 2000.
  • (331) M. A. Fischler and R. C. Bolles, “Random sample consensus: a paradigm for model fitting with applications to image analysis and automated cartography,” Communications of the ACM, 1981.
  • (332) M. Sugiyama and K. Borgwardt, “Rapid distance-based outlier detection via sampling,” NIPS, 2013.
  • (333) G. H. Orair, C. H. Teixeira, W. Meira Jr, Y. Wang, and S. Parthasarathy, “Distance-based outlier detection: consolidation and renewed bearing,” Proceedings of the VLDB Endowment, 2010.
  • (334) M. Ester, H.-P. Kriegel, J. Sander, X. Xu, et al., “A density-based algorithm for discovering clusters in large spatial databases with noise,” in KDD, 1996.
  • (335) V. Hautamaki, I. Karkkainen, and P. Franti, “Outlier detection using k-nearest neighbour graph,” in ICPR, 2004.
  • (336) F. Muhlenbach, S. Lallich, and D. A. Zighed, “Identifying and handling mislabelled instances,” Journal of Intelligent Information Systems, 2004.
  • (337) W. Liu, J. He, and S.-F. Chang, “Large graph construction for scalable semi-supervised learning,” in ICML, 2010.
  • (338) L. Akoglu, H. Tong, and D. Koutra, “Graph based anomaly detection and description: a survey,” Data mining and knowledge discovery, 2015.
  • (339) C. C. Noble and D. J. Cook, “Graph-based anomaly detection,” in SIGKDD, 2003.
  • (340) Y. Kou, C.-T. Lu, and R. F. Dos Santos, “Spatial outlier detection: a graph-based approach,” in 19th IEEE International Conference on Tools with Artificial Intelligence (ICTAI), 2007.
  • (341) Z. Mingqiang, H. Hui, and W. Qian, “A graph-based clustering algorithm for anomaly intrusion detection,” in International Conference on Computer Science & Education (ICCSE), 2012.
  • (342) Z.-F. Wu, T. Wei, J. Jiang, C. Mao, M. Tang, and Y.-F. Li, “Ngc: A unified framework for learning with open-world noisy data,” in ICCV, 2021.
  • (343) J. Yang, W. Chen, L. Feng, X. Yan, H. Zheng, and W. Zhang, “Webly supervised image classification with metadata: Automatic noisy label correction via visual-semantic graph,” in ACM Multimedia, 2020.
  • (344) F. T. Liu, K. M. Ting, and Z.-H. Zhou, “Isolation forest,” in ICDM, 2008.
  • (345) Y. Li, J. Yang, Y. Song, L. Cao, J. Luo, and L.-J. Li, “Learning from noisy labels with distillation,” in CVPR, 2017.
  • (346) D. T. Nguyen, C. K. Mummadi, T. P. N. Ngo, T. H. P. Nguyen, L. Beggel, and T. Brox, “Self: Learning to filter noisy labels with self-ensembling,” in ICLR, 2020.
  • (347) B. Han, Q. Yao, X. Yu, G. Niu, M. Xu, W. Hu, I. Tsang, and M. Sugiyama, “Co-teaching: Robust training of deep neural networks with extremely noisy labels,” in NIPS, 2018.
  • (348) J. Yang, L. Feng, W. Chen, X. Yan, H. Zheng, P. Luo, and W. Zhang, “Webly supervised image classification with self-contained confidence,” in ECCV, 2020.
  • (349) J. Yang, P. Wang, D. Zou, Z. Zhou, K. Ding, W. Peng, H. Wang, G. Chen, B. Li, Y. Sun, X. Du, K. Zhou, W. Zhang, D. Hendrycks, Y. Li, and Z. Liu, “Openood: Benchmarking generalized out-of-distribution detection,” in NeurIPS, 2022.
  • (350) A. Krizhevsky, G. Hinton, et al., “Learning multiple layers of features from tiny images,” 2009.
  • (351) A. Krizhevsky, V. Nair, and G. Hinton, “Cifar-10 and cifar-100 datasets,” URl: https://www. cs. toronto. edu/kriz/cifar. html, vol. 6, no. 1, p. 1, 2009.
  • (352) Y. LeCun and C. Cortes, “The mnist database of handwritten digits,” 2005.
  • (353) Y. Netzer, T. Wang, A. Coates, A. Bissacco, B. Wu, and A. Y. Ng, “Reading digits in natural images with unsupervised feature learning,” 2011.
  • (354) G. Kylberg, “Kylberg texture dataset v. 1.0,” 2011.
  • (355) B. Zhou, A. Lapedriza, A. Khosla, A. Oliva, and A. Torralba, “Places: A 10 million image database for scene recognition,” IEEE Transactions on Pattern Analysis and Machine Intelligence, 2017.
  • (356) A. Torralba, R. Fergus, and W. T. Freeman, “80 million tiny images: A large data set for nonparametric object and scene recognition,” TPAMI, 2008.
  • (357) K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” in CVPR, 2016.
  • (358) J. Zhang, J. Yang, P. Wang, H. Wang, Y. Lin, H. Zhang, Y. Sun, X. Du, K. Zhou, W. Zhang, Y. Li, Z. Liu, Y. Chen, and H. Li, “Openood v1.5: Enhanced benchmark for out-of-distribution detection,” arXiv preprint arXiv:2306.09301, 2023.
  • (359) J. Bitterwolf, M. Müller, and M. Hein, “In or out? fixing imagenet out-of-distribution detection evaluation,” in ICML, 2023.
  • (360) P. W. Koh, S. Sagawa, H. Marklund, S. M. Xie, M. Zhang, A. Balsubramani, W. Hu, M. Yasunaga, R. L. Phillips, I. Gao, et al., “Wilds: A benchmark of in-the-wild distribution shifts,” in International Conference on Machine Learning, pp. 5637–5664, PMLR, 2021.
  • (361) L. Cultrera, L. Seidenari, and A. Del Bimbo, “Leveraging visual attention for out-of-distribution detection,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4447–4456, 2023.
  • (362) Y. Ming, H. Yin, and Y. Li, “On the impact of spurious correlation for out-of-distribution detection,” in AAAI, 2022.
  • (363) P. Panareda Busto and J. Gall, “Open set domain adaptation,” in ICCV, 2017.
  • (364) Y. Shu, Z. Cao, C. Wang, J. Wang, and M. Long, “Open domain generalization with domain-augmented meta-learning,” in CVPR, 2021.
  • (365) J. Li, C. Xiong, and S. C. Hoi, “Mopro: Webly supervised learning with momentum prototypes,” ICLR, 2021.
  • (366) V. D. Nguyen, “Out-of-distribution detection for lidar-based 3d object detection,” Master’s thesis, University of Waterloo, 2022.
  • (367) G. Shalev, G.-L. Shalev, and J. Keshet, “A baseline for detecting out-of-distribution examples in image captioning,” arXiv preprint arXiv:2207.05418, 2022.
  • (368) X. Wu, J. Lu, Z. Fang, and G. Zhang, “Meta ood learning for continuously adaptive ood detection,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 19353–19364, 2023.
  • (369) C. Zhou, G. Neubig, J. Gu, M. Diab, P. Guzman, L. Zettlemoyer, and M. Ghazvininejad, “Detecting hallucinated content in conditional neural sequence generation,” ACL, 2020.
  • (370) Y. Dai, H. Lang, K. Zeng, F. Huang, and Y. Li, “Exploring large language models for multi-modal out-of-distribution detection,” arXiv preprint arXiv:2310.08027, 2023.
  • (371) Z. Yang, L. Li, K. Lin, J. Wang, C.-C. Lin, Z. Liu, and L. Wang, “The dawn of lmms: Preliminary explorations with gpt-4v (ision),” arXiv preprint arXiv:2309.17421, vol. 9, no. 1, 2023.
  • (372) H. Liu, C. Li, Q. Wu, and Y. J. Lee, “Visual instruction tuning,” arXiv preprint arXiv:2304.08485, 2023.
  • (373) B. Li, Y. Zhang, L. Chen, J. Wang, J. Yang, and Z. Liu, “Otter: A multi-modal model with in-context instruction tuning,” arXiv preprint arXiv:2305.03726, 2023.