Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                

Automatic Open-World Reliability Assessment

2021, 2021 IEEE Winter Conference on Applications of Computer Vision (WACV)

Image classification in the open-world must handle out-of-distribution (OOD) images. Systems should ideally reject OOD images, or they will map atop of known classes and reduce reliability. Using open-set classifiers that can reject OOD inputs can help. However, optimal accuracy of open-set classifiers depend on the frequency of OOD data. Thus, for either standard or open-set classifiers, it is important to be able to determine when the world changes and increasing OOD inputs will result in reduced system reliability. However, during operations, we cannot directly assess accuracy as there are no labels. Thus, the reliability assessment of these classifiers must be done by human operators, made more complex because networks are not 100% accurate, so some failures are to be expected. To automate this process, herein, we formalize the open-world recognition reliability problem and propose multiple automatic reliability assessment policies to address this new problem using only the distribution of reported scores/probability data. The distributional algorithms can be applied to both classic classifiers with SoftMax as well as the open-world Extreme Value Machine (EVM) to provide automated reliability assessment. We show that all of the new algorithms significantly outperform detection using the mean of SoftMax.

Automatic Open-World Reliability Assessment Mohsen Jafarzadeh Touqeer Ahmad Akshay Raj Dhamija Chunchun Li Steve Cruz Terrance E. Boult ∗ University of Colorado, Colorado Springs Colorado Springs, Colorado 80918, USA {mjafarzadeh, tahmad, adhamija, cli, scruz, tboult}@vast.uccs.edu Abstract Image classification in the open-world must handle outof-distribution (OOD) images. Systems should ideally reject OOD images, or they will map atop of known classes and reduce reliability. Using open-set classifiers that can reject OOD inputs can help. However, optimal accuracy of open-set classifiers depend on the frequency of OOD data. Thus, for either standard or open-set classifiers, it is important to be able to determine when the world changes and increasing OOD inputs will result in reduced system reliability. However, during operations, we cannot directly assess accuracy as there are no labels. Thus, the reliability assessment of these classifiers must be done by human operators, made more complex because networks are not 100% accurate, so some failures are to be expected. To automate this process, herein, we formalize the open-world recognition reliability problem and propose multiple automatic reliability assessment policies to address this new problem using only the distribution of reported scores/probability data. The distributional algorithms can be applied to both classic classifiers with SoftMax as well as the open-world Extreme Value Machine (EVM) to provide automated reliability assessment. We show that all of the new algorithms significantly outperform detection using the mean of SoftMax. 1 Introduction Many online businesses, especially shopping websites and social media, created a platform that users can upload an image, and it classifies the images into predefined classes, potentially open-set classifiers, which in addition to known classes in the training data set, have an out-of-distribution (OOD) class, also known as novel, garbage, unknown, reject, or background class, that the system otherwise ignores. The OOD class can be considered as a ”background” separate class or as the origin of feature space [11]. Mapping ∗ Supported in part by DARPA SAIL-ON Contract # HR001120C0055 (a) SoftMax for 6 batches (b) EVM for 6 batches (c) Scores over different data partitions Fig 1: Percentage of images that yield the given score/probability for either SoftMax or EVM. In (a) and (b), we show the percentage for six different batches of 100 images. Reliability is about to determine when the system’s accuracy will be degraded by OOD samples. Can you determine which of the six batches (A–F) are ”normal” and which have increased OOD data? As a hint, three are normal; the others have 3%, 5%, and 20% OOD data. See the text for answers. Open-set classifiers deal with OOD inputs rejecting inputs below a threshold on score/probability. Plot (c) shows the cumulative percentage of images above threshold for knowns and below threshold for unknowns, i.e., the percentage of data correctly handled at that threshold. Assessing a change in reliability, i.e., detecting a shift in the distribution of knowns and unknowns, is a non-trivial problem because of the overlap of scores on knowns and OOD images. This paper formalizes and explores solutions to the open-world reliability problem. too many images to reject class or misclassifying them reduces users’ satisfaction. But since there are no labels during operation, there is no obvious way to detect that the input distribution has changed. Therefore, online businesses use human laborers to monitor image classification quality and to fine-tune the system when the quality drops. Mon- 1984 itoring using human labor is expensive. In other applications, e.g., robotic or vehicular navigation, reduced reliability from increasing OOD inputs must be handled automatically as the processing cycle is too fast for human oversight. Fig. 1 shows score distributions and why thresholding can be problematic, and why detecting changes in the mix of known and OOD is difficult. Did you correctly guess that A, B, and C are the normal batches while D, E, and F have 3%, 5%, and 20% OOD data, respectively? As we move toward the path of operational reliability, we start by adding robustness to close-set classifiers. We can further improve reliability by transitioning to open-set classifiers or even transition to open-world classifiers that incrementally add the OOD samples [5]. While there has been a modest amount of research looking at various definitions of classifier reliability determination (see Section 2) and defining open-set and open-world recognition algorithms, no prior work has addressed the problem of reliability when there are changes in the frequency of OOD inputs. In this work, we take operational reliability one step further by defining and presenting solutions to the problem of automatically assessing classifier reliability in an openset/open-world setting. Such assessments are necessary to determine when it may be time to fine-tune and incrementally learn new classes; however, the choice of thresholds and processes for incremental learning is domain-specific and beyond the scope of this paper. Extreme value theory (EVT) is a branch of statistics that studies the behavior of extreme events, the tails of probability distributions [8, 1, 6]. EVT estimates the probability of events that are more extreme than any of the already observed ones. EVT is an extrapolation from observed samples to unobserved samples. The extreme value machine (EVM) [26, 16] uses EVT to compute the radial probability of inclusion of a point with respect to the class of another. Therefore, EVM has been successfully used for instancelevel OOD detection. While the EVM is an inherently open-set classifier, it is less widely used, and hence it is also natural to ask if we can detect when there is a shift in the distribution for a standard closed-set classifier and hence the need for retraining. In this paper, we show distribution-based algorithms, including Gaussian-model with Kullback–Leibler (KL) divergence applied to SoftMax scores or EVM scores, are effective in detection. We also develop a novel detection algorithm for EVM, which makes Bernoulli distributional assumptions and find it works equally as well. Thus, the proposed reliability algorithms can be used for both traditional ”closed-set” and more recent ”open-set” classifiers for open-world reliability assessment. Our Contributions • Formalizing open-world reliability assessment. • Proposing three new policies to automate open-world reliability assessment of closed-set (SoftMax) and open-set image classifiers (EVM). • Developing a test-bed to evaluate the performance of reliability assessment policies. • Comparing proposed policies with simple approaches, showing a significant improvement over the base algorithm of tracking the mean of SoftMax scores. 2 Related work The robustness of image classifiers has been studied for many years, with a known trade-off between the accuracy and robustness of image classifiers [28]. The effect of data transformation in robustness is demonstrated in [3]. Stability training is another way of increasing robustness [34]. Although, by improving robustness, an image classifier can increase reliability to variation of known classes; in practice, robustness degrades when the ratio of OOD images to other images increases. When presented with an out-of-distribution input, most image classifiers map OOD to a known class. Recently introduced open-set classifiers [2, 26, 32, 24, 4, 13, 19], have developed improved approaches to reasonably detect if an input is from a novel class (a.k.a unknown or outof-distribution) and map it to a reject or novel class label. A family of quasi-linear polyhedral conic discriminant is proposed in [7] to overcome label imbalance. An iterative learning framework for training image classifiers with openset noisy labels is proposed in [31]. Networks can be trained for joint classification and reconstruction [32]. Class conditioned auto-encoders proposed may improve the accuracy of known while rejecting unknowns [24]. Policies for a guided exploration of the unknown in the open-world are proposed in [21]. The robustness of open-world learning has been studied in [27]. Some techniques, such as [26], allow incremental updating to support open-world learning of these new unknowns but provide no guidance on when the system’s reliability has been reduced to the point of needing incremental updates. The quality of these open-set image classifiers is directly related to the ratio of known to unknown images. By decreasing this ratio, the open-set image classifiers should be fine-tuned to have better reliabilities on the data. Thus, businesses should invest in the reliability assessment policy. There have been various studies of the reliability of classifiers [20, 35, 9] that looked at close-set classifiers’ accuracy or error. These do not apply in the open-world setting. Still others considered the largest few scores (extreme values) in the close-set setting [36, 29, 15] weak approximation of using extreme value for open-set classification as in [2]. In [22], reliability is defined as the difference between the classification confidence values. Matiz and Barner defined reliability as confidence values of predictions [23]. In a published paper [18], authors showed that confidence is 1985 not equal to reliability because image classifiers may put images in the wrong class with high confidence. Any reliability measure on a single input can be viewed as a type of classifier with rejection, which is a subset of open-set recognition algorithms. Looking at sets of data, either in batches or moving windows, it is necessary to detect the change in the distribution associated with increasing OOD samples. In [25], authors defined reliability as an inverse to its standard deviation, i.e., R = σ −1 . Zhang defined reliability as the sum of rejection rate and accuracy [33]. This definition is not acceptable because if the image classifier puts all images in the reject class, the rejection rate is 1.0, and consequently, the reliability is 1.0. In [18], they used a false positive rate of points where no desirable correct predictions are lost. Their definition of reliability also fails for the same reason, if an image classifier says every image is negative (reject class), then it has a reliability of 1.0. Considering the data as a sequence in time, one can view the reliability assessment problem as a specialized type of distributional change detection. In the change detection literature, there is a wide range of algorithms, and an important algorithm is KL divergence [10]. Thus, while there has been a modest amount of research looking at various definitions of classifier reliability determination, no prior work addresses it from the point of view of open-set classifiers when there are changes in the frequency of OOD inputs. 3 Problem Formalization & Evaluation British standards institution defined reliability as “the characteristic of an item expressed by the probability that it will perform a required function under stated conditions for a stated period of time” [12]. In this paper, we assume a welltrained network, such that per-class errors are relatively consistent across classes, i.e., the network is trained such that their errors are not a function of the classes. With that assumption and the standard definition, networks should be reasonably reliable to changes in sampling frequency between classes in their training distributions, so distributional shifts between such classes are not important. However, since OOD data is not something we can use in training, we cannot assume the uniformity of errors on OOD data and networks. Classifiers become unreliable if the frequency of samples from unseen classes change, and they increase their frequency of rejection. Thus, we seek to formalize reliability via changes in the distribution with respect to OOD samples. Definition 1 Open-world Reliability Assessment Let us define T for the mixture of known and unknown classes seen in training, as well as U , classes unseen in training. Let x1 . . . xN , be samples drawn from distribution D1 , where x ∈ D1 → x ∈ T , i.e. D1 is either consistent with training or some operational validation set and hence is by definition reliable. Let xN +1 . . . xm , be samples drawn from another distribution D2 . Let PU (x; D) be the probability, given distribution D, that x ∈ U, i.e., the probability x is a class unseen in training. Finally, let M be a dissimilarity measure between two probability distributions. Open-world Reliability Assessment is determining if there is a change in the distribution of samples from the unknown or OOD classes, i.e.  M PU (x; D2 ); PU (x; D1 ) > δ (1) for some user-defined δ. When the above equation holds, we say that D2 has a change in reliability. The choice of M and δ may be domain-dependent as the cost of errors may be influenced by the size and frequency of the distributional shifts. In this paper, we explore multiple models. As defined, this does not say if reliability is increasing or decreasing, just that it changed. However, with some measures, one can determine the direction of the change. From an experimental point of view, we can introduce different distributions D2 and measure detection rates based on the ground truth label of when it was introduced. Given a sequence of samples, detection of the distributional change from reliable to unreliable is often an important operational criterion. We define reliability localization error as the absolute error in the reported location of reliability change, i.e., if the detection is reported at time nr and ground truth is ng , for a given trial absolute error is el = |ng − nr | (2) and can consider the mean absolute error over many trials. We define On-Time detection as reporting the change with zero absolute error for a given trial, and early and late detection have a non-zero error with the appropriate sign. To implement such an assessment, we need some way to estimate the probability of something being out of distribution. One simple and common approach is using classifier confidence, i.e., SoftMax, as in [22]. Better OOD detection approaches are based on the more formal open-set classifiers that have been developed, and for this work, we consider the model from the extreme value machine (EVM) [26, 16]. While EVM may be better for individual OOD detection, we show the proposed reliability assessment approaches works equally well for both and expect it will work well with building on top of any good per-instance estimate of the probability of input being out-of-distribution. Evaluation Protocol Similar to testing open-set classifiers, testing of open-world reliability assessment needs a mixture of in-distribution 1986 classes that was used during training as well as out-ofdistribution data. For measuring open-set evaluation, we recommend a large number of classes to ensure good features; a small number of classes produce more degenerate features from overfitting. For this evaluation, we used all 1000 classes of ImageNet 2012 for training and validation as known classes. Then, to create a test set, we used combinations of 1000 classes of ImageNet 2012 validation set as known classes and 166 non-overlapping classes of ImageNet 2010 training set as unknown classes. We note that while some earlier work, such as [2, 26], used 2012/2010 splits, they claimed 2010 has 360 nonoverlapping class. While it may be true by class name, our analysis showed many super-class/subclass relations or renamed classes, e.g., ax and hatchet. To determine nonoverlapping classes, we first processed 2010 classes with a 2012 classifier to determine which classes had more than 10% of its test images classified within the same 2012 class. We then had a human check if the classes were the same. The final list has 166 classes and is in our GitHub repo at https://github.com/ROBOTICSENGINEER/AutomaticOpen-World-Reliability-Assessment. Testing uses different percentages of data from the 2012 validation test data as knowns and from the 2010 classes as the unknowns or out-of-distribution data. To evaluate the policies, we created 24 configuration tests. Each configuration has 10,000 tests. Each test consists of 4000 images. The first 2000 images are mostly knowns, 20 of them unknowns (1%). The ratio of unknown images increased in the second 2000 images, ranging from 2% to 25% by 1% increments. We used a batch size of 100. 4 Background on EVM Because we use the EVM as a core of multiple new algorithms, we briefly review its properties, so this paper is self-contained. The EVM is a distance-based kernel-free non-linear classifier that uses Weibull families distribution to compute the radial probability of inclusion of a point with respect to the class of another. Weibull families distribution is defined as ( x−µ ξ e−(1+ξ( σ )) , x < µ − σξ (3) W(x; µ, σ, ξ) = 1 , x ≥ µ − σξ where µ ∈ R, σ ∈ R+ , and ξ ∈ R− are locations, scale, and shape parameters [8, 1, 6]. EVM provides a compact probabilistic representation of each class’s decision boundary, characterized in terms of its extreme vectors. Each extreme vector has a family of Weibull distribution.s Probability of a point to each class is defined as the maximum probability of point belonging to each extreme vector of the class. P̂ (Cl |x) = max Wl,k (x; µl,k , σl,k , ξl,k ) k (4) where Wl,k (x) is Weibull probability of x corresponding to k extreme vector in class l. If we show unknown label with 0, the predicted class label ŷ ∈ W can be computed as following. m(x) = max P̂ (Cl |x) (5) l z(x) = arg max P̂ (Cl |x) (6) q(x; τ ) = Heaviside(m(x) − τ ) (7) l ŷ(x; τ ) = q(x; τ ) z(x) (8) where τ ∈ R+ is a threshold that can be optimized on some validation set. If the threshold is too large, EVM tends to put known images into the reject class, i.e., high false rejection rate. If the selected threshold is too small, EVM classifies unknown images to the known classes, i.e., high false classification rate. Therefore, this threshold acts as a slider in the trade-off between false rejection rate and false classification rate. 5 Algorithms for Open-world Reliability Assessment We process batches, or windows of data, and attempt to detect the first batch where the distribution has shifted. Given the definition, the natural family of algorithms is to estimate the probability of input being unknown and then estimate properties of the distributions and compare them. Mean of SoftMax Thresholding the SoftMax score of neural networks is well-known and the simplest method for out of distribution rejection of single instances [14]. Tracking the mean of recognition scores is a natural way to consider assessing if the input distribution has changed. To design a reliability assessor, we use this, which is the simple estimate of a distribution, its mean. Using the observation that if the distribution changes to include more OOD samples, one would expect the mean SoftMax score to also be reduced. In this method, we collect the maximum value of SoftMax for each image in the batch and save it in a buffer, where buffer size is equal to batch size. Then, we compute the mean of the buffer. Finally, reliability can be assessed by comparing the mean with a calibrated mean value, e.g., a mean calibrated to optimize performance on a validation set with some fixed mixing ratio. Algorithm 1 in supplemental material summarizes this method. Kullback-Leibler Divergence of Truncated Gaussian model of SoftMax Scores The mean of the maximum SoftMax only considers the first moment of the distribution. Therefore, if the distribution of a batch of images changes while the mean of distribution remains constant or changes a little, the first method fails. A more advanced method for change detection is using Kullback–Leibler divergence with a more advanced distributional model of the data. The Kullback-Leibler divergence is one of the most fundamental measures in information theory, which measures 1987 the relative entropy KL (P kQ) = Z +∞ p(x) log( −∞ p(x) ) dx q(x) (9) where x is maximum SoftMax, p(x) is probability density function of testing batch, and q(x) is probability density function of training data set. Making the classic Gaussian assumption for the distribution, i.e. letting p(x) ∼ N (µ, σ 2 ) and q(x) ∼ N (m, s2 ), we can derive the KL divergence measure as: σ 2 + (µ − m)2 1 s − (10) KL (P kQ) = log( ) + σ 2 s2 2 Algorithm 2 in the supplemental material summarizes this method, which is a direct application of KL divergence, as a measure between distribution of SoftMax values. Looking at the distribution in Fig. 1(a), one might note that the distibution is not at all symmetric and is bounded above by 1, which is the mode. Hence, it is not well modeled by a Gaussian with simple mean and standard deviation of the raw data. The important insight here is to consider this a truncated Guassian and use a moment-matching approach for approximation, such as [17]. only examples that have significant novelty, i.e., high probability of being OOD samples. Not only do we build from the extreme-value theory probabilities from EVM, but rather than looking at the mean or a Gaussian model of all data, we consider the mean of only those samples whose values probability of an input being from an unknown class exceed a threshold. We then combine that with a hypothesis test, which assumes a Bernoulli distribution of known and unknown inputs. This allows us to not just detect a distributional change but also if the change is improving or reducing reliability. From (4), we expect ( 1 ≥ m(xknown ) ≥ τ (11) τ > m(xunknown ) ≥ 0 By subtracting from 1, ( 1 − τ ≥ 1 − m(xknown ) ≥ 0 1 ≥ 1 − m(xunknown ) > 1 − τ (12) Let’s define ( u(x) := 1 − m(x) δ := 1 − τ (13) δ ≥ u(xknown ) ≥ 0 1 ≥ u(xunknown ) > δ (14) Then, KL divergence of Truncated Gaussian-model of EVM scores It is, of course, natural to also consider combining the information theory-based KL divergence method with EVM-based probability estimates on each batch. Algorithm 4 in the supplement provides the details. Again, the distribution of scores shown in Fig. 1(b) is bounded, and this time with two disparate peaks, they seem even less Gaussian-like. This is a novel combination of KL divergence and EVM scores and is included in the evaluation to help assess the information gain from EVM probabilities over SoftMax with exactly the same algorithm. Fusion of KL divergence models KL divergence of SoftMax and EVM are both viable methods with different error characteristics. Thus, it is also natural to ask if there is a way to fuse these algorithms. With KL divergence, an easy way to do this is to generalize to a bi-variate distributional model. We do this with a 2D Gaussian model, using a full 2x2 co-variance model, and Algorithm 5 summarizes this method. Plots in the supplemental material show its performance, which is not significantly different. Thus, given the added cost we don’t recommend it. Open-world novelty detection with EVM There are many ways to consider EVM probabilities, and given that the shape of data in Fig. 1 is not apparently Gaussian, we sought to develop a method with a different distributional assumption. We develop a novel algorithm, we call openworld novelty detection (OND), which is designed to use ( By subtracting from ∆ ∈ R+ ( δ − ∆ ≥ u(xknown ) − ∆ ≥ −∆ 1 − ∆ ≥ u(xunknown ) − ∆ > δ − ∆ If 1 ≥ ∆ > δ > 0 ( max{0, u(xknown ) − ∆} = 0 1 − ∆ ≥ max{0, u(xunknown ) − ∆} > 0 (15) (16) In batch mode with N images (P max{0, u(xknown ) − ∆} = 0 P N (1 − ∆) ≥ max{0, u(xunknown ) − ∆} > 0 (17) By applying (18) in (17) X ρN (1 − ∆) ≥ max{0, u(x) − ∆} > 0 (19) For a given domain, one might assume the mixture of known and OOD classes can be modeled by a Bernoulli distribution with probability of ρ. ( ρ , x ∈ OOD f (x) = (18) 1 − ρ , x ∈ Known classes over time ρ will change. If ρ is constant or decreases, image classifier remains reliable. However, if it increases, the image classifier should be updated. We can find the error of this hypothesis as 1 X max{0, u(x) − ∆}) − ρ(1 − ∆)} ε := max{0, ( N (20) 1988 Finally, we propose the OND policy for automatic reliability assessment via thresholding on ε, see Algorithm 3 in the supplemental material. Note this algorithm has added parameters ∆andρ, which need to be estimated on training/validation data. Although not evaluated herein, this algorithm has the operational advantage that it can actually be trivially tuned to target expected OOD ratios by changing ρ. 6 Experimental Results For all experiments in this paper, we use an EfficientNetB3 [30] network that was trained to classify the ImageNet 2012 training data, with 79.3% Top-1 accuracy on the 2012 validation set. Then, we extract features for images in the training set and train EVM with these extracted features. In the evaluation, we consider the overall detection rate, ontime detection, and mean absolute error (see Eq 2) as our primary metrics. We also show a metric called total accuracy. We start by considering the many batches in each test sequence and scoring each batch as 1.0 if it is correctly labeled as Reliable before the change or correctly labeled Unreliable after the change, and scoring 0.0 if incorrectly labeled. Total accuracy is the number of batches correctly labeled divided by the total number of batches. In Fig. 2 we use window size of 100 and select each algorithms threshold to maximize its true detection rate (sum of on-time and late detection) when the percentage of unknown (out-of-distribution) data is at 2%. Operationally, users might determine the minimum variation they want to consider and the measure that is most important to them, then optimize that metric over the validation set to choose the parameter. For example, some users might prefer to maximize on-time detection or minimize false detection or mean absolute error. Given these parameters, the performance of the various algorithms as the percentage of unknown data is increased, shown in Fig. 2. For this setting, All of the distributional algorithms, the KL divergent of EVM or SoftMax (Sec 5) as well as the OND algoirthm 3, have similar good performance. The baseline mean of SoftMax-based Algorithm 1 has by far the lowest detection rate, total accuracy, and ontime. In the remaining plots in the paper, we are doing ablation studies. Such ablation studies allow us to understand the deeper aspects of performance. In Fig. 3, we use batch/window size of 20 and 25 and select each algorithms threshold to limit false detctions to at most 1% when the percentage of unknown (out-of-distribution) data is at 2%. With smaller windows the difference between KL EVM and KL SoftMax become signficant, while KL EVM and OND are similar showing that the difference is the superior EVM features for detection of OOD data. Next let us consider peak performance, which we shall see as a function of the percentage of OOD/Unknown data, (a) Total Detection Percentage (b) Percentage of on-time (c) Mean Absolute Error Fig 2: Performance of proposed policies when the threshold is selected to maximize the sum of on-time + late detection rates validation test with 2% unknown. which is unknowable. This ablation study is to determine the impact and sensitivity of thresholds, Fig. 4 demonstrates the effect of threshold on performance Algorithm 5 in different percentages of unknown. 7 Discussion Fig. 1 (on the first page) shows the distribution of the magnitude of the maximum probability of SofMax and EVM on known validation and OOD validation sets. There are few known images with low predicted probability, and for SoftMax, many unknowns with high probability. Therefore, for any given threshold, a significant portion of the unknown images will be classified as known. Thus, with neither of 1989 (a) True Detection, window size 20 (b) True Detection, window size 25 Fig 3: True detection performance with different sliding windows sizes for the proposed policies when the threshold is selected to limit false detection to at most 1%. We show true detection rate with is sum of on-time and late detection. these base models, is it possible to design a reliability assessor without failures. This also makes it clear why very simple naive algorithms, such as thresholding on SoftMax and retraining when images with low scores are detected, are useless – because the network is not perfect, and even training images will fail such a simple test, and the system will always be training. This is part of what makes this problem challenging and interesting – the algorithms must account for the imperfect nature of the core classification algorithm as well as handling the open-set nature of the problem while detecting the change in OOD. The current EVM has three important parameters: cover threshold, tail size, and distance multiplier. We use 0.7 and 33998 for cover threshold and tail size, the same as the first EVM paper [26]. The EVM paper did not discuss a distance multiplier, seeming to use an implicit 0.5. We try 0.40, 0.45, 0.5, 0.75, and 1.0 for the distance multiplier. Among them, 0.45 demonstrates the best separation between known validation and unknown validation sets. After training EVM, the threshold ∆ must be selected for KL EVM Algorithm 4. Because the cover threshold is 0.7, the ∆ must be greater than 0.3 and less than 0.7. This data can be optimized based on the validation data set. However, with such a small region of allowed values, we decide to choose the middle, i.e., 0.5, to avoid bias. There is not any consensus or common sense to determine thresholds exactly, especially given the inherent trade-off between false detection and failure. As the percentage of failure decreases, the false detection increases and vice versa. Studying all possible methods to tune the thresholds is beyond the scope of this paper. One approach is to select a threshold slightly above the signal to avoid early detection. This approach in practice can be found by creating a validation set and updating it in a timely manner according to the need and scene of the images in a real test. We note that our KL divergence models are assuming a truncated Gaussian distribution, which is only an approximation to the shapes of the distributions shown in Fig. 1, and the quality of the approximation will be impacted by mixing and by window sizes. While only an approximation, the KL divergence does much better than the simple detection of a shift in the mean. The OND algorithm makes very different assumptions using a Bernoulli model. The nearly equal performance of KL divergence assuming truncated Gaussian and OND assuming Bernoulli suggest the overall detection is relatively stable and not strongly dependent on either set of assumptions. A good algorithm must detect the change to unreliability even with a small input distribution shift of only a few percent and it does. However, our experiments show that even with a large distribution shift, i.e., 25% OOD data, algorithms fail to always detect the shift on-time. The main reason is that features of unknown images can overlap with known images, and, therefore algorithms cannot sense the shift. In a few splits, Kullback–Leibler divergence of SoftMax (baseline) works somewhat better than algorithms that use EVM information. However, when the shift is larger, the algorithms that use EVM information outperform the algorithms that use only SoftMax. For lower percentages of unknown KL EVM does slightly better in total detection, and for larger percentages OND has slightly lower mean absolute error of detection, but neither are very significant differences. The sensitivity and accuracy of all presented methods increases as batch size increases. However, the batch size is proportional to the inverse of speed. If the batch size goes to infinity, the speed of detection goes to zero, which is not practical. Thus, in the trade-off between accuracy and speed, we should find a conformation to reduce the batch size while the distribution approximately is a Gaussian. For the ImageNet data, we found that a batch size of 100 is a good balance for both and EVM is reasonably stable down to windows of 20, but KL on SoftMax requires larger batches. We note that all experiments herein were with ImageNet classifiers which have 1000 classes. For smaller problems, there may be a greater issue of OOD classes folding on- 1990 (a) False Detection Percentage (b) Total Accuracy (c) Total Detection Percentage (d) On-time Percentage (e) Late Percentage (f) Mean Absolute Error Fig 4: Ablation study on performance of Bivariate KL Fusion Algorithm Section 5 when the threshold changes. As the percentage of unknown increases, the breadth of good algorithm performance increases, and the peak performance of moves slightly to the right (to higher thresholds). top of known classes(see [11]), and hence the difference between SoftMax and EVM based detection may be more apparent for problems with smaller number of classes. 8 Conclusions Both standard closed-set and more recent open-set classifiers face an open-world, and errors can increase when there is a change in the distribution of the frequency of unknown inputs. While open-set image classifiers can map images to a set of predefined classes and simultaneously reject images that do not belong to the known classes, they cannot predict their own overall reliability when there is a shift in the OOD distribution of the world. Currently, one must allocate human resources to monitor the quality of closed or open-set image classifiers. A proper automatic reliability assessment policy can reduce the cost by decreasing human operators and simultaneously increasing user satisfaction. In this paper, we formalized the open-world reliability assessment problem and proposed multiple automatic reliability assessment policies (Algorithms 2, 3, 4, and 5) that use only score distributional data. Algorithm 3 uses the maximum of the mean of thresholded maximum EVM class probability to determine reliability. This algorithm ignores the higher order moments of distribution for simplicity. Algorithm 4 uses Gaussian Model of score distributions and Kullback–Leibler divergence of maximum EVM class probabilities. We also evaluate, Algorithm 2, that uses the Kullback–Leibler divergence on a Gaussian Model of SoftMax scores, and show it also provides an effective openworld reliability assessment. We used the mean of SoftMax as the baseline, and the new algorithms are significantly better. We used ImageNet 2012 for known images and non-overlapping ImageNet 2010 for unknown images. As in any detection problem, there is an inherent trade-off between false or earlydetection and true detection rates. If we tune the threshold, so there are zero-false detection, then even the best algorithm fails to detect in 5.68% of tests. This is because features of unknown images can overlap on known images, and therefore, algorithms cannot sense the shift. In future works, we will investigate extreme value theory rather than Gaussian assumptions to design a better automatic reliability assessment policy. References [1] Jan Beirlant, Yuri Goegebeur, Johan Segers, and Jozef L Teugels. Statistics of extremes: theory and applications. John Wiley & Sons, 2006. [2] Abhijit Bendale and Terrance E Boult. Towards open set deep networks. In IEEE CVPR, pages 1563–1572, 2016. [3] Arjun Nitin Bhagoji, Daniel Cullina, Chawin Sitawarin, and Prateek Mittal. Enhancing robustness of machine learning systems via data transformations. In 2018 52nd Annual Conference on Information Sciences and Systems (CISS), pages 1–5. IEEE, 2018. 1991 [4] Supritam Bhattacharjee, Devraj Mandal, and Soma Biswas. Multi-class novelty detection using mix-up technique. In The IEEE Winter Conference on Applications of Computer Vision, pages 1400–1409, 2020. [5] Terrance E Boult, Steve Cruz, Akshay Raj Dhamija, M Gunther, James Henrydoss, and Walter J Scheirer. Learning and the unknown: Surveying steps toward open world recognition. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 33, pages 9801–9807, 2019. [6] Enrique Castillo. Extreme value theory in engineering. Elsevier, 2012. [7] Hakan Cevikalp and Halil Saglamlar. Polyhedral conic classifiers for computer vision applications and open set recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2019. [8] Stuart Coles, Joanna Bawa, Lesley Trenner, and Pat Dorazio. An introduction to statistical modeling of extreme values, volume 208. Springer, 2001. [9] Antitza Dantcheva, Arun Singh, Petros Elia, and Jean-Luc Dugelay. Search pruning in video surveillance systems: Efficiency-reliability tradeoff. In 2011 IEEE International Conference on Computer Vision Workshops (ICCV Workshops), pages 1356–1363. IEEE, 2011. [10] Tamraparni Dasu, Shankar Krishnan, Dongyu Lin, Suresh Venkatasubramanian, and Kevin Yi. Change (detection) you can believe in: Finding distributional shifts in data streams. In International Symposium on Intelligent Data Analysis, pages 21–34. Springer, 2009. [11] Akshay Raj Dhamija, Manuel Günther, and Terrance Boult. Reducing network agnostophobia. In Advances in Neural Information Processing Systems, pages 9157–9168, 2018. [12] GWA Dummer, R Winton, and Mike Tooley. An elementary guide to reliability. Elsevier, 1997. [13] Chuanxing Geng and Songcan Chen. Collective decision for open set recognition. IEEE Transactions on Knowledge and Data Engineering, 2020. [14] Chuanxing Geng, Sheng-jun Huang, and Songcan Chen. Recent advances in open set recognition: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2020. [15] Lykele Hazelhoff, Ivo Creusen, and Peter HN de With. Robust classification system with reliability prediction for semiautomatic traffic-sign inventory systems. In 2013 IEEE Workshop on Applications of Computer Vision (WACV), pages 125–132. IEEE, 2013. [16] James Henrydoss, Steve Cruz, Ethan M Rudd, Manuel Gunther, and Terrance E Boult. Incremental open set intrusion recognition using extreme value machine. In 2017 16th IEEE International Conference on Machine Learning and Applications (ICMLA), pages 1089–1093. IEEE, 2017. [17] Malte Kuss and Carl Edward Rasmussen. Assessing approximate inference for binary gaussian process classification. Journal of machine learning research, 6(Oct):1679– 1704, 2005. [18] Salar Latifi, Babak Zamirai, and Scott Mahlke. Polygraphmr: Enhancing the reliability and dependability of cnns. In 2020 50th Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN), pages 99–112. IEEE, 2020. [19] Qingming Leng, Mang Ye, and Qi Tian. A survey of openworld person re-identification. IEEE Transactions on Circuits and Systems for Video Technology, 30(4):1092–1108, 2019. [20] Yi Liu and Yuan F Zheng. One-against-all multi-class svm classification using reliability measures. In Proceedings. 2005 IEEE International Joint Conference on Neural Networks, 2005., volume 2, pages 849–854. IEEE, 2005. [21] Ziwei Liu, Zhongqi Miao, Xiaohang Zhan, Jiayun Wang, Boqing Gong, and Stella X Yu. Large-scale long-tailed recognition in an open world. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 2537–2546, 2019. [22] Ling Ma, Xiabi Liu, Li Song, Chunwu Zhou, Xinming Zhao, and Yanfeng Zhao. A new classifier fusion method based on historical and on-line classification reliability for recognizing common ct imaging signs of lung diseases. Computerized Medical Imaging and Graphics, 40:39–48, 2015. [23] Sergio Matiz and Kenneth E Barner. Inductive conformal predictor for convolutional neural networks: Applications to active learning for image classification. Pattern Recognition, 90:172–182, 2019. [24] Poojan Oza and Vishal M Patel. C2ae: Class conditioned auto-encoder for open-set recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 2307–2316, 2019. [25] Santiago Romanı́, Pilar Sobrevilla, and Eduard Montseny. On the reliability degree of hue and saturation values of a pixel for color image classification. In The 14th IEEE International Conference on Fuzzy Systems, 2005. FUZZ’05., pages 306–311. IEEE, 2005. [26] Ethan M Rudd, Lalit P Jain, Walter J Scheirer, and Terrance E Boult. The extreme value machine. IEEE transactions on pattern analysis and machine intelligence, 40(3):762–768, 2017. [27] Vikash Sehwag, Arjun Nitin Bhagoji, Liwei Song, Chawin Sitawarin, Daniel Cullina, Mung Chiang, and Prateek Mittal. Analyzing the robustness of open-world machine learning. In Proceedings of the 12th ACM Workshop on Artificial Intelligence and Security, pages 105–116, 2019. [28] Dong Su, Huan Zhang, Hongge Chen, Jinfeng Yi, Pin-Yu Chen, and Yupeng Gao. Is robustness the cost of accuracy?– a comprehensive study on the robustness of 18 deep image classification models. In Proceedings of the European Conference on Computer Vision (ECCV), pages 631–648, 2018. [29] Zhenan Sun, Hui Zhang, Tieniu Tan, and Jianyu Wang. Iris image classification based on hierarchical visual codebook. IEEE Transactions on pattern analysis and machine intelligence, 36(6):1120–1133, 2013. [30] Mingxing Tan and Quoc Le. Efficientnet: Rethinking model scaling for convolutional neural networks. In International Conference on Machine Learning, pages 6105–6114, 2019. [31] Yisen Wang, Weiyang Liu, Xingjun Ma, James Bailey, Hongyuan Zha, Le Song, and Shu-Tao Xia. Iterative learning with open-set noisy labels. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 8688–8696, 2018. 1992 [32] Ryota Yoshihashi, Wen Shao, Rei Kawakami, Shaodi You, Makoto Iida, and Takeshi Naemura. Classificationreconstruction learning for open-set recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 4016–4025, 2019. [33] Bailing Zhang. Reliable classification of vehicle types based on cascade classifier ensembles. IEEE Transactions on intelligent transportation systems, 14(1):322–332, 2012. [34] Stephan Zheng, Yang Song, Thomas Leung, and Ian Goodfellow. Improving the robustness of deep neural networks via stability training. In Proceedings of the ieee conference on computer vision and pattern recognition, pages 4480–4488, 2016. [35] Weibao Zou, Yan Li, and Arthur Tang. Effects of the number of hidden nodes used in a structured-based neural network on the reliability of image classification. Neural Computing and Applications, 18(3):249–260, 2009. [36] Weiwen Zou and Pong C Yuen. Discriminability and reliability indexes: two new measures to enhance multi-image face recognition. Pattern Recognition, 43(10):3483–3493, 2010. 1993