1. Introduction
Predictive models based on deep learning have seen a dramatic improvement in recent years [
1], which has led to widespread adoption in many areas. For critical, high-stakes domains, such as medicine or self-driving cars, it is imperative that mechanisms are in place to ensure safe and reliable operation. Crucial to the notion of safe and reliable deep learning is the effective quantification and communication of predictive uncertainty to potential end-users of a system. In medicine, for instance, understanding predictive uncertainty could lead to better decision-making through improved allocation of hospital resources, detecting dataset shift in deployed algorithms, or helping machine learning models abstain from making a prediction [
2]. For medical classification problems involving many possible labels (i.e., creating a differential diagnosis), methods that provide a set of possible diagnoses when uncertain are natural to consider and align more closely with the differential diagnosis procedure used by physicians. The prediction sets and intervals we propose in this work are an intuitive way to quantify uncertainty in machine learning models and provide interpretable metrics for downstream, nontechnical users.
Commonly used approaches to quantify uncertainty in deep learning generally fall into two broad categories: ensembles and approximate Bayesian methods. Deep ensembles [
3] aggregate information from multiple individual models to provide a measure of uncertainty that reflects the ensembles’ agreement about a given data point. Bayesian methods offer direct access to predictive uncertainty through the posterior predictive distribution, which combines prior knowledge with the observed data. Although conceptually elegant, calculating exact posteriors of even simple neural models is computationally intractable [
4,
5], and many approximations have been developed [
6,
7,
8,
9,
10,
11,
12]. Though approximate Bayesian methods scale to modern sized data and models, recent work has questioned the quality of the uncertainty provided by these approximations [
4,
13,
14].
Previous work assessing the quality of uncertainty estimates has focused on calibration metrics and scoring rules, such as the negative log-likelihood (NLL), expected calibration error (ECE), and Brier score. Here we provide an alternative perspective based on the notion of empirical coverage, a well-established concept in the statistical literature [
15] that evaluates the quality of a predictive set or interval instead of a point prediction. Informally, coverage asks the question: If a model produces a predictive uncertainty interval, how often does that interval actually contain the observed value? Ideally, predictions on examples for which a model is uncertain would produce larger intervals and thus be more likely to cover the observed value.
In this work, we focus on marginal coverage over a dataset for the canonical value of , i.e., 95% prediction intervals. For a machine learning model that produces a 95% prediction interval based on the training dataset , we consider what fraction of the points in the dataset have their true label contained in for . To measure the robustness of these intervals, we also consider cases when the generating distributions for and are not the same (i.e., dataset shift).
Figure 1 provides a visual depiction of marginal coverage over a dataset for two hypothetical regression models. Throughout this work, we refer to “marginal coverage over a dataset” as “coverage”.
For a machine learning model that produces predictive uncertainty estimates (i.e., approximate Bayesian methods and ensembling), coverage encompasses both the aleatoric and epistemic uncertainties [
16] produced by these models. In a regression setting, the predictions from these models can be written as:
where epistemic uncertainty is captured in the
component, while aleatoric uncertainty is considered in the
term. Since coverage captures how often the predicted interval of
contains the true value, it captures the contributions from both types of uncertainty.
A complementary metric to coverage is width, which is the size of the prediction interval or set. In regression problems, we typically measure width in terms of the standard deviation of the true label in the training set. As an example, an uncertainty quantification procedure could produce prediction intervals that have 90% marginal coverage with an average width of two standard deviations. For classification problems, width is simply the average size of a prediction set. Width can provide a relative ranking of different methods, i.e., given two methods with the same level of coverage, we should prefer the method that provides intervals with smaller widths.
Contributions: In this study, we investigate the empirical coverage properties of prediction intervals constructed from a catalog of popular uncertainty quantification techniques, such as ensembling, Monte Carlo dropout, Gaussian processes, and stochastic variational inference. We assess the coverage properties of these methods on nine regression tasks and two classification tasks with and without dataset shift. These tasks help us make the following contributions:
We introduce coverage and width over a dataset as natural and interpretable metrics for evaluating predictive uncertainty for deep learning models.
A comprehensive set of coverage evaluations on a suite of popular uncertainty quantification techniques.
An examination of how dataset shift affects these coverage properties.
3. Methods
For features and a response or (for regression and classification, respectively) for some dataset , we consider the prediction intervals or sets in regression and classification settings, respectively. Unlike in the definitions of marginal and conditional coverage, we do not assume that always holds true. Thus, we consider the marginal coverage on a dataset for some new test sets that may have undergone dataset shift from the generating distribution of the training set .
In both the regression and classification settings, we analyzed the coverage properties of prediction intervals and sets of five different approximate Bayesian and non-Bayesian approaches for uncertainty quantification. These include dropout [
16,
29], ensembles [
3], Stochastic Variational Inference [
7,
8,
11,
12,
30], and last layer approximations of SVI and dropout [
31]. Additionally, we considered prediction intervals from linear regression and the 95% credible interval of a Gaussian process with the squared exponential kernel as baselines in regression tasks. For classification, we also considered temperature scaling [
25] and the softmax output of vanilla deep networks [
28]. For more detail on our modeling choices, see
Appendix B.
3.1. Regression Methods and Metrics
We evaluated the coverage properties of these methods on nine large real-world regression datasets used as a benchmark in Hernández-Lobato and Adams [
6] and later Gal and Ghahramani [
16]. We used the training, validation, and testing splits publicly available from Gal and Ghahramani [
16] and performed nested cross-validation to find hyperparameters. On the training sets, we did 100 trials of a random search over hyperparameter space of a multi-layer-perceptron architecture with an Adam optimizer [
32] and selected hyperparameters based on RMSE on the validation set.
Each approach required slightly different ways to obtain a 95% prediction interval. For an ensemble of neural networks, we trained vanilla networks and used the 2.5% and 97.5% quantiles as the boundaries of the prediction interval. For dropout and last layer dropout, we made 200 predictions per sample and similarly discarded the top and bottom 2.5% quantiles. For SVI, last layer SVI (LL SVI), and Gaussian processes we had approximate variances available for the posterior, which we used to calculate the prediction interval. We calculated 95% prediction intervals from linear regression using the closed-form solution.
Then we calculated two metrics:
Coverage: A sample is considered covered if the true label is contained in this 95% prediction interval. We average over all samples in a test set to estimate a method’s marginal coverage on this dataset.
Width: The width is the average over the test set of the ranges of the 95% prediction intervals.
Coverage measures how often the true label is in the prediction region, while width measures how specific that prediction region is. Ideally, we would have high levels of coverage with low levels of width on in-distribution data. As data becomes increasingly out of distribution, we would like coverage to remain high while width increases to indicate model uncertainty.
3.2. Classification Methods and Metrics
Ovadia et al. [
14] evaluated model uncertainty on a variety of datasets publicly available. These predictions were made with the five approximate Bayesian methods described above, plus vanilla neural networks, with and without temperature scaling. We focus on the predictions from MNIST, CIFAR-10, CIFAR-10-C, ImageNet, and ImageNet-C datasets. For MNIST, we calculated coverage and width of model prediction intervals on rotated and translated versions of the test set. For CIFAR-10, Ovadia et al. [
14] measured model predictions on translated and corrupted versions of the test set from CIFAR-10-C [
28] (see
Figure 2). For ImageNet, we only considered the coverage and width of prediction sets on the corrupted images of ImageNet-C [
28]. Each of these transformations (rotation, translation, or any of the 16 corruptions) has multiple levels of shift. Rotations range from 15 to 180 degrees in 15 degrees increments. Translations shift images every 2 and 4 pixels for MNIST and CIFAR-10, respectively (see
Figure 3). Corruptions have five increasing levels of intensity.
Figure 2 shows the effects of the 16 corruptions in CIFAR-10-C at the first, third, and fifth levels of intensity.
Given
and predicted probabilities
from a model for all K classes
, the
prediction set
for a sample
is the minimum sized set of classes such that:
This results in a set of size , which consists of the largest probabilities in the full probability distribution over all classes such that probability has been accumulated. This inherently assumes that the labels are unordered categorical classes such that including classes 1 and K does not imply that all classes between are also included in the set . Then we can define:
Coverage: For each example in a dataset, we calculate the prediction set of the label probabilities, then coverage is what fraction of these prediction sets contain the true label.
Width: The width of a prediction set is simply the number of labels in the set, . We report the average width of prediction sets over a dataset in our figures.
Although both calibration [
25] and coverage can involve a probability over a model’s output, calibration only considers the most likely label, and its corresponding probability, while coverage considers the top-
probabilities. In the classification setting, coverage is more robust to label errors as it does not penalize models for putting probability on similar classes.
5. Discussion
We have provided the first comprehensive empirical study of the frequentist-style coverage properties of popular uncertainty quantification techniques for deep learning models. In regression tasks, LL SVI, SVI, and Gaussian processes all had high levels of coverage across nearly all benchmarks. LL SVI, in particular, had the lowest widths amongst methods with high coverage. SVI also had excellent coverage properties across most tasks with tighter intervals than GPs and linear regression. In contrast, the methods based on ensembles and Monte Carlo dropout had significantly worse coverage due to their overly confident and tight prediction intervals.
In the classification setting, all methods showed very high coverage in the i.i.d setting (i.e., no dataset shift), as coverage is reflective of top-1 accuracy in this scenario. On MNIST data, SVI had the best performance, maintaining high levels of coverage under slight dataset shift and scaling the width of its prediction intervals more appropriately as shift increased relative to other methods. On CIFAR-10 data and ImageNet, ensemble models were superior. They had the highest coverage relative to other methods, as demonstrated in
Figure 9 and
Figure 10.
An important consideration throughout this work is the choice of hyperparameters in most all of the analyzed methods makes a significant impact on the uncertainty estimates. We set hyperparameters and optimized model parameters according to community best practices in an attempt to reflect what a “real-world” machine learning practitioner might do: selecting hyperparameters based on minimizing validation loss over nested cross-validation. Our work is a measurement of the empirical coverage properties of these methods as one would typically utilize them, rather than an exploration of how pathological hyperparameters can skew uncertainty estimates to 0 or to infinity, while this is an inherent limitation in the applicability of our work to every context, our sensible choices will provide a relevant benchmark for models in practice.
Of particular note is that the width of a prediction interval or set typically correlated with the degree of dataset shift. For instance, when the translation shift is applied to MNIST, both prediction set width and dataset shift is maximized at around 14 pixels. There is a 0.9 Pearson correlation between width and shift. Width can serve as a soft proxy of dataset shift and potentially detect shift in real-world scenarios.
Simultaneously, the ranks of coverage, Brier score, and ECE are all generally consistent. However, coverage is arguably the most interpretable to downstream users of machine learning models. Clinicians, for instance, may not have the technical training to have an intuition about what specific values of Brier score or ECE mean in practice, while coverage and width are readily understandable. Manrai et al. [
33] already demonstrated clinicians’ general lack of intuition about the positive predictive value, and these uncertainty quantification metrics are more difficult to internalize than PPV.
Moreover, proper scoring rules (e.g., Brier score and negative log-likelihood) can be misleading under model misspecification [
34]. Negative log-likelihood, specifically, suffers from the potential impact of a few points with low probability. These points can contribute near-infinite terms to NLL that distort interpretation. In contrast, marginal coverage over a dataset is less sensitive to the impacts of outlying data.
In summary, we find that popular uncertainty quantification methods for deep learning models do not provide good coverage properties under moderate levels of dataset shift. Although the width of prediction regions do increase under increasing amounts of shift, these changes are not enough to maintain the levels of coverage seen on i.i.d data. We conclude that the methods we evaluated for uncertainty quantification are likely insufficient for use in high-stakes, real-world applications, where dataset shift is likely to occur. However, marginal coverage of a prediction interval or set is a natural and intuitive metric to quantify uncertainty. The width of a prediction interval/set is an additional tool that captures dataset shift and provides additional interpretable information to downstream users of machine learning models.