Keywords

1 Introduction

Machine learning and deep models are everywhere today. It has been shown, however, that these models can sometimes provide scores with a high confidence in a clearly erroneous prediction. Thus, a dog image can almost certainly be recognized as a panda, due to an adversarial noise invisible to the naked eye [4]. In addition, since deep networks have little explanation and interpretability by their very nature, it becomes all the more important to make their decisions robust and reliable.

There are two popular approaches that estimate the confidence to be placed in the predictions of machine learning algorithms: Bayesian learning and Probably Approximately Correct (PAC) learning. However, both these methods provide major limitations. Indeed, the first one needs correct prior distributions to produce accurate confidence values, which is often not the case in real-world applications. Experiments conducted by [10] show that when assumptions are incorrect, Bayesian frameworks give misleading and invalid confidence values (i.e. the probability of error is higher than what is expected by the confidence level). The second method, i.e. PAC learning, does not rely on a strong underlying prior but generates error bounds that are not helpful in practice, as demonstrated in [13]. Another approach that offers hedged predictions and does not have these drawbacks is conformal prediction [14].

Conformal prediction is a framework that can be implemented on any machine learning algorithm in order to add a useful confidence measure to its predictions. It provides predictions that can come in the form of a set of classes whose statistical reliability (the average percentage of the true class recovery by the predicted set) is guaranteed under the traditional identically and independently distributed (i.i.d.) assumption. This general assumption can be relaxed into a slightly weaker one that is exchangeability, meaning that the joint probability distribution of a sequence of examples does not change if the order of the examples in this sequence is altered. The principle of conformal prediction and its extensions will be recalled in Sect. 2.

Our work uses an extension of this principle proposed by [6]. They propose to use the density p(x|y) instead of p(y|x) to produce the prediction. This makes it possible to differentiate two cases of different uncertainties: the first predicts more than one label compatible with x in case of ambiguity and the second predicts the empty set \(\emptyset \) when the model does not know or did not see a similar example during training. This approach is recalled in Sect. 2.3. However, the tests in [6] only concern images and Convolutional Neural Networks.

Therefore, the validity and interest of this approach still largely remains to be empirically confirmed. This is what we do in Sect. 3, where we show experimentally that this approach is very generic, in the sense that it works for different neural network architectures (Convolutional Neural Networks, Gated Recurrent Unit and Multi Layer Perceptron) and various types of data (image, textual, cross sectional).

2 Conformal Prediction Methods

Conformal prediction was initially introduced in [14] as a transductive online learning method that directly uses the previous examples to provide an individual prediction for each new example. An inductive variant of conformal prediction is described in [11] that starts by deriving a general rule from which the predictions are based. This section presents both approaches as well as the density-based approach, which we used in this paper.

2.1 Transductive Conformal Prediction

Let \(z_1=(x_1, y_1), z_2=(x_2, y_2), \dots , z_{n}=(x_{n}, y_{n})\) be successive pairs constituting the examples, with \(x_i \in X\) an object and \(y_i \in Y\) its label. For any sequence \(z_1, z_2, \dots , z_{n} \in Z^* \) and any new object \(x_{n+1} \in X\), we can define a simple predictor D such as:

$$\begin{aligned} D : Z^* \times X \xrightarrow {} Y. \end{aligned}$$
(1)

This simple predictor D produces a point prediction \(D(z_1, \dots , z_{n}, x_{n+1}) \in Y\), which is the prediction for \(y_{n+1}\), the true label of \(x_{n+1}\).

By adding another parameter \(\epsilon \in (0, 1)\) which is the probability of error called the significance level, this simple predictor becomes a confidence predictor \(\varGamma \) that can predict a subset of Y with a confidence level \(1 - \epsilon \), which corresponds to a statistical guarantee of coverage of the true label \(y_{n+1}\). \(\varGamma \) is defined as follows:

$$\begin{aligned} \varGamma : Z^* \times X \times (0, 1) \xrightarrow {} 2^Y, \end{aligned}$$
(2)

where \(2^Y\) denotes the power set of Y. This confidence predictor \(\varGamma ^\epsilon \) must be decreasing for the inclusion with respect to \(\epsilon \), i.e. we must have:

$$\begin{aligned} \forall n > 0,\quad \forall {\epsilon }_1 \ge {\epsilon }_2, \quad {\varGamma }^{{\epsilon }_1}(z_1, \dots , z_{n}, x_{n+1}) \subseteq {\varGamma }^{{\epsilon }_2}(z_1, \dots , z_{n}, x_{n+1}). \end{aligned}$$
(3)

The two main properties desired in confidence predictors are (a) validity, meaning the error rate does not exceed \(\epsilon \) for each chosen confidence level \(\epsilon \), and (b) efficiency, i.e. prediction sets are as small as possible. Therefore, a prediction set with fewer labels will be much more informative and useful than a bigger prediction set.

To build such a predictor, conformal prediction relies on a non-conformity measure \(A_n\). This measure calculates a score that estimates how strange an example \(z_i\) is from a bag of other examples . We then note \({\alpha }_i\) the non-conformity score of \(z_i\) compared to the other examples, such as:

(4)

Comparing \({\alpha }_i\) with other non-conformity scores \({\alpha }_j\) with \(j\ne i\), we calculate a p-value of \(z_i\) expressing the proportion of less conforming examples than \(z_i\), with:

$$\begin{aligned} \frac{|\{ j = 1, \dots , n : {\alpha }_j \ge {\alpha }_i \}|}{n}. \end{aligned}$$
(5)

If the p-value approaches the lower bound 1/n then \(z_i\) is non-compliant to most other examples (an outlier). If, on the contrary, it approaches the upper bound 1 then \(z_i\) is very consistent.

We can then compute the p-value for the new example \(x_{n+1}\) being classified as each possible label \(y \in Y\) by using (5). More precisely, we can consider for each \(y \in Y\) the sequence and derive from that scores \(\alpha _1^y,\ldots ,\alpha _{n+1}^y \). We thus get a conformal predictor by predicting the set:

$$\begin{aligned} \varGamma ^\epsilon (x_{n+1}) = \left\{ y \in Y : \frac{|\{ i = 1, \dots , n, n+1 : {\alpha }^y_i \ge {\alpha }^y_{n+1} \}|}{n+1} > \epsilon \right\} . \end{aligned}$$
(6)

Constructing a conformal predictor therefore amounts to defining a non-conformity measure that can be built based on any machine learning algorithm called the underlying algorithm of the conformal prediction. Popular underlying algorithms for conformal prediction include Support Vector Machines (SVMs) and k-Nearest Neighbours (k-NN).

2.2 Inductive Conformal Prediction

One important drawback of Transductive Conformal Prediction (TCP) is the fact that it is not computationally efficient. When dealing with a large amount of data, it is inadequate to use all previous examples to predict an outcome for each new example. Hence, this approach is not suitable for any time consuming training tasks such as deep learning models. Inductive Conformal prediction (ICP) is a method that was outlined in [11] to solve the computational inefficiency problem by replacing the transductive inference with an inductive one. The paper shows that ICP preserves the validity of conformal prediction. However, it has a slight loss in efficiency.

ICP requires the same assumption as TCP (the i.i.d. assumption or the weaker assumption exchangeability), and can also be applied on any underlying machine learning algorithm. The difference between ICP and TCP consists of splitting the original training data set into two parts in the inductive approach. The first part is called the proper training set, and the second smaller one is called the calibration set. In this case, the non-conformity measure \(A_l\) based on the chosen underlying algorithm is trained only on the proper training set. For each example of the calibration set \(i = l+1, \ldots , n\), a non-conformity score \({\alpha }_i\) is calculated by applying (4) to get the sequence \({\alpha }_{l+1}, \ldots , {\alpha }_{n}\). For a new example \(x_{n+1}\), a non-conformity score \({\alpha }^y_{n+1}\) is computed for each possible \(y \in Y\), so that the p-values are obtained and compared to the significance level \(\epsilon \) to get the predictions such as:

$$\begin{aligned} \varGamma ^\epsilon (x_{n+1}) = \{ y \in Y : \frac{|\{ i = l+1, \dots , n, n+1 : {\alpha }_i \ge {\alpha }^y_{n+1} \}|}{n - l + 1} > \epsilon \}. \end{aligned}$$
(7)

In other words, this inductive conformal predictor will output the set of all possible labels for each new example of the classification problem without the need of recomputing the non-conformity scores in each time by including the previous examples, i.e., only \(\alpha _{n+1}\) is recomputed for each y in Eq. (7).

2.3 Density-Based Conformal Prediction

The paper [6] uses a density-based conformal prediction approach inspired from the inductive approach and considers a density estimate \(\hat{p}(x|y)\) of p(x|y) for the label \(y \in Y\). Therefore, this method divides labeled data into two parts: the first one is the proper training data \(D^{tr} = \{X^{tr}, Y^{tr}\}\) used to build \(\hat{p}(x|y)\), the second is the calibration data \(D^{cal} = \{X^{cal}, Y^{cal}\}\) to evaluate \(\{\hat{p}(x_i|y)\}\) and set \(\hat{t}_y\) to be the empirical quantile of order \(\epsilon \) of the values \(\{\hat{p}(x_i|y)\}\):

$$\begin{aligned} \hat{t}_y = \sup \Bigg \{ t : \frac{1}{n_y} \sum _{\{z_i \in D^{cal}_y\}} I(\hat{p}(x_i|y) \ge t) \ge 1 - \epsilon \Bigg \}, \end{aligned}$$
(8)

where \(n_y\) is the number of elements belonging to the class y in \(D^{cal}\), and \(D^{cal}_y=\{z_i \in D^{cal}:y_i=y\}\) is the subset of calibration examples of class y. For a new observation \(x_{n+1}\), we set the conformal predictor \(\varGamma _{d}^{\epsilon }\) such that:

$$\begin{aligned} \varGamma _{d}^{\epsilon }(x_{n+1}) = \{ y \in Y : \hat{p}(x_{n+1}|y) \ge \hat{t}_y \}. \end{aligned}$$
(9)

This ensures that the observations with low probability—that is, the poorly populated regions of the input space—are classified as \(\emptyset \). This divisional procedure avoids the high cost of deep learning calculations in the case where the online approach is used. The paper [6] also shows that \(| P(y \in \varGamma _{d}^{\epsilon }(x_{n+1})) - (1 - \epsilon )| \rightarrow 0\) with \(\min _y n_y \rightarrow \infty \), which ensures the validity of the model. The training and prediction algorithms are defined in the Algorithms 1 and 2.

figure a
figure b

We can rewrite (9) so that it approaches (7) with a few differences, mainly the fact that \(\varGamma _{d}^{\epsilon }\) uses a conformity measure based on density estimation (calculating how much an example is compliant with the others) instead of a non-conformity measure as in \(\varGamma ^{\epsilon }\), with \({\alpha }^y_i = - \hat{p}(x_i|y)\) [14], and that the number of examples used to build the prediction set depends on y. Thus, \(\varGamma _{d}^{\epsilon }\) can also be written as:

$$\begin{aligned} \varGamma ^{\epsilon }(x_{n+1}) = \left\{ y \in Y : \frac{|\{ z_i \in D^{cal}_y: {\alpha }^y_i \ge {\alpha }^y_{n+1} \}|}{n_y} > \epsilon \right\} . \end{aligned}$$
(10)

The proof can be found in Appendix A.

The final quality of the predictor (its efficiency, robustness) depends in part on the density estimator. The paper [7] suggests that the use of kernel estimators gives good results under weak conditions.

The results of the paper show that the training and prediction of each label are independent of the other classes. This makes conformal prediction an adaptive method, which means that adding or removing a class does not require retraining the model from scratch. However, it does not provide any information on the relationship between the classes. In addition, the results depend on \(\epsilon \): when \(\epsilon \) is small, the model has high precision and a large number of classes predicted for each observation. On the contrary, when \(\epsilon \) is large, there are no more cases classified as \(\emptyset \) and fewer cases predicted by label.

3 Experiments

In order to examine the effectiveness of the conformal method on different types of data, three data sets for binary classification were used. They are:

  1. 1.

    CelebA [8]: face attributes dataset with over 200,000 celebrity images used to determine if a person is a man (1) or a woman (0).

  2. 2.

    IMDb [9]: contains more than 50,000 different texts describing film reviews for sentiment analysis (with 1 representing a positive opinion and 0 indicating a negative opinion).

  3. 3.

    EGSS [1]: contains 10000 examples for the study of the electrical networks’ stability (1 representing a stable network), with 12 numerical characteristics.

3.1 Approach

The overall approach followed the same steps as in density-based conformal prediction [6] and meets the conditions listed above (the i.i.d. or exchangeability assumptions). Each data set is divided into proper training, calibration and test sets. A deep learning model dedicated to each type of data is trained on the proper training and calibration sets. The before last dense layer serves as a feature extractor which produces a fixed size vector for each dataset and representing the object (image, text or vector). These feature vectors are then used for the conformal part to estimate the density. Here we used a gaussian kernel density estimator of bandwidth 1 available in Python’s scikit-learn [12]. The architecture of deep learning models is shown in Fig. 1. It is built following the steps below:

  1. 1.

    Use a basic deep learning model depending on the type of data. In the case of CelebA, it is a CNN with a ResNet50 [5] pre-trained on ImageNet [2] and adjusted to CelebA. For IMDb, this model is a bidirectional GRU that takes processed data with a tokenizer and padding. For EGSS, this model is a multilayer perceptron (MLP).

  2. 2.

    Apply an intermediate dense layer and use it as a feature extractor with a vector of size 50 representing the object, and which will be used later for conformal prediction.

  3. 3.

    Add a dense layer to obtain the class predicted by the model (0 or 1).

Fig. 1.
figure 1

Architecture of deep learning models.

Based on the recovered vectors, a Gaussian kernel density estimate is made on the proper training set of each class to obtain the values P(x|y). Then, the calibration set is used to compute the density scores and sort them to determine the given \(\epsilon \) threshold of all the values, thus delimiting the density region of each class. Finally, the test set is used to calculate the performance of the model. The code used for this article is available in GithubFootnote 1.

The visualization of the density regions (Fig. 2) is done via the first two dimensions of a Principal Component Analysis. The results show the distinct regions of the classes 0 (in red) and 1 (in blue) with a non-empty intersection (in green) representing a region of random uncertainty. The points outside these three regions belong to the region of epistemic uncertainty, meaning that the classifier “does not know”.

Fig. 2.
figure 2

Conformal prediction density regions for all datasets. (Color figure online)

Fig. 3.
figure 3

The accuracy and the percentages according to \(\epsilon \) for CelebA (top), IMDb (middle) and EGSS (bottom).

3.2 Results on the Test Examples

To obtain more information on the results of this experiment, the accuracy of the models was calculated with different values \(\epsilon \) between 0.01 and 0.5 when determining the threshold of conformal prediction density as follows:

  • DL accuracy: the accuracy of the basic deep model (CNN for CelebA, GRU for IMDb or MLP for EGSS) on all the test examples.

  • Valid conformal accuracy: the accuracy of the conformal model when one considers only the singleton predictions 0 or 1 (without taking into account the \(\{0, 1\}\) and the empty sets).

  • Valid DL accuracy: The accuracy of the basic deep model on the test examples that have been predicted as 0 or 1 by the conformal model.

The percentage of empty sets \(\emptyset \) and \(\{0, 1\}\) sets was also calculated from all the predictions of the test examples made by the conformal prediction model. The results are shown in the Fig. 3.

The results show that the accuracy of the valid conformal model and the accuracy of the valid basic deep learning model are almost equal and are better than the accuracy of the base model for all \(\epsilon \) values. In our tests, the addition of conformal prediction to a deep model does not degrade its performance, and sometimes even improves it (EGSS). This is due to the fact that the conformal prediction model allows to abstain from predicting (empty set \(\emptyset \)) or to predict both classes for ambiguous examples, thus making it possible to have a more reliable prediction of the label. It is also noticed that as \(\epsilon \) grows, the percentage of predicted \(\{0, 1\}\) sets decreases until it is no longer predicted (at \(\epsilon =\) 0.15 for CelebA for example). Conversely, the opposite is observed with the percentage of empty sets \(\emptyset \) which escalates as \(\epsilon \) increases.

Fig. 4.
figure 4

Examples of outlier and noisy images compared to the actual image for CelebA.

3.3 Results on Noisy and Foreign Examples

CelebA: Two types of noise were introduced: a noise masking parts of the face and another Gaussian on all the pixels. These perturbations and their predictions are illustrated in the Fig. 4 with “CNN” the prediction of the CNN and “CNN + CP” that of the conformal model. This example shows that the CNN and the conformal prediction model correctly identify the woman in the image (a). However, by masking the image (b), the CNN predicts it as a man with a score of 0.6 whereas the model of conformal prediction is more cautious by indicating that it does not know (\(\emptyset \)). When applying a Gaussian noise over the whole image (c), the CNN predicts that it is a man with a larger score of 0.91, whereas the conformal model predicts both classes. For outliers, examples (d), (e), and (f) illustrate the ability of the conformal model to identify different outliers as such (\(\emptyset \)) in contrast to the deep model that predicts them as men with a high score.

IMDb: The Fig. 5 displays a comparison of two texts before and after the random change of a few words (in bold) by other words in the model’s vocabulary. The actual text predicted as negative opinion by both models becomes positive for the GRU after disturbance. Nevertheless, the conformal model is more cautious by indicating that it can be both cases (\(\{0, 1\}\)). For the outlier example formed completely of vocabulary words, the GRU model predicts positive with a score of 0.99, while the conformal model says that it does not know (\(\emptyset \)).

Fig. 5.
figure 5

Examples of outlier and noisy texts compared to the original one for IMDb.

Fig. 6.
figure 6

Density visualization of real, noisy and outlier examples for EGSS.

EGSS: The Fig. 6 displays a comparison of the positions of the test examples on the density regions before (a) and after (b) the addition of a Gaussian noise. This shows that several examples are positioned outside the density regions after the introduction of the disturbances. The outlier examples (c) created by modifying some characteristics of these test examples with extreme values (to simulate a sensor failure, for example) are even further away from the density regions, and recognized as such by the conformal model (\(\emptyset \)).

4 Conclusions and Perspectives

We used the conformal prediction and the technique presented in [6] to have a more reliable and cautious deep learning model. The results show the interest of this method on different data types (image, text, tabular) used with different deep learning architectures (CNN, GRU and MLP). Indeed, in these three cases, the conformal model not only adds reliability and robustness to the deep model by detecting ambiguous examples but also keeps or even improves the performance of the basic deep model when it predicts only one class. We also illustrated the ability of conformal prediction to handle noisy and outlier examples for all three types of data. These experiments show that the conformal method can give more robustness and reliability to predictions on several types of data and basic deep architectures.

To improve the experiments and results, the perspectives include the optimization of density estimation based on neural networks. For instance, at a fixed \(\epsilon \) the problem of finding the most efficient model arises that could be done by modifying the density estimation technique, but also by proposing an end-to-end, integrated estimation method. Also, it would be useful to compare the conformal prediction with calibration methods, for example, evidential ones that are also adopted for cautious predictions [3].