\titlecontents

task [3.8em] \contentslabel2.3em \contentspage

Deep Discriminative to Kernel Density Graph for In- and Out-of-distribution Calibrated Inference

Jayanta Dey ^1,∗ Haoyin Xu ^1,† Will LeVine ^1,† Ashwin De Silva ^1,† Tyler M. Tomita ¹ Ali Geisa ¹ Tiffany Chu ¹ Jacob Desman ¹ and Joshua T. Vogelstein¹

Abstract

Deep discriminative approaches like random forests and deep neural networks have recently found applications in many important real-world scenarios. However, deploying these learning algorithms in safety-critical applications raises concerns, particularly when it comes to ensuring confidence calibration for both in-distribution and out-of-distribution data points. Many popular methods for in-distribution (ID) calibration, such as isotonic and Platt’s sigmoidal regression, exhibit excellent ID calibration performance. However, these methods are not calibrated for the entire feature space, leading to overconfidence in the case of out-of-distribution (OOD) samples. On the other end of the spectrum, existing out-of-distribution (OOD) calibration methods generally exhibit poor in-distribution (ID) calibration. In this paper, we address ID and OOD calibration problems jointly. We leveraged the fact that deep models, including both random forests and deep-nets, learn internal representations which are unions of polytopes with affine activation functions to conceptualize them both as partitioning rules of the feature space. We replace the affine function in each polytope populated by the training data with a Gaussian kernel. Our experiments on both tabular and vision benchmarks show that the proposed approaches obtain well-calibrated posteriors while mostly preserving or improving the classification accuracy of the original algorithm for ID region, and extrapolate beyond the training data to handle OOD inputs appropriately.

^†^†¹Johns Hopkins University (JHU) ^†^† ^∗ corresponding author: jdey4@jhu.edu,

\dagger

denotes equal contribution

1 Introduction

Machine learning methods, specially deep neural networks and random forests have shown excellent performance in many real-world tasks, including drug discovery, autonomous driving and clinical surgery [1, 2, 3]. However, calibrating confidence over the whole feature space for these approaches remains a key challenge in the field [4]. Calibrated confidence within the training or in-distribution (ID) region as well as in the out-of-distribution (OOD) region is crucial for safety critical applications like autonomous driving and computer-assisted surgery, where any aberrant reading should be detected and taken care of immediately [4, 5].

The approaches to calibrate OOD confidence for learning algorithms described in the literature can be roughly divided into two groups: discriminative and generative. Intuitively, the easiest solution for OOD confidence calibration is to learn a function that gives higher scores for in-distribution samples and lower scores for OOD samples [6]. The discriminative approaches try to either modify the loss function [7, 8, 9] or train the network exhaustively on OOD datasets to calibrate on OOD samples [10, 4]. Recently, Hein et al. [4] showed ReLU networks produce arbitrarily high confidence as the inference point moves far away from the training data. Therefore, calibrating ReLU networks for the whole OOD region is not possible without fundamentally changing the network architecture. As a result, all of the aforementioned algorithms are unable to provide any guarantee about the performance of the network throughout the whole feature space. The other group tries to learn generative models for the in-distribution as well as the out-of-distribution samples. The general idea is to do likelihood ratio test for a particular sample using the generative models [11], or threshold the ID likelihoods to detect OOD samples. However, it is not obvious how to control likelihoods far away from the training data for powerful generative models like variational autoencoders (VAEs) [12] and generative adversarial networks (GAN) [13]. Moreover, Nalisnick et al. [14] and Hendrycks et al. [10] showed VAEs and GANs can also yield overconfident likelihoods far away from the training data.

The algorithms described so far are concerned with OOD confidence calibration for deep-nets only. However, we show that other approaches which partition the feature space, for example random forest, can also suffer from poor confidence calibration both in the ID and the OOD regions. Moreover, the algorithms described above are concerned about the confidence in the OOD region only and do not address the confidence calibration within the ID region at all. This issue is addressed separately in a different group of literature [15, 16, 17, 18, 19, 20]. Instead, we consider both calibration problems jointly and propose an approach that achieves good calibration throughout the whole feature space.

In this paper, we conceptualize both random forest and ReLU networks as partitioning rules with an affine activation over each polytope. We consider replacing the affine functions learned over the polytopes with Gaussian kernels. We propose two novel kernel density estimation techniques named Kernel Density Forest (KDF) and Kernel Density Network (KDN). Our proposed approach completely excludes the need for training on OOD examples for the model (unsupervised OOD calibration). We conduct several simulation and real data studies that show both KDF and KDN are well-calibrated for OOD samples while they maintain good performance in the ID region.

2 Related Works and Our Contributions

There are a number of approaches in the literature which attempt to learn a generative model and control the likelihoods far away from the training data. For example, Ren et al. [11] employed likelihood ratio test for detecting OOD samples. Wan et al. [8] modified the training loss so that the downstream projected features follow a Gaussian distribution. However, there is no guarantee of performance for OOD detection for the above methods. To the best of our knowledge, apart from us, only Meinke et al. [5] has proposed an approach to guarantee asymptotic performance for OOD detection. Compared to the aforementioned methods, our approach differs in several ways:

•

We address the confidence calibration problem for both ReLU-nets and random forests.
•

We address ID and OOD calibration problem as a continuum.
•

We provide an algorithm for OOD confidence calibration for both tabular and vision datatsets whereas most of the existing methods are tailor-made for vision problems.
•

We propose an unsupervised post-hoc OOD calibration approach.

3 Technical Background

3.1 Setting

Consider a supervised learning problem with independent and identically distributed training samples $\{(\mathbf{x}_{i},y_{i})\}_{i=1}^{n}$ such that $(\mathbf{X},Y)\sim P_{X,Y}$ , where $\mathbf{X}\sim P_{X}$ is a $\mathcal{X}\subseteq\mathbb{R}^{D}$ valued input and $Y\sim P_{Y}$ is a $\mathcal{Y}=\{1,\cdots,K\}$ valued class label. Let $\mathcal{S}$ be the high density region of the marginal, $P_{X}$ , thus $\mathcal{S}\subsetneq\mathcal{X}$ . Here the goal is to learn a confidence score, $\mathbf{g}:\mathbb{R}^{D}\rightarrow[0,1]^{K}$ , $\mathbf{g}(\mathbf{x})=[g_{1}(\mathbf{x}),g_{2}(\mathbf{x}),\dots,g_{K}(% \mathbf{x})]$ such that,

(1)

g_{y}(\mathbf{x})=\begin{cases}P_{Y|X}(y|\mathbf{x}),&\leavevmode\nobreak\ % \text{if}\leavevmode\nobreak\ \mathbf{x}\in\mathcal{S}\\ P_{Y}(y),&\leavevmode\nobreak\ \text{if}\leavevmode\nobreak\ \mathbf{x}\notin% \mathcal{S}\end{cases},\quad\forall y\in\mathcal{Y}

where $P_{Y|X}(y|\mathbf{x})$ is the posterior probability for class $y$ given by the Bayes formula:

(2)

P_{Y|X}(y|\mathbf{x})=\frac{P_{X|Y}(\mathbf{x}|y)P_{Y}(y)}{\sum_{k=1}^{K}P_{X|% Y}(\mathbf{x}|k)P_{Y}(k)},\quad\forall y\in\mathcal{Y}.

Here $P_{X|Y}(\mathbf{x}|y)$ is the class conditional density which we will refer as $f_{y}(\mathbf{x})$ hereafter for brevity.

3.2 Main Idea

Deep discriminative networks partition the feature space $\mathbb{R}^{d}$ into a union of $p$ affine polytopes $Q_{r}$ such that $\bigcup_{r=1}^{p}Q_{r}=\mathbb{R}^{d}$ , and learn an affine function over each polytope [4, 21]. Mathematically, the unnormalized class-conditional density for the label $y$ estimated by these deep discriminative models at a particular point $\mathbf{x}$ can be expressed as:

(3)

\hat{f}_{y}(\mathbf{x})=\sum_{r=1}^{p}(\mathbf{a}_{r}^{\top}\mathbf{x}+b_{r})% \mathbbm{1}(\mathbf{x}\in Q_{r}).

For example, in the case of a decision tree, $\mathbf{a}_{r}=\mathbf{0}$ , i.e., decision tree assumes uniform distribution for the class-conditional densities over the leaf nodes. Among these polytopes, the ones that lie on the boundary of the training data extend to the whole feature space and hence encompass all the OOD samples. Since the posterior probability for a class is determined by the affine activation over each of these polytopes, the algorithms tend to be overconfident when making predictions on the OOD inputs. Moreover, there exist some polytopes that are not populated with training data. These unpopulated polytopes serve to interpolate between the training sample points. If we replace the affine activation function of the populated polytopes with Gaussian kernels and prune the unpopulated ones, the tail of the kernel will help interpolate between the training sample points while assigning lower likelihood to the low density or unpopulated polytope regions of the feature space. This results in better confidence calibration for the proposed modified approach.

3.3 Proposed Approach

We will call the above discriminative approaches as the ‘parent approach’ hereafter. Consider the collection of polytope indices $\mathcal{P}$ from the parent approach which are populated by the training data. We replace the affine functions over the populated polytopes with Gaussian kernels $\mathcal{G}(\cdot;\hat{\mu}_{r},\hat{\Sigma}_{r})$ . For a particular inference point $\mathbf{x}$ , we consider the Gaussian kernel with the minimum distance from the center of the kernel to the corresponding point:

(4)

r^{*}_{\mathbf{x}}=\operatornamewithlimits{argmin}_{r}\|\mu_{r}-\mathbf{x}\|,

where $\|\cdot\|$ denotes a distance. As we will show later, the type of distance metric considered in Equation 4 highly impacts the performance of the proposed model. In short, we modify Equation 3 from the parent ReLU-net or random forest to estimate the class-conditional density (unnormalized):

(5)

\tilde{f}_{y}(\mathbf{x})=\frac{1}{n_{y}}\sum_{r\in\mathcal{P}}n_{ry}\mathcal{% G}(\mathbf{x};\mu_{r},\Sigma_{r})\mathbbm{1}(r=r^{*}_{\mathbf{x}}),

where $n_{y}$ is the total number of samples with label $y$ and $n_{ry}$ is the number of samples from class $y$ that end up in polytope $Q_{r}$ . We add a small constant to the class conditional density $\tilde{f}_{y}$ :

(6)

\hat{f}_{y}(\mathbf{x})=\tilde{f}_{y}(\mathbf{x})+\frac{b}{\log(n)}.

Note that in Equation 6, $\frac{b}{\log(n)}\rightarrow 0$ as the total training points, $n\rightarrow\infty$ . The intuition behind the added constant will be clarified further later in Proposition 4.3. The confidence score $\hat{g}_{y}(\mathbf{x})$ for class $y$ given a test point $\mathbf{x}$ is estimated using the Bayes rule as:

(7)

\hat{g}_{y}(\mathbf{x})=\frac{\hat{f}_{y}(\mathbf{x})\hat{P}_{Y}(y)}{\sum_{k=1% }^{K}\hat{f}_{k}(\mathbf{x})\hat{P}_{Y}(k)},

where $\hat{P}_{Y}(y)$ is the empirical prior probability of class $y$ estimated from the training data. We estimate the class for a particular inference point $\mathbf{x}$ as:

(8)

\hat{y}=\operatornamewithlimits{argmax}_{y\in\mathcal{Y}}\hat{g}_{y}(\mathbf{x% }).

4 Model Parameter Estimation

4.1 Gaussian Kernel Parameter Estimation

We fit Gaussian kernel parameters to the samples that end up in the $r$ -th polytope. We set the kernel center along the $d$ -th dimension:

(9)

\hat{\mu}^{d}_{r}=\frac{1}{n_{r}}\sum_{i=1}^{n}x^{d}_{i}\mathbbm{1}(\mathbf{x}% _{i}\in Q_{r}),

where $x^{d}_{i}$ is the value of $\mathbf{x}_{i}$ along the $d$ -th dimension. We set the kernel variance along the $d$ -th dimension:

(10)

(\hat{\sigma}^{d}_{r})^{2}=\frac{1}{n_{r}}\{\sum_{i=1}^{n}\mathbbm{1}(\mathbf{% x}_{i}\in Q_{r})(x^{d}_{i}-\hat{\mu}^{d}_{r})^{2}+\lambda\},

where $\lambda$ is a small constant that prevents $\hat{\sigma}^{d}_{r}$ from being $0$ . We constrain our estimated Gaussian kernels to have diagonal covariance.

4.2 Sample Size Ratio Estimation

For a high dimensional dataset with low training sample size, the polytopes are sparsely populated with training samples. For improving the estimate of the ratio $\frac{n_{ry}}{n_{y}}$ in Equation 5, we incorporate the samples from other polytopes $Q_{s}$ based on the similarity $w_{rs}$ between $Q_{r}$ and $Q_{s}$ as:

(11)

\displaystyle\frac{\hat{n}_{ry}}{\hat{n}_{y}}=\frac{\sum_{s\in\mathcal{P}}\sum% _{i=1}^{n}w_{rs}\mathbbm{1}(\mathbf{x}_{i}\in Q_{s})\mathbbm{1}(y_{i}=y)}{\sum% _{r\in\mathcal{P}}\sum_{s\in\mathcal{P}}\sum_{i=1}^{n}w_{rs}\mathbbm{1}(% \mathbf{x}_{i}\in Q_{s})\mathbbm{1}(y_{i}=y)}.

As $n\to\infty$ , the estimated weights $w_{rs}$ should satisfy the condition:

(12)

w_{rs}\to\begin{cases}0,&\text{if }Q_{r}\neq Q_{s}\\ 1,&\text{if }Q_{r}=Q_{s}.\end{cases}

For simplicity, we will describe the estimation procedure for $w_{rs}$ in the next sections. Note that if we satisfy Condition 12, then we have $\frac{\hat{n}_{ry}}{\hat{n}_{y}}\to\frac{n_{ry}}{n_{y}}$ as $n\to\infty$ . Therefore, we modify Equation 5 as:

(13)

\hat{f}_{y}(\mathbf{x})=\frac{1}{\hat{n}_{y}}\sum_{r\in\mathcal{P}}\hat{n}_{ry% }\mathcal{G}(\mathbf{x};\hat{\mu}_{r},\hat{\Sigma}_{r})\mathbbm{1}(r=\hat{r}^{% *}_{\mathbf{x}}),

where $\hat{r}^{*}_{\mathbf{x}}=\operatornamewithlimits{argmin}_{r}\|\hat{\mu}_{r}-% \mathbf{x}\|$ . Now we use $\hat{f}_{y}(\mathbf{x})$ estimated using (13) in Equation (6), (7) and (8), respectively. Below, we describe how we estimate $w_{rs}$ for KDF and KDN .

4.3 Forest Kernel

Consider $T$ number of decision trees in a random forest trained on $n$ $iid$ training samples $\{(\mathbf{x}_{i},y_{i})\}_{i=1}^{n}$ . Each tree $t$ partitions the feature space into $p_{t}$ polytopes resulting in a set of polytopes: $\{\{Q_{t,r}\}_{r=1}^{p_{t}}\}_{t=1}^{T}$ . The intersection of these polytopes gives a new set of polytopes $\{Q_{r}\}_{r=1}^{p}$ for the forest. For any two points $\mathbf{x}\in Q_{r}$ and $\mathbf{x}^{\prime}\in Q_{s}$ , we define the kernel $\mathcal{K}(r,s)$ as:

(14)

\mathcal{K}(r,s)=\frac{t_{rs}}{T},

where $t_{rs}$ is the total number of trees, $\mathbf{x}$ and $\mathbf{x}^{\prime}$ end up in the same leaf node. Here, $0\leq\mathcal{K}(r,s)\leq 1$ .

If the two samples end up in the same leaf in all the trees, i.e., $\mathcal{K}(r,s)=1$ , they belong to the same polytope, i.e. $r=s$ . In short, $\mathcal{K}(r,s)$ is the fraction of total trees where the two samples follow the same path from the root to a leaf node. We exponentiate $\mathcal{K}(r,s)$ so that Condition 12 is satisfied:

(15)

w_{rs}=\mathcal{K}(r,s)^{k\log n}.

We choose $k$ using grid search on a hold-out dataset.

4.4 Network Kernel

Consider a fully connected $L$ layer ReLU-net trained on $n$ $iid$ training samples $\{(\mathbf{x}_{i},y_{i})\}_{i=1}^{n}$ . We have the set of all nodes denoted by $\mathcal{N}_{l}$ at a particular layer $l$ . We can randomly pick a node $n_{l}\in\mathcal{N}_{l}$ at each layer $l$ , and construct a sequence of nodes starting at the input layer and ending at the output layer which we call an activation path: $m=\{n_{l}\in\mathcal{N}_{l}\}_{l=1}^{L}$ . Note that there are $N=\Pi_{l=1}^{L}{|\mathcal{N}_{l}|}$ possible activation paths for a sample in the ReLU-net. We index each path by a unique identifier number $z\in\mathbb{N}$ and construct a sequence of activation paths as: $\mathcal{M}=\{m_{z}\}_{z=1,\cdots,N}$ . Therefore, $\mathcal{M}$ contains all possible activation pathways from the input to the output of the network.

While pushing a training sample $\mathbf{x}_{i}$ through the network, we define the activation from a ReLU unit at any node as ‘ $1$ ’ when it has positive output and ‘ $0$ ’ otherwise. Therefore, the activation indicates on which side of the affine function at each node the sample falls. The activation for all nodes in an activation path $m_{z}$ for a particular sample creates an activation mode $a_{z}\in\{0,1\}^{L}$ . If we evaluate the activation mode for all activation paths in $\mathcal{M}$ while pushing a sample through the network, we get a sequence of activation modes: $\mathcal{A}_{r}=\{a_{z}^{r}\}_{z=1}^{N}$ . Here $r$ is the index of the polytope where the sample falls in.

If the two sequences of activation modes for two different training samples are identical, they belong to the same polytope. In other words, if $\mathcal{A}_{r}=\mathcal{A}_{s}$ , then $Q_{r}=Q_{s}$ . This statement holds because the above samples will lie on the same side of the affine function at each node in different layers of the network. Now, we define the kernel $\mathcal{K}(r,s)$ as:

(16)

\mathcal{K}(r,s)=\frac{\sum_{z=1}^{N}\mathbbm{1}(a_{z}^{r}=a_{z}^{s})}{N}.

Note that $0\leq\mathcal{K}(r,s)\leq 1$ . In short, $\mathcal{K}(r,s)$ is the fraction of total activation paths which are identically activated for two samples in two different polytopes $r$ and $s$ . We exponentiate the kernel using Equation 15. Pseudocodes outlining the two algorithms are provided in Appendix D.

4.5 Geodesic Distance

Consider $\mathcal{P}_{n}=\{Q_{1},Q_{2},\cdots,Q_{p}\}$ as a partition of $\mathbb{R}^{d}$ given by a random forest or a ReLU-net after being trained on $n$ training samples. We measure distance between two points $\mathbf{x}\in Q_{r},\mathbf{x}^{\prime}\in Q_{s}$ using the kernel introduced in Equation 14 and Equation 16, and call it ‘Geodesic’ distance [22]:

(17)

d(r,s)=-\mathcal{K}(r,s)+\frac{1}{2}(\mathcal{K}(r,r)+\mathcal{K}(s,s))=1-% \mathcal{K}(r,s).

Proposition 4.1.

$(\mathcal{P}_{n},d)$ is a metric space.

Proof 4.2.

See Appendix A.1 for the proof.

We use Geodesic distance to find the nearest polytope to the inference point. As Geodesic distance cannot distinguish between points within the same polytope, it has a resolution similar to the size of the polytope. For discriminating between two points within the same polytope, we fit a Gaussian kernel within the polytope (described above). As $h_{n}\to 0$ , the resolution for Geodesic distance improves. In Section 5, we will empirically show that using Geodesic distance scales better with higher dimension compared to that of Euclidean distance.

Given $n$ training samples $\{(\mathbf{x}_{i},y_{i})\}_{i=1}^{n}$ , we define the distance of an inference point $\mathbf{x}$ from the training points as: $d_{\mathbf{x}}=\min_{i=1,\cdots,n}\|\mathbf{x}-\mathbf{x}_{i}\|$ , where $\|\cdot\|$ denotes Euclidean distance.

Proposition 4.3 (Asymptotic OOD Convergence).

Given non-zero and bounded bandwidth of the Gaussians, then we have almost sure convergence for $\hat{g}_{y}$ as:

\lim_{d_{\mathbf{x}}\to\infty}\hat{g}_{y}(\mathbf{x})=\hat{P_{Y}}(y).

Proof 4.4.

See Appendix A.2 for the proof.

5 Empirical Results

We conduct several experiments on simulated, OpenML-CC18 [23] ¹¹1https://www.openml.org/s/99 and vision benchmark datasets to gain insights on the finite sample performance of KDF and KDN. The details of the simulation datasets and hyperparameters used for all the experiments are provided in Appendix C. For Trunk simulation dataset, we follow the simulation setup proposed by Trunk [24] which was designed to demonstrate ‘curse of dimensionality’. In the Trunk simulation, a binary class dataset is used where each class is sampled from a Gaussian distribution with higher dimensions having increasingly less discriminative information. We use both Euclidean and Geodesic distance to detect the nearest polytope (see Equation (4)) on simulation datasets and use only Geodesic distance for benchmark datasets. For the simulation setups, we use classification error, Hellinger distance [25, 26] from the true class conditional posteriors and mean max confidence [4] as performance statistics. While measuring in-distribution calibration for the datasets in OpenML-CC18 data suite, we used maximum calibration error as defined by Guo et al. [18] with a fixed bin number of $R=15$ across all the datasets. Given $n$ OOD samples, we define OOD calibration error (OCE) to measure OOD performance for the benchmark datasets as:

(18)

\text{OCE}=\frac{1}{n}\sum_{i=1}^{n}\left|\max_{y\in\mathcal{Y}}(\hat{P}_{Y|X}% (y|\mathbf{x}_{i}))-\max_{y\in\mathcal{Y}}(\hat{P}_{Y}(y))\right|.

For the tabular and the vision datasets, we have used ID calibration approaches, such as Isotonic [15, 16] and Sigmoid [17] regression, as baselines. Additionally, for the vision benchmark dataset, we provide results with OOD calibration approaches such as: ACET [4], ODIN [6], OE (outlier exposure) [10]. For each approach, $70\%$ of the training data was used to fit the model and the rest of the data was used to calibrate the model.

5.1 Empirical Study on Tabular Data

Refer to caption — Figure 1: Simulation datasets, Classification error, Hellinger distance from true posteriors, mean max confidence or posterior for A. five two-dimensional and B. a high dimensional (Trunk) simulation experiments, visualized for the first two dimensions. The median performance is shown as a dark curve with shaded region as error bars.

5.1.1 Simulation Study

Figure 1 leftmost column shows $10000$ training samples with $5000$ samples per class sampled within the region $[-1,1]\times[-1,1]$ from the six simulation setups described in Appendix C. Therefore, the empty annular region between $[-1,1]\times[-1,1]$ and $[-2,2]\times[-2,2]$ is the low density or OOD region in Figure 1. Figure 1 quantifies the performance of the algorithms which are visually represented in Appendix Figure 4. KDF and KDN maintain similar classification accuracy to those of their parent algorithms. We measure hellinger distance from the true distribution for increasing training sample size within $[-1,1]\times[-1,1]$ region as a statistics for in-distribution calibration. Column $3$ and $6$ in Figure 1 show KDF and KDN are better at estimating the ID region compared to their parent methods. In all of the simulations, using geodesic distance measure results in better performance compared to those while using Euclidean distance. For measuring OOD performance, we keep the training sample size fixed at $1000$ and normalize the training data by the maximum of their $l2$ norm so that the training data is confined within a unit circle. For inference, we sample $1000$ inference points uniformly from a circle where the circles have increasing radius and plot mean max posterior for increasing distance from the origin. Therefore, for distance up to $1$ we have in-distribution samples and distances farther than $1$ can be considered as OOD region. As shown in Column $4$ and $7$ of Figure 1, mean max confidence for KDF and KDN converge to the maximum of the class priors, i.e., $0.5$ as we go farther away from the training data origin.

Row $6$ of Figure 1 shows KDF-Geodesic and KDN-Geodesic scale better with higher dimensions compared to their Euclidean counterpart algorithms respectively.

5.1.2 OpenML-CC18 Data Study

We use OpenML-CC18 data suite for tabular benchmark dataset study. We exclude any dataset which contains categorical features or NaN values ²²2We also excluded the dataset with dataset id $23517$ as we could not achieve better than chance accuracy using RF and DN on that dataset. and conduct our experiments on $45$ datasets with varying dimensions and sample sizes. For the OOD experiments, we follow a similar setup as that of the simulation data. We normalize the training data by their maximum $l_{2}$ norm and sample $1000$ testing samples uniformly from hyperspheres where each hypersphere has increasing radius starting from $1$ to $5$ . For each dataset, we measure improvement with respect to the parent algorithm:

(19)

\frac{\mathcal{E}_{p}-\mathcal{E}_{M}}{\mathcal{E}_{p}},

where $\mathcal{E}_{p}=$ classification error, MCE or OCE for the parent algorithm and $\mathcal{E}_{M}$ represents the performance of the approach in consideration. Note that positive improvement implies the corresponding approach performs better than the parent approach. We report the median of improvement on different datasets along with the error bar in Figure 2. The extended results for each dataset is shown separately in the appendix. Figure 2 left column shows on average KDF and KDN has nearly similar or better classification accuracy compared to their respective parent algorithm whereas Isotonic and Sigmoid regression have lower classification accuracy most of the cases. However, according to Figure 2 middle column, KDF and KDN have similar in-distribution calibration performance to the other baseline approaches. Most interestingly, Figure 2 right column shows that KDN and KDF improves OOD calibration of their respective parent algorithms by a huge margin while the baseline approaches completely fails to address the OOD calibration problem.

5.2 Empirical Study on Vision Data

In vision data, each image pixel contains local information about the neighboring pixels. To extract the local information, we use convolutional or vision transformer encoders at the front-end. More precisely, we have a front-end encoder, $h_{e}:\mathbb{R}^{D}\mapsto\mathbb{R}^{m}$ and typically, $m<<D$ . After the encoder there is a few fully connected dense layers for discriminating among the $K$ class labels, $h_{f}:\mathbb{R}^{m}\mapsto\mathbb{R}^{K}$ . Note that the $m$ -dimensional embedding outputs from the encoder are partitioned into polytopes by the dense layers (see Equation (3)) and we fit a KDN on the embedding outputs. The above approach results in extraction of better inductive bias by KDN from the parent model and makes KDN more scalable with larger parent models and training sample size.

5.2.1 Simulation Study

For the simulation study, we use a simple CNN with one convolutional layer ( $3$ channels with $3\times 3$ kernel) followed by two fully connected layers with $10$ and $2$ nodes in each. We train the CNN on $2000$ circle (radius $10$ ) and $2000$ rectangle (sides $20,50$ ) images with their RGB values being fixed at $[127,127,127]$ and their centers randomly sampled within a square with sides $100$ . The other pixels in the background where there is no object (circle, rectangle or ellipse) were set to $0$ .

We perform three experiments while inducing semantic shifts in the inference points as shown in Figure 3. In the first experiment, we randomly sampled data similar to the training points. However, we added the same shift to all the RGB values of an inference point (shown as color intensity in Figure 3 D). Therefore, the inference point is ID for color intensity at $127$ and otherwise OOD. In the second experiment, we kept the RGB values fixed at $[127,127,127]$ while taking convex combination of a circle and a rectangle. Let images of circles and rectangles be denoted by $X_{c}$ and $X_{r}$ . We derive an interference point as $X_{inf}$ :

(20)

X_{inf}=\epsilon X_{c}+(1-\epsilon)X_{r}

Therefore, $X_{inf}$ is maximally distant from the training points for $\epsilon=0.5$ and closest to the ID points at $\epsilon=\{0,1\}$ . In the third experiment, we sampled ellipse images with the same RGB values as the training points. However, this time we gradually change one of the ellipse axes from $0.01$ to $40$ while keeping the other axis fixed at $10$ . As a result, the inference point becomes ID for the axis length of $10$ . As shown in Figure 3 (D, E, F), in all the experiments KDN becomes less confident for the OOD points while the parent CNN remains overconfident throughout the semantic shifts of the test points.

5.2.2 Vision Benchmark Datasets Study

In this study, we use a $ViT\_B16$ (provided in keras-vit package) vision transformer encoder [27] pretrained on ImageNet [28] dataset and finetuned on CIFAR-10 [29]. We use the same encoder for all the baseline algorithms and finetune it with the corresponding loss function without freezing any weight. As shown in Table 1, pretrained vision transformers are already well-calibrated for ID and the OOD approaches (ACET, ODIN, OE) degrade ID calibration of the parent model. On the contrary, ID calibration approaches (Isotonic, Sigmoid) perform poorly compared to that of KDN in the OOD region. KDN achieves a compromise between ID and OOD performance while having reduced confidence on wrongly classified ID samples. The number of populated polytopes (and Gaussians) for KDN is $9323\pm 353$ . See Appendix F for the corresponding experiments using Resnet-50.

Table 1: KDN achieves good calibration at both ID and OOD regions whereas other approaches excel either in the ID or the OOD region. Notably, KDN has reduced confidence on wrongly classified ID points. ‘

\uparrow

’ and ‘

\downarrow

’ indicate whether higher and lower values are better, respectively. MMC^∗ = Mean Max Confidence on wrongly classified ID points.

	Dataset	Statistics	Parent	KDN	Isotonic	Sigmoid	ACET	ODIN	OE
ID	CIFAR-10	Accuracy $(\%)\uparrow$	$98.06\pm 0.00$	$97.45\pm 0.00$	$98.16\pm 0.00$	$98.10\pm 0.00$	$\mathbf{98.23}\pm 0.00$	$97.97\pm 0.00$	$97.94\pm 0.00$
		MCE $\downarrow$	$\mathbf{0.00}\pm 0.00$	$\mathbf{0.00}\pm 0.00$	$\mathbf{0.00}\pm 0.00$	$\mathbf{0.00}\pm 0.00$	$0.01\pm 0.00$	$0.02\pm 0.00$	$0.01\pm 0.00$
		MMC^∗ $\downarrow$	$0.76\pm 0.01$	$\mathbf{0.65}\pm 0.08$	$0.74\pm 0.02$	$0.90\pm 0.01$	$0.86\pm 0.02$	$0.97\pm 0.01$	$0.69\pm 0.01$
OOD	CIFAR-100	OCE $\downarrow$	$0.47\pm 0.01$	$\mathbf{0.12}\pm 0.01$	$0.47\pm 0.01$	$0.69\pm 0.01$	$0.57\pm 0.01$	$0.79\pm 0.00$	$0.29\pm 0.01$
	SVHN	OCE $\downarrow$	$0.44\pm 0.06$	$\mathbf{0.08}\pm 0.02$	$0.34\pm 0.12$	$0.64\pm 0.16$	$0.47\pm 0.04$	$0.75\pm 0.03$	$0.11\pm 0.02$
	Noise	OCE $\downarrow$	$0.28\pm 0.08$	$0.03\pm 0.02$	$0.30\pm 0.04$	$0.56\pm 0.12$	$\mathbf{0.01}\pm 0.00$	$0.53\pm 0.09$	$0.07\pm 0.02$

6 Discussion

In this paper, we demonstrated a simple intuition that renders traditional deep discriminative models into a type of binning and kerneling approach. The bin boundaries are determined by the internal structure learned by the parent approach and Geodesic distance encodes the low dimensional structure learned by the model. Moreover, Geodesic distance introduced in this paper may have broader impact on understanding the internal structure of the deep discriminative models which we will pursue in future. Our code, including the package and the experiments in this manuscript, will be made publicly available upon acceptance of the paper.

Acknowledgements

The authors thank the support of the NSF-Simons Research Collaborations on the Mathematical and Scientific Foundations of Deep Learning (NSF grant 2031985) and THEORINET. This work is graciously supported by the Defense Advanced Research Projects Agency (DARPA) Lifelong Learning Machines program through contracts FA8650-18-2-7834 and HR0011-18-2-0025. Research was partially supported by funding from Microsoft Research and the Kavli Neuroscience Discovery Institute.

References

Guo et al. [2017a] Chuan Guo, Geoff Pleiss, Yu Sun, and Kilian Q. Weinberger. On calibration of modern neural networks. In Doina Precup and Yee Whye Teh, editors, Proceedings of the 34th International Conference on Machine Learning, volume 70 of Proceedings of Machine Learning Research, pages 1321–1330. PMLR, 06–11 Aug 2017a.
Kristiadi et al. [2020] Agustinus Kristiadi, Matthias Hein, and Philipp Hennig. Being bayesian, even just a bit, fixes overconfidence in ReLU networks. In Hal Daumé III and Aarti Singh, editors, Proceedings of the 37th International Conference on Machine Learning, volume 119 of Proceedings of Machine Learning Research, pages 5436–5446. PMLR, 13–18 Jul 2020.
Xu et al. [2021a] Haoyin Xu, Kaleab A. Kinfu, Will LeVine, Sambit Panda, Jayanta Dey, Michael Ainsworth, Yu-Chung Peng, Madi Kusmanov, Florian Engert, Christopher M. White, Joshua T. Vogelstein, and Carey E. Priebe. When are Deep Networks really better than Decision Forests at small sample sizes, and how? arXiv preprint arXiv:2108.13637, 2021a.
Hein et al. [2019] Matthias Hein, Maksym Andriushchenko, and Julian Bitterwolf. Why relu networks yield high-confidence predictions far away from the training data and how to mitigate the problem. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 41–50, 2019.
Meinke et al. [2021] Alexander Meinke, Julian Bitterwolf, and Matthias Hein. Provably robust detection of out-of-distribution data (almost) for free. arXiv preprint arXiv:2106.04260, 2021.
Liang et al. [2017] Shiyu Liang, Yixuan Li, and Rayadurgam Srikant. Enhancing the reliability of out-of-distribution image detection in neural networks. arXiv preprint arXiv:1706.02690, 2017.
Nandy et al. [2020] Jay Nandy, Wynne Hsu, and Mong Li Lee. Towards maximizing the representation gap between in-domain & out-of-distribution examples. Advances in Neural Information Processing Systems, 33:9239–9250, 2020.
Wan et al. [2018] Weitao Wan, Yuanyi Zhong, Tianpeng Li, and Jiansheng Chen. Rethinking feature distribution for loss functions in image classification. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 9117–9126, 2018.
DeVries and Taylor [2018] Terrance DeVries and Graham W Taylor. Learning confidence for out-of-distribution detection in neural networks. arXiv preprint arXiv:1802.04865, 2018.
Hendrycks et al. [2018] Dan Hendrycks, Mantas Mazeika, and Thomas Dietterich. Deep anomaly detection with outlier exposure. arXiv preprint arXiv:1812.04606, 2018.
Ren et al. [2019] Jie Ren, Peter J Liu, Emily Fertig, Jasper Snoek, Ryan Poplin, Mark Depristo, Joshua Dillon, and Balaji Lakshminarayanan. Likelihood ratios for out-of-distribution detection. Advances in neural information processing systems, 32, 2019.
Kingma et al. [2019] Diederik P Kingma, Max Welling, et al. An introduction to variational autoencoders. Foundations and Trends® in Machine Learning, 12(4):307–392, 2019.
Goodfellow et al. [2020] Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. Generative adversarial networks. Communications of the ACM, 63(11):139–144, 2020.
Nalisnick et al. [2018] Eric Nalisnick, Akihiro Matsukawa, Yee Whye Teh, Dilan Gorur, and Balaji Lakshminarayanan. Do deep generative models know what they don’t know? arXiv preprint arXiv:1810.09136, 2018.
Zadrozny and Elkan [2001] Bianca Zadrozny and Charles Elkan. Obtaining calibrated probability estimates from decision trees and naive bayesian classifiers. In Icml, volume 1, pages 609–616, 2001.
Caruana [2004] R Caruana. Predicting good probabilities with supervised learning. In Proceedings of NIPS 2004 Workshop on Calibration and Probabilistic Prediction in Supervised Learning, 2004.
Platt et al. [1999] John Platt et al. Probabilistic outputs for support vector machines and comparisons to regularized likelihood methods. Advances in large margin classifiers, 10(3):61–74, 1999.
Guo et al. [2017b] Chuan Guo, Geoff Pleiss, Yu Sun, and Kilian Q Weinberger. On calibration of modern neural networks. In International conference on machine learning, pages 1321–1330. PMLR, 2017b.
Guo et al. [2019] Richard Guo, Ronak Mehta, Jesus Arroyo, Hayden Helm, Cencheng Shen, and Joshua T Vogelstein. Estimating information-theoretic quantities with uncertainty forests. arXiv, pages arXiv–1907, 2019.
Kull et al. [2019] Meelis Kull, Miquel Perello Nieto, Markus Kängsepp, Telmo Silva Filho, Hao Song, and Peter Flach. Beyond temperature scaling: Obtaining well-calibrated multi-class probabilities with dirichlet calibration. Advances in neural information processing systems, 32, 2019.
Xu et al. [2021b] Haoyin Xu, Kaleab A Kinfu, Will LeVine, Sambit Panda, Jayanta Dey, Michael Ainsworth, Yu-Chung Peng, Madi Kusmanov, Florian Engert, Christopher M White, et al. When are deep networks really better than decision forests at small sample sizes, and how? arXiv preprint arXiv:2108.13637, 2021b.
Schölkopf [2000] Bernhard Schölkopf. The kernel trick for distances. Advances in neural information processing systems, 13, 2000.
Bischl et al. [2017] Bernd Bischl, Giuseppe Casalicchio, Matthias Feurer, Pieter Gijsbers, Frank Hutter, Michel Lang, Rafael G Mantovani, Jan N van Rijn, and Joaquin Vanschoren. Openml benchmarking suites. arXiv preprint arXiv:1708.03731, 2017.
Trunk [1979] Gerard V Trunk. A problem of dimensionality: A simple example. IEEE Transactions on pattern analysis and machine intelligence, (3):306–307, 1979.
Kailath [1967] Thomas Kailath. The divergence and bhattacharyya distance measures in signal selection. IEEE transactions on communication technology, 15(1):52–60, 1967.
Rao [1995] C Radhakrishna Rao. A review of canonical coordinates and an alternative to correspondence analysis using hellinger distance. Qüestiió: quaderns d’estadística i investigació operativa, 1995.
Dosovitskiy et al. [2020] Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, et al. An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929, 2020.
Deng et al. [2009] Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. Imagenet: A large-scale hierarchical image database. In 2009 IEEE Conference on Computer Vision and Pattern Recognition, pages 248–255, 2009. 10.1109/CVPR.2009.5206848.
[29] Alex Krizhevsky, Vinod Nair, and Geoffrey Hinton. Cifar-10 (canadian institute for advanced research). URL http://www.cs.toronto.edu/~kriz/cifar.html.
Khosla et al. [2020] Prannay Khosla, Piotr Teterwak, Chen Wang, Aaron Sarna, Yonglong Tian, Phillip Isola, Aaron Maschinot, Ce Liu, and Dilip Krishnan. Supervised contrastive learning. Advances in neural information processing systems, 33:18661–18673, 2020.

Appendix A Proofs

A.1 Proof of Proposition 4.1

For proving that $d$ is a valid distance metric for $\mathcal{P}_{n}$ , we need to prove the following four statements:

1.

$d(r,s)=0$ when $r=s$ .
Proof: By definition, $\mathcal{K}(r,s)=1$ and $d(r,s)=0$ when $r=s$ .
2.

$d(r,s)>0$ when $r\neq s$ .
Proof: By definition, $0\leq\mathcal{K}(r,s)<1$ and $d(r,s)>0$ for $r\neq s$ .
3.

$d$ is symmetric, i.e., $d(r,s)=d(s,r)$ .
Proof: By definition, $\mathcal{K}(r,s)=\mathcal{K}(s,r)$ which implies $d(r,s)=d(s,r)$ .

$d$ follows the triangle inequality, i.e., for any three polytopes $Q_{r},Q_{s},Q_{t}\in\mathcal{P}_{n}$ : $d(r,t)\leq d(r,s)+d(s,t)$ .
Proof: Let $\mathcal{A}_{r}$ denote the set of activation modes in a ReLU-net and the set of leaf nodes in a random forest for a particular polytope $r$ . $N$ is the total number of possible activation paths in a ReLU-net or total trees in a random forest. Below $c(\cdot)$ denotes the cardinality of the set. We can write:

(21)	$\displaystyle N$	$\displaystyle\geq c((\mathcal{A}_{r}\cap\mathcal{A}_{s})\cup(\mathcal{A}_{s}% \cap\mathcal{A}_{t}))$
	$\displaystyle=c(\mathcal{A}_{r}\cap\mathcal{A}_{s})+c(\mathcal{A}_{s}\cap% \mathcal{A}_{t})-c(\mathcal{A}_{r}\cap\mathcal{A}_{s}\cap\mathcal{A}_{t})$
	$\displaystyle\geq c(\mathcal{A}_{r}\cap\mathcal{A}_{s})+c(\mathcal{A}_{s}\cap% \mathcal{A}_{t})-c(\mathcal{A}_{r}\cap\mathcal{A}_{t}).$

Rearranging the above equation, we get:

	$\displaystyle N-c(\mathcal{A}_{r}\cap\mathcal{A}_{t})\leq N-c(\mathcal{A}_{r}% \cap\mathcal{A}_{s})+N-c(\mathcal{A}_{s}\cap\mathcal{A}_{t})$
	$\displaystyle\implies 1-\frac{c(\mathcal{A}_{r}\cap\mathcal{A}_{t})}{N}\leq 1-% \frac{c(\mathcal{A}_{r}\cap\mathcal{A}_{s})}{N}+1$
	$\displaystyle\leavevmode\nobreak\ \leavevmode\nobreak\ \leavevmode\nobreak\ % \leavevmode\nobreak\ \leavevmode\nobreak\ \leavevmode\nobreak\ \leavevmode% \nobreak\ \leavevmode\nobreak\ \leavevmode\nobreak\ \leavevmode\nobreak\ % \leavevmode\nobreak\ \leavevmode\nobreak\ \leavevmode\nobreak\ \leavevmode% \nobreak\ \leavevmode\nobreak\ \leavevmode\nobreak\ \leavevmode\nobreak\ % \leavevmode\nobreak\ \leavevmode\nobreak\ \leavevmode\nobreak\ \leavevmode% \nobreak\ \leavevmode\nobreak\ \leavevmode\nobreak\ \leavevmode\nobreak\ % \leavevmode\nobreak\ \leavevmode\nobreak\ \leavevmode\nobreak\ \leavevmode% \nobreak\ \leavevmode\nobreak\ \leavevmode\nobreak\ \leavevmode\nobreak\ % \leavevmode\nobreak\ \leavevmode\nobreak\ \leavevmode\nobreak\ \leavevmode% \nobreak\ \leavevmode\nobreak\ \leavevmode\nobreak\ \leavevmode\nobreak\ % \leavevmode\nobreak\ \leavevmode\nobreak\ \leavevmode\nobreak\ \leavevmode% \nobreak\ \leavevmode\nobreak\ \leavevmode\nobreak\ \leavevmode\nobreak\ % \leavevmode\nobreak\ -\frac{c(\mathcal{A}_{s}\cap\mathcal{A}_{t})}{N}$
(22)		$\displaystyle\implies d(r,t)\leq d(r,s)+d(s,t).$

A.2 Proof of Proposition 4.3

Note that first we find the nearest polytope to the inference point $x$ using Geodesic distance and use Gaussian kernel locally for $x$ within that polytope. Here the Gaussian kernel uses Euclidean distance from the kernel center to $x$ (within the numerator of the exponent). The value out of the Gaussian kernel decays exponentially with the increasing distance of the inference point from the kernel center. We first expand $\hat{g}_{y}(\mathbf{x})$ :

	$\displaystyle\hat{g}_{y}(\mathbf{x})$	$\displaystyle=\frac{\hat{f}_{y}(\mathbf{x})\hat{P}_{Y}(y)}{\sum_{k=1}^{K}\hat{% f}_{k}(x)\hat{P}_{Y}(k)}$
		$\displaystyle=\frac{\tilde{f}_{y}(\mathbf{x})\hat{P}_{Y}(y)+\frac{b}{\log(n)}% \hat{P}_{Y}(y)}{\sum_{k=1}^{K}(\hat{f}_{k}(\mathbf{x})\hat{P}_{Y}(k)+\frac{b}{% \log(n)}\hat{P}_{Y}(k))}$

As the inference point $\mathbf{x}$ becomes more distant from training samples (and more distant from all of the Gaussian centers), we have that $\mathcal{G}(\mathbf{x},\hat{\mu}_{r},\hat{\Sigma}_{r})$ becomes smaller. Thus, $\forall y,\tilde{f}_{y}(\mathbf{x})$ shrinks. More formally, $\forall y$ ,

\lim_{d_{\mathbf{x}}\to\infty}\tilde{f}_{y}(\mathbf{x})=0.

We can use this result to then examine the limiting behavior of our posteriors as the inference point $\mathbf{x}$ becomes more distant from the training data:

	$\displaystyle\lim_{d_{\mathbf{x}}\to\infty}\hat{g}_{y}(\mathbf{x})$	$\displaystyle=\lim_{d_{\mathbf{x}}\to\infty}\frac{\tilde{f}_{y}(\mathbf{x})% \hat{P}_{Y}(y)+\frac{b}{\log(n)}\hat{P}_{Y}(y)}{\sum_{k=1}^{K}(\tilde{f}_{k}(% \mathbf{x})\hat{P}_{Y}(k)+\frac{b}{\log(n)}\hat{P}_{Y}(k))}$
		$\displaystyle=\frac{(\lim_{d_{\mathbf{x}}\to\infty}\tilde{f}_{y}(\mathbf{x}))% \hat{P}_{Y}(y)+\frac{b}{\log(n)}\hat{P}_{Y}(y)}{\sum_{k=1}^{K}(\lim_{d_{% \mathbf{x}}\to\infty}\tilde{f}_{k}(\mathbf{x}))\hat{P}_{Y}(k)+\frac{b}{\log(n)% }\hat{P}_{Y}(k))}$
		$\displaystyle=\frac{\hat{P}_{Y}(y)}{\sum_{k=1}^{K}\hat{P}_{Y}(k)}$
		$\displaystyle=\hat{P}_{Y}(y).$

Appendix B Hardware and Software Configurations

•

Operating System: Linux (ubuntu 20.04), macOS (Ventura 13.2.1)
•

VM Size: Azure Standard D96as v4 (96 vcpus, 384 GiB memory)
•

GPU: Apple M1 Max
•

Software: Python 3.8, scikit-learn $\geq$ 0.22.0, tensorflow-macos $\leq$ 2.9, tensorflow-metal $\leq$ 0.5.0.

Appendix C Simulations

We construct six types of binary class simulations:

•

Gaussian XOR is a two-class classification problem with equal class priors. Conditioned on being in class 0, a sample is drawn from a mixture of two Gaussians with means $\pm[0.5,-0.5]^{\top}$ and standard deviations of $0.25$ . Conditioned on being in class 1, a sample is drawn from a mixture of two Gaussians with means $\pm[0.5,-0.5]^{\top}$ and standard deviations of $0.25$ .
•

Spiral is a two-class classification problem with the following data distributions: let $K$ be the number of classes and $S\sim$ multinomial $(\frac{1}{K}\vec{1}_{K},n)$ . Conditioned on $S$ , each feature vector is parameterized by two variables, the radius $r$ and an angle $\theta$ . For each sample, $r$ is sampled uniformly in $[0,1]$ . Conditioned on a particular class, the angles are evenly spaced between $\frac{4\pi(k-1)t_{K}}{K}$ and $\frac{4\pi(k)t_{K}}{K}$ , where $t_{K}$ controls the number of turns in the spiral. To inject noise along the spirals, we add Gaussian noise to the evenly spaced angles $\theta^{\prime}:\theta=\theta^{\prime}+\mathcal{N}(0,0.09)$ . The observed feature vector is then $(r\;\cos(\theta),r\;\sin(\theta))$ .
•

Circle is a two-class classification problem with equal class priors. Conditioned on being in class 0, a sample is drawn from a circle centered at $(0,0)$ with a radius of $r=0.75$ . Conditioned on being in class 1, a sample is drawn from a circle centered at $(0,0)$ with a radius of $r=1$ , which is cut off by the region bounds. To inject noise along the circles, we add Gaussian noise to the circle radii $r^{\prime}:r=r^{\prime}+\mathcal{N}(0,0.01)$ .
•

Sinewave is a two-class classification problem based on sine waves. Conditioned on being in class 0, a sample is drawn from the distribution $y=\cos(\pi x)$ . Conditioned on being in class 1, a sample is drawn from the distribution $y=\sin(\pi x)$ . We inject Gaussian noise to the sine wave heights $y^{\prime}:y=y^{\prime}+\mathcal{N}(0,0.01)$ .
•

Polynomial is a two-class classification problem with the following data distributions: $y=x^{a}$ . Conditioned on being in class 0, a sample is drawn from the distribution $y=x^{1}$ . Conditioned on being in class 1, a sample is drawn from the distribution $y=x^{3}$ . Gaussian noise is added to variables $y^{\prime}:y=y^{\prime}+\mathcal{N}(0,0.01)$ .

•

Trunk is a two-class classification problem with gradually increasing dimension and equal class priors. The class conditional probabilities are Gaussian:

	$\displaystyle P(X\|Y=0)$	$\displaystyle=\mathcal{G}(\mu_{1},I),$
	$\displaystyle P(X\|Y=1)$	$\displaystyle=\mathcal{G}(\mu_{2},I),$

where $\mu_{1}=\mu,\mu_{2}=-\mu$ , $\mu$ is a $d$ dimensional vector whose $i$ -th component is $(\frac{1}{i})^{1/2}$ and $I$ is $d$ dimensional identity matrix.

Table 2: Hyperparameters for RF and KDF.

Hyperparameters	Value
n_estimators	$500$
max_depth	$\infty$
min_samples_leaf	$1$
$\lambda$	$1\times 10^{-6}$
$b$	$\exp{(-10^{-7})}$

Table 3: Hyperparameters for ReLU-net and KDNon Tabular data.

Hyperparameters	Value
number of hidden layers	$4$
nodes per hidden layer	$1000$
optimizer	Adam
learning rate	$3\times 10^{-4}$
$\lambda$	$1\times 10^{-6}$
$b$	$\exp{(-10^{-7})}$

Appendix D Pseudocodes

We provide the pseudocode for our porposed algorithms in Algorithm 1, 2 and 3.

Algorithm 1 Fit a KDX model.

2:(1)

\theta

\triangleright

Parent learner (random forest or deep network model)

3:(2)

\mathcal{D}_{n}=(\mathbf{X},\mathbf{y})\in\mathbb{R}^{n\times d}\times\{1,% \ldots,K\}^{n}

\triangleright

Training data

\mathcal{G}

\triangleright

a KDX model

5:function KGX.fit(

\theta,\mathbf{X},\mathbf{y}

)

6: for

i=1,\ldots,n

\triangleright

Iterate over the dataset to calculate the weights

7: for

j=1,\ldots,n

w_{ij}\leftarrow

computeWeights(

\mathbf{x}_{i},\mathbf{x}_{j},\theta

)

9: end for

10: end for

11:

12:

13:

\{Q_{r},\mathbf{w}_{rs}\}_{r=1}^{\tilde{p}}\leftarrow

getPolytopes(

\mathbf{w}

)

\triangleright

Identify the polytopes by clustering the samples with similar weight

14:

15: for

r=1,\ldots,\tilde{p}

\triangleright

Iterate over each polytope

16:

\mathcal{G}.\hat{\mu}_{r},\mathcal{G}.\hat{\Sigma}_{r},\mathcal{G}.\hat{n}_{ry}\leftarrow

estimateParameters(

\mathbf{X},y,\{\mathbf{w}_{rs}\}_{s=1}^{\tilde{p}}

)

\triangleright

Fit Gaussians using MLE

17: end for

18: return

\mathcal{G}

19:end function

Algorithm 2 Computing weights in KDF

2:(1)

\mathbf{x}_{i},\mathbf{x}_{j}\in\mathbb{R}^{1\times d}

\triangleright

two input samples to be weighted

3:(2)

\theta

\triangleright

parent random forest with

T

trees

w_{ij}\in[0,1]

\triangleright

compute similarity between

i

and

j

-th samples.

5:function computeWeights(

\mathbf{x}_{i},\mathbf{x}_{j},\theta

)

\mathcal{I}_{i}\leftarrow

pushDownTrees(

\mathbf{x}_{i},\theta

)

\triangleright

push

\mathbf{x}_{i}

down

T

trees and get the leaf numbers it end up in.

\mathcal{I}_{j}\leftarrow

pushDownTrees(

\mathbf{x}_{j},\theta

)

\triangleright

push

\mathbf{x}_{j}

down

T

trees and get the leaf numbers it end up in.

l\leftarrow

countMatches(

\mathcal{I}_{i},\mathcal{I}_{j}

)

\triangleright

count the number of times the samples end up in the same leaf

w_{ij}\leftarrow\frac{l}{T}

10: return

w_{ij}

11:end function

Algorithm 3 Computing weights in KDN

2:(1)

\mathbf{x}_{i},\mathbf{x}_{j}\in\mathbb{R}^{1\times d}

\triangleright

two input samples to be weighted

3:(2)

\theta

\triangleright

parent deep-net model

w_{ij}\in[0,1]

\triangleright

compute similarity between

i

and

j

-th samples.

5:function computeWeights(

\mathbf{x}_{i},\mathbf{x}_{j},\theta

)

\mathcal{A}_{i}\leftarrow

pushDownNetwork(

\mathbf{x}_{i},\theta

)

\triangleright

get activation modes

\mathcal{A}_{i}

\mathcal{A}_{j}\leftarrow

pushDownNetwork(

\mathbf{x}_{j},\theta

)

\triangleright

get activation modes

\mathcal{A}_{j}

l\leftarrow

countMatches(

\mathcal{A}_{i}

\mathcal{A}_{j}

)

\triangleright

count the number of times the two samples activate the activation paths in a similar way

w_{ij}\leftarrow\frac{l}{N}

\triangleright

N

is the total number of activation paths

10: return

w_{ij}

11:end function

Appendix E Extended Results on OpenML-CC18 data suite

See Figure 5, 6, 7 and 8 for extended results on OpenML-CC18 data suite.

Appendix F Extended Results on Vision datasets using Resnet-50

Table 4: ID approaches (Sigmoid, Isotonic) are bad at OOD calibration and OOD approaches (ACET, ODIN, OE) are bad at ID calibration. KDN bridges between both ID and OOD calibration approaches. ‘

\uparrow

’ and ‘

\downarrow

’ indicate whether higher and lower values are better, respectively. Bolded indicates most performant, or within the margin of error of the most performant.

	Dataset	Statistics	Parent	KDN	Isotonic	Sigmoid	ACET	ODIN	OE
ID	CIFAR-10	Accuracy $(\%)\uparrow$	$77.78\pm 0.00$	$76.84\pm 0.01$	$\mathbf{78.25}\pm 0.00$	$76.93\pm 0.00$	$75.08\pm 0.03$	$78.00\pm 0.00$	$73.95\pm 0.00$
		MCE $\downarrow$	$0.09\pm 0.00$	$\mathbf{0.04}\pm 0.00$	$\mathbf{0.03}\pm 0.01$	$0.10\pm 0.01$	$0.13\pm 0.00$	$0.09\pm 0.00$	$0.55\pm 0.00$
		MMC^∗ $\downarrow$	$0.47\pm 0.00$	$0.37\pm 0.01$	$0.54\pm 0.01$	$0.43\pm 0.01$	$0.69\pm 0.00$	$0.48\pm 0.01$	$\mathbf{0.13}\pm 0.00$
OOD	CIFAR-100	OCE $\downarrow$	$0.30\pm 0.00$	$0.20\pm 0.01$	$0.37\pm 0.01$	$0.29\pm 0.01$	$0.55\pm 0.00$	$0.31\pm 0.00$	$\mathbf{0.01}\pm 0.00$
	SVHN	OCE $\downarrow$	$0.87\pm 0.00$	$\textbf{0.01}\pm 0.00$	$0.85\pm 0.00$	$0.69\pm 0.01$	$0.90\pm 0.00$	$0.87\pm 0.00$	$0.04\pm 0.01$
	Noise	OCE $\downarrow$	$0.90\pm 0.00$	$\mathbf{0.00}\pm 0.00$	$0.87\pm 0.00$	$0.71\pm 0.00$	$0.01\pm 0.01$	$0.06\pm 0.00$	$\mathbf{0.00}\pm 0.00$

In this experiments, we use a Resnet-50 encoder pretrained using contrastive loss [30] as described in http://keras.io/examples/vision/supervised-contrastive-learning. The encoder projects the input images down to a $256$ dimensional latent space and we add two dense layers with $200$ and $10$ nodes on top of the encoder. We use the same pretrained encoder for all the baseline algorithms.

As shown in Table 4, KDN achieves good calibration for both ID and OOD datasets whereas the ID calibration approaches are poorly calibrated in the OOD regions and the OOD approaches have poor ID calibration.