Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
\titlecontents

task [3.8em] \contentslabel2.3em \contentspage

Deep Discriminative to Kernel Density Graph for In- and Out-of-distribution Calibrated Inference

Jayanta Dey    1,∗ Haoyin Xu    1,† Will LeVine    1,† Ashwin De Silva    1,† Tyler M. Tomita    1 Ali Geisa    1 Tiffany Chu    1 Jacob Desman    1 and Joshua T. Vogelstein1
Abstract

Deep discriminative approaches like random forests and deep neural networks have recently found applications in many important real-world scenarios. However, deploying these learning algorithms in safety-critical applications raises concerns, particularly when it comes to ensuring confidence calibration for both in-distribution and out-of-distribution data points. Many popular methods for in-distribution (ID) calibration, such as isotonic and Platt’s sigmoidal regression, exhibit excellent ID calibration performance. However, these methods are not calibrated for the entire feature space, leading to overconfidence in the case of out-of-distribution (OOD) samples. On the other end of the spectrum, existing out-of-distribution (OOD) calibration methods generally exhibit poor in-distribution (ID) calibration. In this paper, we address ID and OOD calibration problems jointly. We leveraged the fact that deep models, including both random forests and deep-nets, learn internal representations which are unions of polytopes with affine activation functions to conceptualize them both as partitioning rules of the feature space. We replace the affine function in each polytope populated by the training data with a Gaussian kernel. Our experiments on both tabular and vision benchmarks show that the proposed approaches obtain well-calibrated posteriors while mostly preserving or improving the classification accuracy of the original algorithm for ID region, and extrapolate beyond the training data to handle OOD inputs appropriately.

1Johns Hopkins University (JHU) corresponding author: jdey4@jhu.edu, \dagger denotes equal contribution

1 Introduction

Machine learning methods, specially deep neural networks and random forests have shown excellent performance in many real-world tasks, including drug discovery, autonomous driving and clinical surgery [1, 2, 3]. However, calibrating confidence over the whole feature space for these approaches remains a key challenge in the field [4]. Calibrated confidence within the training or in-distribution (ID) region as well as in the out-of-distribution (OOD) region is crucial for safety critical applications like autonomous driving and computer-assisted surgery, where any aberrant reading should be detected and taken care of immediately [4, 5].

The approaches to calibrate OOD confidence for learning algorithms described in the literature can be roughly divided into two groups: discriminative and generative. Intuitively, the easiest solution for OOD confidence calibration is to learn a function that gives higher scores for in-distribution samples and lower scores for OOD samples [6]. The discriminative approaches try to either modify the loss function [7, 8, 9] or train the network exhaustively on OOD datasets to calibrate on OOD samples [10, 4]. Recently, Hein et al. [4] showed ReLU networks produce arbitrarily high confidence as the inference point moves far away from the training data. Therefore, calibrating ReLU networks for the whole OOD region is not possible without fundamentally changing the network architecture. As a result, all of the aforementioned algorithms are unable to provide any guarantee about the performance of the network throughout the whole feature space. The other group tries to learn generative models for the in-distribution as well as the out-of-distribution samples. The general idea is to do likelihood ratio test for a particular sample using the generative models [11], or threshold the ID likelihoods to detect OOD samples. However, it is not obvious how to control likelihoods far away from the training data for powerful generative models like variational autoencoders (VAEs) [12] and generative adversarial networks (GAN) [13]. Moreover, Nalisnick et al. [14] and Hendrycks et al. [10] showed VAEs and GANs can also yield overconfident likelihoods far away from the training data.

The algorithms described so far are concerned with OOD confidence calibration for deep-nets only. However, we show that other approaches which partition the feature space, for example random forest, can also suffer from poor confidence calibration both in the ID and the OOD regions. Moreover, the algorithms described above are concerned about the confidence in the OOD region only and do not address the confidence calibration within the ID region at all. This issue is addressed separately in a different group of literature [15, 16, 17, 18, 19, 20]. Instead, we consider both calibration problems jointly and propose an approach that achieves good calibration throughout the whole feature space.

In this paper, we conceptualize both random forest and ReLU networks as partitioning rules with an affine activation over each polytope. We consider replacing the affine functions learned over the polytopes with Gaussian kernels. We propose two novel kernel density estimation techniques named Kernel Density Forest (KDF) and Kernel Density Network (KDN). Our proposed approach completely excludes the need for training on OOD examples for the model (unsupervised OOD calibration). We conduct several simulation and real data studies that show both KDF  and KDN  are well-calibrated for OOD samples while they maintain good performance in the ID region.

2 Related Works and Our Contributions

There are a number of approaches in the literature which attempt to learn a generative model and control the likelihoods far away from the training data. For example, Ren et al. [11] employed likelihood ratio test for detecting OOD samples. Wan et al. [8] modified the training loss so that the downstream projected features follow a Gaussian distribution. However, there is no guarantee of performance for OOD detection for the above methods. To the best of our knowledge, apart from us, only Meinke et al. [5] has proposed an approach to guarantee asymptotic performance for OOD detection. Compared to the aforementioned methods, our approach differs in several ways:

  • We address the confidence calibration problem for both ReLU-nets and random forests.

  • We address ID and OOD calibration problem as a continuum.

  • We provide an algorithm for OOD confidence calibration for both tabular and vision datatsets whereas most of the existing methods are tailor-made for vision problems.

  • We propose an unsupervised post-hoc OOD calibration approach.

3 Technical Background

3.1 Setting

Consider a supervised learning problem with independent and identically distributed training samples {(𝐱i,yi)}i=1nsuperscriptsubscriptsubscript𝐱𝑖subscript𝑦𝑖𝑖1𝑛\{(\mathbf{x}_{i},y_{i})\}_{i=1}^{n}{ ( bold_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) } start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT such that (𝐗,Y)PX,Ysimilar-to𝐗𝑌subscript𝑃𝑋𝑌(\mathbf{X},Y)\sim P_{X,Y}( bold_X , italic_Y ) ∼ italic_P start_POSTSUBSCRIPT italic_X , italic_Y end_POSTSUBSCRIPT, where 𝐗PXsimilar-to𝐗subscript𝑃𝑋\mathbf{X}\sim P_{X}bold_X ∼ italic_P start_POSTSUBSCRIPT italic_X end_POSTSUBSCRIPT is a 𝒳D𝒳superscript𝐷\mathcal{X}\subseteq\mathbb{R}^{D}caligraphic_X ⊆ blackboard_R start_POSTSUPERSCRIPT italic_D end_POSTSUPERSCRIPT valued input and YPYsimilar-to𝑌subscript𝑃𝑌Y\sim P_{Y}italic_Y ∼ italic_P start_POSTSUBSCRIPT italic_Y end_POSTSUBSCRIPT is a 𝒴={1,,K}𝒴1𝐾\mathcal{Y}=\{1,\cdots,K\}caligraphic_Y = { 1 , ⋯ , italic_K } valued class label. Let 𝒮𝒮\mathcal{S}caligraphic_S be the high density region of the marginal, PXsubscript𝑃𝑋P_{X}italic_P start_POSTSUBSCRIPT italic_X end_POSTSUBSCRIPT, thus 𝒮𝒳𝒮𝒳\mathcal{S}\subsetneq\mathcal{X}caligraphic_S ⊊ caligraphic_X. Here the goal is to learn a confidence score, 𝐠:D[0,1]K:𝐠superscript𝐷superscript01𝐾\mathbf{g}:\mathbb{R}^{D}\rightarrow[0,1]^{K}bold_g : blackboard_R start_POSTSUPERSCRIPT italic_D end_POSTSUPERSCRIPT → [ 0 , 1 ] start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT, 𝐠(𝐱)=[g1(𝐱),g2(𝐱),,gK(𝐱)]𝐠𝐱subscript𝑔1𝐱subscript𝑔2𝐱subscript𝑔𝐾𝐱\mathbf{g}(\mathbf{x})=[g_{1}(\mathbf{x}),g_{2}(\mathbf{x}),\dots,g_{K}(% \mathbf{x})]bold_g ( bold_x ) = [ italic_g start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( bold_x ) , italic_g start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( bold_x ) , … , italic_g start_POSTSUBSCRIPT italic_K end_POSTSUBSCRIPT ( bold_x ) ] such that,

(1) gy(𝐱)={PY|X(y|𝐱),if𝐱𝒮PY(y),if𝐱𝒮,y𝒴formulae-sequencesubscript𝑔𝑦𝐱casessubscript𝑃conditional𝑌𝑋conditional𝑦𝐱if𝐱𝒮subscript𝑃𝑌𝑦if𝐱𝒮for-all𝑦𝒴g_{y}(\mathbf{x})=\begin{cases}P_{Y|X}(y|\mathbf{x}),&\leavevmode\nobreak\ % \text{if}\leavevmode\nobreak\ \mathbf{x}\in\mathcal{S}\\ P_{Y}(y),&\leavevmode\nobreak\ \text{if}\leavevmode\nobreak\ \mathbf{x}\notin% \mathcal{S}\end{cases},\quad\forall y\in\mathcal{Y}italic_g start_POSTSUBSCRIPT italic_y end_POSTSUBSCRIPT ( bold_x ) = { start_ROW start_CELL italic_P start_POSTSUBSCRIPT italic_Y | italic_X end_POSTSUBSCRIPT ( italic_y | bold_x ) , end_CELL start_CELL if bold_x ∈ caligraphic_S end_CELL end_ROW start_ROW start_CELL italic_P start_POSTSUBSCRIPT italic_Y end_POSTSUBSCRIPT ( italic_y ) , end_CELL start_CELL if bold_x ∉ caligraphic_S end_CELL end_ROW , ∀ italic_y ∈ caligraphic_Y

where PY|X(y|𝐱)subscript𝑃conditional𝑌𝑋conditional𝑦𝐱P_{Y|X}(y|\mathbf{x})italic_P start_POSTSUBSCRIPT italic_Y | italic_X end_POSTSUBSCRIPT ( italic_y | bold_x ) is the posterior probability for class y𝑦yitalic_y given by the Bayes formula:

(2) PY|X(y|𝐱)=PX|Y(𝐱|y)PY(y)k=1KPX|Y(𝐱|k)PY(k),y𝒴.formulae-sequencesubscript𝑃conditional𝑌𝑋conditional𝑦𝐱subscript𝑃conditional𝑋𝑌conditional𝐱𝑦subscript𝑃𝑌𝑦superscriptsubscript𝑘1𝐾subscript𝑃conditional𝑋𝑌conditional𝐱𝑘subscript𝑃𝑌𝑘for-all𝑦𝒴P_{Y|X}(y|\mathbf{x})=\frac{P_{X|Y}(\mathbf{x}|y)P_{Y}(y)}{\sum_{k=1}^{K}P_{X|% Y}(\mathbf{x}|k)P_{Y}(k)},\quad\forall y\in\mathcal{Y}.italic_P start_POSTSUBSCRIPT italic_Y | italic_X end_POSTSUBSCRIPT ( italic_y | bold_x ) = divide start_ARG italic_P start_POSTSUBSCRIPT italic_X | italic_Y end_POSTSUBSCRIPT ( bold_x | italic_y ) italic_P start_POSTSUBSCRIPT italic_Y end_POSTSUBSCRIPT ( italic_y ) end_ARG start_ARG ∑ start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT italic_P start_POSTSUBSCRIPT italic_X | italic_Y end_POSTSUBSCRIPT ( bold_x | italic_k ) italic_P start_POSTSUBSCRIPT italic_Y end_POSTSUBSCRIPT ( italic_k ) end_ARG , ∀ italic_y ∈ caligraphic_Y .

Here PX|Y(𝐱|y)subscript𝑃conditional𝑋𝑌conditional𝐱𝑦P_{X|Y}(\mathbf{x}|y)italic_P start_POSTSUBSCRIPT italic_X | italic_Y end_POSTSUBSCRIPT ( bold_x | italic_y ) is the class conditional density which we will refer as fy(𝐱)subscript𝑓𝑦𝐱f_{y}(\mathbf{x})italic_f start_POSTSUBSCRIPT italic_y end_POSTSUBSCRIPT ( bold_x ) hereafter for brevity.

3.2 Main Idea

Deep discriminative networks partition the feature space dsuperscript𝑑\mathbb{R}^{d}blackboard_R start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT into a union of p𝑝pitalic_p affine polytopes Qrsubscript𝑄𝑟Q_{r}italic_Q start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT such that r=1pQr=dsuperscriptsubscript𝑟1𝑝subscript𝑄𝑟superscript𝑑\bigcup_{r=1}^{p}Q_{r}=\mathbb{R}^{d}⋃ start_POSTSUBSCRIPT italic_r = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_p end_POSTSUPERSCRIPT italic_Q start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT = blackboard_R start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT, and learn an affine function over each polytope [4, 21]. Mathematically, the unnormalized class-conditional density for the label y𝑦yitalic_y estimated by these deep discriminative models at a particular point 𝐱𝐱\mathbf{x}bold_x can be expressed as:

(3) f^y(𝐱)=r=1p(𝐚r𝐱+br)𝟙(𝐱Qr).subscript^𝑓𝑦𝐱superscriptsubscript𝑟1𝑝superscriptsubscript𝐚𝑟top𝐱subscript𝑏𝑟1𝐱subscript𝑄𝑟\hat{f}_{y}(\mathbf{x})=\sum_{r=1}^{p}(\mathbf{a}_{r}^{\top}\mathbf{x}+b_{r})% \mathbbm{1}(\mathbf{x}\in Q_{r}).over^ start_ARG italic_f end_ARG start_POSTSUBSCRIPT italic_y end_POSTSUBSCRIPT ( bold_x ) = ∑ start_POSTSUBSCRIPT italic_r = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_p end_POSTSUPERSCRIPT ( bold_a start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT bold_x + italic_b start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT ) blackboard_1 ( bold_x ∈ italic_Q start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT ) .

For example, in the case of a decision tree, 𝐚r=𝟎subscript𝐚𝑟0\mathbf{a}_{r}=\mathbf{0}bold_a start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT = bold_0, i.e., decision tree assumes uniform distribution for the class-conditional densities over the leaf nodes. Among these polytopes, the ones that lie on the boundary of the training data extend to the whole feature space and hence encompass all the OOD samples. Since the posterior probability for a class is determined by the affine activation over each of these polytopes, the algorithms tend to be overconfident when making predictions on the OOD inputs. Moreover, there exist some polytopes that are not populated with training data. These unpopulated polytopes serve to interpolate between the training sample points. If we replace the affine activation function of the populated polytopes with Gaussian kernels and prune the unpopulated ones, the tail of the kernel will help interpolate between the training sample points while assigning lower likelihood to the low density or unpopulated polytope regions of the feature space. This results in better confidence calibration for the proposed modified approach.

3.3 Proposed Approach

We will call the above discriminative approaches as the ‘parent approach’ hereafter. Consider the collection of polytope indices 𝒫𝒫\mathcal{P}caligraphic_P from the parent approach which are populated by the training data. We replace the affine functions over the populated polytopes with Gaussian kernels 𝒢(;μ^r,Σ^r)𝒢subscript^𝜇𝑟subscript^Σ𝑟\mathcal{G}(\cdot;\hat{\mu}_{r},\hat{\Sigma}_{r})caligraphic_G ( ⋅ ; over^ start_ARG italic_μ end_ARG start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT , over^ start_ARG roman_Σ end_ARG start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT ). For a particular inference point 𝐱𝐱\mathbf{x}bold_x, we consider the Gaussian kernel with the minimum distance from the center of the kernel to the corresponding point:

(4) r𝐱=argminrμr𝐱,subscriptsuperscript𝑟𝐱subscriptargmin𝑟normsubscript𝜇𝑟𝐱r^{*}_{\mathbf{x}}=\operatornamewithlimits{argmin}_{r}\|\mu_{r}-\mathbf{x}\|,italic_r start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT bold_x end_POSTSUBSCRIPT = roman_argmin start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT ∥ italic_μ start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT - bold_x ∥ ,

where \|\cdot\|∥ ⋅ ∥ denotes a distance. As we will show later, the type of distance metric considered in Equation 4 highly impacts the performance of the proposed model. In short, we modify Equation 3 from the parent ReLU-net or random forest to estimate the class-conditional density (unnormalized):

(5) f~y(𝐱)=1nyr𝒫nry𝒢(𝐱;μr,Σr)𝟙(r=r𝐱),subscript~𝑓𝑦𝐱1subscript𝑛𝑦subscript𝑟𝒫subscript𝑛𝑟𝑦𝒢𝐱subscript𝜇𝑟subscriptΣ𝑟1𝑟subscriptsuperscript𝑟𝐱\tilde{f}_{y}(\mathbf{x})=\frac{1}{n_{y}}\sum_{r\in\mathcal{P}}n_{ry}\mathcal{% G}(\mathbf{x};\mu_{r},\Sigma_{r})\mathbbm{1}(r=r^{*}_{\mathbf{x}}),over~ start_ARG italic_f end_ARG start_POSTSUBSCRIPT italic_y end_POSTSUBSCRIPT ( bold_x ) = divide start_ARG 1 end_ARG start_ARG italic_n start_POSTSUBSCRIPT italic_y end_POSTSUBSCRIPT end_ARG ∑ start_POSTSUBSCRIPT italic_r ∈ caligraphic_P end_POSTSUBSCRIPT italic_n start_POSTSUBSCRIPT italic_r italic_y end_POSTSUBSCRIPT caligraphic_G ( bold_x ; italic_μ start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT , roman_Σ start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT ) blackboard_1 ( italic_r = italic_r start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT bold_x end_POSTSUBSCRIPT ) ,

where nysubscript𝑛𝑦n_{y}italic_n start_POSTSUBSCRIPT italic_y end_POSTSUBSCRIPT is the total number of samples with label y𝑦yitalic_y and nrysubscript𝑛𝑟𝑦n_{ry}italic_n start_POSTSUBSCRIPT italic_r italic_y end_POSTSUBSCRIPT is the number of samples from class y𝑦yitalic_y that end up in polytope Qrsubscript𝑄𝑟Q_{r}italic_Q start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT. We add a small constant to the class conditional density f~ysubscript~𝑓𝑦\tilde{f}_{y}over~ start_ARG italic_f end_ARG start_POSTSUBSCRIPT italic_y end_POSTSUBSCRIPT:

(6) f^y(𝐱)=f~y(𝐱)+blog(n).subscript^𝑓𝑦𝐱subscript~𝑓𝑦𝐱𝑏𝑛\hat{f}_{y}(\mathbf{x})=\tilde{f}_{y}(\mathbf{x})+\frac{b}{\log(n)}.over^ start_ARG italic_f end_ARG start_POSTSUBSCRIPT italic_y end_POSTSUBSCRIPT ( bold_x ) = over~ start_ARG italic_f end_ARG start_POSTSUBSCRIPT italic_y end_POSTSUBSCRIPT ( bold_x ) + divide start_ARG italic_b end_ARG start_ARG roman_log ( italic_n ) end_ARG .

Note that in Equation 6, blog(n)0𝑏𝑛0\frac{b}{\log(n)}\rightarrow 0divide start_ARG italic_b end_ARG start_ARG roman_log ( italic_n ) end_ARG → 0 as the total training points, n𝑛n\rightarrow\inftyitalic_n → ∞. The intuition behind the added constant will be clarified further later in Proposition 4.3. The confidence score g^y(𝐱)subscript^𝑔𝑦𝐱\hat{g}_{y}(\mathbf{x})over^ start_ARG italic_g end_ARG start_POSTSUBSCRIPT italic_y end_POSTSUBSCRIPT ( bold_x ) for class y𝑦yitalic_y given a test point 𝐱𝐱\mathbf{x}bold_x is estimated using the Bayes rule as:

(7) g^y(𝐱)=f^y(𝐱)P^Y(y)k=1Kf^k(𝐱)P^Y(k),subscript^𝑔𝑦𝐱subscript^𝑓𝑦𝐱subscript^𝑃𝑌𝑦superscriptsubscript𝑘1𝐾subscript^𝑓𝑘𝐱subscript^𝑃𝑌𝑘\hat{g}_{y}(\mathbf{x})=\frac{\hat{f}_{y}(\mathbf{x})\hat{P}_{Y}(y)}{\sum_{k=1% }^{K}\hat{f}_{k}(\mathbf{x})\hat{P}_{Y}(k)},over^ start_ARG italic_g end_ARG start_POSTSUBSCRIPT italic_y end_POSTSUBSCRIPT ( bold_x ) = divide start_ARG over^ start_ARG italic_f end_ARG start_POSTSUBSCRIPT italic_y end_POSTSUBSCRIPT ( bold_x ) over^ start_ARG italic_P end_ARG start_POSTSUBSCRIPT italic_Y end_POSTSUBSCRIPT ( italic_y ) end_ARG start_ARG ∑ start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT over^ start_ARG italic_f end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( bold_x ) over^ start_ARG italic_P end_ARG start_POSTSUBSCRIPT italic_Y end_POSTSUBSCRIPT ( italic_k ) end_ARG ,

where P^Y(y)subscript^𝑃𝑌𝑦\hat{P}_{Y}(y)over^ start_ARG italic_P end_ARG start_POSTSUBSCRIPT italic_Y end_POSTSUBSCRIPT ( italic_y ) is the empirical prior probability of class y𝑦yitalic_y estimated from the training data. We estimate the class for a particular inference point 𝐱𝐱\mathbf{x}bold_x as:

(8) y^=argmaxy𝒴g^y(𝐱).^𝑦subscriptargmax𝑦𝒴subscript^𝑔𝑦𝐱\hat{y}=\operatornamewithlimits{argmax}_{y\in\mathcal{Y}}\hat{g}_{y}(\mathbf{x% }).over^ start_ARG italic_y end_ARG = roman_argmax start_POSTSUBSCRIPT italic_y ∈ caligraphic_Y end_POSTSUBSCRIPT over^ start_ARG italic_g end_ARG start_POSTSUBSCRIPT italic_y end_POSTSUBSCRIPT ( bold_x ) .

4 Model Parameter Estimation

4.1 Gaussian Kernel Parameter Estimation

We fit Gaussian kernel parameters to the samples that end up in the r𝑟ritalic_r-th polytope. We set the kernel center along the d𝑑ditalic_d-th dimension:

(9) μ^rd=1nri=1nxid𝟙(𝐱iQr),subscriptsuperscript^𝜇𝑑𝑟1subscript𝑛𝑟superscriptsubscript𝑖1𝑛subscriptsuperscript𝑥𝑑𝑖1subscript𝐱𝑖subscript𝑄𝑟\hat{\mu}^{d}_{r}=\frac{1}{n_{r}}\sum_{i=1}^{n}x^{d}_{i}\mathbbm{1}(\mathbf{x}% _{i}\in Q_{r}),over^ start_ARG italic_μ end_ARG start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT = divide start_ARG 1 end_ARG start_ARG italic_n start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT end_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT italic_x start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT blackboard_1 ( bold_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∈ italic_Q start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT ) ,

where xidsubscriptsuperscript𝑥𝑑𝑖x^{d}_{i}italic_x start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT is the value of 𝐱isubscript𝐱𝑖\mathbf{x}_{i}bold_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT along the d𝑑ditalic_d-th dimension. We set the kernel variance along the d𝑑ditalic_d-th dimension:

(10) (σ^rd)2=1nr{i=1n𝟙(𝐱iQr)(xidμ^rd)2+λ},superscriptsubscriptsuperscript^𝜎𝑑𝑟21subscript𝑛𝑟superscriptsubscript𝑖1𝑛1subscript𝐱𝑖subscript𝑄𝑟superscriptsubscriptsuperscript𝑥𝑑𝑖subscriptsuperscript^𝜇𝑑𝑟2𝜆(\hat{\sigma}^{d}_{r})^{2}=\frac{1}{n_{r}}\{\sum_{i=1}^{n}\mathbbm{1}(\mathbf{% x}_{i}\in Q_{r})(x^{d}_{i}-\hat{\mu}^{d}_{r})^{2}+\lambda\},( over^ start_ARG italic_σ end_ARG start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT = divide start_ARG 1 end_ARG start_ARG italic_n start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT end_ARG { ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT blackboard_1 ( bold_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∈ italic_Q start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT ) ( italic_x start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT - over^ start_ARG italic_μ end_ARG start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + italic_λ } ,

where λ𝜆\lambdaitalic_λ is a small constant that prevents σ^rdsubscriptsuperscript^𝜎𝑑𝑟\hat{\sigma}^{d}_{r}over^ start_ARG italic_σ end_ARG start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT from being 00. We constrain our estimated Gaussian kernels to have diagonal covariance.

4.2 Sample Size Ratio Estimation

For a high dimensional dataset with low training sample size, the polytopes are sparsely populated with training samples. For improving the estimate of the ratio nrynysubscript𝑛𝑟𝑦subscript𝑛𝑦\frac{n_{ry}}{n_{y}}divide start_ARG italic_n start_POSTSUBSCRIPT italic_r italic_y end_POSTSUBSCRIPT end_ARG start_ARG italic_n start_POSTSUBSCRIPT italic_y end_POSTSUBSCRIPT end_ARG in Equation 5, we incorporate the samples from other polytopes Qssubscript𝑄𝑠Q_{s}italic_Q start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT based on the similarity wrssubscript𝑤𝑟𝑠w_{rs}italic_w start_POSTSUBSCRIPT italic_r italic_s end_POSTSUBSCRIPT between Qrsubscript𝑄𝑟Q_{r}italic_Q start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT and Qssubscript𝑄𝑠Q_{s}italic_Q start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT as:

(11) n^ryn^y=s𝒫i=1nwrs𝟙(𝐱iQs)𝟙(yi=y)r𝒫s𝒫i=1nwrs𝟙(𝐱iQs)𝟙(yi=y).subscript^𝑛𝑟𝑦subscript^𝑛𝑦subscript𝑠𝒫superscriptsubscript𝑖1𝑛subscript𝑤𝑟𝑠1subscript𝐱𝑖subscript𝑄𝑠1subscript𝑦𝑖𝑦subscript𝑟𝒫subscript𝑠𝒫superscriptsubscript𝑖1𝑛subscript𝑤𝑟𝑠1subscript𝐱𝑖subscript𝑄𝑠1subscript𝑦𝑖𝑦\displaystyle\frac{\hat{n}_{ry}}{\hat{n}_{y}}=\frac{\sum_{s\in\mathcal{P}}\sum% _{i=1}^{n}w_{rs}\mathbbm{1}(\mathbf{x}_{i}\in Q_{s})\mathbbm{1}(y_{i}=y)}{\sum% _{r\in\mathcal{P}}\sum_{s\in\mathcal{P}}\sum_{i=1}^{n}w_{rs}\mathbbm{1}(% \mathbf{x}_{i}\in Q_{s})\mathbbm{1}(y_{i}=y)}.divide start_ARG over^ start_ARG italic_n end_ARG start_POSTSUBSCRIPT italic_r italic_y end_POSTSUBSCRIPT end_ARG start_ARG over^ start_ARG italic_n end_ARG start_POSTSUBSCRIPT italic_y end_POSTSUBSCRIPT end_ARG = divide start_ARG ∑ start_POSTSUBSCRIPT italic_s ∈ caligraphic_P end_POSTSUBSCRIPT ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT italic_w start_POSTSUBSCRIPT italic_r italic_s end_POSTSUBSCRIPT blackboard_1 ( bold_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∈ italic_Q start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ) blackboard_1 ( italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = italic_y ) end_ARG start_ARG ∑ start_POSTSUBSCRIPT italic_r ∈ caligraphic_P end_POSTSUBSCRIPT ∑ start_POSTSUBSCRIPT italic_s ∈ caligraphic_P end_POSTSUBSCRIPT ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT italic_w start_POSTSUBSCRIPT italic_r italic_s end_POSTSUBSCRIPT blackboard_1 ( bold_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∈ italic_Q start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ) blackboard_1 ( italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = italic_y ) end_ARG .

As n𝑛n\to\inftyitalic_n → ∞, the estimated weights wrssubscript𝑤𝑟𝑠w_{rs}italic_w start_POSTSUBSCRIPT italic_r italic_s end_POSTSUBSCRIPT should satisfy the condition:

(12) wrs{0,if QrQs1,if Qr=Qs.subscript𝑤𝑟𝑠cases0if subscript𝑄𝑟subscript𝑄𝑠1if subscript𝑄𝑟subscript𝑄𝑠w_{rs}\to\begin{cases}0,&\text{if }Q_{r}\neq Q_{s}\\ 1,&\text{if }Q_{r}=Q_{s}.\end{cases}italic_w start_POSTSUBSCRIPT italic_r italic_s end_POSTSUBSCRIPT → { start_ROW start_CELL 0 , end_CELL start_CELL if italic_Q start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT ≠ italic_Q start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT end_CELL end_ROW start_ROW start_CELL 1 , end_CELL start_CELL if italic_Q start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT = italic_Q start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT . end_CELL end_ROW

For simplicity, we will describe the estimation procedure for wrssubscript𝑤𝑟𝑠w_{rs}italic_w start_POSTSUBSCRIPT italic_r italic_s end_POSTSUBSCRIPT in the next sections. Note that if we satisfy Condition 12, then we have n^ryn^ynrynysubscript^𝑛𝑟𝑦subscript^𝑛𝑦subscript𝑛𝑟𝑦subscript𝑛𝑦\frac{\hat{n}_{ry}}{\hat{n}_{y}}\to\frac{n_{ry}}{n_{y}}divide start_ARG over^ start_ARG italic_n end_ARG start_POSTSUBSCRIPT italic_r italic_y end_POSTSUBSCRIPT end_ARG start_ARG over^ start_ARG italic_n end_ARG start_POSTSUBSCRIPT italic_y end_POSTSUBSCRIPT end_ARG → divide start_ARG italic_n start_POSTSUBSCRIPT italic_r italic_y end_POSTSUBSCRIPT end_ARG start_ARG italic_n start_POSTSUBSCRIPT italic_y end_POSTSUBSCRIPT end_ARG as n𝑛n\to\inftyitalic_n → ∞. Therefore, we modify Equation 5 as:

(13) f^y(𝐱)=1n^yr𝒫n^ry𝒢(𝐱;μ^r,Σ^r)𝟙(r=r^𝐱),subscript^𝑓𝑦𝐱1subscript^𝑛𝑦subscript𝑟𝒫subscript^𝑛𝑟𝑦𝒢𝐱subscript^𝜇𝑟subscript^Σ𝑟1𝑟subscriptsuperscript^𝑟𝐱\hat{f}_{y}(\mathbf{x})=\frac{1}{\hat{n}_{y}}\sum_{r\in\mathcal{P}}\hat{n}_{ry% }\mathcal{G}(\mathbf{x};\hat{\mu}_{r},\hat{\Sigma}_{r})\mathbbm{1}(r=\hat{r}^{% *}_{\mathbf{x}}),over^ start_ARG italic_f end_ARG start_POSTSUBSCRIPT italic_y end_POSTSUBSCRIPT ( bold_x ) = divide start_ARG 1 end_ARG start_ARG over^ start_ARG italic_n end_ARG start_POSTSUBSCRIPT italic_y end_POSTSUBSCRIPT end_ARG ∑ start_POSTSUBSCRIPT italic_r ∈ caligraphic_P end_POSTSUBSCRIPT over^ start_ARG italic_n end_ARG start_POSTSUBSCRIPT italic_r italic_y end_POSTSUBSCRIPT caligraphic_G ( bold_x ; over^ start_ARG italic_μ end_ARG start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT , over^ start_ARG roman_Σ end_ARG start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT ) blackboard_1 ( italic_r = over^ start_ARG italic_r end_ARG start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT bold_x end_POSTSUBSCRIPT ) ,

where r^𝐱=argminrμ^r𝐱subscriptsuperscript^𝑟𝐱subscriptargmin𝑟normsubscript^𝜇𝑟𝐱\hat{r}^{*}_{\mathbf{x}}=\operatornamewithlimits{argmin}_{r}\|\hat{\mu}_{r}-% \mathbf{x}\|over^ start_ARG italic_r end_ARG start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT bold_x end_POSTSUBSCRIPT = roman_argmin start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT ∥ over^ start_ARG italic_μ end_ARG start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT - bold_x ∥. Now we use f^y(𝐱)subscript^𝑓𝑦𝐱\hat{f}_{y}(\mathbf{x})over^ start_ARG italic_f end_ARG start_POSTSUBSCRIPT italic_y end_POSTSUBSCRIPT ( bold_x ) estimated using (13) in Equation (6), (7) and (8), respectively. Below, we describe how we estimate wrssubscript𝑤𝑟𝑠w_{rs}italic_w start_POSTSUBSCRIPT italic_r italic_s end_POSTSUBSCRIPT for KDF and KDN .

4.3 Forest Kernel

Consider T𝑇Titalic_T number of decision trees in a random forest trained on n𝑛nitalic_n iid𝑖𝑖𝑑iiditalic_i italic_i italic_d training samples {(𝐱i,yi)}i=1nsuperscriptsubscriptsubscript𝐱𝑖subscript𝑦𝑖𝑖1𝑛\{(\mathbf{x}_{i},y_{i})\}_{i=1}^{n}{ ( bold_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) } start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT. Each tree t𝑡titalic_t partitions the feature space into ptsubscript𝑝𝑡p_{t}italic_p start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT polytopes resulting in a set of polytopes: {{Qt,r}r=1pt}t=1Tsuperscriptsubscriptsuperscriptsubscriptsubscript𝑄𝑡𝑟𝑟1subscript𝑝𝑡𝑡1𝑇\{\{Q_{t,r}\}_{r=1}^{p_{t}}\}_{t=1}^{T}{ { italic_Q start_POSTSUBSCRIPT italic_t , italic_r end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_r = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_p start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUPERSCRIPT } start_POSTSUBSCRIPT italic_t = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT. The intersection of these polytopes gives a new set of polytopes {Qr}r=1psuperscriptsubscriptsubscript𝑄𝑟𝑟1𝑝\{Q_{r}\}_{r=1}^{p}{ italic_Q start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_r = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_p end_POSTSUPERSCRIPT for the forest. For any two points 𝐱Qr𝐱subscript𝑄𝑟\mathbf{x}\in Q_{r}bold_x ∈ italic_Q start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT and 𝐱Qssuperscript𝐱subscript𝑄𝑠\mathbf{x}^{\prime}\in Q_{s}bold_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∈ italic_Q start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT, we define the kernel 𝒦(r,s)𝒦𝑟𝑠\mathcal{K}(r,s)caligraphic_K ( italic_r , italic_s ) as:

(14) 𝒦(r,s)=trsT,𝒦𝑟𝑠subscript𝑡𝑟𝑠𝑇\mathcal{K}(r,s)=\frac{t_{rs}}{T},caligraphic_K ( italic_r , italic_s ) = divide start_ARG italic_t start_POSTSUBSCRIPT italic_r italic_s end_POSTSUBSCRIPT end_ARG start_ARG italic_T end_ARG ,

where trssubscript𝑡𝑟𝑠t_{rs}italic_t start_POSTSUBSCRIPT italic_r italic_s end_POSTSUBSCRIPT is the total number of trees, 𝐱𝐱\mathbf{x}bold_x and 𝐱superscript𝐱\mathbf{x}^{\prime}bold_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end up in the same leaf node. Here, 0𝒦(r,s)10𝒦𝑟𝑠10\leq\mathcal{K}(r,s)\leq 10 ≤ caligraphic_K ( italic_r , italic_s ) ≤ 1.

If the two samples end up in the same leaf in all the trees, i.e., 𝒦(r,s)=1𝒦𝑟𝑠1\mathcal{K}(r,s)=1caligraphic_K ( italic_r , italic_s ) = 1, they belong to the same polytope, i.e. r=s𝑟𝑠r=sitalic_r = italic_s. In short, 𝒦(r,s)𝒦𝑟𝑠\mathcal{K}(r,s)caligraphic_K ( italic_r , italic_s ) is the fraction of total trees where the two samples follow the same path from the root to a leaf node. We exponentiate 𝒦(r,s)𝒦𝑟𝑠\mathcal{K}(r,s)caligraphic_K ( italic_r , italic_s ) so that Condition 12 is satisfied:

(15) wrs=𝒦(r,s)klogn.subscript𝑤𝑟𝑠𝒦superscript𝑟𝑠𝑘𝑛w_{rs}=\mathcal{K}(r,s)^{k\log n}.italic_w start_POSTSUBSCRIPT italic_r italic_s end_POSTSUBSCRIPT = caligraphic_K ( italic_r , italic_s ) start_POSTSUPERSCRIPT italic_k roman_log italic_n end_POSTSUPERSCRIPT .

We choose k𝑘kitalic_k using grid search on a hold-out dataset.

4.4 Network Kernel

Consider a fully connected L𝐿Litalic_L layer ReLU-net trained on n𝑛nitalic_n iid𝑖𝑖𝑑iiditalic_i italic_i italic_d training samples {(𝐱i,yi)}i=1nsuperscriptsubscriptsubscript𝐱𝑖subscript𝑦𝑖𝑖1𝑛\{(\mathbf{x}_{i},y_{i})\}_{i=1}^{n}{ ( bold_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) } start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT. We have the set of all nodes denoted by 𝒩lsubscript𝒩𝑙\mathcal{N}_{l}caligraphic_N start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT at a particular layer l𝑙litalic_l. We can randomly pick a node nl𝒩lsubscript𝑛𝑙subscript𝒩𝑙n_{l}\in\mathcal{N}_{l}italic_n start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT ∈ caligraphic_N start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT at each layer l𝑙litalic_l, and construct a sequence of nodes starting at the input layer and ending at the output layer which we call an activation path: m={nl𝒩l}l=1L𝑚superscriptsubscriptsubscript𝑛𝑙subscript𝒩𝑙𝑙1𝐿m=\{n_{l}\in\mathcal{N}_{l}\}_{l=1}^{L}italic_m = { italic_n start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT ∈ caligraphic_N start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_l = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_L end_POSTSUPERSCRIPT. Note that there are N=Πl=1L|𝒩l|𝑁superscriptsubscriptΠ𝑙1𝐿subscript𝒩𝑙N=\Pi_{l=1}^{L}{|\mathcal{N}_{l}|}italic_N = roman_Π start_POSTSUBSCRIPT italic_l = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_L end_POSTSUPERSCRIPT | caligraphic_N start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT | possible activation paths for a sample in the ReLU-net. We index each path by a unique identifier number z𝑧z\in\mathbb{N}italic_z ∈ blackboard_N and construct a sequence of activation paths as: ={mz}z=1,,Nsubscriptsubscript𝑚𝑧𝑧1𝑁\mathcal{M}=\{m_{z}\}_{z=1,\cdots,N}caligraphic_M = { italic_m start_POSTSUBSCRIPT italic_z end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_z = 1 , ⋯ , italic_N end_POSTSUBSCRIPT. Therefore, \mathcal{M}caligraphic_M contains all possible activation pathways from the input to the output of the network.

While pushing a training sample 𝐱isubscript𝐱𝑖\mathbf{x}_{i}bold_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT through the network, we define the activation from a ReLU unit at any node as ‘1111’ when it has positive output and ‘00’ otherwise. Therefore, the activation indicates on which side of the affine function at each node the sample falls. The activation for all nodes in an activation path mzsubscript𝑚𝑧m_{z}italic_m start_POSTSUBSCRIPT italic_z end_POSTSUBSCRIPT for a particular sample creates an activation mode az{0,1}Lsubscript𝑎𝑧superscript01𝐿a_{z}\in\{0,1\}^{L}italic_a start_POSTSUBSCRIPT italic_z end_POSTSUBSCRIPT ∈ { 0 , 1 } start_POSTSUPERSCRIPT italic_L end_POSTSUPERSCRIPT. If we evaluate the activation mode for all activation paths in \mathcal{M}caligraphic_M while pushing a sample through the network, we get a sequence of activation modes: 𝒜r={azr}z=1Nsubscript𝒜𝑟superscriptsubscriptsuperscriptsubscript𝑎𝑧𝑟𝑧1𝑁\mathcal{A}_{r}=\{a_{z}^{r}\}_{z=1}^{N}caligraphic_A start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT = { italic_a start_POSTSUBSCRIPT italic_z end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_r end_POSTSUPERSCRIPT } start_POSTSUBSCRIPT italic_z = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT. Here r𝑟ritalic_r is the index of the polytope where the sample falls in.

If the two sequences of activation modes for two different training samples are identical, they belong to the same polytope. In other words, if 𝒜r=𝒜ssubscript𝒜𝑟subscript𝒜𝑠\mathcal{A}_{r}=\mathcal{A}_{s}caligraphic_A start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT = caligraphic_A start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT, then Qr=Qssubscript𝑄𝑟subscript𝑄𝑠Q_{r}=Q_{s}italic_Q start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT = italic_Q start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT. This statement holds because the above samples will lie on the same side of the affine function at each node in different layers of the network. Now, we define the kernel 𝒦(r,s)𝒦𝑟𝑠\mathcal{K}(r,s)caligraphic_K ( italic_r , italic_s ) as:

(16) 𝒦(r,s)=z=1N𝟙(azr=azs)N.𝒦𝑟𝑠superscriptsubscript𝑧1𝑁1superscriptsubscript𝑎𝑧𝑟superscriptsubscript𝑎𝑧𝑠𝑁\mathcal{K}(r,s)=\frac{\sum_{z=1}^{N}\mathbbm{1}(a_{z}^{r}=a_{z}^{s})}{N}.caligraphic_K ( italic_r , italic_s ) = divide start_ARG ∑ start_POSTSUBSCRIPT italic_z = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT blackboard_1 ( italic_a start_POSTSUBSCRIPT italic_z end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_r end_POSTSUPERSCRIPT = italic_a start_POSTSUBSCRIPT italic_z end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_s end_POSTSUPERSCRIPT ) end_ARG start_ARG italic_N end_ARG .

Note that 0𝒦(r,s)10𝒦𝑟𝑠10\leq\mathcal{K}(r,s)\leq 10 ≤ caligraphic_K ( italic_r , italic_s ) ≤ 1. In short, 𝒦(r,s)𝒦𝑟𝑠\mathcal{K}(r,s)caligraphic_K ( italic_r , italic_s ) is the fraction of total activation paths which are identically activated for two samples in two different polytopes r𝑟ritalic_r and s𝑠sitalic_s. We exponentiate the kernel using Equation 15. Pseudocodes outlining the two algorithms are provided in Appendix D.

4.5 Geodesic Distance

Consider 𝒫n={Q1,Q2,,Qp}subscript𝒫𝑛subscript𝑄1subscript𝑄2subscript𝑄𝑝\mathcal{P}_{n}=\{Q_{1},Q_{2},\cdots,Q_{p}\}caligraphic_P start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT = { italic_Q start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_Q start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , ⋯ , italic_Q start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT } as a partition of dsuperscript𝑑\mathbb{R}^{d}blackboard_R start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT given by a random forest or a ReLU-net after being trained on n𝑛nitalic_n training samples. We measure distance between two points 𝐱Qr,𝐱Qsformulae-sequence𝐱subscript𝑄𝑟superscript𝐱subscript𝑄𝑠\mathbf{x}\in Q_{r},\mathbf{x}^{\prime}\in Q_{s}bold_x ∈ italic_Q start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT , bold_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∈ italic_Q start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT using the kernel introduced in Equation 14 and Equation 16, and call it ‘Geodesic’ distance [22]:

(17) d(r,s)=𝒦(r,s)+12(𝒦(r,r)+𝒦(s,s))=1𝒦(r,s).𝑑𝑟𝑠𝒦𝑟𝑠12𝒦𝑟𝑟𝒦𝑠𝑠1𝒦𝑟𝑠d(r,s)=-\mathcal{K}(r,s)+\frac{1}{2}(\mathcal{K}(r,r)+\mathcal{K}(s,s))=1-% \mathcal{K}(r,s).italic_d ( italic_r , italic_s ) = - caligraphic_K ( italic_r , italic_s ) + divide start_ARG 1 end_ARG start_ARG 2 end_ARG ( caligraphic_K ( italic_r , italic_r ) + caligraphic_K ( italic_s , italic_s ) ) = 1 - caligraphic_K ( italic_r , italic_s ) .
Proposition 4.1.

(𝒫n,d)subscript𝒫𝑛𝑑(\mathcal{P}_{n},d)( caligraphic_P start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT , italic_d ) is a metric space.

Proof 4.2.

See Appendix A.1 for the proof.

We use Geodesic distance to find the nearest polytope to the inference point. As Geodesic distance cannot distinguish between points within the same polytope, it has a resolution similar to the size of the polytope. For discriminating between two points within the same polytope, we fit a Gaussian kernel within the polytope (described above). As hn0subscript𝑛0h_{n}\to 0italic_h start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT → 0, the resolution for Geodesic distance improves. In Section 5, we will empirically show that using Geodesic distance scales better with higher dimension compared to that of Euclidean distance.

Given n𝑛nitalic_n training samples {(𝐱i,yi)}i=1nsuperscriptsubscriptsubscript𝐱𝑖subscript𝑦𝑖𝑖1𝑛\{(\mathbf{x}_{i},y_{i})\}_{i=1}^{n}{ ( bold_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) } start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT, we define the distance of an inference point 𝐱𝐱\mathbf{x}bold_x from the training points as: d𝐱=mini=1,,n𝐱𝐱isubscript𝑑𝐱subscript𝑖1𝑛norm𝐱subscript𝐱𝑖d_{\mathbf{x}}=\min_{i=1,\cdots,n}\|\mathbf{x}-\mathbf{x}_{i}\|italic_d start_POSTSUBSCRIPT bold_x end_POSTSUBSCRIPT = roman_min start_POSTSUBSCRIPT italic_i = 1 , ⋯ , italic_n end_POSTSUBSCRIPT ∥ bold_x - bold_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∥, where \|\cdot\|∥ ⋅ ∥ denotes Euclidean distance.

Proposition 4.3 (Asymptotic OOD Convergence).

Given non-zero and bounded bandwidth of the Gaussians, then we have almost sure convergence for g^ysubscript^𝑔𝑦\hat{g}_{y}over^ start_ARG italic_g end_ARG start_POSTSUBSCRIPT italic_y end_POSTSUBSCRIPT as:

limd𝐱g^y(𝐱)=PY^(y).subscriptsubscript𝑑𝐱subscript^𝑔𝑦𝐱^subscript𝑃𝑌𝑦\lim_{d_{\mathbf{x}}\to\infty}\hat{g}_{y}(\mathbf{x})=\hat{P_{Y}}(y).roman_lim start_POSTSUBSCRIPT italic_d start_POSTSUBSCRIPT bold_x end_POSTSUBSCRIPT → ∞ end_POSTSUBSCRIPT over^ start_ARG italic_g end_ARG start_POSTSUBSCRIPT italic_y end_POSTSUBSCRIPT ( bold_x ) = over^ start_ARG italic_P start_POSTSUBSCRIPT italic_Y end_POSTSUBSCRIPT end_ARG ( italic_y ) .

Proof 4.4.

See Appendix A.2 for the proof.

5 Empirical Results

We conduct several experiments on simulated, OpenML-CC18 [23] 111https://www.openml.org/s/99 and vision benchmark datasets to gain insights on the finite sample performance of KDF and KDN. The details of the simulation datasets and hyperparameters used for all the experiments are provided in Appendix C. For Trunk simulation dataset, we follow the simulation setup proposed by Trunk [24] which was designed to demonstrate ‘curse of dimensionality’. In the Trunk simulation, a binary class dataset is used where each class is sampled from a Gaussian distribution with higher dimensions having increasingly less discriminative information. We use both Euclidean and Geodesic distance to detect the nearest polytope (see Equation (4)) on simulation datasets and use only Geodesic distance for benchmark datasets. For the simulation setups, we use classification error, Hellinger distance [25, 26] from the true class conditional posteriors and mean max confidence [4] as performance statistics. While measuring in-distribution calibration for the datasets in OpenML-CC18 data suite, we used maximum calibration error as defined by Guo et al. [18] with a fixed bin number of R=15𝑅15R=15italic_R = 15 across all the datasets. Given n𝑛nitalic_n OOD samples, we define OOD calibration error (OCE) to measure OOD performance for the benchmark datasets as:

(18) OCE=1ni=1n|maxy𝒴(P^Y|X(y|𝐱i))maxy𝒴(P^Y(y))|.\text{OCE}=\frac{1}{n}\sum_{i=1}^{n}\left|\max_{y\in\mathcal{Y}}(\hat{P}_{Y|X}% (y|\mathbf{x}_{i}))-\max_{y\in\mathcal{Y}}(\hat{P}_{Y}(y))\right|.OCE = divide start_ARG 1 end_ARG start_ARG italic_n end_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT | roman_max start_POSTSUBSCRIPT italic_y ∈ caligraphic_Y end_POSTSUBSCRIPT ( over^ start_ARG italic_P end_ARG start_POSTSUBSCRIPT italic_Y | italic_X end_POSTSUBSCRIPT ( italic_y | bold_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ) - roman_max start_POSTSUBSCRIPT italic_y ∈ caligraphic_Y end_POSTSUBSCRIPT ( over^ start_ARG italic_P end_ARG start_POSTSUBSCRIPT italic_Y end_POSTSUBSCRIPT ( italic_y ) ) | .

For the tabular and the vision datasets, we have used ID calibration approaches, such as Isotonic [15, 16] and Sigmoid [17] regression, as baselines. Additionally, for the vision benchmark dataset, we provide results with OOD calibration approaches such as: ACET [4], ODIN [6], OE (outlier exposure) [10]. For each approach, 70%percent7070\%70 % of the training data was used to fit the model and the rest of the data was used to calibrate the model.

5.1 Empirical Study on Tabular Data

Refer to caption
Figure 1: Simulation datasets, Classification error, Hellinger distance from true posteriors, mean max confidence or posterior for A. five two-dimensional and B. a high dimensional (Trunk) simulation experiments, visualized for the first two dimensions. The median performance is shown as a dark curve with shaded region as error bars.

5.1.1 Simulation Study

Figure 1 leftmost column shows 10000100001000010000 training samples with 5000500050005000 samples per class sampled within the region [1,1]×[1,1]1111[-1,1]\times[-1,1][ - 1 , 1 ] × [ - 1 , 1 ] from the six simulation setups described in Appendix C. Therefore, the empty annular region between [1,1]×[1,1]1111[-1,1]\times[-1,1][ - 1 , 1 ] × [ - 1 , 1 ] and [2,2]×[2,2]2222[-2,2]\times[-2,2][ - 2 , 2 ] × [ - 2 , 2 ] is the low density or OOD region in Figure 1. Figure 1 quantifies the performance of the algorithms which are visually represented in Appendix Figure 4. KDF and KDN maintain similar classification accuracy to those of their parent algorithms. We measure hellinger distance from the true distribution for increasing training sample size within [1,1]×[1,1]1111[-1,1]\times[-1,1][ - 1 , 1 ] × [ - 1 , 1 ] region as a statistics for in-distribution calibration. Column 3333 and 6666 in Figure 1 show KDF and KDN are better at estimating the ID region compared to their parent methods. In all of the simulations, using geodesic distance measure results in better performance compared to those while using Euclidean distance. For measuring OOD performance, we keep the training sample size fixed at 1000100010001000 and normalize the training data by the maximum of their l2𝑙2l2italic_l 2 norm so that the training data is confined within a unit circle. For inference, we sample 1000100010001000 inference points uniformly from a circle where the circles have increasing radius and plot mean max posterior for increasing distance from the origin. Therefore, for distance up to 1111 we have in-distribution samples and distances farther than 1111 can be considered as OOD region. As shown in Column 4444 and 7777 of Figure 1, mean max confidence for KDF and KDN converge to the maximum of the class priors, i.e., 0.50.50.50.5 as we go farther away from the training data origin.

Row 6666 of Figure 1 shows KDF-Geodesic and KDN-Geodesic scale better with higher dimensions compared to their Euclidean counterpart algorithms respectively.

5.1.2 OpenML-CC18 Data Study

Refer to caption
Figure 2: Performance summary of KDF and KDN on OpenML-CC18 data suite. The dark curve in the middle shows the median of performance on 45454545 datasets with the shaded region as error bar.

We use OpenML-CC18 data suite for tabular benchmark dataset study. We exclude any dataset which contains categorical features or NaN values 222We also excluded the dataset with dataset id 23517235172351723517 as we could not achieve better than chance accuracy using RF and DN on that dataset. and conduct our experiments on 45454545 datasets with varying dimensions and sample sizes. For the OOD experiments, we follow a similar setup as that of the simulation data. We normalize the training data by their maximum l2subscript𝑙2l_{2}italic_l start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT norm and sample 1000100010001000 testing samples uniformly from hyperspheres where each hypersphere has increasing radius starting from 1111 to 5555. For each dataset, we measure improvement with respect to the parent algorithm:

(19) pMp,subscript𝑝subscript𝑀subscript𝑝\frac{\mathcal{E}_{p}-\mathcal{E}_{M}}{\mathcal{E}_{p}},divide start_ARG caligraphic_E start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT - caligraphic_E start_POSTSUBSCRIPT italic_M end_POSTSUBSCRIPT end_ARG start_ARG caligraphic_E start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT end_ARG ,

where p=subscript𝑝absent\mathcal{E}_{p}=caligraphic_E start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT =classification error, MCE or OCE for the parent algorithm and Msubscript𝑀\mathcal{E}_{M}caligraphic_E start_POSTSUBSCRIPT italic_M end_POSTSUBSCRIPT represents the performance of the approach in consideration. Note that positive improvement implies the corresponding approach performs better than the parent approach. We report the median of improvement on different datasets along with the error bar in Figure 2. The extended results for each dataset is shown separately in the appendix. Figure 2 left column shows on average KDF and KDN has nearly similar or better classification accuracy compared to their respective parent algorithm whereas Isotonic and Sigmoid regression have lower classification accuracy most of the cases. However, according to Figure 2 middle column, KDF and KDN have similar in-distribution calibration performance to the other baseline approaches. Most interestingly, Figure 2 right column shows that KDN and KDF improves OOD calibration of their respective parent algorithms by a huge margin while the baseline approaches completely fails to address the OOD calibration problem.

5.2 Empirical Study on Vision Data

In vision data, each image pixel contains local information about the neighboring pixels. To extract the local information, we use convolutional or vision transformer encoders at the front-end. More precisely, we have a front-end encoder, he:Dm:subscript𝑒maps-tosuperscript𝐷superscript𝑚h_{e}:\mathbb{R}^{D}\mapsto\mathbb{R}^{m}italic_h start_POSTSUBSCRIPT italic_e end_POSTSUBSCRIPT : blackboard_R start_POSTSUPERSCRIPT italic_D end_POSTSUPERSCRIPT ↦ blackboard_R start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT and typically, m<<Dmuch-less-than𝑚𝐷m<<Ditalic_m < < italic_D. After the encoder there is a few fully connected dense layers for discriminating among the K𝐾Kitalic_K class labels, hf:mK:subscript𝑓maps-tosuperscript𝑚superscript𝐾h_{f}:\mathbb{R}^{m}\mapsto\mathbb{R}^{K}italic_h start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT : blackboard_R start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT ↦ blackboard_R start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT. Note that the m𝑚mitalic_m-dimensional embedding outputs from the encoder are partitioned into polytopes by the dense layers (see Equation (3)) and we fit a KDN on the embedding outputs. The above approach results in extraction of better inductive bias by KDN from the parent model and makes KDN more scalable with larger parent models and training sample size.

5.2.1 Simulation Study

Refer to caption
Figure 3: KDN filters out inference points with different kinds of semantic shifts from the training data. Simulated images: (A) circle with radius 10101010, (B) rectangle with sides (20,50)2050(20,50)( 20 , 50 ) and out-of-distribution test points: (C) ellipse with minor and major axis (10,30)1030(10,30)( 10 , 30 ). Mean max confidence of KDN are plotted for semantic shift of the inference points created by (D) changing the color intensity, (E) taking convex combination of circle and rectangle, (F) changing one of the axes of the ellipse.

For the simulation study, we use a simple CNN with one convolutional layer (3333 channels with 3×3333\times 33 × 3 kernel) followed by two fully connected layers with 10101010 and 2222 nodes in each. We train the CNN on 2000200020002000 circle (radius 10101010) and 2000200020002000 rectangle (sides 20,50205020,5020 , 50) images with their RGB values being fixed at [127,127,127]127127127[127,127,127][ 127 , 127 , 127 ] and their centers randomly sampled within a square with sides 100100100100. The other pixels in the background where there is no object (circle, rectangle or ellipse) were set to 00.

We perform three experiments while inducing semantic shifts in the inference points as shown in Figure 3. In the first experiment, we randomly sampled data similar to the training points. However, we added the same shift to all the RGB values of an inference point (shown as color intensity in Figure 3 D). Therefore, the inference point is ID for color intensity at 127127127127 and otherwise OOD. In the second experiment, we kept the RGB values fixed at [127,127,127]127127127[127,127,127][ 127 , 127 , 127 ] while taking convex combination of a circle and a rectangle. Let images of circles and rectangles be denoted by Xcsubscript𝑋𝑐X_{c}italic_X start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT and Xrsubscript𝑋𝑟X_{r}italic_X start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT. We derive an interference point as Xinfsubscript𝑋𝑖𝑛𝑓X_{inf}italic_X start_POSTSUBSCRIPT italic_i italic_n italic_f end_POSTSUBSCRIPT:

(20) Xinf=ϵXc+(1ϵ)Xrsubscript𝑋𝑖𝑛𝑓italic-ϵsubscript𝑋𝑐1italic-ϵsubscript𝑋𝑟X_{inf}=\epsilon X_{c}+(1-\epsilon)X_{r}italic_X start_POSTSUBSCRIPT italic_i italic_n italic_f end_POSTSUBSCRIPT = italic_ϵ italic_X start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT + ( 1 - italic_ϵ ) italic_X start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT

Therefore, Xinfsubscript𝑋𝑖𝑛𝑓X_{inf}italic_X start_POSTSUBSCRIPT italic_i italic_n italic_f end_POSTSUBSCRIPT is maximally distant from the training points for ϵ=0.5italic-ϵ0.5\epsilon=0.5italic_ϵ = 0.5 and closest to the ID points at ϵ={0,1}italic-ϵ01\epsilon=\{0,1\}italic_ϵ = { 0 , 1 }. In the third experiment, we sampled ellipse images with the same RGB values as the training points. However, this time we gradually change one of the ellipse axes from 0.010.010.010.01 to 40404040 while keeping the other axis fixed at 10101010. As a result, the inference point becomes ID for the axis length of 10101010. As shown in Figure 3 (D, E, F), in all the experiments KDN becomes less confident for the OOD points while the parent CNN remains overconfident throughout the semantic shifts of the test points.

5.2.2 Vision Benchmark Datasets Study

In this study, we use a ViT_B16𝑉𝑖𝑇_𝐵16ViT\_B16italic_V italic_i italic_T _ italic_B 16 (provided in keras-vit package) vision transformer encoder [27] pretrained on ImageNet [28] dataset and finetuned on CIFAR-10 [29]. We use the same encoder for all the baseline algorithms and finetune it with the corresponding loss function without freezing any weight. As shown in Table 1, pretrained vision transformers are already well-calibrated for ID and the OOD approaches (ACET, ODIN, OE) degrade ID calibration of the parent model. On the contrary, ID calibration approaches (Isotonic, Sigmoid) perform poorly compared to that of KDN in the OOD region. KDN achieves a compromise between ID and OOD performance while having reduced confidence on wrongly classified ID samples. The number of populated polytopes (and Gaussians) for KDN is 9323±353plus-or-minus93233539323\pm 3539323 ± 353. See Appendix F for the corresponding experiments using Resnet-50.

Table 1: KDN achieves good calibration at both ID and OOD regions whereas other approaches excel either in the ID or the OOD region. Notably, KDN has reduced confidence on wrongly classified ID points.\uparrow’ and ‘\downarrow’ indicate whether higher and lower values are better, respectively. MMC = Mean Max Confidence on wrongly classified ID points.
Dataset Statistics Parent KDN Isotonic Sigmoid ACET ODIN OE
ID CIFAR-10 Accuracy(%)(\%)\uparrow( % ) ↑ 98.06±0.00plus-or-minus98.060.0098.06\pm 0.0098.06 ± 0.00 97.45±0.00plus-or-minus97.450.0097.45\pm 0.0097.45 ± 0.00 98.16±0.00plus-or-minus98.160.0098.16\pm 0.0098.16 ± 0.00 98.10±0.00plus-or-minus98.100.0098.10\pm 0.0098.10 ± 0.00 98.23±0.00plus-or-minus98.230.00\mathbf{98.23}\pm 0.00bold_98.23 ± 0.00 97.97±0.00plus-or-minus97.970.0097.97\pm 0.0097.97 ± 0.00 97.94±0.00plus-or-minus97.940.0097.94\pm 0.0097.94 ± 0.00
MCE \downarrow 0.00±0.00plus-or-minus0.000.00\mathbf{0.00}\pm 0.00bold_0.00 ± 0.00 0.00±0.00plus-or-minus0.000.00\mathbf{0.00}\pm 0.00bold_0.00 ± 0.00 0.00±0.00plus-or-minus0.000.00\mathbf{0.00}\pm 0.00bold_0.00 ± 0.00 0.00±0.00plus-or-minus0.000.00\mathbf{0.00}\pm 0.00bold_0.00 ± 0.00 0.01±0.00plus-or-minus0.010.000.01\pm 0.000.01 ± 0.00 0.02±0.00plus-or-minus0.020.000.02\pm 0.000.02 ± 0.00 0.01±0.00plus-or-minus0.010.000.01\pm 0.000.01 ± 0.00
MMC \downarrow 0.76±0.01plus-or-minus0.760.010.76\pm 0.010.76 ± 0.01 0.65±0.08plus-or-minus0.650.08\mathbf{0.65}\pm 0.08bold_0.65 ± 0.08 0.74±0.02plus-or-minus0.740.020.74\pm 0.020.74 ± 0.02 0.90±0.01plus-or-minus0.900.010.90\pm 0.010.90 ± 0.01 0.86±0.02plus-or-minus0.860.020.86\pm 0.020.86 ± 0.02 0.97±0.01plus-or-minus0.970.010.97\pm 0.010.97 ± 0.01 0.69±0.01plus-or-minus0.690.010.69\pm 0.010.69 ± 0.01
OOD CIFAR-100 OCE \downarrow 0.47±0.01plus-or-minus0.470.010.47\pm 0.010.47 ± 0.01 0.12±0.01plus-or-minus0.120.01\mathbf{0.12}\pm 0.01bold_0.12 ± 0.01 0.47±0.01plus-or-minus0.470.010.47\pm 0.010.47 ± 0.01 0.69±0.01plus-or-minus0.690.010.69\pm 0.010.69 ± 0.01 0.57±0.01plus-or-minus0.570.010.57\pm 0.010.57 ± 0.01 0.79±0.00plus-or-minus0.790.000.79\pm 0.000.79 ± 0.00 0.29±0.01plus-or-minus0.290.010.29\pm 0.010.29 ± 0.01
SVHN OCE \downarrow 0.44±0.06plus-or-minus0.440.060.44\pm 0.060.44 ± 0.06 0.08±0.02plus-or-minus0.080.02\mathbf{0.08}\pm 0.02bold_0.08 ± 0.02 0.34±0.12plus-or-minus0.340.120.34\pm 0.120.34 ± 0.12 0.64±0.16plus-or-minus0.640.160.64\pm 0.160.64 ± 0.16 0.47±0.04plus-or-minus0.470.040.47\pm 0.040.47 ± 0.04 0.75±0.03plus-or-minus0.750.030.75\pm 0.030.75 ± 0.03 0.11±0.02plus-or-minus0.110.020.11\pm 0.020.11 ± 0.02
Noise OCE \downarrow 0.28±0.08plus-or-minus0.280.080.28\pm 0.080.28 ± 0.08 0.03±0.02plus-or-minus0.030.020.03\pm 0.020.03 ± 0.02 0.30±0.04plus-or-minus0.300.040.30\pm 0.040.30 ± 0.04 0.56±0.12plus-or-minus0.560.120.56\pm 0.120.56 ± 0.12 0.01±0.00plus-or-minus0.010.00\mathbf{0.01}\pm 0.00bold_0.01 ± 0.00 0.53±0.09plus-or-minus0.530.090.53\pm 0.090.53 ± 0.09 0.07±0.02plus-or-minus0.070.020.07\pm 0.020.07 ± 0.02

6 Discussion

In this paper, we demonstrated a simple intuition that renders traditional deep discriminative models into a type of binning and kerneling approach. The bin boundaries are determined by the internal structure learned by the parent approach and Geodesic distance encodes the low dimensional structure learned by the model. Moreover, Geodesic distance introduced in this paper may have broader impact on understanding the internal structure of the deep discriminative models which we will pursue in future. Our code, including the package and the experiments in this manuscript, will be made publicly available upon acceptance of the paper.

Acknowledgements

The authors thank the support of the NSF-Simons Research Collaborations on the Mathematical and Scientific Foundations of Deep Learning (NSF grant 2031985) and THEORINET. This work is graciously supported by the Defense Advanced Research Projects Agency (DARPA) Lifelong Learning Machines program through contracts FA8650-18-2-7834 and HR0011-18-2-0025. Research was partially supported by funding from Microsoft Research and the Kavli Neuroscience Discovery Institute.

References

  • Guo et al. [2017a] Chuan Guo, Geoff Pleiss, Yu Sun, and Kilian Q. Weinberger. On calibration of modern neural networks. In Doina Precup and Yee Whye Teh, editors, Proceedings of the 34th International Conference on Machine Learning, volume 70 of Proceedings of Machine Learning Research, pages 1321–1330. PMLR, 06–11 Aug 2017a.
  • Kristiadi et al. [2020] Agustinus Kristiadi, Matthias Hein, and Philipp Hennig. Being bayesian, even just a bit, fixes overconfidence in ReLU networks. In Hal Daumé III and Aarti Singh, editors, Proceedings of the 37th International Conference on Machine Learning, volume 119 of Proceedings of Machine Learning Research, pages 5436–5446. PMLR, 13–18 Jul 2020.
  • Xu et al. [2021a] Haoyin Xu, Kaleab A. Kinfu, Will LeVine, Sambit Panda, Jayanta Dey, Michael Ainsworth, Yu-Chung Peng, Madi Kusmanov, Florian Engert, Christopher M. White, Joshua T. Vogelstein, and Carey E. Priebe. When are Deep Networks really better than Decision Forests at small sample sizes, and how? arXiv preprint arXiv:2108.13637, 2021a.
  • Hein et al. [2019] Matthias Hein, Maksym Andriushchenko, and Julian Bitterwolf. Why relu networks yield high-confidence predictions far away from the training data and how to mitigate the problem. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 41–50, 2019.
  • Meinke et al. [2021] Alexander Meinke, Julian Bitterwolf, and Matthias Hein. Provably robust detection of out-of-distribution data (almost) for free. arXiv preprint arXiv:2106.04260, 2021.
  • Liang et al. [2017] Shiyu Liang, Yixuan Li, and Rayadurgam Srikant. Enhancing the reliability of out-of-distribution image detection in neural networks. arXiv preprint arXiv:1706.02690, 2017.
  • Nandy et al. [2020] Jay Nandy, Wynne Hsu, and Mong Li Lee. Towards maximizing the representation gap between in-domain & out-of-distribution examples. Advances in Neural Information Processing Systems, 33:9239–9250, 2020.
  • Wan et al. [2018] Weitao Wan, Yuanyi Zhong, Tianpeng Li, and Jiansheng Chen. Rethinking feature distribution for loss functions in image classification. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 9117–9126, 2018.
  • DeVries and Taylor [2018] Terrance DeVries and Graham W Taylor. Learning confidence for out-of-distribution detection in neural networks. arXiv preprint arXiv:1802.04865, 2018.
  • Hendrycks et al. [2018] Dan Hendrycks, Mantas Mazeika, and Thomas Dietterich. Deep anomaly detection with outlier exposure. arXiv preprint arXiv:1812.04606, 2018.
  • Ren et al. [2019] Jie Ren, Peter J Liu, Emily Fertig, Jasper Snoek, Ryan Poplin, Mark Depristo, Joshua Dillon, and Balaji Lakshminarayanan. Likelihood ratios for out-of-distribution detection. Advances in neural information processing systems, 32, 2019.
  • Kingma et al. [2019] Diederik P Kingma, Max Welling, et al. An introduction to variational autoencoders. Foundations and Trends® in Machine Learning, 12(4):307–392, 2019.
  • Goodfellow et al. [2020] Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. Generative adversarial networks. Communications of the ACM, 63(11):139–144, 2020.
  • Nalisnick et al. [2018] Eric Nalisnick, Akihiro Matsukawa, Yee Whye Teh, Dilan Gorur, and Balaji Lakshminarayanan. Do deep generative models know what they don’t know? arXiv preprint arXiv:1810.09136, 2018.
  • Zadrozny and Elkan [2001] Bianca Zadrozny and Charles Elkan. Obtaining calibrated probability estimates from decision trees and naive bayesian classifiers. In Icml, volume 1, pages 609–616, 2001.
  • Caruana [2004] R Caruana. Predicting good probabilities with supervised learning. In Proceedings of NIPS 2004 Workshop on Calibration and Probabilistic Prediction in Supervised Learning, 2004.
  • Platt et al. [1999] John Platt et al. Probabilistic outputs for support vector machines and comparisons to regularized likelihood methods. Advances in large margin classifiers, 10(3):61–74, 1999.
  • Guo et al. [2017b] Chuan Guo, Geoff Pleiss, Yu Sun, and Kilian Q Weinberger. On calibration of modern neural networks. In International conference on machine learning, pages 1321–1330. PMLR, 2017b.
  • Guo et al. [2019] Richard Guo, Ronak Mehta, Jesus Arroyo, Hayden Helm, Cencheng Shen, and Joshua T Vogelstein. Estimating information-theoretic quantities with uncertainty forests. arXiv, pages arXiv–1907, 2019.
  • Kull et al. [2019] Meelis Kull, Miquel Perello Nieto, Markus Kängsepp, Telmo Silva Filho, Hao Song, and Peter Flach. Beyond temperature scaling: Obtaining well-calibrated multi-class probabilities with dirichlet calibration. Advances in neural information processing systems, 32, 2019.
  • Xu et al. [2021b] Haoyin Xu, Kaleab A Kinfu, Will LeVine, Sambit Panda, Jayanta Dey, Michael Ainsworth, Yu-Chung Peng, Madi Kusmanov, Florian Engert, Christopher M White, et al. When are deep networks really better than decision forests at small sample sizes, and how? arXiv preprint arXiv:2108.13637, 2021b.
  • Schölkopf [2000] Bernhard Schölkopf. The kernel trick for distances. Advances in neural information processing systems, 13, 2000.
  • Bischl et al. [2017] Bernd Bischl, Giuseppe Casalicchio, Matthias Feurer, Pieter Gijsbers, Frank Hutter, Michel Lang, Rafael G Mantovani, Jan N van Rijn, and Joaquin Vanschoren. Openml benchmarking suites. arXiv preprint arXiv:1708.03731, 2017.
  • Trunk [1979] Gerard V Trunk. A problem of dimensionality: A simple example. IEEE Transactions on pattern analysis and machine intelligence, (3):306–307, 1979.
  • Kailath [1967] Thomas Kailath. The divergence and bhattacharyya distance measures in signal selection. IEEE transactions on communication technology, 15(1):52–60, 1967.
  • Rao [1995] C Radhakrishna Rao. A review of canonical coordinates and an alternative to correspondence analysis using hellinger distance. Qüestiió: quaderns d’estadística i investigació operativa, 1995.
  • Dosovitskiy et al. [2020] Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, et al. An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929, 2020.
  • Deng et al. [2009] Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. Imagenet: A large-scale hierarchical image database. In 2009 IEEE Conference on Computer Vision and Pattern Recognition, pages 248–255, 2009. 10.1109/CVPR.2009.5206848.
  • [29] Alex Krizhevsky, Vinod Nair, and Geoffrey Hinton. Cifar-10 (canadian institute for advanced research). URL http://www.cs.toronto.edu/~kriz/cifar.html.
  • Khosla et al. [2020] Prannay Khosla, Piotr Teterwak, Chen Wang, Aaron Sarna, Yonglong Tian, Phillip Isola, Aaron Maschinot, Ce Liu, and Dilip Krishnan. Supervised contrastive learning. Advances in neural information processing systems, 33:18661–18673, 2020.

Appendix A Proofs

A.1 Proof of Proposition 4.1

For proving that d𝑑ditalic_d is a valid distance metric for 𝒫nsubscript𝒫𝑛\mathcal{P}_{n}caligraphic_P start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT, we need to prove the following four statements:

  1. 1.

    d(r,s)=0𝑑𝑟𝑠0d(r,s)=0italic_d ( italic_r , italic_s ) = 0 when r=s𝑟𝑠r=sitalic_r = italic_s.
    Proof: By definition, 𝒦(r,s)=1𝒦𝑟𝑠1\mathcal{K}(r,s)=1caligraphic_K ( italic_r , italic_s ) = 1 and d(r,s)=0𝑑𝑟𝑠0d(r,s)=0italic_d ( italic_r , italic_s ) = 0 when r=s𝑟𝑠r=sitalic_r = italic_s.

  2. 2.

    d(r,s)>0𝑑𝑟𝑠0d(r,s)>0italic_d ( italic_r , italic_s ) > 0 when rs𝑟𝑠r\neq sitalic_r ≠ italic_s.
    Proof: By definition, 0𝒦(r,s)<10𝒦𝑟𝑠10\leq\mathcal{K}(r,s)<10 ≤ caligraphic_K ( italic_r , italic_s ) < 1 and d(r,s)>0𝑑𝑟𝑠0d(r,s)>0italic_d ( italic_r , italic_s ) > 0 for rs𝑟𝑠r\neq sitalic_r ≠ italic_s.

  3. 3.

    d𝑑ditalic_d is symmetric, i.e., d(r,s)=d(s,r)𝑑𝑟𝑠𝑑𝑠𝑟d(r,s)=d(s,r)italic_d ( italic_r , italic_s ) = italic_d ( italic_s , italic_r ).
    Proof: By definition, 𝒦(r,s)=𝒦(s,r)𝒦𝑟𝑠𝒦𝑠𝑟\mathcal{K}(r,s)=\mathcal{K}(s,r)caligraphic_K ( italic_r , italic_s ) = caligraphic_K ( italic_s , italic_r ) which implies d(r,s)=d(s,r)𝑑𝑟𝑠𝑑𝑠𝑟d(r,s)=d(s,r)italic_d ( italic_r , italic_s ) = italic_d ( italic_s , italic_r ).

  4. 4.

    d𝑑ditalic_d follows the triangle inequality, i.e., for any three polytopes Qr,Qs,Qt𝒫nsubscript𝑄𝑟subscript𝑄𝑠subscript𝑄𝑡subscript𝒫𝑛Q_{r},Q_{s},Q_{t}\in\mathcal{P}_{n}italic_Q start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT , italic_Q start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT , italic_Q start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∈ caligraphic_P start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT: d(r,t)d(r,s)+d(s,t)𝑑𝑟𝑡𝑑𝑟𝑠𝑑𝑠𝑡d(r,t)\leq d(r,s)+d(s,t)italic_d ( italic_r , italic_t ) ≤ italic_d ( italic_r , italic_s ) + italic_d ( italic_s , italic_t ).
    Proof: Let 𝒜rsubscript𝒜𝑟\mathcal{A}_{r}caligraphic_A start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT denote the set of activation modes in a ReLU-net and the set of leaf nodes in a random forest for a particular polytope r𝑟ritalic_r. N𝑁Nitalic_N is the total number of possible activation paths in a ReLU-net or total trees in a random forest. Below c()𝑐c(\cdot)italic_c ( ⋅ ) denotes the cardinality of the set. We can write:

    (21) N𝑁\displaystyle Nitalic_N c((𝒜r𝒜s)(𝒜s𝒜t))absent𝑐subscript𝒜𝑟subscript𝒜𝑠subscript𝒜𝑠subscript𝒜𝑡\displaystyle\geq c((\mathcal{A}_{r}\cap\mathcal{A}_{s})\cup(\mathcal{A}_{s}% \cap\mathcal{A}_{t}))≥ italic_c ( ( caligraphic_A start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT ∩ caligraphic_A start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ) ∪ ( caligraphic_A start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ∩ caligraphic_A start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) )
    =c(𝒜r𝒜s)+c(𝒜s𝒜t)c(𝒜r𝒜s𝒜t)absent𝑐subscript𝒜𝑟subscript𝒜𝑠𝑐subscript𝒜𝑠subscript𝒜𝑡𝑐subscript𝒜𝑟subscript𝒜𝑠subscript𝒜𝑡\displaystyle=c(\mathcal{A}_{r}\cap\mathcal{A}_{s})+c(\mathcal{A}_{s}\cap% \mathcal{A}_{t})-c(\mathcal{A}_{r}\cap\mathcal{A}_{s}\cap\mathcal{A}_{t})= italic_c ( caligraphic_A start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT ∩ caligraphic_A start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ) + italic_c ( caligraphic_A start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ∩ caligraphic_A start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) - italic_c ( caligraphic_A start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT ∩ caligraphic_A start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ∩ caligraphic_A start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT )
    c(𝒜r𝒜s)+c(𝒜s𝒜t)c(𝒜r𝒜t).absent𝑐subscript𝒜𝑟subscript𝒜𝑠𝑐subscript𝒜𝑠subscript𝒜𝑡𝑐subscript𝒜𝑟subscript𝒜𝑡\displaystyle\geq c(\mathcal{A}_{r}\cap\mathcal{A}_{s})+c(\mathcal{A}_{s}\cap% \mathcal{A}_{t})-c(\mathcal{A}_{r}\cap\mathcal{A}_{t}).≥ italic_c ( caligraphic_A start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT ∩ caligraphic_A start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ) + italic_c ( caligraphic_A start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ∩ caligraphic_A start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) - italic_c ( caligraphic_A start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT ∩ caligraphic_A start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) .

    Rearranging the above equation, we get:

    Nc(𝒜r𝒜t)Nc(𝒜r𝒜s)+Nc(𝒜s𝒜t)𝑁𝑐subscript𝒜𝑟subscript𝒜𝑡𝑁𝑐subscript𝒜𝑟subscript𝒜𝑠𝑁𝑐subscript𝒜𝑠subscript𝒜𝑡\displaystyle N-c(\mathcal{A}_{r}\cap\mathcal{A}_{t})\leq N-c(\mathcal{A}_{r}% \cap\mathcal{A}_{s})+N-c(\mathcal{A}_{s}\cap\mathcal{A}_{t})italic_N - italic_c ( caligraphic_A start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT ∩ caligraphic_A start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ≤ italic_N - italic_c ( caligraphic_A start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT ∩ caligraphic_A start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ) + italic_N - italic_c ( caligraphic_A start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ∩ caligraphic_A start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT )
    1c(𝒜r𝒜t)N1c(𝒜r𝒜s)N+1absent1𝑐subscript𝒜𝑟subscript𝒜𝑡𝑁1𝑐subscript𝒜𝑟subscript𝒜𝑠𝑁1\displaystyle\implies 1-\frac{c(\mathcal{A}_{r}\cap\mathcal{A}_{t})}{N}\leq 1-% \frac{c(\mathcal{A}_{r}\cap\mathcal{A}_{s})}{N}+1⟹ 1 - divide start_ARG italic_c ( caligraphic_A start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT ∩ caligraphic_A start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) end_ARG start_ARG italic_N end_ARG ≤ 1 - divide start_ARG italic_c ( caligraphic_A start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT ∩ caligraphic_A start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ) end_ARG start_ARG italic_N end_ARG + 1
    c(𝒜s𝒜t)N𝑐subscript𝒜𝑠subscript𝒜𝑡𝑁\displaystyle\leavevmode\nobreak\ \leavevmode\nobreak\ \leavevmode\nobreak\ % \leavevmode\nobreak\ \leavevmode\nobreak\ \leavevmode\nobreak\ \leavevmode% \nobreak\ \leavevmode\nobreak\ \leavevmode\nobreak\ \leavevmode\nobreak\ % \leavevmode\nobreak\ \leavevmode\nobreak\ \leavevmode\nobreak\ \leavevmode% \nobreak\ \leavevmode\nobreak\ \leavevmode\nobreak\ \leavevmode\nobreak\ % \leavevmode\nobreak\ \leavevmode\nobreak\ \leavevmode\nobreak\ \leavevmode% \nobreak\ \leavevmode\nobreak\ \leavevmode\nobreak\ \leavevmode\nobreak\ % \leavevmode\nobreak\ \leavevmode\nobreak\ \leavevmode\nobreak\ \leavevmode% \nobreak\ \leavevmode\nobreak\ \leavevmode\nobreak\ \leavevmode\nobreak\ % \leavevmode\nobreak\ \leavevmode\nobreak\ \leavevmode\nobreak\ \leavevmode% \nobreak\ \leavevmode\nobreak\ \leavevmode\nobreak\ \leavevmode\nobreak\ % \leavevmode\nobreak\ \leavevmode\nobreak\ \leavevmode\nobreak\ \leavevmode% \nobreak\ \leavevmode\nobreak\ \leavevmode\nobreak\ \leavevmode\nobreak\ % \leavevmode\nobreak\ -\frac{c(\mathcal{A}_{s}\cap\mathcal{A}_{t})}{N}- divide start_ARG italic_c ( caligraphic_A start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ∩ caligraphic_A start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) end_ARG start_ARG italic_N end_ARG
    (22) d(r,t)d(r,s)+d(s,t).absent𝑑𝑟𝑡𝑑𝑟𝑠𝑑𝑠𝑡\displaystyle\implies d(r,t)\leq d(r,s)+d(s,t).⟹ italic_d ( italic_r , italic_t ) ≤ italic_d ( italic_r , italic_s ) + italic_d ( italic_s , italic_t ) .

A.2 Proof of Proposition 4.3

Note that first we find the nearest polytope to the inference point x𝑥xitalic_x using Geodesic distance and use Gaussian kernel locally for x𝑥xitalic_x within that polytope. Here the Gaussian kernel uses Euclidean distance from the kernel center to x𝑥xitalic_x (within the numerator of the exponent). The value out of the Gaussian kernel decays exponentially with the increasing distance of the inference point from the kernel center. We first expand g^y(𝐱)subscript^𝑔𝑦𝐱\hat{g}_{y}(\mathbf{x})over^ start_ARG italic_g end_ARG start_POSTSUBSCRIPT italic_y end_POSTSUBSCRIPT ( bold_x ):

g^y(𝐱)subscript^𝑔𝑦𝐱\displaystyle\hat{g}_{y}(\mathbf{x})over^ start_ARG italic_g end_ARG start_POSTSUBSCRIPT italic_y end_POSTSUBSCRIPT ( bold_x ) =f^y(𝐱)P^Y(y)k=1Kf^k(x)P^Y(k)absentsubscript^𝑓𝑦𝐱subscript^𝑃𝑌𝑦superscriptsubscript𝑘1𝐾subscript^𝑓𝑘𝑥subscript^𝑃𝑌𝑘\displaystyle=\frac{\hat{f}_{y}(\mathbf{x})\hat{P}_{Y}(y)}{\sum_{k=1}^{K}\hat{% f}_{k}(x)\hat{P}_{Y}(k)}= divide start_ARG over^ start_ARG italic_f end_ARG start_POSTSUBSCRIPT italic_y end_POSTSUBSCRIPT ( bold_x ) over^ start_ARG italic_P end_ARG start_POSTSUBSCRIPT italic_Y end_POSTSUBSCRIPT ( italic_y ) end_ARG start_ARG ∑ start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT over^ start_ARG italic_f end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( italic_x ) over^ start_ARG italic_P end_ARG start_POSTSUBSCRIPT italic_Y end_POSTSUBSCRIPT ( italic_k ) end_ARG
=f~y(𝐱)P^Y(y)+blog(n)P^Y(y)k=1K(f^k(𝐱)P^Y(k)+blog(n)P^Y(k))absentsubscript~𝑓𝑦𝐱subscript^𝑃𝑌𝑦𝑏𝑛subscript^𝑃𝑌𝑦superscriptsubscript𝑘1𝐾subscript^𝑓𝑘𝐱subscript^𝑃𝑌𝑘𝑏𝑛subscript^𝑃𝑌𝑘\displaystyle=\frac{\tilde{f}_{y}(\mathbf{x})\hat{P}_{Y}(y)+\frac{b}{\log(n)}% \hat{P}_{Y}(y)}{\sum_{k=1}^{K}(\hat{f}_{k}(\mathbf{x})\hat{P}_{Y}(k)+\frac{b}{% \log(n)}\hat{P}_{Y}(k))}= divide start_ARG over~ start_ARG italic_f end_ARG start_POSTSUBSCRIPT italic_y end_POSTSUBSCRIPT ( bold_x ) over^ start_ARG italic_P end_ARG start_POSTSUBSCRIPT italic_Y end_POSTSUBSCRIPT ( italic_y ) + divide start_ARG italic_b end_ARG start_ARG roman_log ( italic_n ) end_ARG over^ start_ARG italic_P end_ARG start_POSTSUBSCRIPT italic_Y end_POSTSUBSCRIPT ( italic_y ) end_ARG start_ARG ∑ start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT ( over^ start_ARG italic_f end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( bold_x ) over^ start_ARG italic_P end_ARG start_POSTSUBSCRIPT italic_Y end_POSTSUBSCRIPT ( italic_k ) + divide start_ARG italic_b end_ARG start_ARG roman_log ( italic_n ) end_ARG over^ start_ARG italic_P end_ARG start_POSTSUBSCRIPT italic_Y end_POSTSUBSCRIPT ( italic_k ) ) end_ARG

As the inference point 𝐱𝐱\mathbf{x}bold_x becomes more distant from training samples (and more distant from all of the Gaussian centers), we have that 𝒢(𝐱,μ^r,Σ^r)𝒢𝐱subscript^𝜇𝑟subscript^Σ𝑟\mathcal{G}(\mathbf{x},\hat{\mu}_{r},\hat{\Sigma}_{r})caligraphic_G ( bold_x , over^ start_ARG italic_μ end_ARG start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT , over^ start_ARG roman_Σ end_ARG start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT ) becomes smaller. Thus, y,f~y(𝐱)for-all𝑦subscript~𝑓𝑦𝐱\forall y,\tilde{f}_{y}(\mathbf{x})∀ italic_y , over~ start_ARG italic_f end_ARG start_POSTSUBSCRIPT italic_y end_POSTSUBSCRIPT ( bold_x ) shrinks. More formally, yfor-all𝑦\forall y∀ italic_y,

limd𝐱f~y(𝐱)=0.subscriptsubscript𝑑𝐱subscript~𝑓𝑦𝐱0\lim_{d_{\mathbf{x}}\to\infty}\tilde{f}_{y}(\mathbf{x})=0.roman_lim start_POSTSUBSCRIPT italic_d start_POSTSUBSCRIPT bold_x end_POSTSUBSCRIPT → ∞ end_POSTSUBSCRIPT over~ start_ARG italic_f end_ARG start_POSTSUBSCRIPT italic_y end_POSTSUBSCRIPT ( bold_x ) = 0 .

We can use this result to then examine the limiting behavior of our posteriors as the inference point 𝐱𝐱\mathbf{x}bold_x becomes more distant from the training data:

limd𝐱g^y(𝐱)subscriptsubscript𝑑𝐱subscript^𝑔𝑦𝐱\displaystyle\lim_{d_{\mathbf{x}}\to\infty}\hat{g}_{y}(\mathbf{x})roman_lim start_POSTSUBSCRIPT italic_d start_POSTSUBSCRIPT bold_x end_POSTSUBSCRIPT → ∞ end_POSTSUBSCRIPT over^ start_ARG italic_g end_ARG start_POSTSUBSCRIPT italic_y end_POSTSUBSCRIPT ( bold_x ) =limd𝐱f~y(𝐱)P^Y(y)+blog(n)P^Y(y)k=1K(f~k(𝐱)P^Y(k)+blog(n)P^Y(k))absentsubscriptsubscript𝑑𝐱subscript~𝑓𝑦𝐱subscript^𝑃𝑌𝑦𝑏𝑛subscript^𝑃𝑌𝑦superscriptsubscript𝑘1𝐾subscript~𝑓𝑘𝐱subscript^𝑃𝑌𝑘𝑏𝑛subscript^𝑃𝑌𝑘\displaystyle=\lim_{d_{\mathbf{x}}\to\infty}\frac{\tilde{f}_{y}(\mathbf{x})% \hat{P}_{Y}(y)+\frac{b}{\log(n)}\hat{P}_{Y}(y)}{\sum_{k=1}^{K}(\tilde{f}_{k}(% \mathbf{x})\hat{P}_{Y}(k)+\frac{b}{\log(n)}\hat{P}_{Y}(k))}= roman_lim start_POSTSUBSCRIPT italic_d start_POSTSUBSCRIPT bold_x end_POSTSUBSCRIPT → ∞ end_POSTSUBSCRIPT divide start_ARG over~ start_ARG italic_f end_ARG start_POSTSUBSCRIPT italic_y end_POSTSUBSCRIPT ( bold_x ) over^ start_ARG italic_P end_ARG start_POSTSUBSCRIPT italic_Y end_POSTSUBSCRIPT ( italic_y ) + divide start_ARG italic_b end_ARG start_ARG roman_log ( italic_n ) end_ARG over^ start_ARG italic_P end_ARG start_POSTSUBSCRIPT italic_Y end_POSTSUBSCRIPT ( italic_y ) end_ARG start_ARG ∑ start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT ( over~ start_ARG italic_f end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( bold_x ) over^ start_ARG italic_P end_ARG start_POSTSUBSCRIPT italic_Y end_POSTSUBSCRIPT ( italic_k ) + divide start_ARG italic_b end_ARG start_ARG roman_log ( italic_n ) end_ARG over^ start_ARG italic_P end_ARG start_POSTSUBSCRIPT italic_Y end_POSTSUBSCRIPT ( italic_k ) ) end_ARG
=(limd𝐱f~y(𝐱))P^Y(y)+blog(n)P^Y(y)k=1K(limd𝐱f~k(𝐱))P^Y(k)+blog(n)P^Y(k))\displaystyle=\frac{(\lim_{d_{\mathbf{x}}\to\infty}\tilde{f}_{y}(\mathbf{x}))% \hat{P}_{Y}(y)+\frac{b}{\log(n)}\hat{P}_{Y}(y)}{\sum_{k=1}^{K}(\lim_{d_{% \mathbf{x}}\to\infty}\tilde{f}_{k}(\mathbf{x}))\hat{P}_{Y}(k)+\frac{b}{\log(n)% }\hat{P}_{Y}(k))}= divide start_ARG ( roman_lim start_POSTSUBSCRIPT italic_d start_POSTSUBSCRIPT bold_x end_POSTSUBSCRIPT → ∞ end_POSTSUBSCRIPT over~ start_ARG italic_f end_ARG start_POSTSUBSCRIPT italic_y end_POSTSUBSCRIPT ( bold_x ) ) over^ start_ARG italic_P end_ARG start_POSTSUBSCRIPT italic_Y end_POSTSUBSCRIPT ( italic_y ) + divide start_ARG italic_b end_ARG start_ARG roman_log ( italic_n ) end_ARG over^ start_ARG italic_P end_ARG start_POSTSUBSCRIPT italic_Y end_POSTSUBSCRIPT ( italic_y ) end_ARG start_ARG ∑ start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT ( roman_lim start_POSTSUBSCRIPT italic_d start_POSTSUBSCRIPT bold_x end_POSTSUBSCRIPT → ∞ end_POSTSUBSCRIPT over~ start_ARG italic_f end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( bold_x ) ) over^ start_ARG italic_P end_ARG start_POSTSUBSCRIPT italic_Y end_POSTSUBSCRIPT ( italic_k ) + divide start_ARG italic_b end_ARG start_ARG roman_log ( italic_n ) end_ARG over^ start_ARG italic_P end_ARG start_POSTSUBSCRIPT italic_Y end_POSTSUBSCRIPT ( italic_k ) ) end_ARG
=P^Y(y)k=1KP^Y(k)absentsubscript^𝑃𝑌𝑦superscriptsubscript𝑘1𝐾subscript^𝑃𝑌𝑘\displaystyle=\frac{\hat{P}_{Y}(y)}{\sum_{k=1}^{K}\hat{P}_{Y}(k)}= divide start_ARG over^ start_ARG italic_P end_ARG start_POSTSUBSCRIPT italic_Y end_POSTSUBSCRIPT ( italic_y ) end_ARG start_ARG ∑ start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT over^ start_ARG italic_P end_ARG start_POSTSUBSCRIPT italic_Y end_POSTSUBSCRIPT ( italic_k ) end_ARG
=P^Y(y).absentsubscript^𝑃𝑌𝑦\displaystyle=\hat{P}_{Y}(y).= over^ start_ARG italic_P end_ARG start_POSTSUBSCRIPT italic_Y end_POSTSUBSCRIPT ( italic_y ) .

Appendix B Hardware and Software Configurations

  • Operating System: Linux (ubuntu 20.04), macOS (Ventura 13.2.1)

  • VM Size: Azure Standard D96as v4 (96 vcpus, 384 GiB memory)

  • GPU: Apple M1 Max

  • Software: Python 3.8, scikit-learn \geq 0.22.0, tensorflow-macos\leq2.9, tensorflow-metal \leq 0.5.0.

Appendix C Simulations

Refer to caption
Figure 4: Visualization of true and estimated posteriors for class 0 from five binary class simulation experiments. Column 1: 10,000 training points with 5,000 samples per class sampled from 6 different simulation setups for binary class classification. Trunk simulation is shown for two dimensional case. The class labels are indicated by yellow and blue colors. Column 2-8: True and estimated class conditional posteriors from different approaches. The posteriors estimated from KDN and KDF are better calibrated for both in- and out-of-distribution regions compared to those of their parent algorithms.

We construct six types of binary class simulations:

  • Gaussian XOR is a two-class classification problem with equal class priors. Conditioned on being in class 0, a sample is drawn from a mixture of two Gaussians with means ±[0.5,0.5]plus-or-minussuperscript0.50.5top\pm[0.5,-0.5]^{\top}± [ 0.5 , - 0.5 ] start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT and standard deviations of 0.250.250.250.25. Conditioned on being in class 1, a sample is drawn from a mixture of two Gaussians with means ±[0.5,0.5]plus-or-minussuperscript0.50.5top\pm[0.5,-0.5]^{\top}± [ 0.5 , - 0.5 ] start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT and standard deviations of 0.250.250.250.25.

  • Spiral is a two-class classification problem with the following data distributions: let K𝐾Kitalic_K be the number of classes and Ssimilar-to𝑆absentS\simitalic_S ∼ multinomial(1K1K,n)1𝐾subscript1𝐾𝑛(\frac{1}{K}\vec{1}_{K},n)( divide start_ARG 1 end_ARG start_ARG italic_K end_ARG over→ start_ARG 1 end_ARG start_POSTSUBSCRIPT italic_K end_POSTSUBSCRIPT , italic_n ). Conditioned on S𝑆Sitalic_S, each feature vector is parameterized by two variables, the radius r𝑟ritalic_r and an angle θ𝜃\thetaitalic_θ. For each sample, r𝑟ritalic_r is sampled uniformly in [0,1]01[0,1][ 0 , 1 ]. Conditioned on a particular class, the angles are evenly spaced between 4π(k1)tKK4𝜋𝑘1subscript𝑡𝐾𝐾\frac{4\pi(k-1)t_{K}}{K}divide start_ARG 4 italic_π ( italic_k - 1 ) italic_t start_POSTSUBSCRIPT italic_K end_POSTSUBSCRIPT end_ARG start_ARG italic_K end_ARG and 4π(k)tKK4𝜋𝑘subscript𝑡𝐾𝐾\frac{4\pi(k)t_{K}}{K}divide start_ARG 4 italic_π ( italic_k ) italic_t start_POSTSUBSCRIPT italic_K end_POSTSUBSCRIPT end_ARG start_ARG italic_K end_ARG, where tKsubscript𝑡𝐾t_{K}italic_t start_POSTSUBSCRIPT italic_K end_POSTSUBSCRIPT controls the number of turns in the spiral. To inject noise along the spirals, we add Gaussian noise to the evenly spaced angles θ:θ=θ+𝒩(0,0.09):superscript𝜃𝜃superscript𝜃𝒩00.09\theta^{\prime}:\theta=\theta^{\prime}+\mathcal{N}(0,0.09)italic_θ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT : italic_θ = italic_θ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT + caligraphic_N ( 0 , 0.09 ). The observed feature vector is then (rcos(θ),rsin(θ))𝑟𝜃𝑟𝜃(r\;\cos(\theta),r\;\sin(\theta))( italic_r roman_cos ( italic_θ ) , italic_r roman_sin ( italic_θ ) ).

  • Circle is a two-class classification problem with equal class priors. Conditioned on being in class 0, a sample is drawn from a circle centered at (0,0)00(0,0)( 0 , 0 ) with a radius of r=0.75𝑟0.75r=0.75italic_r = 0.75. Conditioned on being in class 1, a sample is drawn from a circle centered at (0,0)00(0,0)( 0 , 0 ) with a radius of r=1𝑟1r=1italic_r = 1, which is cut off by the region bounds. To inject noise along the circles, we add Gaussian noise to the circle radii r:r=r+𝒩(0,0.01):superscript𝑟𝑟superscript𝑟𝒩00.01r^{\prime}:r=r^{\prime}+\mathcal{N}(0,0.01)italic_r start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT : italic_r = italic_r start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT + caligraphic_N ( 0 , 0.01 ).

  • Sinewave is a two-class classification problem based on sine waves. Conditioned on being in class 0, a sample is drawn from the distribution y=cos(πx)𝑦𝜋𝑥y=\cos(\pi x)italic_y = roman_cos ( italic_π italic_x ). Conditioned on being in class 1, a sample is drawn from the distribution y=sin(πx)𝑦𝜋𝑥y=\sin(\pi x)italic_y = roman_sin ( italic_π italic_x ). We inject Gaussian noise to the sine wave heights y:y=y+𝒩(0,0.01):superscript𝑦𝑦superscript𝑦𝒩00.01y^{\prime}:y=y^{\prime}+\mathcal{N}(0,0.01)italic_y start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT : italic_y = italic_y start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT + caligraphic_N ( 0 , 0.01 ).

  • Polynomial is a two-class classification problem with the following data distributions: y=xa𝑦superscript𝑥𝑎y=x^{a}italic_y = italic_x start_POSTSUPERSCRIPT italic_a end_POSTSUPERSCRIPT. Conditioned on being in class 0, a sample is drawn from the distribution y=x1𝑦superscript𝑥1y=x^{1}italic_y = italic_x start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT. Conditioned on being in class 1, a sample is drawn from the distribution y=x3𝑦superscript𝑥3y=x^{3}italic_y = italic_x start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT. Gaussian noise is added to variables y:y=y+𝒩(0,0.01):superscript𝑦𝑦superscript𝑦𝒩00.01y^{\prime}:y=y^{\prime}+\mathcal{N}(0,0.01)italic_y start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT : italic_y = italic_y start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT + caligraphic_N ( 0 , 0.01 ).

  • Trunk is a two-class classification problem with gradually increasing dimension and equal class priors. The class conditional probabilities are Gaussian:

    P(X|Y=0)𝑃conditional𝑋𝑌0\displaystyle P(X|Y=0)italic_P ( italic_X | italic_Y = 0 ) =𝒢(μ1,I),absent𝒢subscript𝜇1𝐼\displaystyle=\mathcal{G}(\mu_{1},I),= caligraphic_G ( italic_μ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_I ) ,
    P(X|Y=1)𝑃conditional𝑋𝑌1\displaystyle P(X|Y=1)italic_P ( italic_X | italic_Y = 1 ) =𝒢(μ2,I),absent𝒢subscript𝜇2𝐼\displaystyle=\mathcal{G}(\mu_{2},I),= caligraphic_G ( italic_μ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , italic_I ) ,

    where μ1=μ,μ2=μformulae-sequencesubscript𝜇1𝜇subscript𝜇2𝜇\mu_{1}=\mu,\mu_{2}=-\muitalic_μ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT = italic_μ , italic_μ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT = - italic_μ, μ𝜇\muitalic_μ is a d𝑑ditalic_d dimensional vector whose i𝑖iitalic_i-th component is (1i)1/2superscript1𝑖12(\frac{1}{i})^{1/2}( divide start_ARG 1 end_ARG start_ARG italic_i end_ARG ) start_POSTSUPERSCRIPT 1 / 2 end_POSTSUPERSCRIPT and I𝐼Iitalic_I is d𝑑ditalic_d dimensional identity matrix.

Table 2: Hyperparameters for RF and KDF.
Hyperparameters Value
n_estimators 500500500500
max_depth \infty
min_samples_leaf 1111
λ𝜆\lambdaitalic_λ 1×1061superscript1061\times 10^{-6}1 × 10 start_POSTSUPERSCRIPT - 6 end_POSTSUPERSCRIPT
b𝑏bitalic_b exp(107)superscript107\exp{(-10^{-7})}roman_exp ( - 10 start_POSTSUPERSCRIPT - 7 end_POSTSUPERSCRIPT )
Table 3: Hyperparameters for ReLU-net and KDNon Tabular data.
Hyperparameters Value
number of hidden layers 4444
nodes per hidden layer 1000100010001000
optimizer Adam
learning rate 3×1043superscript1043\times 10^{-4}3 × 10 start_POSTSUPERSCRIPT - 4 end_POSTSUPERSCRIPT
λ𝜆\lambdaitalic_λ 1×1061superscript1061\times 10^{-6}1 × 10 start_POSTSUPERSCRIPT - 6 end_POSTSUPERSCRIPT
b𝑏bitalic_b exp(107)superscript107\exp{(-10^{-7})}roman_exp ( - 10 start_POSTSUPERSCRIPT - 7 end_POSTSUPERSCRIPT )

Appendix D Pseudocodes

We provide the pseudocode for our porposed algorithms in Algorithm 1, 2 and 3.

Algorithm 1 Fit a KDX model.
1:
2:(1) θ𝜃\thetaitalic_θ \triangleright Parent learner (random forest or deep network model)
3:(2) 𝒟n=(𝐗,𝐲)n×d×{1,,K}nsubscript𝒟𝑛𝐗𝐲superscript𝑛𝑑superscript1𝐾𝑛\mathcal{D}_{n}=(\mathbf{X},\mathbf{y})\in\mathbb{R}^{n\times d}\times\{1,% \ldots,K\}^{n}caligraphic_D start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT = ( bold_X , bold_y ) ∈ blackboard_R start_POSTSUPERSCRIPT italic_n × italic_d end_POSTSUPERSCRIPT × { 1 , … , italic_K } start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT \triangleright Training data
4:𝒢𝒢\mathcal{G}caligraphic_G \triangleright a KDX model
5:function KGX.fit(θ,𝐗,𝐲𝜃𝐗𝐲\theta,\mathbf{X},\mathbf{y}italic_θ , bold_X , bold_y)
6:     for i=1,,n𝑖1𝑛i=1,\ldots,nitalic_i = 1 , … , italic_n do \triangleright Iterate over the dataset to calculate the weights
7:         for j=1,,n𝑗1𝑛j=1,\ldots,nitalic_j = 1 , … , italic_n do
8:              wijsubscript𝑤𝑖𝑗absentw_{ij}\leftarrowitalic_w start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT ← computeWeights(𝐱i,𝐱j,θsubscript𝐱𝑖subscript𝐱𝑗𝜃\mathbf{x}_{i},\mathbf{x}_{j},\thetabold_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , bold_x start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT , italic_θ)
9:         end for
10:     end for
11:
12:
13:     {Qr,𝐰rs}r=1p~superscriptsubscriptsubscript𝑄𝑟subscript𝐰𝑟𝑠𝑟1~𝑝absent\{Q_{r},\mathbf{w}_{rs}\}_{r=1}^{\tilde{p}}\leftarrow{ italic_Q start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT , bold_w start_POSTSUBSCRIPT italic_r italic_s end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_r = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT over~ start_ARG italic_p end_ARG end_POSTSUPERSCRIPT ← getPolytopes(𝐰𝐰\mathbf{w}bold_w) \triangleright Identify the polytopes by clustering the samples with similar weight
14:
15:     for r=1,,p~𝑟1~𝑝r=1,\ldots,\tilde{p}italic_r = 1 , … , over~ start_ARG italic_p end_ARG do \triangleright Iterate over each polytope
16:         𝒢.μ^r,𝒢.Σ^r,𝒢.n^ryformulae-sequence𝒢subscript^𝜇𝑟𝒢subscript^Σ𝑟𝒢subscript^𝑛𝑟𝑦absent\mathcal{G}.\hat{\mu}_{r},\mathcal{G}.\hat{\Sigma}_{r},\mathcal{G}.\hat{n}_{ry}\leftarrowcaligraphic_G . over^ start_ARG italic_μ end_ARG start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT , caligraphic_G . over^ start_ARG roman_Σ end_ARG start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT , caligraphic_G . over^ start_ARG italic_n end_ARG start_POSTSUBSCRIPT italic_r italic_y end_POSTSUBSCRIPT ← estimateParameters(𝐗,y,{𝐰rs}s=1p~𝐗𝑦superscriptsubscriptsubscript𝐰𝑟𝑠𝑠1~𝑝\mathbf{X},y,\{\mathbf{w}_{rs}\}_{s=1}^{\tilde{p}}bold_X , italic_y , { bold_w start_POSTSUBSCRIPT italic_r italic_s end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_s = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT over~ start_ARG italic_p end_ARG end_POSTSUPERSCRIPT) \triangleright Fit Gaussians using MLE
17:     end for
18:     return 𝒢𝒢\mathcal{G}caligraphic_G
19:end function
Algorithm 2 Computing weights in KDF
1:
2:(1) 𝐱i,𝐱j1×dsubscript𝐱𝑖subscript𝐱𝑗superscript1𝑑\mathbf{x}_{i},\mathbf{x}_{j}\in\mathbb{R}^{1\times d}bold_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , bold_x start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT 1 × italic_d end_POSTSUPERSCRIPT \triangleright two input samples to be weighted
3:(2) θ𝜃\thetaitalic_θ \triangleright parent random forest with T𝑇Titalic_T trees
4:wij[0,1]subscript𝑤𝑖𝑗01w_{ij}\in[0,1]italic_w start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT ∈ [ 0 , 1 ] \triangleright compute similarity between i𝑖iitalic_i and j𝑗jitalic_j-th samples.
5:function computeWeights(𝐱i,𝐱j,θsubscript𝐱𝑖subscript𝐱𝑗𝜃\mathbf{x}_{i},\mathbf{x}_{j},\thetabold_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , bold_x start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT , italic_θ)
6:     isubscript𝑖absent\mathcal{I}_{i}\leftarrowcaligraphic_I start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ← pushDownTrees(𝐱i,θsubscript𝐱𝑖𝜃\mathbf{x}_{i},\thetabold_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_θ) \triangleright push 𝐱isubscript𝐱𝑖\mathbf{x}_{i}bold_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT down T𝑇Titalic_T trees and get the leaf numbers it end up in.
7:     jsubscript𝑗absent\mathcal{I}_{j}\leftarrowcaligraphic_I start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ← pushDownTrees(𝐱j,θsubscript𝐱𝑗𝜃\mathbf{x}_{j},\thetabold_x start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT , italic_θ) \triangleright push 𝐱jsubscript𝐱𝑗\mathbf{x}_{j}bold_x start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT down T𝑇Titalic_T trees and get the leaf numbers it end up in.
8:     l𝑙absentl\leftarrowitalic_l ← countMatches(i,jsubscript𝑖subscript𝑗\mathcal{I}_{i},\mathcal{I}_{j}caligraphic_I start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , caligraphic_I start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT) \triangleright count the number of times the samples end up in the same leaf
9:     wijlTsubscript𝑤𝑖𝑗𝑙𝑇w_{ij}\leftarrow\frac{l}{T}italic_w start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT ← divide start_ARG italic_l end_ARG start_ARG italic_T end_ARG
10:     return wijsubscript𝑤𝑖𝑗w_{ij}italic_w start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT
11:end function
Algorithm 3 Computing weights in KDN
1:
2:(1) 𝐱i,𝐱j1×dsubscript𝐱𝑖subscript𝐱𝑗superscript1𝑑\mathbf{x}_{i},\mathbf{x}_{j}\in\mathbb{R}^{1\times d}bold_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , bold_x start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT 1 × italic_d end_POSTSUPERSCRIPT \triangleright two input samples to be weighted
3:(2) θ𝜃\thetaitalic_θ \triangleright parent deep-net model
4:wij[0,1]subscript𝑤𝑖𝑗01w_{ij}\in[0,1]italic_w start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT ∈ [ 0 , 1 ] \triangleright compute similarity between i𝑖iitalic_i and j𝑗jitalic_j-th samples.
5:function computeWeights(𝐱i,𝐱j,θsubscript𝐱𝑖subscript𝐱𝑗𝜃\mathbf{x}_{i},\mathbf{x}_{j},\thetabold_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , bold_x start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT , italic_θ)
6:     𝒜isubscript𝒜𝑖absent\mathcal{A}_{i}\leftarrowcaligraphic_A start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ← pushDownNetwork(𝐱i,θsubscript𝐱𝑖𝜃\mathbf{x}_{i},\thetabold_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_θ) \triangleright get activation modes 𝒜isubscript𝒜𝑖\mathcal{A}_{i}caligraphic_A start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT
7:     𝒜jsubscript𝒜𝑗absent\mathcal{A}_{j}\leftarrowcaligraphic_A start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ← pushDownNetwork(𝐱j,θsubscript𝐱𝑗𝜃\mathbf{x}_{j},\thetabold_x start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT , italic_θ) \triangleright get activation modes 𝒜jsubscript𝒜𝑗\mathcal{A}_{j}caligraphic_A start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT
8:     l𝑙absentl\leftarrowitalic_l ← countMatches(𝒜isubscript𝒜𝑖\mathcal{A}_{i}caligraphic_A start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT, 𝒜jsubscript𝒜𝑗\mathcal{A}_{j}caligraphic_A start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT) \triangleright count the number of times the two samples activate the activation paths in a similar way
9:     wijlNsubscript𝑤𝑖𝑗𝑙𝑁w_{ij}\leftarrow\frac{l}{N}italic_w start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT ← divide start_ARG italic_l end_ARG start_ARG italic_N end_ARG \triangleright N𝑁Nitalic_N is the total number of activation paths
10:     return wijsubscript𝑤𝑖𝑗w_{ij}italic_w start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT
11:end function

Appendix E Extended Results on OpenML-CC18 data suite

See Figure 5, 6, 7 and 8 for extended results on OpenML-CC18 data suite.

Refer to caption
Figure 5: Extended results on OpenML-CC18 datasets. Left: Performance (classification error, MCE and mean max confidence) of KDF on different Openml-CC18 datasets. Right: Performance (classification error, MCE and mean max confidence) of KDN on different Openml-CC18 datasets.
Refer to caption
Figure 6: Extended results on OpenML-CC18 datasets (continued). Left: Performance (classification error, MCE and mean max confidence) of KDF on different Openml-CC18 datasets. Right: Performance (classification error, MCE and mean max confidence) of KDN on different Openml-CC18 datasets.
Refer to caption
Figure 7: Extended results on OpenML-CC18 datasets (continued). Left: Performance (classification error, MCE and mean max confidence) of KDF on different Openml-CC18 datasets. Right: Performance (classification error, MCE and mean max confidence) of KDN on different Openml-CC18 datasets.
Refer to caption
Figure 8: Extended results on OpenML-CC18 datasets (continued). Left: Performance (classification error, MCE and mean max confidence) of KDF on different Openml-CC18 datasets. Right: Performance (classification error, MCE and mean max confidence) of KDN on different Openml-CC18 datasets.

Appendix F Extended Results on Vision datasets using Resnet-50

Table 4: ID approaches (Sigmoid, Isotonic) are bad at OOD calibration and OOD approaches (ACET, ODIN, OE) are bad at ID calibration. KDN bridges between both ID and OOD calibration approaches. ‘\uparrow’ and ‘\downarrow’ indicate whether higher and lower values are better, respectively. Bolded indicates most performant, or within the margin of error of the most performant.
Dataset Statistics Parent KDN Isotonic Sigmoid ACET ODIN OE
ID CIFAR-10 Accuracy(%)(\%)\uparrow( % ) ↑ 77.78±0.00plus-or-minus77.780.0077.78\pm 0.0077.78 ± 0.00 76.84±0.01plus-or-minus76.840.0176.84\pm 0.0176.84 ± 0.01 78.25±0.00plus-or-minus78.250.00\mathbf{78.25}\pm 0.00bold_78.25 ± 0.00 76.93±0.00plus-or-minus76.930.0076.93\pm 0.0076.93 ± 0.00 75.08±0.03plus-or-minus75.080.0375.08\pm 0.0375.08 ± 0.03 78.00±0.00plus-or-minus78.000.0078.00\pm 0.0078.00 ± 0.00 73.95±0.00plus-or-minus73.950.0073.95\pm 0.0073.95 ± 0.00
MCE \downarrow 0.09±0.00plus-or-minus0.090.000.09\pm 0.000.09 ± 0.00 0.04±0.00plus-or-minus0.040.00\mathbf{0.04}\pm 0.00bold_0.04 ± 0.00 0.03±0.01plus-or-minus0.030.01\mathbf{0.03}\pm 0.01bold_0.03 ± 0.01 0.10±0.01plus-or-minus0.100.010.10\pm 0.010.10 ± 0.01 0.13±0.00plus-or-minus0.130.000.13\pm 0.000.13 ± 0.00 0.09±0.00plus-or-minus0.090.000.09\pm 0.000.09 ± 0.00 0.55±0.00plus-or-minus0.550.000.55\pm 0.000.55 ± 0.00
MMC \downarrow 0.47±0.00plus-or-minus0.470.000.47\pm 0.000.47 ± 0.00 0.37±0.01plus-or-minus0.370.010.37\pm 0.010.37 ± 0.01 0.54±0.01plus-or-minus0.540.010.54\pm 0.010.54 ± 0.01 0.43±0.01plus-or-minus0.430.010.43\pm 0.010.43 ± 0.01 0.69±0.00plus-or-minus0.690.000.69\pm 0.000.69 ± 0.00 0.48±0.01plus-or-minus0.480.010.48\pm 0.010.48 ± 0.01 0.13±0.00plus-or-minus0.130.00\mathbf{0.13}\pm 0.00bold_0.13 ± 0.00
OOD CIFAR-100 OCE \downarrow 0.30±0.00plus-or-minus0.300.000.30\pm 0.000.30 ± 0.00 0.20±0.01plus-or-minus0.200.010.20\pm 0.010.20 ± 0.01 0.37±0.01plus-or-minus0.370.010.37\pm 0.010.37 ± 0.01 0.29±0.01plus-or-minus0.290.010.29\pm 0.010.29 ± 0.01 0.55±0.00plus-or-minus0.550.000.55\pm 0.000.55 ± 0.00 0.31±0.00plus-or-minus0.310.000.31\pm 0.000.31 ± 0.00 0.01±0.00plus-or-minus0.010.00\mathbf{0.01}\pm 0.00bold_0.01 ± 0.00
SVHN OCE \downarrow 0.87±0.00plus-or-minus0.870.000.87\pm 0.000.87 ± 0.00 0.01±0.00plus-or-minus0.010.00\textbf{0.01}\pm 0.000.01 ± 0.00 0.85±0.00plus-or-minus0.850.000.85\pm 0.000.85 ± 0.00 0.69±0.01plus-or-minus0.690.010.69\pm 0.010.69 ± 0.01 0.90±0.00plus-or-minus0.900.000.90\pm 0.000.90 ± 0.00 0.87±0.00plus-or-minus0.870.000.87\pm 0.000.87 ± 0.00 0.04±0.01plus-or-minus0.040.010.04\pm 0.010.04 ± 0.01
Noise OCE \downarrow 0.90±0.00plus-or-minus0.900.000.90\pm 0.000.90 ± 0.00 0.00±0.00plus-or-minus0.000.00\mathbf{0.00}\pm 0.00bold_0.00 ± 0.00 0.87±0.00plus-or-minus0.870.000.87\pm 0.000.87 ± 0.00 0.71±0.00plus-or-minus0.710.000.71\pm 0.000.71 ± 0.00 0.01±0.01plus-or-minus0.010.010.01\pm 0.010.01 ± 0.01 0.06±0.00plus-or-minus0.060.000.06\pm 0.000.06 ± 0.00 0.00±0.00plus-or-minus0.000.00\mathbf{0.00}\pm 0.00bold_0.00 ± 0.00

In this experiments, we use a Resnet-50 encoder pretrained using contrastive loss [30] as described in http://keras.io/examples/vision/supervised-contrastive-learning. The encoder projects the input images down to a 256256256256 dimensional latent space and we add two dense layers with 200200200200 and 10101010 nodes on top of the encoder. We use the same pretrained encoder for all the baseline algorithms.

As shown in Table 4, KDN achieves good calibration for both ID and OOD datasets whereas the ID calibration approaches are poorly calibrated in the OOD regions and the OOD approaches have poor ID calibration.