Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                

AdapTable: Test-Time Adaptation for Tabular Data
via Shift-Aware Uncertainty Calibrator and Label Distribution Handler

Changhun Kim1,2\equalcontrib Taewon Kim1\equalcontrib Seungyeon Woo1,3  June Yong Yang1  Eunho Yang1,2
Abstract

In real-world scenarios, tabular data often suffer from distribution shifts that threaten the performance of machine learning models. Despite its prevalence and importance, handling distribution shifts in the tabular domain remains underexplored due to the inherent challenges within the tabular data itself. In this sense, test-time adaptation (TTA) offers a promising solution by adapting models to target data without accessing source data, crucial for privacy-sensitive tabular domains. However, existing TTA methods either 1) overlook the nature of tabular distribution shifts, often involving label distribution shifts, or 2) impose architectural constraints on the model, leading to a lack of applicability. To this end, we propose AdapTable, a novel TTA framework for tabular data. AdapTable operates in two stages: 1) calibrating model predictions using a shift-aware uncertainty calibrator, and 2) adjusting these predictions to match the target label distribution with a label distribution handler. We validate the effectiveness of AdapTable through theoretical analysis and extensive experiments on various distribution shift scenarios. Our results demonstrate AdapTable’s ability to handle various real-world distribution shifts, achieving up to a 16% improvement on the HELOC dataset.

1 Introduction

Tabular data is one of the most abundant forms across various industries, including healthcare (Johnson et al. 2016), finance (Studies 2019), manufacturing (Hein et al. 2017), and public administration (Gardner, Popovic, and Schmidt 2023). However, tabular learning models often face challenges in real-world applications due to distribution shifts, which severely degrade their integrity and reliability. In this regard, test-time adaptation (TTA) (Lee 2013; Liu et al. 2021; Wang et al. 2021a; Niu et al. 2022, 2023; Boudiaf et al. 2022) offers a promising solution to address this issue by adapting models under unknown distribution shifts using only unlabeled test data without access to training data.

Despite its potential, direct application of TTA without the consideration of characteristics of tabular data, results in limited performance gain or model collapse. We identify two primary reasons for this. First, representation learning in the tabular domain is often hindered by the entanglement of covariate shifts and concept drifts (Liu et al. 2023). Consequently, TTA methods leveraging unsupervised objectives, which rely on cluster assumption often fail or lead to model collapse. Second, these approaches often do not take label distribution shifts into account, a key factor in the performance decline within the tabular domain. This issue is further aggravated by the tendency for predictions in the target domain to be biased towards the source label.

To address these issues, we propose AdapTable, a novel TTA approach tailored for tabular data. AdapTable consists of two main components: 1) a shift-aware uncertainty calibrator and 2) a label distribution handler. Our shift-aware uncertainty calibrator utilizes graph neural networks to assign per-sample temperature for each model prediction. By treating each column as a node, it captures not only individual feature shifts but also complex patterns across features. Our label distribution handler then adjusts the calibrated model probabilities by estimating the label distribution of the current target batch. This process aligns predictions with the target label distribution, addressing biases towards the source distribution. AdapTable requires no parameter updates, making it model-agnostic and thus compatible with both deep learning models and gradient-boosted decision trees, offering high versatility for tabular data.

We evaluate AdapTable under various distribution shifts and demonstrate AdapTable consistently outperforms baselines, achieving up to 16% gains on the HELOC dataset. Furthermore, we provide theoretical insights into AdapTable’s performance, supported by extensive ablation studies. We hereby summarize our contributions:

  • We analyze the challenges of tabular distribution shifts to reveal why existing TTA methods fail, highlighting the entanglement of covariate, concept shifts, and label distribution shifts as key factors in performance degradation.

  • Building on these analyses, we introduce AdapTable, a first model-agnostic TTA method specifically designed for tabular data. AdapTable addresses label distribution shifts by estimating and adjusting the label distribution of the current test batch, while also calibrating model predictions with a shift-aware uncertainty calibrator.

  • Our extensive experiments demonstrate that AdapTable exhibits robust adaptation performance across various model architectures and under diverse natural distribution shifts and common corruptions, further supported by extensive ablation studies.

2 Analysis of Tabular Distribution Shifts

In this section, we examine why prior TTA methods struggle with distribution shifts in the tabular domain. First, we note that deep learning models’ latent representations do not follow label-based cluster assumptions due to the entanglement of covariate and concept shifts, causing TTA methods relying on these assumptions (Wang et al. 2021a; Lee 2013; Niu et al. 2022) to falter in tabular data. Second, we identify label distribution shift as a key driver of performance degradation under distribution shifts, as discussed further in Section 2.1 and Section 2.2.

2.1 Indistinguishable Representations

Refer to caption
Figure 1: Latent space visualization with t-SNE comparing (a) tabular data (Gardner, Popovic, and Schmidt 2023) and (b) image data (Bischl et al. 2021). Reliability diagrams of (c) underconfident and (d) overconfident scenarios are shown. All experiments are conducted using an MLP architecture.

We first reveal that deep tabular models fail to learn distinguishable embeddings. In Figures 1 (a) and (b), we visualize the embedding spaces of models trained on two datasets: HELOC (Gardner, Popovic, and Schmidt 2023), a pure tabular dataset, and Optdigits (Bischl et al. 2021), a linearized image dataset. Notably, the deep learning models’ representations adhere to the cluster assumption by labels only in the image data, not in the tabular data.

We attribute this unique behavior of deep tabular models to the high-frequency nature of tabular data. In the tabular domain, weak causality from inputs X𝑋Xitalic_X to outputs Y𝑌Yitalic_Y due to latent confounders Z𝑍Zitalic_Z often leads to vastly different labels for similar inputs (Grinsztajn, Oyallon, and Varoquaux 2022; Liu et al. 2023). For instance, cardiovascular disease risk predictions based on cholesterol, blood pressure, age, and smoking history are influenced by gender as a latent confounder, resulting in different risk levels for men and women despite identical inputs (Mosca, Barrett-Connor, and Kass Wenger 2011; DeFilippis and Van Spall 2021). This leads to high-frequency functions that are difficult for deep neural networks, which are biased toward low-frequency functions, to accurately model (Beyazit et al. 2024).

Consequently, prior TTA methods, which rely on cluster assumptions and primarily target input covariate shifts, show limited performance gains. Figure 5 demonstrates that these methods fail to improve beyond the vanilla performance of the source model due to the lack of a cluster assumption.

2.2 Importance of Label Distribution Shifts

Refer to caption
Figure 2: Label distribution of (a) source domain, (b) target domain, (c) estimated label distribution using pseudo labels, and (d) corrected label distribution of AdapTable are shown using MLP on HELOC dataset.

Second, we find that label distribution shift is a primary cause of performance degradation, and accurate estimation of target label distribution can lead to significant performance gains. A recent benchmark study, TableShift (Gardner, Popovic, and Schmidt 2023) have emphasized that label distribution shift is a primary cause of performance degradation in tabular data. Specifically, They investigated the relationship between three key shift factors—input covariate shift (X𝑋Xitalic_X-shift), concept shift (Y|Xconditional𝑌𝑋Y|Xitalic_Y | italic_X-shift), and label distribution shift (Y𝑌Yitalic_Y-shift)—and model performance, and discovered that label distribution shifts are strongly correlated with performance degradation. Our analysis in Section F further reveals that these shifts are highly prevalent in tabular data. This underscores the need for a test-time adaptation method that addresses label distribution shifts by estimating the target label distribution and adjusting predictions accordingly.

Refer to caption
Figure 3: The overall pipeline of the AdapTable framework. AdapTable employs a per-sample temperature scaling to correct overconfident predictions by treating each column as a graph node, enabling a shift-aware uncertainty calibrator with graph neural networks to capture both individual and complex feature shifts (Section 3.2). It also estimates the label distribution of the current test batch and adjusts the model’s predictions accordingly (Section 3.3).

Moreover, we visualize model predictions in Figure 2 and observe that, similar to other domains (Wu et al. 2021; Hwang et al. 2022; Park, Seo, and Yang 2023), the marginal distribution of output labels is biased toward the source label distribution. Given that tabular models are often poorly calibrated (Figure 1), we conduct an experiment using a perfectly calibrated model, which yields high confidence for correct samples and low confidence for incorrect ones. As shown in Table 1, our label distribution adaptation method significantly improves under these conditions. This underscores the need for an uncertainty calibrator specific to tabular data.

Table 1: Key findings demonstrate that uncertainty calibration enhances the performance of the label distribution handler.

Method HELOC Voting Source 47.6 79.3 AdapTable 63.7 79.6 AdapTable (Oracle) 90.1 84.7

3 AdapTable

This section introduces AdapTable, the first model-agnostic test-time adaptation strategy for tabular data. AdapTable uses per-sample temperature scaling to correct overconfident yet incorrect predictions; by treating each column as a graph node, it employs a shift-aware uncertainty calibrator with graph neural networks to capture both individual and complex feature shifts (Section 3.2). It also estimates the average label distribution of the current test batch and adjusts the model’s output predictions accordingly (Section 3.3). We also provide a theoretical justification for how our label distribution estimation reduces the error bound in Section 3.4. The overall framework of AdapTable is depicted in Figure 3.

3.1 Test-Time Adaptation Setup for Tabular Data

We begin by defining the problem setup for test-time adaptation (TTA) for tabular data. Let fθ:DC:subscript𝑓𝜃superscript𝐷superscript𝐶f_{\theta}:\mathbb{R}^{D}\rightarrow\mathbb{R}^{C}italic_f start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT : blackboard_R start_POSTSUPERSCRIPT italic_D end_POSTSUPERSCRIPT → blackboard_R start_POSTSUPERSCRIPT italic_C end_POSTSUPERSCRIPT be a pre-trained classifier on the labeled source tabular domain 𝒟s={(𝐱is,yis)}iXs×Yssubscript𝒟𝑠subscriptsuperscriptsubscript𝐱𝑖𝑠superscriptsubscript𝑦𝑖𝑠𝑖subscript𝑋𝑠subscript𝑌𝑠\mathcal{D}_{s}=\{(\mathbf{x}_{i}^{s},y_{i}^{s})\}_{i}\subset X_{s}\times Y_{s}caligraphic_D start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT = { ( bold_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_s end_POSTSUPERSCRIPT , italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_s end_POSTSUPERSCRIPT ) } start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ⊂ italic_X start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT × italic_Y start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT, where each pair consists of a tabular input 𝐱is𝒳=Dsuperscriptsubscript𝐱𝑖𝑠𝒳superscript𝐷\mathbf{x}_{i}^{s}\in\mathcal{X}=\mathbb{R}^{D}bold_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_s end_POSTSUPERSCRIPT ∈ caligraphic_X = blackboard_R start_POSTSUPERSCRIPT italic_D end_POSTSUPERSCRIPT and its corresponding output class label yis𝒴={1,,C}superscriptsubscript𝑦𝑖𝑠𝒴1𝐶y_{i}^{s}\in\mathcal{Y}=\{1,\cdots,C\}italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_s end_POSTSUPERSCRIPT ∈ caligraphic_Y = { 1 , ⋯ , italic_C }. The classifier takes a row 𝐱iDsubscript𝐱𝑖superscript𝐷\mathbf{x}_{i}\in\mathbb{R}^{D}bold_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_D end_POSTSUPERSCRIPT from a table and returns output logit fθ(𝐱i)Csubscript𝑓𝜃subscript𝐱𝑖superscript𝐶f_{\theta}(\mathbf{x}_{i})\in\mathbb{R}^{C}italic_f start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( bold_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ∈ blackboard_R start_POSTSUPERSCRIPT italic_C end_POSTSUPERSCRIPT. Here, D𝐷Ditalic_D and C𝐶Citalic_C are the number of input features and output classes, respectively. The objective of TTA for tabular data is to adapt fθsubscript𝑓𝜃f_{\theta}italic_f start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT to the unlabeled target tabular domain 𝒟t={𝐱it}i=Xtsubscript𝒟𝑡subscriptsuperscriptsubscript𝐱𝑖𝑡𝑖subscript𝑋𝑡\mathcal{D}_{t}={\{\mathbf{x}_{i}^{t}\}}_{i}=X_{t}caligraphic_D start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = { bold_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT } start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = italic_X start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT during inference, without access to 𝒟ssubscript𝒟𝑠\mathcal{D}_{s}caligraphic_D start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT. Unlike most TTA methods that fine-tune model parameters θ𝜃\thetaitalic_θ with unsupervised objectives, our approach directly adjusts the output prediction fθ(𝐱it)subscript𝑓𝜃superscriptsubscript𝐱𝑖𝑡f_{\theta}(\mathbf{x}_{i}^{t})italic_f start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( bold_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ).

3.2 Shift-Aware Uncertainty Calibrator

This section describes a shift-aware uncertainty calibrator gϕ:C×D×N+:subscript𝑔italic-ϕsuperscript𝐶superscript𝐷𝑁superscriptg_{\phi}:\mathbb{R}^{C}\times\mathbb{R}^{D\times N}\rightarrow\mathbb{R}^{+}italic_g start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT : blackboard_R start_POSTSUPERSCRIPT italic_C end_POSTSUPERSCRIPT × blackboard_R start_POSTSUPERSCRIPT italic_D × italic_N end_POSTSUPERSCRIPT → blackboard_R start_POSTSUPERSCRIPT + end_POSTSUPERSCRIPT designed to adjust the poorly calibrated original predictions pt(y|𝐱it)=softmax(fθ(𝐱it))subscript𝑝𝑡conditional𝑦superscriptsubscript𝐱𝑖𝑡softmaxsubscript𝑓𝜃superscriptsubscript𝐱𝑖𝑡p_{t}(y|\mathbf{x}_{i}^{t})=\text{softmax}(f_{\theta}(\mathbf{x}_{i}^{t}))italic_p start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_y | bold_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) = softmax ( italic_f start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( bold_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) ), where softmax(z)i=exp(zi)/iexp(zi)softmaxsubscript𝑧𝑖subscript𝑧𝑖subscriptsuperscript𝑖subscript𝑧superscript𝑖\text{softmax}(z)_{i}=\exp(z_{i})/\sum_{i^{\prime}}\exp(z_{i^{\prime}})softmax ( italic_z ) start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = roman_exp ( italic_z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) / ∑ start_POSTSUBSCRIPT italic_i start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT roman_exp ( italic_z start_POSTSUBSCRIPT italic_i start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ) normalizes the logits. Our shift-aware uncertainty calibrator lowers the confidence of overconfident yet incorrect predictions, thereby 1) facilitating better alignment of these predictions with the estimated target label distribution, and 2) mitigating their impact on the inaccurate estimation of the target label distribution.

Conventional post-hoc calibration methods (Platt 2000; Stylianou and Flournoy 2002) typically take solely the original model prediction fθ(𝐱it)subscript𝑓𝜃superscriptsubscript𝐱𝑖𝑡f_{\theta}(\mathbf{x}_{i}^{t})italic_f start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( bold_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) as input and return the corresponding temperature Tisubscript𝑇𝑖T_{i}italic_T start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT without taking input variations into account. We argue that this can be suboptimal as it fails to account for the uncertainty arising from variations in the input itself. Instead, our gϕsubscript𝑔italic-ϕg_{\phi}italic_g start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT not only considers fθ(𝐱it)subscript𝑓𝜃superscriptsubscript𝐱𝑖𝑡f_{\theta}(\mathbf{x}_{i}^{t})italic_f start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( bold_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) but also incorporates 𝐱itsuperscriptsubscript𝐱𝑖𝑡\mathbf{x}_{i}^{t}bold_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT with the shift trend 𝐬tsuperscript𝐬𝑡\mathbf{s}^{t}bold_s start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT of the current batch as additional inputs. Capturing the common shift patterns within the current batch enables a more accurate reflection of the uncertainty caused by the overall shift patterns within the current batch.

In detail, the shift trend 𝐬t=(𝐬ut)u=1DD×Nsuperscript𝐬𝑡superscriptsubscriptsuperscriptsubscript𝐬𝑢𝑡𝑢1𝐷superscript𝐷𝑁\mathbf{s}^{t}=(\mathbf{s}_{u}^{t})_{u=1}^{D}\in\mathbb{R}^{D\times N}bold_s start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT = ( bold_s start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) start_POSTSUBSCRIPT italic_u = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_D end_POSTSUPERSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_D × italic_N end_POSTSUPERSCRIPT is defined for a specific column index u𝑢uitalic_u as follows:

𝐬ut=(𝐱iut1|𝒟s|i=1|𝒟s|𝐱ius)i=1NN.superscriptsubscript𝐬𝑢𝑡superscriptsubscriptsubscriptsuperscript𝐱𝑡𝑖𝑢1subscript𝒟𝑠superscriptsubscriptsuperscript𝑖1subscript𝒟𝑠superscriptsubscript𝐱superscript𝑖𝑢𝑠𝑖1𝑁superscript𝑁\mathbf{s}_{u}^{t}=\big{(}\mathbf{x}^{t}_{iu}-\frac{1}{|\mathcal{D}_{s}|}\sum_% {i^{\prime}=1}^{|\mathcal{D}_{s}|}\mathbf{x}_{i^{\prime}u}^{s}\big{)}_{i=1}^{N% }\in\mathbb{R}^{N}.bold_s start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT = ( bold_x start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i italic_u end_POSTSUBSCRIPT - divide start_ARG 1 end_ARG start_ARG | caligraphic_D start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT | end_ARG ∑ start_POSTSUBSCRIPT italic_i start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT | caligraphic_D start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT | end_POSTSUPERSCRIPT bold_x start_POSTSUBSCRIPT italic_i start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT italic_u end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_s end_POSTSUPERSCRIPT ) start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT . (1)

Here, 𝐬utsuperscriptsubscript𝐬𝑢𝑡\mathbf{s}_{u}^{t}bold_s start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT represents the difference between the values of the u𝑢uitalic_u-th column within the current test batch and the average values of the corresponding column in the source data. Using 𝐬utsuperscriptsubscript𝐬𝑢𝑡\mathbf{s}_{u}^{t}bold_s start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT for each column u𝑢uitalic_u, we define a shift trend graph where each node u𝑢uitalic_u represents a column, and each edge captures the relationship between different columns; the node feature for each node u𝑢uitalic_u is defined as 𝐬utsuperscriptsubscript𝐬𝑢𝑡\mathbf{s}_{u}^{t}bold_s start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT, and the adjacency matrix is represented by an D×D𝐷𝐷D\times Ditalic_D × italic_D all-ones matrix.

A graph neural network (GNN) is then applied to the graph formed above, enabling the exchange of shift trends between different columns through message passing. This process generates a column-wise contextualized representation, which is then averaged to produce an overall feature representation that encompasses all columns. Finally, the averaged node representation is concatenated with the initial prediction fθ(𝐱it)subscript𝑓𝜃superscriptsubscript𝐱𝑖𝑡f_{\theta}(\mathbf{x}_{i}^{t})italic_f start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( bold_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) to yield the final output temperature Tisubscript𝑇𝑖T_{i}italic_T start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT. This GNN-based uncertainty calibration not only captures shifts in individual columns but also sensitively detects correlation shifts occurring simultaneously across different columns, which are common in the tabular domain. A more detailed explanation of the architecture and training of the shift-aware uncertainty calibrator can be found in Section A.

3.3 Label Distribution Handler

This section introduces a label distribution handler designed to accurately estimate the target label distribution for the current test batch and adjust the model’s output predictions accordingly. This approach is empirically justified by our observation that the marginal distribution of model predictions pt(y)subscript𝑝𝑡𝑦p_{t}(y)italic_p start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_y ) in the target domain tends to be biased towards the source label distribution ps(y)subscript𝑝𝑠𝑦p_{s}(y)italic_p start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ( italic_y ), as discussed in Section 2.2 and illustrated in Figure 2.

A straightforward solution to correct this bias is to simply multiply pt(y)/ps(y)subscript𝑝𝑡𝑦subscript𝑝𝑠𝑦p_{t}(y)/p_{s}(y)italic_p start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_y ) / italic_p start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ( italic_y ) to align the marginal label distribution (Berthelot et al. 2020). Specifically, given pt(y|𝐱it)=softmax(fθ(𝐱it))subscript𝑝𝑡conditional𝑦superscriptsubscript𝐱𝑖𝑡softmaxsubscript𝑓𝜃superscriptsubscript𝐱𝑖𝑡p_{t}(y|\mathbf{x}_{i}^{t})=\text{softmax}(f_{\theta}(\mathbf{x}_{i}^{t}))italic_p start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_y | bold_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) = softmax ( italic_f start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( bold_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) ), the adjusted prediction would be:

norm(pt(y|𝐱it)pt(y)/ps(y))normsubscript𝑝𝑡conditional𝑦superscriptsubscript𝐱𝑖𝑡subscript𝑝𝑡𝑦subscript𝑝𝑠𝑦\text{norm}(p_{t}(y|\mathbf{x}_{i}^{t})p_{t}(y)/p_{s}(y))norm ( italic_p start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_y | bold_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) italic_p start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_y ) / italic_p start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ( italic_y ) ) (2)

where norm(z)i=zi/izinormsubscript𝑧𝑖subscript𝑧𝑖subscriptsuperscript𝑖subscript𝑧superscript𝑖\text{norm}(z)_{i}=z_{i}/\sum_{i^{\prime}}{z_{i^{\prime}}}norm ( italic_z ) start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = italic_z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT / ∑ start_POSTSUBSCRIPT italic_i start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT italic_z start_POSTSUBSCRIPT italic_i start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT normalizes the unnormalized probability. However, we find two major issues: 1) pt(y|𝐱it)subscript𝑝𝑡conditional𝑦superscriptsubscript𝐱𝑖𝑡p_{t}(y|\mathbf{x}_{i}^{t})italic_p start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_y | bold_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) is often poorly calibrated and 2) overconfident yet incorrect predictions significantly hinder the accurate estimation of the target label distribution pt(y)subscript𝑝𝑡𝑦p_{t}(y)italic_p start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_y ) (Section 2.2).

To tackle these challenges, we propose a simple yet effective estimator p¯i(y|𝐱it)subscript¯𝑝𝑖conditional𝑦superscriptsubscript𝐱𝑖𝑡\bar{p}_{i}(y|\mathbf{x}_{i}^{t})over¯ start_ARG italic_p end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_y | bold_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) defined like below:

p¯i(y|𝐱it)=p~t(y|𝐱it)+norm(p~t(y|𝐱it)pt(y)/ps(y))2.subscript¯𝑝𝑖conditional𝑦superscriptsubscript𝐱𝑖𝑡subscript~𝑝𝑡conditional𝑦superscriptsubscript𝐱𝑖𝑡normsubscript~𝑝𝑡conditional𝑦superscriptsubscript𝐱𝑖𝑡subscript𝑝𝑡𝑦subscript𝑝𝑠𝑦2\bar{p}_{i}(y|\mathbf{x}_{i}^{t})=\frac{\tilde{p}_{t}(y|\mathbf{x}_{i}^{t})+% \text{norm}\big{(}\tilde{p}_{t}(y|\mathbf{x}_{i}^{t})p_{t}(y)/p_{s}(y)\big{)}}% {2}.over¯ start_ARG italic_p end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_y | bold_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) = divide start_ARG over~ start_ARG italic_p end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_y | bold_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) + norm ( over~ start_ARG italic_p end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_y | bold_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) italic_p start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_y ) / italic_p start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ( italic_y ) ) end_ARG start_ARG 2 end_ARG . (3)

The key differences between the original Equation 2 and our Equation 3 are: 1) we use the calibrated prediction p~t(y|𝐱it)subscript~𝑝𝑡conditional𝑦superscriptsubscript𝐱𝑖𝑡\tilde{p}_{t}(y|\mathbf{x}_{i}^{t})over~ start_ARG italic_p end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_y | bold_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) instead of the original prediction pt(y|𝐱it)subscript𝑝𝑡conditional𝑦superscriptsubscript𝐱𝑖𝑡p_{t}(y|\mathbf{x}_{i}^{t})italic_p start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_y | bold_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) to enhance uncertainty quantification, and 2) we combine the calibrated estimate p~t(y|𝐱it)subscript~𝑝𝑡conditional𝑦superscriptsubscript𝐱𝑖𝑡\tilde{p}_{t}(y|\mathbf{x}_{i}^{t})over~ start_ARG italic_p end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_y | bold_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) with the distributionally aligned prediction norm(p~t(y|𝐱it)pt(y)/ps(y))normsubscript~𝑝𝑡conditional𝑦superscriptsubscript𝐱𝑖𝑡subscript𝑝𝑡𝑦subscript𝑝𝑠𝑦\text{norm}(\tilde{p}_{t}(y|\mathbf{x}_{i}^{t})p_{t}(y)/p_{s}(y))norm ( over~ start_ARG italic_p end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_y | bold_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) italic_p start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_y ) / italic_p start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ( italic_y ) ) for more robust estimation.

Given the already-known source label distribution ps(y)subscript𝑝𝑠𝑦p_{s}(y)italic_p start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ( italic_y ), we now explain the step-by-step process for estimating p~t(y|𝐱it)subscript~𝑝𝑡conditional𝑦superscriptsubscript𝐱𝑖𝑡\tilde{p}_{t}(y|\mathbf{x}_{i}^{t})over~ start_ARG italic_p end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_y | bold_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) and pt(y)subscript𝑝𝑡𝑦p_{t}(y)italic_p start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_y ). pt(y|𝐱it)subscript𝑝𝑡conditional𝑦superscriptsubscript𝐱𝑖𝑡p_{t}(y|\mathbf{x}_{i}^{t})italic_p start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_y | bold_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) is calibrated into p~t(y|𝐱it)subscript~𝑝𝑡conditional𝑦superscriptsubscript𝐱𝑖𝑡\tilde{p}_{t}(y|\mathbf{x}_{i}^{t})over~ start_ARG italic_p end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_y | bold_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) through a two-stage uncertainty calibration process. Specifically, for a current test batch {𝐱it}i=1Nsuperscriptsubscriptsuperscriptsubscript𝐱𝑖𝑡𝑖1𝑁\{\mathbf{x}_{i}^{t}\}_{i=1}^{N}{ bold_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT } start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT, we calculate shift trend 𝐬tsuperscript𝐬𝑡\mathbf{s}^{t}bold_s start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT using Equation 1 and get per-sample temperature Ti=gϕ(fθ(𝐱it),𝐬t)subscript𝑇𝑖subscript𝑔italic-ϕsubscript𝑓𝜃superscriptsubscript𝐱𝑖𝑡superscript𝐬𝑡T_{i}=g_{\phi}(f_{\theta}(\mathbf{x}_{i}^{t}),\mathbf{s}^{t})italic_T start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = italic_g start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT ( italic_f start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( bold_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) , bold_s start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) using shift-aware uncertainty calibrator gϕsubscript𝑔italic-ϕg_{\phi}italic_g start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT to capture overall distribution shifts, as well as correlation and individual column shifts within the current batch. Here, we define the uncertainty δisubscript𝛿𝑖\delta_{i}italic_δ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT of fθ(𝐱it)subscript𝑓𝜃superscriptsubscript𝐱𝑖𝑡f_{\theta}(\mathbf{x}_{i}^{t})italic_f start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( bold_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) as a reciprocal of the margin of the calibrated probability distribution softmax(fθ(𝐱it)/Ti)softmaxsubscript𝑓𝜃superscriptsubscript𝐱𝑖𝑡subscript𝑇𝑖\text{softmax}(f_{\theta}(\mathbf{x}_{i}^{t})/T_{i})softmax ( italic_f start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( bold_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) / italic_T start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ). We then measure the quantiles for each instance 𝐱isubscript𝐱𝑖\mathbf{x}_{i}bold_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT using δisubscript𝛿𝑖\delta_{i}italic_δ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT within the current batch and recalibrate the original probability with T~isubscript~𝑇𝑖\tilde{T}_{i}over~ start_ARG italic_T end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT, resulting in p~t(y|𝐱it)=softmax(fθ(𝐱it)/T~i)subscript~𝑝𝑡conditional𝑦superscriptsubscript𝐱𝑖𝑡softmaxsubscript𝑓𝜃superscriptsubscript𝐱𝑖𝑡subscript~𝑇𝑖\tilde{p}_{t}(y|\mathbf{x}_{i}^{t})=\text{softmax}(f_{\theta}(\mathbf{x}_{i}^{% t})/\tilde{T}_{i})over~ start_ARG italic_p end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_y | bold_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) = softmax ( italic_f start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( bold_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) / over~ start_ARG italic_T end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ). This process calibrates predictions by leveraging relative uncertainty within the batch. Our temperature T~isubscript~𝑇𝑖\tilde{T}_{i}over~ start_ARG italic_T end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT is defined as:

T~i={Tif δiQ({δi}i=1N,qhigh)1/Tif δiQ({δi}i=1N,qlow)1otherwise,subscript~𝑇𝑖cases𝑇if subscript𝛿𝑖𝑄superscriptsubscriptsubscript𝛿superscript𝑖superscript𝑖1𝑁subscript𝑞high1𝑇if subscript𝛿𝑖𝑄superscriptsubscriptsubscript𝛿superscript𝑖superscript𝑖1𝑁subscript𝑞low1otherwise\tilde{T}_{i}=\begin{cases}T&\text{if }\delta_{i}\geq Q\big{(}\{\delta_{i^{% \prime}}\}_{i^{\prime}=1}^{N},q_{\text{high}}\big{)}\\ 1/T&\text{if }\delta_{i}\leq Q\big{(}\{\delta_{i^{\prime}}\}_{i^{\prime}=1}^{N% },q_{\text{low}}\big{)}\\ 1&\text{otherwise},\end{cases}over~ start_ARG italic_T end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = { start_ROW start_CELL italic_T end_CELL start_CELL if italic_δ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ≥ italic_Q ( { italic_δ start_POSTSUBSCRIPT italic_i start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_i start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT , italic_q start_POSTSUBSCRIPT high end_POSTSUBSCRIPT ) end_CELL end_ROW start_ROW start_CELL 1 / italic_T end_CELL start_CELL if italic_δ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ≤ italic_Q ( { italic_δ start_POSTSUBSCRIPT italic_i start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_i start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT , italic_q start_POSTSUBSCRIPT low end_POSTSUBSCRIPT ) end_CELL end_ROW start_ROW start_CELL 1 end_CELL start_CELL otherwise , end_CELL end_ROW (4)

where Q(X,q)𝑄𝑋𝑞Q(X,q)italic_Q ( italic_X , italic_q ) is a quantile function which gives the value corresponding to the lower q𝑞qitalic_q quantile in X𝑋Xitalic_X, T=1.5maxjps(y)j/minjps(y)j𝑇1.5subscript𝑗subscript𝑝𝑠subscript𝑦𝑗subscript𝑗subscript𝑝𝑠subscript𝑦𝑗T=1.5\max_{j}p_{s}(y)_{j}/\min_{j}p_{s}(y)_{j}italic_T = 1.5 roman_max start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT italic_p start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ( italic_y ) start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT / roman_min start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT italic_p start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ( italic_y ) start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT is a temperature, and qlowsubscript𝑞lowq_{\text{low}}italic_q start_POSTSUBSCRIPT low end_POSTSUBSCRIPT and qhighsubscript𝑞highq_{\text{high}}italic_q start_POSTSUBSCRIPT high end_POSTSUBSCRIPT represent the low and high uncertainty quantiles, respectively. This two-stage uncertainty calibration comprehensively evaluates the current batch and estimates relative uncertainty using 𝐬tsuperscript𝐬𝑡\mathbf{s}^{t}bold_s start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT, gϕsubscript𝑔italic-ϕg_{\phi}italic_g start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT, and T~isubscript~𝑇𝑖\tilde{T}_{i}over~ start_ARG italic_T end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT.

Meanwhile, the target label distribution pt(y)subscript𝑝𝑡𝑦p_{t}(y)italic_p start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_y ) is estimated as follows:

pt(y)=(1α)1Ni=1Nptde(y|𝐱it)+αptoe(y),subscript𝑝𝑡𝑦1𝛼1𝑁superscriptsubscript𝑖1𝑁subscriptsuperscript𝑝de𝑡conditional𝑦superscriptsubscript𝐱𝑖𝑡𝛼subscriptsuperscript𝑝oe𝑡𝑦p_{t}(y)=(1-\alpha)\cdot\frac{1}{N}{\sum_{i=1}^{N}p^{\text{de}}_{t}(y|\mathbf{% x}_{i}^{t})}+\alpha\cdot{p^{\text{oe}}_{t}(y)},italic_p start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_y ) = ( 1 - italic_α ) ⋅ divide start_ARG 1 end_ARG start_ARG italic_N end_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT italic_p start_POSTSUPERSCRIPT de end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_y | bold_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) + italic_α ⋅ italic_p start_POSTSUPERSCRIPT oe end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_y ) , (5)

where ptde(y|𝐱it)=norm(pt(y|𝐱it)/ps(y))subscriptsuperscript𝑝de𝑡conditional𝑦superscriptsubscript𝐱𝑖𝑡normsubscript𝑝𝑡conditional𝑦superscriptsubscript𝐱𝑖𝑡subscript𝑝𝑠𝑦p^{\text{de}}_{t}(y|\mathbf{x}_{i}^{t})=\text{norm}\big{(}p_{t}(y|\mathbf{x}_{% i}^{t})/p_{s}(y)\big{)}italic_p start_POSTSUPERSCRIPT de end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_y | bold_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) = norm ( italic_p start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_y | bold_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) / italic_p start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ( italic_y ) ) is a debiased target label estimator that departs from ps(y)subscript𝑝𝑠𝑦p_{s}(y)italic_p start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ( italic_y ), and ptoe(y)superscriptsubscript𝑝𝑡𝑜𝑒𝑦p_{t}^{oe}(y)italic_p start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_o italic_e end_POSTSUPERSCRIPT ( italic_y ) is an online target label estimator, initialized as uniform distribution and updated as:

ptoe(y)=(1α)1Ni=1Np¯t(y|𝐱it)+αptoe(y)subscriptsuperscript𝑝oe𝑡𝑦1𝛼1𝑁superscriptsubscript𝑖1𝑁subscript¯𝑝𝑡conditional𝑦superscriptsubscript𝐱𝑖𝑡𝛼subscriptsuperscript𝑝oe𝑡𝑦p^{\text{oe}}_{t}(y)=(1-\alpha)\cdot\frac{1}{N}\sum_{i=1}^{N}{\bar{p}_{t}(y|% \mathbf{x}_{i}^{t})}+\alpha\cdot p^{\text{oe}}_{t}(y)italic_p start_POSTSUPERSCRIPT oe end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_y ) = ( 1 - italic_α ) ⋅ divide start_ARG 1 end_ARG start_ARG italic_N end_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT over¯ start_ARG italic_p end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_y | bold_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) + italic_α ⋅ italic_p start_POSTSUPERSCRIPT oe end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_y ) (6)

from the current batch to the next, with a smoothing factor α𝛼\alphaitalic_α. This online target label estimator leverages label locality between nearby test batches, making it effective for accurately estimating the next batch’s target label distribution. A more detailed explanation of AdapTable is provided in Section A.

Table 2: The average balanced accuracy (%) and macro F1 score (%) with their standard errors for both supervised models and TTA baselines are reported across six datasets including natural distribution shifts within the TableShift (Gardner, Popovic, and Schmidt 2023) benchmark. The results are averaged over three random repetitions.

Method HELOC Voting Hospital Readmission ICU Mortality Childhood Lead Diabetes bAcc. F1 bAcc. F1 bAcc. F1 bAcc. F1 bAcc. F1 bAcc. F1 k𝑘kitalic_k-NN 62.0 ±plus-or-minus\pm± 0.0 40.3 ±plus-or-minus\pm± 0.0 76.9 ±plus-or-minus\pm± 0.0 71.1 ±plus-or-minus\pm± 0.0 57.7 ±plus-or-minus\pm± 0.0 56.9 ±plus-or-minus\pm± 0.0 81.5 ±plus-or-minus\pm± 0.3 47.6 ±plus-or-minus\pm± 0.0 57.6 ±plus-or-minus\pm± 0.1 56.9 ±plus-or-minus\pm± 0.0 67.9 ±plus-or-minus\pm± 0.3 53.3 ±plus-or-minus\pm± 0.1 LogReg 63.5 ±plus-or-minus\pm± 0.0 44.2 ±plus-or-minus\pm± 0.0 80.2 ±plus-or-minus\pm± 0.0 76.2 ±plus-or-minus\pm± 0.0 61.4 ±plus-or-minus\pm± 0.0 58.9 ±plus-or-minus\pm± 0.0 61.6 ±plus-or-minus\pm± 0.0 62.2 ±plus-or-minus\pm± 0.0 50.0 ±plus-or-minus\pm± 0.0 47.9 ±plus-or-minus\pm± 0.0 71.0 ±plus-or-minus\pm± 0.0 55.4 ±plus-or-minus\pm± 0.0 RandomForest 58.2 ±plus-or-minus\pm± 7.6 32.2 ±plus-or-minus\pm± 1.5 81.7 ±plus-or-minus\pm± 0.1 68.4 ±plus-or-minus\pm± 0.7 64.4 ±plus-or-minus\pm± 0.5 42.1 ±plus-or-minus\pm± 1.2 85.2 ±plus-or-minus\pm± 0.4 52.0 ±plus-or-minus\pm± 0.1 50.0 ±plus-or-minus\pm± 0.0 47.9 ±plus-or-minus\pm± 0.0 76.5 ±plus-or-minus\pm± 0.1 46.9 ±plus-or-minus\pm± 0.1 XGBoost 57.6 ±plus-or-minus\pm± 7.2 39.9 ±plus-or-minus\pm± 4.9 80.5 ±plus-or-minus\pm± 0.2 75.8 ±plus-or-minus\pm± 0.4 63.1 ±plus-or-minus\pm± 0.1 61.3 ±plus-or-minus\pm± 0.4 79.9 ±plus-or-minus\pm± 0.1 64.3 ±plus-or-minus\pm± 0.1 50.0 ±plus-or-minus\pm± 0.0 47.9 ±plus-or-minus\pm± 0.0 71.5 ±plus-or-minus\pm± 0.1 56.2 ±plus-or-minus\pm± 0.1 CatBoost 65.4 ±plus-or-minus\pm± 0.0 51.7 ±plus-or-minus\pm± 0.0 80.4 ±plus-or-minus\pm± 0.0 76.8 ±plus-or-minus\pm± 0.0 63.4 ±plus-or-minus\pm± 0.0 61.8 ±plus-or-minus\pm± 0.5 81.4 ±plus-or-minus\pm± 0.0 59.8 ±plus-or-minus\pm± 0.0 50.0 ±plus-or-minus\pm± 0.0 47.9 ±plus-or-minus\pm± 0.0 65.0 ±plus-or-minus\pm± 0.0 59.3 ±plus-or-minus\pm± 0.0 + AdapTable 65.5 ±plus-or-minus\pm± 0.0 65.4 ±plus-or-minus\pm± 0.0 79.6 ±plus-or-minus\pm± 0.0 78.6 ±plus-or-minus\pm± 0.0 65.4 ±plus-or-minus\pm± 0.0 62.5 ±plus-or-minus\pm± 0.3 82.6 ±plus-or-minus\pm± 0.0 64.8 ±plus-or-minus\pm± 0.3 62.8 ±plus-or-minus\pm± 0.4 61.7 ±plus-or-minus\pm± 0.3 74.2 ±plus-or-minus\pm± 0.0 62.5 ±plus-or-minus\pm± 0.3 Source 53.2 ±plus-or-minus\pm± 1.5 38.2 ±plus-or-minus\pm± 3.5 76.5 ±plus-or-minus\pm± 0.5 77.3 ±plus-or-minus\pm± 0.4 61.1 ±plus-or-minus\pm± 0.1 60.2 ±plus-or-minus\pm± 0.3 56.3 ±plus-or-minus\pm± 0.0 58.1 ±plus-or-minus\pm± 0.0 50.0 ±plus-or-minus\pm± 0.0 47.9 ±plus-or-minus\pm± 0.0 55.2 ±plus-or-minus\pm± 0.0 55.5 ±plus-or-minus\pm± 0.0 PL 51.8 ±plus-or-minus\pm± 1.0 34.9 ±plus-or-minus\pm± 2.3 75.6 ±plus-or-minus\pm± 0.5 76.6 ±plus-or-minus\pm± 0.5 60.5 ±plus-or-minus\pm± 0.1 58.9 ±plus-or-minus\pm± 0.3 56.3 ±plus-or-minus\pm± 0.0 58.0 ±plus-or-minus\pm± 0.1 50.0 ±plus-or-minus\pm± 0.0 47.9 ±plus-or-minus\pm± 0.0 55.1 ±plus-or-minus\pm± 0.1 55.3 ±plus-or-minus\pm± 0.0 TTT++ 53.2 ±plus-or-minus\pm± 1.5 38.2 ±plus-or-minus\pm± 3.6 76.8 ±plus-or-minus\pm± 0.5 77.6 ±plus-or-minus\pm± 0.2 61.1 ±plus-or-minus\pm± 0.1 60.2 ±plus-or-minus\pm± 0.3 56.6 ±plus-or-minus\pm± 0.5 58.5 ±plus-or-minus\pm± 0.1 50.0 ±plus-or-minus\pm± 0.0 47.9 ±plus-or-minus\pm± 0.0 55.4 ±plus-or-minus\pm± 0.0 55.7 ±plus-or-minus\pm± 0.0 TENT 51.2 ±plus-or-minus\pm± 1.2 33.2 ±plus-or-minus\pm± 2.6 74.0 ±plus-or-minus\pm± 0.6 74.9 ±plus-or-minus\pm± 0.6 60.2 ±plus-or-minus\pm± 0.1 58.3 ±plus-or-minus\pm± 0.3 55.1 ±plus-or-minus\pm± 0.1 56.3 ±plus-or-minus\pm± 0.1 50.0 ±plus-or-minus\pm± 0.0 47.9 ±plus-or-minus\pm± 0.0 55.0 ±plus-or-minus\pm± 0.0 55.0 ±plus-or-minus\pm± 0.0 EATA 53.2 ±plus-or-minus\pm± 1.5 38.2 ±plus-or-minus\pm± 3.6 76.5 ±plus-or-minus\pm± 0.5 77.3 ±plus-or-minus\pm± 0.4 61.1 ±plus-or-minus\pm± 0.1 60.2 ±plus-or-minus\pm± 0.4 56.3 ±plus-or-minus\pm± 0.0 58.1 ±plus-or-minus\pm± 0.0 50.0 ±plus-or-minus\pm± 0.0 47.9 ±plus-or-minus\pm± 0.0 55.2 ±plus-or-minus\pm± 0.0 55.5 ±plus-or-minus\pm± 0.0 SAR 50.0 ±plus-or-minus\pm± 0.0 30.1 ±plus-or-minus\pm± 0.0 62.0 ±plus-or-minus\pm± 1.2 59.4 ±plus-or-minus\pm± 1.6 57.1 ±plus-or-minus\pm± 1.1 51.3 ±plus-or-minus\pm± 2.2 51.1 ±plus-or-minus\pm± 0.1 49.1 ±plus-or-minus\pm± 0.2 50.0 ±plus-or-minus\pm± 0.0 47.9 ±plus-or-minus\pm± 0.0 53.4 ±plus-or-minus\pm± 0.0 52.2 ±plus-or-minus\pm± 0.0 LAME 50.0 ±plus-or-minus\pm± 0.0 30.1 ±plus-or-minus\pm± 0.0 54.6 ±plus-or-minus\pm± 0.5 46.8 ±plus-or-minus\pm± 1.0 54.9 ±plus-or-minus\pm± 0.5 46.9 ±plus-or-minus\pm± 1.0 50.0 ±plus-or-minus\pm± 0.0 46.7 ±plus-or-minus\pm± 0.0 50.0 ±plus-or-minus\pm± 0.0 47.9 ±plus-or-minus\pm± 0.0 54.8 ±plus-or-minus\pm± 0.1 54.8 ±plus-or-minus\pm± 0.2 AdapTable 65.8 ±plus-or-minus\pm± 0.6 64.5 ±plus-or-minus\pm± 0.3 78.4 ±plus-or-minus\pm± 0.3 78.6 ±plus-or-minus\pm± 0.0 61.7 ±plus-or-minus\pm± 0.0 61.7 ±plus-or-minus\pm± 0.0 65.9 ±plus-or-minus\pm± 0.1 65.4 ±plus-or-minus\pm± 0.1 69.2 ±plus-or-minus\pm± 0.1 60.9 ±plus-or-minus\pm± 0.3 70.9 ±plus-or-minus\pm± 0.1 68.3 ±plus-or-minus\pm± 0.1

3.4 Theoretical Insights

Theorem 3.1.

Let Y^|Xconditional^𝑌𝑋\hat{Y}|Xover^ start_ARG italic_Y end_ARG | italic_X and Y^o|Xconditionalsubscript^𝑌𝑜𝑋\hat{Y}_{o}|Xover^ start_ARG italic_Y end_ARG start_POSTSUBSCRIPT italic_o end_POSTSUBSCRIPT | italic_X be defined as follows:

Y^|Xconditional^𝑌𝑋\displaystyle\hat{Y}|Xover^ start_ARG italic_Y end_ARG | italic_X ={argmaxj𝒴fθ(𝐱)j|𝐱X},absentconditional-setsubscriptargmax𝑗𝒴subscript𝑓𝜃subscript𝐱𝑗𝐱𝑋\displaystyle=\{\operatorname*{arg\,max}_{j\in\mathcal{Y}}{f_{\theta}(\mathbf{% x})_{j}}|\mathbf{x}\in X\},= { start_OPERATOR roman_arg roman_max end_OPERATOR start_POSTSUBSCRIPT italic_j ∈ caligraphic_Y end_POSTSUBSCRIPT italic_f start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( bold_x ) start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT | bold_x ∈ italic_X } , (7)
Y^o|Xconditionalsubscript^𝑌𝑜𝑋\displaystyle\hat{Y}_{o}|Xover^ start_ARG italic_Y end_ARG start_POSTSUBSCRIPT italic_o end_POSTSUBSCRIPT | italic_X ={argmaxj𝒴fθ(𝐱)j+logptoe(y)j|𝐱X}.absentconditional-setsubscriptargmax𝑗𝒴subscript𝑓𝜃subscript𝐱𝑗superscriptsubscript𝑝𝑡𝑜𝑒subscript𝑦𝑗𝐱𝑋\displaystyle=\{\operatorname*{arg\,max}_{j\in\mathcal{Y}}{f_{\theta}(\mathbf{% x})_{j}+\log p_{t}^{oe}(y)_{j}}|\mathbf{x}\in X\}.= { start_OPERATOR roman_arg roman_max end_OPERATOR start_POSTSUBSCRIPT italic_j ∈ caligraphic_Y end_POSTSUBSCRIPT italic_f start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( bold_x ) start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT + roman_log italic_p start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_o italic_e end_POSTSUPERSCRIPT ( italic_y ) start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT | bold_x ∈ italic_X } . (8)

Given the error ϵ(Y^|X)=(Y^Y|X)italic-ϵconditional^𝑌𝑋^𝑌conditional𝑌𝑋\epsilon(\hat{Y}|X)=\mathbb{P}(\hat{Y}\neq Y|X)italic_ϵ ( over^ start_ARG italic_Y end_ARG | italic_X ) = blackboard_P ( over^ start_ARG italic_Y end_ARG ≠ italic_Y | italic_X ), with true labels Y𝑌Yitalic_Y of inputs X𝑋Xitalic_X, the error gap |ϵ(Y^|Xs)ϵ(Y^o|Xt)||\epsilon(\hat{Y}|X_{s})-\epsilon(\hat{Y}_{o}|X_{t})|| italic_ϵ ( over^ start_ARG italic_Y end_ARG | italic_X start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ) - italic_ϵ ( over^ start_ARG italic_Y end_ARG start_POSTSUBSCRIPT italic_o end_POSTSUBSCRIPT | italic_X start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) | is upper bounded by

K11ptoe(y)pt(y)1BSE(Y^)+K2ΔCE(Y^),subscript𝐾1subscriptnorm1superscriptsubscript𝑝𝑡𝑜𝑒𝑦subscript𝑝𝑡𝑦1𝐵𝑆𝐸^𝑌subscript𝐾2subscriptΔ𝐶𝐸^𝑌K_{1}\Big{\|}1-\frac{p_{t}^{oe}(y)}{p_{t}(y)}\Big{\|}_{1}BSE(\hat{Y})+K_{2}% \Delta_{CE}(\hat{Y}),italic_K start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ∥ 1 - divide start_ARG italic_p start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_o italic_e end_POSTSUPERSCRIPT ( italic_y ) end_ARG start_ARG italic_p start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_y ) end_ARG ∥ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT italic_B italic_S italic_E ( over^ start_ARG italic_Y end_ARG ) + italic_K start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT roman_Δ start_POSTSUBSCRIPT italic_C italic_E end_POSTSUBSCRIPT ( over^ start_ARG italic_Y end_ARG ) , (9)

where K1subscript𝐾1K_{1}italic_K start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT and K2subscript𝐾2K_{2}italic_K start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT are constants related to pt(y)subscript𝑝𝑡𝑦p_{t}(y)italic_p start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_y ), and ps(y)subscript𝑝𝑠𝑦p_{s}(y)italic_p start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ( italic_y ), respectively.

Theorem 3.1 extends Theorem 2.3 in ODS (Zhou et al. 2023) to cases where the source label distribution is not uniform. It decomposes the error gap between the original model on the source domain and the adapted model for the target model with ptoe(y)superscriptsubscript𝑝𝑡𝑜𝑒𝑦p_{t}^{oe}(y)italic_p start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_o italic_e end_POSTSUPERSCRIPT ( italic_y ) on the target domain into several components. These components include 1ptoe(y)/pt(y)1subscriptnorm1superscriptsubscript𝑝𝑡𝑜𝑒𝑦subscript𝑝𝑡𝑦1\|1-p_{t}^{oe}(y)/p_{t}(y)\|_{1}∥ 1 - italic_p start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_o italic_e end_POSTSUPERSCRIPT ( italic_y ) / italic_p start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_y ) ∥ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT, which is an error of the estimated target label distribution, BSE(Y^)𝐵𝑆𝐸^𝑌BSE(\hat{Y})italic_B italic_S italic_E ( over^ start_ARG italic_Y end_ARG ), which reflects the model’s performance on the source domain, and ΔCE(Y^)subscriptΔ𝐶𝐸^𝑌\Delta_{CE}(\hat{Y})roman_Δ start_POSTSUBSCRIPT italic_C italic_E end_POSTSUBSCRIPT ( over^ start_ARG italic_Y end_ARG ), which measures the generalization of feature representations adapted by the TTA algorithm. Overall, Theorem 3.1 underscores the importance of tracking label distributions and efficiently adapting models to handle label distribution shifts. The detailed explanation and proof of Theorem 3.1 can be found in Section B.

4 Experiments

This section validates AdapTable’s effectiveness. We begin with an overview of our experimental setup in Section 4.1 and then address key research questions:

  • Is AdapTable effective across various tabular distribution shifts, including natural shifts and common corruptions across different tabular models? (Section 4.2)

  • Do AdapTable’s components contribute to overall performance improvements, and do they function as intended? (Section 4.3)

  • Does AdapTable demonstrate strengths in computational efficiency and hyperparameter sensitivity, which are crucial for test time adaptation? (Section 4.4)

4.1 Experimental Setup

Datasets.

We evaluate AdapTable on six diverse datasets—HELOC, Voting, Hospital Readmission, ICU Mortality, Childhood Lead, and Diabetes—within the tabular distribution shift benchmark (Gardner, Popovic, and Schmidt 2023), covering healthcare, finance, and politics with both numerical and categorical features. Additionally, we verify its robustness against six common corruptions—Gaussian, Uniform, Random Drop, Column Drop, Numerical, and Categorical—to ensure its efficacy beyond label distribution shifts. More details of these shifts are in Section C.

Model architectures.

To verify the proposed method under various tabular model architectures, we mainly use MLP, a widely used tabular learning architecture. Additionally, we validate AdapTable on CatBoost (Dorogush, Ershov, and Gulin 2017) and three other representative deep tabular learning models—AutoInt (Song et al. 2019), ResNet (Gorishniy et al. 2021), and FT-Transformer (Gorishniy et al. 2021).

Baselines.

We compare AdapTable with six TTA baselines—PL (Lee 2013), TTT++ (Liu et al. 2021), TENT (Wang et al. 2021a), EATA (Niu et al. 2022), SAR (Niu et al. 2023), and LAME (Boudiaf et al. 2022). TabLog (Ren et al. 2024) is excluded due to its architectural constraint on logical neural networks (Riegel et al. 2020). We also provide performance references from classical machine learning models: k𝑘kitalic_k-nearest neighbors (k𝑘kitalic_k-NN), logistic regression (LogReg), random forest (RandomForest), XGBoost (Chen and Guestrin 2016), and CatBoost (Dorogush, Ershov, and Gulin 2017).

Evaluation metrics.

As shown in Figure 2 and Section F, tabular data often exhibit extreme class imbalance. Since accuracy may not be effective in these cases, we use macro F1 score (F1) and balanced accuracy (bAcc.) as the primary evaluation metrics.

Implementation details.

For all experiments, we use a fixed batch size of 64, a common setting in TTA baselines (Schneider et al. 2020; Wang et al. 2021a). The smoothing factor α𝛼\alphaitalic_α, low uncertainty quantile qlowsubscript𝑞lowq_{\text{low}}italic_q start_POSTSUBSCRIPT low end_POSTSUBSCRIPT, and high uncertainty quantile qhighsubscript𝑞highq_{\text{high}}italic_q start_POSTSUBSCRIPT high end_POSTSUBSCRIPT are set to 0.1, 0.25, and 0.75, respectively. In all tables, the best and second-best results are highlighted in bold and underline, respectively. Further implementation and hyperparameter details are provided in Section E.

Refer to caption
Figure 4: The average macro F1 score for AdapTable and TTA baselines is reported under six common corruptions using MLP across three datasets within the TableShift (Gardner, Popovic, and Schmidt 2023) benchmark.

4.2 Main Results

Result on natural distribution shifts.

Table 2 presents results on natural distribution shifts. Existing TTA methods, successful in computer vision, struggle in the tabular domain, often failing to outperform the source model or offering limited performance gains. In contrast, AdapTable achieves state-of-the-art results across all datasets, with dramatic performance improvements of up to 26% on the HELOC dataset. Since AdapTable does not rely on model parameter tuning, it can be easily applied to classical machine learning models; when integrated with CatBoost, AdapTable consistently improves performance across all datasets, showcasing its versatility, whereas other baselines cannot be similarly integrated as they require model parameter updates.

Result on common corruptions.

We further evaluate the efficacy of AdapTable across six types of common corruptions in real-world applications by applying them to the test sets of three datasets—HELOC, Voting, and Childhood Lead. As shown in Figure 4, prior TTA methods fail considerably, showing only marginal gains over the unadapted source model across all corruption types. It is worth noting that previous TTA methods have demonstrated significant improvements when dealing with common corruptions in vision data, highlighting the difference between corruptions in the tabular domain its counterpart in vision domain. Meanwhile, Adaptable shows substantial improvements across all types of corruptions, showing more than 10% gains of accuracy on all scenarios, demonstrating its robustness across different types of corruptions.

Refer to caption
Figure 5: The average macro F1 score of AdapTable and TTA baselines across three datasets (HELOC, Voting, Childhood Lead) using various backbone architectures.

Result across diverse model architectures.

In Figure 5, we report AdapTable’s effectiveness across three mainstream tabular learning architectures—AutoInt (Song et al. 2019), ResNet (Gorishniy et al. 2021), and FT-Transformer (Gorishniy et al. 2021). We report the average macro F1 score across three datasets—HELOC, Voting, and Childhood Lead. None of the baselines outperform the original source model, with LAME (Boudiaf et al. 2022) even showing significant performance drops. In contrast, AdapTable consistently achieves significant improvements across all architectures, highlighting its robustness and versatility.

Refer to caption
Figure 6: Ablation study on the shift-aware uncertainty calibrator using MLP for the HELOC dataset. (a) and (b) show reliability diagrams before and after calibration, while (c) depicts the average temperature relative to the maximum mean discrepancy (MMD) between the training set and the sampled test sets.
Table 3: Ablation study comparing the shift-aware uncertainty calibrator with classical methods—Platt scaling (PS) and isotonic regression (IR). The results are averaged over three random repetitions.

Method HELOC Voting Hospital Readmission Source 38.2 ±plus-or-minus\pm± 3.5 77.3 ±plus-or-minus\pm± 0.4 60.2 ±plus-or-minus\pm± 0.3 PS 61.6 ±plus-or-minus\pm± 1.3 73.3 ±plus-or-minus\pm± 0.2 59.4 ±plus-or-minus\pm± 0.3 IR 61.3 ±plus-or-minus\pm± 1.7 74.3 ±plus-or-minus\pm± 0.2 58.0 ±plus-or-minus\pm± 0.4 AdapTable 64.5 ±plus-or-minus\pm± 0.6 78.6 ±plus-or-minus\pm± 0.0 61.7 ±plus-or-minus\pm± 0.0

4.3 Ablation Study

Shift-aware uncertainty calibrator.

First, we validate the shift-aware uncertainty calibrator described in Section 3.2. Figure 6 (a) and (b) present reliability diagrams before and after applying calibration. The results demonstrate that our calibrator significantly reduces both overconfidence and underconfidence. Next, we assess the shift-awareness of our calibrator in Figure 6 (c), where we plot the average temperature against the maximum mean discrepancy (MMD) with the training data. As expected, greater shifts from the training data correspond to higher temperatures, indicating increased uncertainty. The strong positive correlation between MMD and average temperature confirms the calibrator’s effectiveness in capturing prediction uncertainty under distribution shifts in tabular data. Finally, we compare our shift-aware uncertainty calibrator with classical methods, namely Platt scaling and isotonic regression, as shown in Table 3. The results indicate that prior methods exhibit inconsistent performance across different datasets. For example, Platt scaling improves performance on the HELOC dataset but degrades it on Voting and Hospital Readmissions. In contrast, our shift-aware uncertainty calibrator consistently outperforms these classical methods, which do not account for domain shift during calibration.

Refer to caption
Figure 7: Jensen-Shannon (JS) Divergence of the estimated target label distribution before and after applying the label distribution handler using MLP on three datasets. The x-axis indicates the online batch index, and the y-axis shows the per-batch JS divergence from the ground truth labels.

Label distribution handler.

In the following section, we validate the workings and efficacy of the label distribution handler. First, Figure 7 compares the Jensen–Shannon (JS) divergence between the true and estimated label distributions across each online batch. The results show that our handler significantly improves the accuracy of label distribution estimation, showing low JS divergence across all datasets. Next, we assess the robustness of our label distribution handler with respect to test batch distribution. Namely, we evaluate under 1) a severe class imbalance scenario with an imbalance ratio of 10, and 2) class-wise temporal correlation scenario, where class labels exhibit strong temporal locality. Results are shown in Table 4. AdapTable shows performance gains in all scenarios, showing up to 27% and 19% improvements in class imbalance and temporal correlation scenarios in HELOC and Childhood Lead, respectively. More experimental details are provided in Section C.

Table 4: The average macro F1 score (%) with standard errors for TTA baselines is reported using MLP across three datasets with 1) class imbalance and 2) temporal correlation from the TableShift benchmark. The results are averaged over three random repetitions.

Method Class Imbalance Temporal Correlation HELOC Voting Childhood Lead HELOC Voting Childhood Lead Source 32.5 ±plus-or-minus\pm± 3.5 52.3 ±plus-or-minus\pm± 4.9 36.7 ±plus-or-minus\pm± 6.5 31.6 ±plus-or-minus\pm± 0.3 62.2 ±plus-or-minus\pm± 0.1 35.1 ±plus-or-minus\pm± 0.2 PL 32.0 ±plus-or-minus\pm± 3.6 52.1 ±plus-or-minus\pm± 4.9 36.7 ±plus-or-minus\pm± 6.5 30.9 ±plus-or-minus\pm± 0.2 54.9 ±plus-or-minus\pm± 0.1 35.1 ±plus-or-minus\pm± 0.2 TENT 32.5 ±plus-or-minus\pm± 3.5 52.3 ±plus-or-minus\pm± 4.9 36.7 ±plus-or-minus\pm± 6.5 31.6 ±plus-or-minus\pm± 0.3 55.7 ±plus-or-minus\pm± 0.1 35.1 ±plus-or-minus\pm± 0.2 EATA 32.5 ±plus-or-minus\pm± 3.5 52.3 ±plus-or-minus\pm± 4.9 36.7 ±plus-or-minus\pm± 6.5 31.6 ±plus-or-minus\pm± 0.3 55.7 ±plus-or-minus\pm± 0.1 35.1 ±plus-or-minus\pm± 0.2 SAR 31.8 ±plus-or-minus\pm± 3.5 57.1 ±plus-or-minus\pm± 5.3 36.7 ±plus-or-minus\pm± 6.5 32.0 ±plus-or-minus\pm± 0.2 54.4 ±plus-or-minus\pm± 0.5 35.1 ±plus-or-minus\pm± 0.2 LAME 29.9 ±plus-or-minus\pm± 3.5 58.7 ±plus-or-minus\pm± 4.0 36.7 ±plus-or-minus\pm± 6.5 29.0 ±plus-or-minus\pm± 0.1 38.0 ±plus-or-minus\pm± 0.4 35.1 ±plus-or-minus\pm± 0.2 AdapTable 59.7 ±plus-or-minus\pm± 0.8 62.0 ±plus-or-minus\pm± 4.6 63.9 ±plus-or-minus\pm± 1.0 56.1 ±plus-or-minus\pm± 0.3 64.5 ±plus-or-minus\pm± 0.0 64.8 ±plus-or-minus\pm± 0.3

4.4 Further Analysis

Computational efficiency.

The leftmost part of Figure 8 compares the computational efficiency of AdapTable with TTA baselines. On the HELOC dataset, AdapTable’s total elapsed time is approximately 1.54 seconds, translating to about 0.0002 seconds per sample, which is highly desirable. Moreover, AdapTable achieves an optimal efficiency-efficacy trade-off.

Hyperparameter sensitivity.

Figure 8 further analyzes the hyperparameter sensitivity of AdapTable on the Childhood Lead dataset. As shown in the figure, AdapTable remains highly insensitive to changes in the smoothing factor α𝛼\alphaitalic_α, low uncertainty quantile qlowsubscript𝑞lowq_{\text{low}}italic_q start_POSTSUBSCRIPT low end_POSTSUBSCRIPT, and high uncertainty quantile qhighsubscript𝑞highq_{\text{high}}italic_q start_POSTSUBSCRIPT high end_POSTSUBSCRIPT.

Refer to caption
Figure 8: Computational efficiency (leftmost figure) and hyperparameter sensitivity analysis of AdapTable (three figures on the right) using MLP on the HELOC and Childhood Lead datasets, respectively.

5 Related Work

Machine learning for tabular data.

The distinct nature of tabular data reduces the effectiveness of deep neural networks, making gradient-boosted decision trees (Chen and Guestrin 2016; Dorogush, Ershov, and Gulin 2017) more suitable. However, research continues to develop deep learning models tailored for tabular data (Murtagh 1991; Song et al. 2019; Arik and Pfister 2021; Gorishniy et al. 2021), including recent efforts involving large language models (Fang et al. 2024; Hegselmann et al. 2023; Hollmann et al. 2023; Dinh et al. 2022) that leverage textual prior knowledge. Notably, our method is architecture-agnostic and can be applied to any model.

Distribution shifts in the tabular domain.

Recently, distribution shift benchmarks for tabular data have been introduced (Liu et al. 2023; Gardner, Popovic, and Schmidt 2023). WhyShift (Liu et al. 2023) reveals that concept shifts (Y|Xconditional𝑌𝑋Y|Xitalic_Y | italic_X-shifts) are more prevalent and detrimental than covariate shifts (X𝑋Xitalic_X-shifts). TableShift (Gardner, Popovic, and Schmidt 2023) offers a benchmark with 15 classification tasks, highlighting a strong correlation between shift gaps and label distribution shifts (Y𝑌Yitalic_Y-shifts), which supports the validity of our method.

Test-time adaptation.

Over the past years, test-time adaptation (TTA) methods have been proposed across various domains, such as computer vision (Wang et al. 2021a; Gong et al. 2022; Niu et al. 2023; Shim, Kim, and Yang 2024), natural language processing (Shi et al. 2024; Liang, He, and Tan 2024), and speech processing (Kim et al. 2023). These methods adapt pre-trained models to unlabeled target domains without requiring access to source data, making them well-suited for sensitive tabular data. TabLog (Ren et al. 2024) is a recent TTA method specifically for tabular data, but it has architectural constraints and lacks a comprehensive analysis of distribution shifts. This underscores the need for model-agnostic TTA methods with a deeper understanding of tabular data, which we address in this paper.

6 Conclusion

In this paper, we have introduced AdapTable, a test-time adaptation framework tailored for tabular data. AdapTable overcomes the limitations of previous methods, which fail to address label distribution shifts, and lack versatility across architectures. Our approach, combined with a shift-aware uncertainty calibrator that enhances calibration via modeling column shifts, and a label distribution handler that adjusts the output distribution based on real-time estimates of the current batch’s label distribution. Extensive experiments show that AdapTable achieves state-of-the-art performance across various datasets and architectures, effectively managing both natural distribution shifts and common corruptions.

References

  • Arik and Pfister (2021) Arik, S. Ö.; and Pfister, T. 2021. TabNet: Attentive interpretable tabular learning. In AAAI Conference on Artificial Intelligence (AAAI).
  • Association (2018) Association, A. D. 2018. Economic costs of diabetes in the US in 2017. Diabetes care.
  • Berthelot et al. (2020) Berthelot, D.; Carlini, N.; Cubuk, E. D.; Kurakin, A.; Sohn, K.; Zhang, H.; and Raffel, C. 2020. Remixmatch: Semi-supervised learning with distribution alignment and augmentation anchoring. In International Conference on Learning Representations (ICLR).
  • Beyazit et al. (2024) Beyazit, E.; Kozaczuk, J.; Li, B.; Wallace, V.; and Fadlallah, B. 2024. An inductive bias for tabular deep learning. Conference in Neural Information Processing Systems (NeurIPS).
  • Bischl et al. (2021) Bischl, B.; Casalicchio, G.; Feurer, M.; Hutter, F.; Lang, M.; Mantovani, R. G.; van Rijn, J. N.; and Vanschoren, J. 2021. OpenML Benchmarking Suites. In Conference on Neural Information Processing Systems (NeurIPS).
  • Boudiaf et al. (2022) Boudiaf, M.; Mueller, R.; Ayed, I. B.; and Bertinetto, L. 2022. Parameter-free Online Test-time Adaptation. In IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).
  • Brown et al. (2018) Brown, K.; Doran, D.; Kramer, R.; and Reynolds, B. 2018. HELOC Applicant Risk Performance Evaluation by Topological Hierarchical Decomposition. In NeurIPS Workshop on Challenges and Opportunities for AI in Financial Services.
  • Chen and Guestrin (2016) Chen, T.; and Guestrin, C. 2016. XGBoost: A Scalable Tree Boosting System. In ACM SIGKDD Conference on Knowledge Discovery and Data Mining (KDD).
  • Clore et al. (2014) Clore, J.; Cios, K.; DeShazo, J.; and Strack, B. 2014. Diabetes 130-US hospitals for years 1999-2008. UCI Machine Learning Repository.
  • DeFilippis and Van Spall (2021) DeFilippis, E. M.; and Van Spall, H. G. 2021. Is it time for sex-specific guidelines for cardiovascular disease? Journal of the American College of Cardiology (JACC).
  • Dinh et al. (2022) Dinh, T.; Zeng, Y.; Zhang, R.; Lin, Z.; Gira, M.; Rajput, S.; Sohn, J.-y.; Papailiopoulos, D.; and Lee, K. 2022. Lift: Language-interfaced fine-tuning for non-language machine learning tasks. In Conference on Neural Information Processing Systems (NeurIPS).
  • Dorogush, Ershov, and Gulin (2017) Dorogush, A. V.; Ershov, V.; and Gulin, A. 2017. CatBoost: gradient boosting with categorical features support. In NeurIPS Workshop on ML Systems.
  • Fang et al. (2024) Fang, X.; Xu, W.; Tan, F. A.; Zhang, J.; Hu, Z.; Qi, Y. J.; Nickleach, S.; Socolinsky, D.; Sengamedu, S.; Faloutsos, C.; et al. 2024. Large language models (LLMs) on tabular data: Prediction, generation, and understanding-a survey. Transactions on Machine Learning Research (TMLR).
  • Fey and Lenssen (2019) Fey, M.; and Lenssen, J. E. 2019. Fast graph representation learning with PyTorch Geometric. In ICLR Workshop on Representation Learning on Graphs and Manifolds.
  • for Disease Control, Prevention et al. (2003) for Disease Control, C.; Prevention; et al. 2003. National Health and Nutrition Examination Survey (NHANES) Data. NCfHS, editor. NCHS.
  • Gandelsman et al. (2022) Gandelsman, Y.; Sun, Y.; Chen, X.; and Efros, A. 2022. Test-time training with masked autoencoders. In Conference on Neural Information Processing Systems (NeurIPS).
  • Gardner, Popovic, and Schmidt (2023) Gardner, J.; Popovic, Z.; and Schmidt, L. 2023. Benchmarking Distribution Shift in Tabular Data with TableShift. In Conference on Neural Information Processing Systems (NeurIPS).
  • Gong et al. (2022) Gong, T.; Jeong, J.; Kim, T.; Kim, Y.; Shin, J.; and Lee, S.-J. 2022. Note: Robust continual test-time adaptation against temporal correlation. In Conference on Neural Information Processing Systems (NeurIPS).
  • Gorishniy et al. (2021) Gorishniy, Y.; Rubachev, I.; Khrulkov, V.; and Babenko, A. 2021. Revisiting Deep Learning Models for Tabular Data. In Conference on Neural Information Processing Systems (NeurIPS).
  • Grinsztajn, Oyallon, and Varoquaux (2022) Grinsztajn, L.; Oyallon, E.; and Varoquaux, G. 2022. Why do tree-based models still outperform deep learning on tabular data? arXiv 2022. In Conference on Neural Information Processing Systems (NeurIPS).
  • He et al. (2016) He, K.; Zhang, X.; Ren, S.; and Sun, J. 2016. Deep residual learning for image recognition. In IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).
  • Hegselmann et al. (2023) Hegselmann, S.; Buendia, A.; Lang, H.; Agrawal, M.; Jiang, X.; and Sontag, D. 2023. Tabllm: Few-shot classification of tabular data with large language models. In International Conference on Artificial Intelligence and Statistics (AISTATS).
  • Hein et al. (2017) Hein, D.; Depeweg, S.; Tokic, M.; Udluft, S.; Hentschel, A.; Runkler, T. A.; and Sterzing, V. 2017. A benchmark environment motivated by industrial control problems. In IEEE Symposium Series on Computational Intelligence (SSCI).
  • Hollmann et al. (2023) Hollmann, N.; Müller, S.; Eggensperger, K.; and Hutter, F. 2023. TabPFN: A Transformer That Solves Small Tabular Classification Problems in a Second. In International Conference on Learning Representations (ICLR).
  • Hwang et al. (2022) Hwang, S.; Lee, S.; Kim, S.; Ok, J.; and Kwak, S. 2022. Combating label distribution shift for active domain adaptation. In European Conference on Computer Vision (ECCV).
  • Johnson et al. (2021) Johnson, A.; Bulgarelli, L.; Pollard, T.; Horng, S.; Celi, L. A.; and Mark, R. 2021. MIMIC-IV.
  • Johnson et al. (2016) Johnson, A. E.; Pollard, T. J.; Shen, L.; Lehman, L.-w. H.; Feng, M.; Ghassemi, M.; Moody, B.; Szolovits, P.; Anthony Celi, L.; and Mark, R. G. 2016. MIMIC-III, a freely accessible critical care database. Scientific data.
  • Kim et al. (2023) Kim, C.; Park, J.; Shim, H.; and Yang, E. 2023. SGEM: Test-Time Adaptation for Automatic Speech Recognition via Sequential-Level Generalized Entropy Minimization. In Conference of the International Speech Communication Association (INTERSPEECH).
  • Lee (2013) Lee, D.-H. 2013. Pseudo-Label: The Simple and Efficient Semi-Supervised Learning Method for Deep Neural Networks. In ICML Workshop on Challenges in Representation Learning.
  • Liang, He, and Tan (2024) Liang, J.; He, R.; and Tan, T. 2024. A comprehensive survey on test-time adaptation under distribution shifts. International Journal of Computer Vision (IJCV).
  • Lin et al. (2017) Lin, T.-Y.; Goyal, P.; Girshick, R.; He, K.; and Dollár, P. 2017. Focal loss for dense object detection. In IEEE/CVF International Conference on Computer Vision (ICCV).
  • Liu et al. (2023) Liu, J.; Wang, T.; Cui, P.; and Namkoong, H. 2023. On the Need for a Language Describing Distribution Shifts: Illustrations on Tabular Datasets. In Conference on Neural Information Processing Systems (NeurIPS).
  • Liu et al. (2021) Liu, Y.; Kothari, P.; Van Delft, B.; Bellot-Gurlet, B.; Mordan, T.; and Alahi, A. 2021. TTT++: When does self-supervised test-time training fail or thrive? In Conference on Neural Information Processing Systems (NeurIPS).
  • Menon et al. (2021) Menon, A. K.; Jayasumana, S.; Rawat, A. S.; Jain, H.; Veit, A.; and Kumar, S. 2021. Long-tail learning via logit adjustment. In International Conference on Learning Representations (ICLR).
  • Mosca, Barrett-Connor, and Kass Wenger (2011) Mosca, L.; Barrett-Connor, E.; and Kass Wenger, N. 2011. Sex/gender differences in cardiovascular disease prevention: what a difference a decade makes. Circulation.
  • Murtagh (1991) Murtagh, F. 1991. Multilayer perceptrons for classification and regression. Neurocomputing.
  • Niu et al. (2022) Niu, S.; Wu, J.; Zhang, Y.; Chen, Y.; Zheng, S.; Zhao, P.; and Tan, M. 2022. Efficient test-time model adaptation without forgetting. In International Conference on Machine Learning (ICML).
  • Niu et al. (2023) Niu, S.; Wu, J.; Zhang, Y.; Wen, Z.; Chen, Y.; Zhao, P.; and Tan, M. 2023. Towards stable test-time adaptation in dynamic wild world. In International Conference on Learning Representations (ICLR).
  • Park, Seo, and Yang (2023) Park, J.; Seo, H.; and Yang, E. 2023. Pc-adapter: Topology-aware adapter for efficient domain adaption on point clouds with rectified pseudo-label. In IEEE/CVF International Conference on Computer Vision (ICCV).
  • Paszke et al. (2019) Paszke, A.; Gross, S.; Massa, F.; Lerer, A.; Bradbury, J.; Chanan, G.; Killeen, T.; Lin, Z.; Gimelshein, N.; Antiga, L.; et al. 2019. Pytorch: An imperative style, high-performance deep learning library. In Conference on neural information processing systems (NeurIPS).
  • Platt (2000) Platt, J. 2000. Probabilistic Outputs for Support Vector Machines and Comparisons to Regularized Likelihood Methods. In Advances in Large Margin Classifiers.
  • Ren et al. (2024) Ren, W.; Li, X.; Chen, H.; Rakesh, V.; Wang, Z.; Das, M.; and Honavar, V. G. 2024. TabLog: Test-Time Adaptation for Tabular Data Using Logic Rules. In International Conference on Machine Learning (ICML).
  • Riegel et al. (2020) Riegel, R.; Gray, A.; Luus, F.; Khan, N.; Makondo, N.; Akhalwaya, I. Y.; Qian, H.; Fagin, R.; Barahona, F.; Sharma, U.; et al. 2020. Logical neural networks. arXiv preprint arXiv:2006.13155.
  • Schneider et al. (2020) Schneider, S.; Rusak, E.; Eck, L.; Bringmann, O.; Brendel, W.; and Bethge, M. 2020. Improving robustness against common corruptions by covariate shift adaptation. In Conference on Neural Information Processing Systems (NeurIPS).
  • Shi et al. (2024) Shi, W.; Xu, R.; Zhuang, Y.; Yu, Y.; Wu, H.; Yang, C.; and Wang, M. D. 2024. MedAdapter: Efficient Test-Time Adaptation of Large Language Models towards Medical Reasoning. arXiv preprint arXiv:2405.03000.
  • Shim, Kim, and Yang (2024) Shim, H.; Kim, C.; and Yang, E. 2024. CloudFixer: Test-Time Adaptation for 3D Point Clouds via Diffusion-Guided Geometric Transformation. In European Conference on Computer Vision (ECCV).
  • Song et al. (2019) Song, W.; Shi, C.; Xiao, Z.; Duan, Z.; Xu, Y.; Zhang, M.; and Tang, J. 2019. Autoint: Automatic feature interaction learning via self-attentive neural networks. In ACM International Conference on Information and Knowledge Management (CIKM).
  • Studies (2019) Studies, A. N. E. 2019. FICO. The Explainable Machine Learning Challenge.
  • Studies (2022) Studies, A. N. E. 2022. ANES Time Series Cumulative Data File [dataset and documentation]. September 16, 2022 version.
  • Stylianou and Flournoy (2002) Stylianou, M.; and Flournoy, N. 2002. Dose Finding Using the Biased Coin Up-and-Down Design and Isotonic Regression. In Biometrics.
  • Sun et al. (2020) Sun, Y.; Wang, X.; Liu, Z.; Miller, J.; Efros, A.; and Hardt, M. 2020. Test-time training with self-supervision for generalization under distribution shifts. In International Conference on Machine Learning (ICML).
  • Tachet des Combes et al. (2020) Tachet des Combes, R.; Zhao, H.; Wang, Y.-X.; and Gordon, G. J. 2020. Domain adaptation with conditional distribution matching and generalized label shift. In Conference on Neural Information Processing Systems (NeurIPS).
  • Wang et al. (2021a) Wang, D.; Shelhamer, E.; Liu, S.; Olshausen, B.; and Darrell, T. 2021a. Tent: Fully test-time adaptation by entropy minimization. In International Conference on Learning Representations (ICLR).
  • Wang et al. (2021b) Wang, X.; Liu, H.; Shi, C.; and Yang, C. 2021b. Be confident! towards trustworthy graph neural networks via confidence calibration. In Conference on Neural Information Processing Systems (NeurIPS).
  • Wu et al. (2021) Wu, R.; Guo, C.; Su, Y.; and Weinberger, K. Q. 2021. Online adaptation to label distribution shift. In Conference on Neural Information Processing Systems (NeurIPS).
  • Zhou et al. (2023) Zhou, Z.; Guo, L.-Z.; Jia, L.-H.; Zhang, D.; and Li, Y.-F. 2023. ODS: Test-Time Adaptation in the Presence of Open-World Data Shift. In International Conference on Machine Learning (ICML).

Appendix

Appendix A Detailed Algorithm of AdapTable

Post-training shift-aware uncertainty calibrator.

Given a pre-trained tabular classifier fθ:DC:subscript𝑓𝜃superscript𝐷superscript𝐶f_{\theta}:\mathbb{R}^{D}\rightarrow\mathbb{R}^{C}italic_f start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT : blackboard_R start_POSTSUPERSCRIPT italic_D end_POSTSUPERSCRIPT → blackboard_R start_POSTSUPERSCRIPT italic_C end_POSTSUPERSCRIPT on the source domain 𝒟s={(𝐱is,yis)}isubscript𝒟𝑠subscriptsuperscriptsubscript𝐱𝑖𝑠superscriptsubscript𝑦𝑖𝑠𝑖\mathcal{D}_{s}=\{(\mathbf{x}_{i}^{s},y_{i}^{s})\}_{i}caligraphic_D start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT = { ( bold_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_s end_POSTSUPERSCRIPT , italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_s end_POSTSUPERSCRIPT ) } start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT, we introduce a post-training phase for a shift-aware uncertainty calibrator gϕ:C×D×N+:subscript𝑔italic-ϕsuperscript𝐶superscript𝐷𝑁superscriptg_{\phi}:\mathbb{R}^{C}\times\mathbb{R}^{D\times N}\rightarrow\mathbb{R}^{+}italic_g start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT : blackboard_R start_POSTSUPERSCRIPT italic_C end_POSTSUPERSCRIPT × blackboard_R start_POSTSUPERSCRIPT italic_D × italic_N end_POSTSUPERSCRIPT → blackboard_R start_POSTSUPERSCRIPT + end_POSTSUPERSCRIPT. This calibrator is trained after the initial training of fθsubscript𝑓𝜃f_{\theta}italic_f start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT using the same training dataset 𝒟ssubscript𝒟𝑠\mathcal{D}_{s}caligraphic_D start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT. For a given training batch {(𝐱is,yis)}i=1Nsuperscriptsubscriptsuperscriptsubscript𝐱𝑖𝑠superscriptsubscript𝑦𝑖𝑠𝑖1𝑁\{(\mathbf{x}_{i}^{s},y_{i}^{s})\}_{i=1}^{N}{ ( bold_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_s end_POSTSUPERSCRIPT , italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_s end_POSTSUPERSCRIPT ) } start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT, we compute the shift trend 𝐬s=(𝐬us)u=1Dsuperscript𝐬𝑠superscriptsubscriptsuperscriptsubscript𝐬𝑢𝑠𝑢1𝐷\mathbf{s}^{s}=(\mathbf{s}_{u}^{s})_{u=1}^{D}bold_s start_POSTSUPERSCRIPT italic_s end_POSTSUPERSCRIPT = ( bold_s start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_s end_POSTSUPERSCRIPT ) start_POSTSUBSCRIPT italic_u = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_D end_POSTSUPERSCRIPT for a specific column index u𝑢uitalic_u as follows:

𝐬us=(𝐱ius1|𝒟s|i=1|𝒟s|𝐱ius)i=1N,superscriptsubscript𝐬𝑢𝑠superscriptsubscriptsubscriptsuperscript𝐱𝑠𝑖𝑢1subscript𝒟𝑠superscriptsubscriptsuperscript𝑖1subscript𝒟𝑠superscriptsubscript𝐱superscript𝑖𝑢𝑠𝑖1𝑁\mathbf{s}_{u}^{s}=\big{(}\mathbf{x}^{s}_{iu}-\frac{1}{|\mathcal{D}_{s}|}\sum_% {i^{\prime}=1}^{|\mathcal{D}_{s}|}\mathbf{x}_{i^{\prime}u}^{s}\big{)}_{i=1}^{N},bold_s start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_s end_POSTSUPERSCRIPT = ( bold_x start_POSTSUPERSCRIPT italic_s end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i italic_u end_POSTSUBSCRIPT - divide start_ARG 1 end_ARG start_ARG | caligraphic_D start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT | end_ARG ∑ start_POSTSUBSCRIPT italic_i start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT | caligraphic_D start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT | end_POSTSUPERSCRIPT bold_x start_POSTSUBSCRIPT italic_i start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT italic_u end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_s end_POSTSUPERSCRIPT ) start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT ,

where we add a linear layer to 𝐬ussuperscriptsubscript𝐬𝑢𝑠\mathbf{s}_{u}^{s}bold_s start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_s end_POSTSUPERSCRIPT for categorical column u𝑢uitalic_u to transform it into a one-dimensional representation, ensuring alignment with the numerical columns. Using 𝐬ssuperscript𝐬𝑠\mathbf{s}^{s}bold_s start_POSTSUPERSCRIPT italic_s end_POSTSUPERSCRIPT, we construct a shift trend graph, where each node u𝑢uitalic_u represents a column, and edges capture the relationships between columns. The node features are given by 𝐬utsuperscriptsubscript𝐬𝑢𝑡\mathbf{s}_{u}^{t}bold_s start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT, and the graph is connected using an all-ones adjacency matrix. A graph neural network (GNN) is applied to this graph, facilitating the exchange of shift trends between columns through message passing, which generates a contextualized column-wise representation 𝒉ussuperscriptsubscript𝒉𝑢𝑠{\bm{h}}_{u}^{s}bold_italic_h start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_s end_POSTSUPERSCRIPT. These representations are averaged to form a global feature representation 𝒉s=1Du=1D𝒉ussuperscript𝒉𝑠1𝐷superscriptsubscript𝑢1𝐷superscriptsubscript𝒉𝑢𝑠{\bm{h}}^{s}=\frac{1}{D}\sum_{u=1}^{D}{{\bm{h}}_{u}^{s}}bold_italic_h start_POSTSUPERSCRIPT italic_s end_POSTSUPERSCRIPT = divide start_ARG 1 end_ARG start_ARG italic_D end_ARG ∑ start_POSTSUBSCRIPT italic_u = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_D end_POSTSUPERSCRIPT bold_italic_h start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_s end_POSTSUPERSCRIPT, which is then concatenated with the initial model prediction fθ(𝐱is)subscript𝑓𝜃superscriptsubscript𝐱𝑖𝑠f_{\theta}(\mathbf{x}_{i}^{s})italic_f start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( bold_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_s end_POSTSUPERSCRIPT ) to produce the final output temperature Tisubscript𝑇𝑖T_{i}italic_T start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT. With the calibrated probability pi=softmax(fθ(𝐱is)/Ti)subscript𝑝𝑖softmaxsubscript𝑓𝜃superscriptsubscript𝐱𝑖𝑠subscript𝑇𝑖p_{i}=\text{softmax}\big{(}f_{\theta}(\mathbf{x}_{i}^{s})/T_{i}\big{)}italic_p start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = softmax ( italic_f start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( bold_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_s end_POSTSUPERSCRIPT ) / italic_T start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ), with the per-sample temperature Tisubscript𝑇𝑖T_{i}italic_T start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT calculated above, we define the most plausible and second plausible class indices jsuperscript𝑗j^{*}italic_j start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT and jsuperscript𝑗absentj^{**}italic_j start_POSTSUPERSCRIPT ∗ ∗ end_POSTSUPERSCRIPT as follows:

j=argmaxj𝒴pijandj=argmaxj𝒴,jjpij.formulae-sequencesuperscript𝑗subscriptargmax𝑗𝒴subscript𝑝𝑖𝑗andsuperscript𝑗absentsubscriptargmaxformulae-sequence𝑗𝒴𝑗superscript𝑗subscript𝑝𝑖𝑗j^{*}=\operatorname*{arg\,max}_{j\in\mathcal{Y}}p_{ij}\quad\text{and}\quad j^{% **}=\operatorname*{arg\,max}_{j\in\mathcal{Y},j\neq j^{*}}p_{ij}.italic_j start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT = start_OPERATOR roman_arg roman_max end_OPERATOR start_POSTSUBSCRIPT italic_j ∈ caligraphic_Y end_POSTSUBSCRIPT italic_p start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT and italic_j start_POSTSUPERSCRIPT ∗ ∗ end_POSTSUPERSCRIPT = start_OPERATOR roman_arg roman_max end_OPERATOR start_POSTSUBSCRIPT italic_j ∈ caligraphic_Y , italic_j ≠ italic_j start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT italic_p start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT .

The focal loss FLsubscriptFL\mathcal{L}_{\text{FL}}caligraphic_L start_POSTSUBSCRIPT FL end_POSTSUBSCRIPT (Lin et al. 2017) and the calibration loss CALsubscriptCAL\mathcal{L}_{\text{CAL}}caligraphic_L start_POSTSUBSCRIPT CAL end_POSTSUBSCRIPT (Wang et al. 2021b) are used to train the shift-aware uncertainty calibrator gϕsubscript𝑔italic-ϕg_{\phi}italic_g start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT, defined as:

FL(𝐱is,yis)subscriptFLsuperscriptsubscript𝐱𝑖𝑠superscriptsubscript𝑦𝑖𝑠\displaystyle\mathcal{L}_{\text{FL}}(\mathbf{x}_{i}^{s},y_{i}^{s})caligraphic_L start_POSTSUBSCRIPT FL end_POSTSUBSCRIPT ( bold_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_s end_POSTSUPERSCRIPT , italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_s end_POSTSUPERSCRIPT ) =j=1C𝟙{𝕪𝕚𝕤}(𝕛)(𝟙𝕡𝕚𝕛)γlog𝕡𝕚𝕛,absentsuperscriptsubscript𝑗1𝐶subscript1superscriptsubscript𝕪𝕚𝕤𝕛superscript1subscript𝕡𝕚𝕛𝛾subscript𝕡𝕚𝕛\displaystyle=\sum_{j=1}^{C}\mathbbold{1}_{\{y_{i}^{s}\}}(j)(1-p_{ij})^{\gamma% }\log{p_{ij}},= ∑ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_C end_POSTSUPERSCRIPT blackboard_1 start_POSTSUBSCRIPT { blackboard_y start_POSTSUBSCRIPT blackboard_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT blackboard_s end_POSTSUPERSCRIPT } end_POSTSUBSCRIPT ( blackboard_j ) ( blackboard_1 - blackboard_p start_POSTSUBSCRIPT blackboard_i blackboard_j end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT italic_γ end_POSTSUPERSCRIPT roman_log blackboard_p start_POSTSUBSCRIPT blackboard_i blackboard_j end_POSTSUBSCRIPT , (10)
CAL(𝐱is,yis)subscriptCALsuperscriptsubscript𝐱𝑖𝑠superscriptsubscript𝑦𝑖𝑠\displaystyle\mathcal{L}_{\text{CAL}}(\mathbf{x}_{i}^{s},y_{i}^{s})caligraphic_L start_POSTSUBSCRIPT CAL end_POSTSUBSCRIPT ( bold_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_s end_POSTSUPERSCRIPT , italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_s end_POSTSUPERSCRIPT ) =𝟙{𝕪𝕚𝕤}(𝕛)(𝟙𝕡𝕚𝕛+𝕡𝕚𝕛)+𝟙𝒴\{𝕪𝕚𝕤}(𝕛)(𝕡𝕚𝕛𝕡𝕚𝕛),absentsubscript1superscriptsubscript𝕪𝕚𝕤superscript𝕛1subscript𝕡𝕚superscript𝕛subscript𝕡𝕚superscript𝕛absentsubscript1\𝒴superscriptsubscript𝕪𝕚𝕤superscript𝕛subscript𝕡𝕚superscript𝕛subscript𝕡𝕚superscript𝕛absent\displaystyle=\mathbbold{1}_{\{y_{i}^{s}\}}(j^{*})(1-p_{ij^{*}}+p_{ij^{**}})+% \mathbbold{1}_{\mathcal{Y}\backslash\{y_{i}^{s}\}}(j^{*})(p_{ij^{*}}-p_{ij^{**% }}),= blackboard_1 start_POSTSUBSCRIPT { blackboard_y start_POSTSUBSCRIPT blackboard_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT blackboard_s end_POSTSUPERSCRIPT } end_POSTSUBSCRIPT ( blackboard_j start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) ( blackboard_1 - blackboard_p start_POSTSUBSCRIPT blackboard_i blackboard_j start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT + blackboard_p start_POSTSUBSCRIPT blackboard_i blackboard_j start_POSTSUPERSCRIPT ∗ ∗ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ) + blackboard_1 start_POSTSUBSCRIPT caligraphic_Y \ { blackboard_y start_POSTSUBSCRIPT blackboard_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT blackboard_s end_POSTSUPERSCRIPT } end_POSTSUBSCRIPT ( blackboard_j start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) ( blackboard_p start_POSTSUBSCRIPT blackboard_i blackboard_j start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT - blackboard_p start_POSTSUBSCRIPT blackboard_i blackboard_j start_POSTSUPERSCRIPT ∗ ∗ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ) , (11)

where 𝟙𝔸(𝕩)subscript1𝔸𝕩\mathbbold{1}_{{\bm{A}}}(x)blackboard_1 start_POSTSUBSCRIPT blackboard_bold_A end_POSTSUBSCRIPT ( blackboard_x ) is an indicator function:

𝟙𝔸(𝕩)={𝟙if 𝕩𝔸𝟘otherwise.subscript1𝔸𝕩cases1if 𝕩𝔸0otherwise\mathbbold{1}_{{\bm{A}}}(x)=\begin{cases}1&\text{if }x\in{\bm{A}}\\ 0&\text{otherwise}.\end{cases}blackboard_1 start_POSTSUBSCRIPT blackboard_bold_A end_POSTSUBSCRIPT ( blackboard_x ) = { start_ROW start_CELL blackboard_1 end_CELL start_CELL if blackboard_x ∈ blackboard_bold_A end_CELL end_ROW start_ROW start_CELL blackboard_0 end_CELL start_CELL otherwise . end_CELL end_ROW

FLsubscriptFL\mathcal{L}_{\text{FL}}caligraphic_L start_POSTSUBSCRIPT FL end_POSTSUBSCRIPT addresses class imbalance by reducing the impact of easily classified examples, while CALsubscriptCAL\mathcal{L}_{\text{CAL}}caligraphic_L start_POSTSUBSCRIPT CAL end_POSTSUBSCRIPT penalizes the gap between pijsubscript𝑝𝑖superscript𝑗p_{ij^{*}}italic_p start_POSTSUBSCRIPT italic_i italic_j start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT and pijsubscript𝑝𝑖superscript𝑗absentp_{ij^{**}}italic_p start_POSTSUBSCRIPT italic_i italic_j start_POSTSUPERSCRIPT ∗ ∗ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT for correct predictions, encouraging them to converge for incorrect predictions. For all experiments, we set γ=2𝛾2\gamma=2italic_γ = 2 and λCAL=0.1subscript𝜆CAL0.1\lambda_{\text{CAL}}=0.1italic_λ start_POSTSUBSCRIPT CAL end_POSTSUBSCRIPT = 0.1.

Label distribution handler.

During the test phase after post-training gϕsubscript𝑔italic-ϕg_{\phi}italic_g start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT, we introduce a label distribution handler using an estimator p¯i(y|𝐱it)subscript¯𝑝𝑖conditional𝑦superscriptsubscript𝐱𝑖𝑡\bar{p}_{i}(y|\mathbf{x}_{i}^{t})over¯ start_ARG italic_p end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_y | bold_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ), defined as:

p¯i(y|𝐱it)=p~t(y|𝐱it)+norm(p~t(y|𝐱it)pt(y)/ps(y))2,subscript¯𝑝𝑖conditional𝑦superscriptsubscript𝐱𝑖𝑡subscript~𝑝𝑡conditional𝑦superscriptsubscript𝐱𝑖𝑡normsubscript~𝑝𝑡conditional𝑦superscriptsubscript𝐱𝑖𝑡subscript𝑝𝑡𝑦subscript𝑝𝑠𝑦2\bar{p}_{i}(y|\mathbf{x}_{i}^{t})=\frac{\tilde{p}_{t}(y|\mathbf{x}_{i}^{t})+% \text{norm}\big{(}\tilde{p}_{t}(y|\mathbf{x}_{i}^{t})p_{t}(y)/p_{s}(y)\big{)}}% {2},over¯ start_ARG italic_p end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_y | bold_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) = divide start_ARG over~ start_ARG italic_p end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_y | bold_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) + norm ( over~ start_ARG italic_p end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_y | bold_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) italic_p start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_y ) / italic_p start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ( italic_y ) ) end_ARG start_ARG 2 end_ARG ,

where p~t(y|𝐱it)subscript~𝑝𝑡conditional𝑦superscriptsubscript𝐱𝑖𝑡\tilde{p}_{t}(y|\mathbf{x}_{i}^{t})over~ start_ARG italic_p end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_y | bold_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) represents the calibrated prediction. This approach enhances uncertainty quantification and combines the calibrated estimation with the distributionally aligned prediction for more robust estimation. To compute p~t(y|𝐱it)subscript~𝑝𝑡conditional𝑦superscriptsubscript𝐱𝑖𝑡\tilde{p}_{t}(y|\mathbf{x}_{i}^{t})over~ start_ARG italic_p end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_y | bold_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ), we perform a two-stage uncertainty calibration. Specifically, for a given test batch {𝐱it}i=1Nsuperscriptsubscriptsuperscriptsubscript𝐱𝑖𝑡𝑖1𝑁\{\mathbf{x}_{i}^{t}\}_{i=1}^{N}{ bold_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT } start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT, we calculate the shift trend 𝐬t=(𝐬ut)u=1DD×Nsuperscript𝐬𝑡superscriptsubscriptsuperscriptsubscript𝐬𝑢𝑡𝑢1𝐷superscript𝐷𝑁\mathbf{s}^{t}=(\mathbf{s}_{u}^{t})_{u=1}^{D}\in\mathbb{R}^{D\times N}bold_s start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT = ( bold_s start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) start_POSTSUBSCRIPT italic_u = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_D end_POSTSUPERSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_D × italic_N end_POSTSUPERSCRIPT as:

𝐬ut=(𝐱iut1|𝒟s|i=1|𝒟s|𝐱ius)i=1NN.superscriptsubscript𝐬𝑢𝑡superscriptsubscriptsubscriptsuperscript𝐱𝑡𝑖𝑢1subscript𝒟𝑠superscriptsubscriptsuperscript𝑖1subscript𝒟𝑠superscriptsubscript𝐱superscript𝑖𝑢𝑠𝑖1𝑁superscript𝑁\mathbf{s}_{u}^{t}=\big{(}\mathbf{x}^{t}_{iu}-\frac{1}{|\mathcal{D}_{s}|}\sum_% {i^{\prime}=1}^{|\mathcal{D}_{s}|}\mathbf{x}_{i^{\prime}u}^{s}\big{)}_{i=1}^{N% }\in\mathbb{R}^{N}.bold_s start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT = ( bold_x start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i italic_u end_POSTSUBSCRIPT - divide start_ARG 1 end_ARG start_ARG | caligraphic_D start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT | end_ARG ∑ start_POSTSUBSCRIPT italic_i start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT | caligraphic_D start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT | end_POSTSUPERSCRIPT bold_x start_POSTSUBSCRIPT italic_i start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT italic_u end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_s end_POSTSUPERSCRIPT ) start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT .

Then, a per-sample temperature Ti=gϕ(fθ(𝐱it),𝐬t)subscript𝑇𝑖subscript𝑔italic-ϕsubscript𝑓𝜃superscriptsubscript𝐱𝑖𝑡superscript𝐬𝑡T_{i}=g_{\phi}(f_{\theta}(\mathbf{x}_{i}^{t}),\mathbf{s}^{t})italic_T start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = italic_g start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT ( italic_f start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( bold_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) , bold_s start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ), which was defined in Equation 1 is computed. The uncertainty δisubscript𝛿𝑖\delta_{i}italic_δ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT of fθ(𝐱it)subscript𝑓𝜃superscriptsubscript𝐱𝑖𝑡f_{\theta}(\mathbf{x}_{i}^{t})italic_f start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( bold_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) is defined as the reciprocal of the margin of the calibrated probability distribution softmax(fθ(𝐱it)/Ti)softmaxsubscript𝑓𝜃superscriptsubscript𝐱𝑖𝑡subscript𝑇𝑖\text{softmax}(f_{\theta}(\mathbf{x}_{i}^{t})/T_{i})softmax ( italic_f start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( bold_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) / italic_T start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ):

δi=1softmax(fθ(𝐱it)/Ti)jsoftmax(fθ(𝐱it)/Ti)j,subscript𝛿𝑖1softmaxsubscriptsubscript𝑓𝜃superscriptsubscript𝐱𝑖𝑡subscript𝑇𝑖superscript𝑗softmaxsubscriptsubscript𝑓𝜃superscriptsubscript𝐱𝑖𝑡subscript𝑇𝑖superscript𝑗absent\delta_{i}=\frac{1}{\text{softmax}\big{(}f_{\theta}(\mathbf{x}_{i}^{t})/T_{i}% \big{)}_{j^{*}}-{\text{softmax}\big{(}f_{\theta}(\mathbf{x}_{i}^{t})/T_{i}\big% {)}}_{j^{**}}},italic_δ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = divide start_ARG 1 end_ARG start_ARG softmax ( italic_f start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( bold_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) / italic_T start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) start_POSTSUBSCRIPT italic_j start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT - softmax ( italic_f start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( bold_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) / italic_T start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) start_POSTSUBSCRIPT italic_j start_POSTSUPERSCRIPT ∗ ∗ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT end_ARG ,

where jsuperscript𝑗j^{*}italic_j start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT and jsuperscript𝑗absentj^{**}italic_j start_POSTSUPERSCRIPT ∗ ∗ end_POSTSUPERSCRIPT are the most plausible and second plausible class indices:

j=argmaxj𝒴fθ(𝐱it)jandj=argmaxj𝒴,jjfθ(𝐱it)j.formulae-sequencesuperscript𝑗subscriptargmax𝑗𝒴subscript𝑓𝜃subscriptsuperscriptsubscript𝐱𝑖𝑡𝑗andsuperscript𝑗absentsubscriptargmaxformulae-sequence𝑗𝒴𝑗superscript𝑗subscript𝑓𝜃subscriptsuperscriptsubscript𝐱𝑖𝑡𝑗j^{*}=\operatorname*{arg\,max}_{j\in\mathcal{Y}}f_{\theta}(\mathbf{x}_{i}^{t})% _{j}\quad\text{and}\quad j^{**}=\operatorname*{arg\,max}_{j\in\mathcal{Y},j% \neq j^{*}}f_{\theta}(\mathbf{x}_{i}^{t})_{j}.italic_j start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT = start_OPERATOR roman_arg roman_max end_OPERATOR start_POSTSUBSCRIPT italic_j ∈ caligraphic_Y end_POSTSUBSCRIPT italic_f start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( bold_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT and italic_j start_POSTSUPERSCRIPT ∗ ∗ end_POSTSUPERSCRIPT = start_OPERATOR roman_arg roman_max end_OPERATOR start_POSTSUBSCRIPT italic_j ∈ caligraphic_Y , italic_j ≠ italic_j start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT italic_f start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( bold_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT .

Based on δisubscript𝛿𝑖\delta_{i}italic_δ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT, the recalibrated temperature T~isubscript~𝑇𝑖\tilde{T}_{i}over~ start_ARG italic_T end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPTis applied:

T~i={Tif δiQ({δi}i=1N,qhigh)1/Tif δiQ({δi}i=1N,qlow)1otherwise,subscript~𝑇𝑖cases𝑇if subscript𝛿𝑖𝑄superscriptsubscriptsubscript𝛿superscript𝑖superscript𝑖1𝑁subscript𝑞high1𝑇if subscript𝛿𝑖𝑄superscriptsubscriptsubscript𝛿superscript𝑖superscript𝑖1𝑁subscript𝑞low1otherwise\tilde{T}_{i}=\begin{cases}T&\text{if }\delta_{i}\geq Q\big{(}\{\delta_{i^{% \prime}}\}_{i^{\prime}=1}^{N},q_{\text{high}}\big{)}\\ 1/T&\text{if }\delta_{i}\leq Q\big{(}\{\delta_{i^{\prime}}\}_{i^{\prime}=1}^{N% },q_{\text{low}}\big{)}\\ 1&\text{otherwise},\end{cases}over~ start_ARG italic_T end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = { start_ROW start_CELL italic_T end_CELL start_CELL if italic_δ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ≥ italic_Q ( { italic_δ start_POSTSUBSCRIPT italic_i start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_i start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT , italic_q start_POSTSUBSCRIPT high end_POSTSUBSCRIPT ) end_CELL end_ROW start_ROW start_CELL 1 / italic_T end_CELL start_CELL if italic_δ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ≤ italic_Q ( { italic_δ start_POSTSUBSCRIPT italic_i start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_i start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT , italic_q start_POSTSUBSCRIPT low end_POSTSUBSCRIPT ) end_CELL end_ROW start_ROW start_CELL 1 end_CELL start_CELL otherwise , end_CELL end_ROW

where T=1.5maxjps(y)j/minjps(y)j𝑇1.5subscript𝑗subscript𝑝𝑠subscript𝑦𝑗subscript𝑗subscript𝑝𝑠subscript𝑦𝑗T=1.5\max_{j}p_{s}(y)_{j}/\min_{j}p_{s}(y)_{j}italic_T = 1.5 roman_max start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT italic_p start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ( italic_y ) start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT / roman_min start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT italic_p start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ( italic_y ) start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT, and qlowsubscript𝑞lowq_{\text{low}}italic_q start_POSTSUBSCRIPT low end_POSTSUBSCRIPT and qhighsubscript𝑞highq_{\text{high}}italic_q start_POSTSUBSCRIPT high end_POSTSUBSCRIPT are the low and high uncertainty quantiles, respectively. The target label distribution pt(y)subscript𝑝𝑡𝑦p_{t}(y)italic_p start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_y ) is then estimated using the following formula:

pt(y)=(1α)1Ni=1Nptde(y|𝐱it)+αptoe(y),subscript𝑝𝑡𝑦1𝛼1𝑁superscriptsubscript𝑖1𝑁subscriptsuperscript𝑝de𝑡conditional𝑦superscriptsubscript𝐱𝑖𝑡𝛼subscriptsuperscript𝑝oe𝑡𝑦p_{t}(y)=(1-\alpha)\cdot\frac{1}{N}\sum_{i=1}^{N}p^{\text{de}}_{t}(y|\mathbf{x% }_{i}^{t})+\alpha\cdot p^{\text{oe}}_{t}(y),italic_p start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_y ) = ( 1 - italic_α ) ⋅ divide start_ARG 1 end_ARG start_ARG italic_N end_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT italic_p start_POSTSUPERSCRIPT de end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_y | bold_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) + italic_α ⋅ italic_p start_POSTSUPERSCRIPT oe end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_y ) ,

where ptde(y|𝐱it)=norm(pt(y|𝐱it)/ps(y))subscriptsuperscript𝑝de𝑡conditional𝑦superscriptsubscript𝐱𝑖𝑡normsubscript𝑝𝑡conditional𝑦superscriptsubscript𝐱𝑖𝑡subscript𝑝𝑠𝑦p^{\text{de}}_{t}(y|\mathbf{x}_{i}^{t})=\text{norm}\big{(}p_{t}(y|\mathbf{x}_{% i}^{t})/p_{s}(y)\big{)}italic_p start_POSTSUPERSCRIPT de end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_y | bold_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) = norm ( italic_p start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_y | bold_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) / italic_p start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ( italic_y ) ) serves as a debiased target label estimator, deviating from the source label distribution ps(y)subscript𝑝𝑠𝑦p_{s}(y)italic_p start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ( italic_y ). The online target label estimator ptoe(y)superscriptsubscript𝑝𝑡oe𝑦p_{t}^{\text{oe}}(y)italic_p start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT oe end_POSTSUPERSCRIPT ( italic_y ) is initialized with a uniform distribution and updated with each new batch as follows:

ptoe(y)=(1α)1Ni=1Np¯t(y|𝐱it)+αptoe(y),subscriptsuperscript𝑝oe𝑡𝑦1𝛼1𝑁superscriptsubscript𝑖1𝑁subscript¯𝑝𝑡conditional𝑦superscriptsubscript𝐱𝑖𝑡𝛼subscriptsuperscript𝑝oe𝑡𝑦p^{\text{oe}}_{t}(y)=(1-\alpha)\cdot\frac{1}{N}\sum_{i=1}^{N}\bar{p}_{t}(y|% \mathbf{x}_{i}^{t})+\alpha\cdot p^{\text{oe}}_{t}(y),italic_p start_POSTSUPERSCRIPT oe end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_y ) = ( 1 - italic_α ) ⋅ divide start_ARG 1 end_ARG start_ARG italic_N end_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT over¯ start_ARG italic_p end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_y | bold_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) + italic_α ⋅ italic_p start_POSTSUPERSCRIPT oe end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_y ) ,

where α𝛼\alphaitalic_α is a smoothing factor. This update process leverages information from the current batch to refine the target label distribution estimation over time. The overall procedure of the proposed AdapTable method is summarized in Algorithm 1.

Algorithm 1 AdapTable
1:Input: Pre-trained classifier fθ()subscript𝑓𝜃f_{\theta}(\cdot)italic_f start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( ⋅ ), post-trained shift-aware uncertainty calibrator gϕ(,)subscript𝑔italic-ϕg_{\phi}(\cdot,\cdot)italic_g start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT ( ⋅ , ⋅ ), indicator function 𝟙()()subscript1\mathbbold{1}_{(\cdot)}(\cdot)blackboard_1 start_POSTSUBSCRIPT ( ⋅ ) end_POSTSUBSCRIPT ( ⋅ ), quantile function Q(,)𝑄Q(\cdot,\cdot)italic_Q ( ⋅ , ⋅ ), softmax function softmax()softmax\text{softmax}(\cdot)softmax ( ⋅ ), normalization function norm()norm\text{norm}(\cdot)norm ( ⋅ ), source data 𝒟s={(𝐱is,yis)}isubscript𝒟𝑠subscriptsuperscriptsubscript𝐱𝑖𝑠superscriptsubscript𝑦𝑖𝑠𝑖\mathcal{D}_{s}={\{(\mathbf{x}_{i}^{s},y_{i}^{s})\}}_{i}caligraphic_D start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT = { ( bold_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_s end_POSTSUPERSCRIPT , italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_s end_POSTSUPERSCRIPT ) } start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT, current test batch {𝐱it}i=1Nsuperscriptsubscriptsuperscriptsubscript𝐱𝑖𝑡𝑖1𝑁{\{\mathbf{x}_{i}^{t}\}}_{i=1}^{N}{ bold_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT } start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT
2:Parameters: Smoothing factor α𝛼\alphaitalic_α, low uncertainty quantile qlowsubscript𝑞lowq_{\text{low}}italic_q start_POSTSUBSCRIPT low end_POSTSUBSCRIPT, high uncertainty quantile qhighsubscript𝑞highq_{\text{high}}italic_q start_POSTSUBSCRIPT high end_POSTSUBSCRIPT
3:ps(y),T(1|𝒟s|i=1|𝒟s|𝟙{𝕛}(𝕪𝕚𝕤))j=1C, 1.5maxjps(y)j/minjps(y)jformulae-sequencesubscript𝑝𝑠𝑦𝑇superscriptsubscript1subscript𝒟𝑠superscriptsubscript𝑖1subscript𝒟𝑠subscript1𝕛superscriptsubscript𝕪𝕚𝕤𝑗1𝐶1.5subscript𝑗subscript𝑝𝑠subscript𝑦𝑗subscript𝑗subscript𝑝𝑠subscript𝑦𝑗p_{s}(y),\leavevmode\nobreak\ T\leftarrow{\big{(}\frac{1}{|\mathcal{D}_{s}|}% \sum_{i=1}^{|\mathcal{D}_{s}|}{\mathbbold{1}_{\{j\}}(y_{i}^{s})}\big{)}}_{j=1}% ^{C},\leavevmode\nobreak\ 1.5\max_{j}p_{s}(y)_{j}/\min_{j}p_{s}(y)_{j}italic_p start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ( italic_y ) , italic_T ← ( divide start_ARG 1 end_ARG start_ARG | caligraphic_D start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT | end_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT | caligraphic_D start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT | end_POSTSUPERSCRIPT blackboard_1 start_POSTSUBSCRIPT { blackboard_j } end_POSTSUBSCRIPT ( blackboard_y start_POSTSUBSCRIPT blackboard_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT blackboard_s end_POSTSUPERSCRIPT ) ) start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_C end_POSTSUPERSCRIPT , 1.5 roman_max start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT italic_p start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ( italic_y ) start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT / roman_min start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT italic_p start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ( italic_y ) start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT
4:for u=1𝑢1u=1italic_u = 1 to D𝐷Ditalic_D do
5:     𝐬ut(𝐱iut1|𝒟s|i=1|𝒟s|𝐱ius)i=1Nsuperscriptsubscript𝐬𝑢𝑡superscriptsubscriptsubscriptsuperscript𝐱𝑡𝑖𝑢1subscript𝒟𝑠superscriptsubscriptsuperscript𝑖1subscript𝒟𝑠superscriptsubscript𝐱superscript𝑖𝑢𝑠𝑖1𝑁\mathbf{s}_{u}^{t}\leftarrow\big{(}\mathbf{x}^{t}_{iu}-\frac{1}{|\mathcal{D}_{% s}|}\sum_{i^{\prime}=1}^{|\mathcal{D}_{s}|}{\mathbf{x}_{i^{\prime}u}^{s}}\big{% )}_{i=1}^{N}bold_s start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ← ( bold_x start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i italic_u end_POSTSUBSCRIPT - divide start_ARG 1 end_ARG start_ARG | caligraphic_D start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT | end_ARG ∑ start_POSTSUBSCRIPT italic_i start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT | caligraphic_D start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT | end_POSTSUPERSCRIPT bold_x start_POSTSUBSCRIPT italic_i start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT italic_u end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_s end_POSTSUPERSCRIPT ) start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT \triangleright Compute shift trend stsuperscripts𝑡\textbf{s}^{t}s start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT
6:end for
7:for i=1𝑖1i=1italic_i = 1 to N𝑁Nitalic_N do
8:     pt(y|𝐱it)softmax(fθ(𝐱it))subscript𝑝𝑡conditional𝑦superscriptsubscript𝐱𝑖𝑡softmaxsubscript𝑓𝜃superscriptsubscript𝐱𝑖𝑡p_{t}(y|\mathbf{x}_{i}^{t})\leftarrow\text{softmax}\big{(}f_{\theta}(\mathbf{x% }_{i}^{t})\big{)}italic_p start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_y | bold_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) ← softmax ( italic_f start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( bold_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) )
9:     Tigϕ(fθ(𝐱it),𝐬t)subscript𝑇𝑖subscript𝑔italic-ϕsubscript𝑓𝜃superscriptsubscript𝐱𝑖𝑡superscript𝐬𝑡T_{i}\leftarrow g_{\phi}\big{(}f_{\theta}(\mathbf{x}_{i}^{t}),\mathbf{s}^{t}% \big{)}italic_T start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ← italic_g start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT ( italic_f start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( bold_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) , bold_s start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) \triangleright Determine per-sample temperature 𝐱itsuperscriptsubscript𝐱𝑖𝑡\mathbf{x}_{i}^{t}bold_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT
10:     j,jargmax1jCpt(y|𝐱it)j,argmax1jC,jjpt(y|𝐱it)jformulae-sequencesuperscript𝑗superscript𝑗absentsubscriptargmax1𝑗𝐶subscript𝑝𝑡subscriptconditional𝑦superscriptsubscript𝐱𝑖𝑡𝑗subscriptargmaxformulae-sequence1𝑗𝐶𝑗superscript𝑗subscript𝑝𝑡subscriptconditional𝑦superscriptsubscript𝐱𝑖𝑡𝑗j^{*},\leavevmode\nobreak\ j^{**}\leftarrow\operatorname*{arg\,max}_{1\leq j% \leq C}{p_{t}(y|\mathbf{x}_{i}^{t})}_{j},\leavevmode\nobreak\ \operatorname*{% arg\,max}_{1\leq j\leq C,j\neq j^{*}}{p_{t}(y|\mathbf{x}_{i}^{t})}_{j}italic_j start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT , italic_j start_POSTSUPERSCRIPT ∗ ∗ end_POSTSUPERSCRIPT ← start_OPERATOR roman_arg roman_max end_OPERATOR start_POSTSUBSCRIPT 1 ≤ italic_j ≤ italic_C end_POSTSUBSCRIPT italic_p start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_y | bold_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT , start_OPERATOR roman_arg roman_max end_OPERATOR start_POSTSUBSCRIPT 1 ≤ italic_j ≤ italic_C , italic_j ≠ italic_j start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT italic_p start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_y | bold_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT
11:     δi(softmax(fθ(𝐱it)/Ti)jsoftmax(fθ(𝐱it)/Ti)j)1subscript𝛿𝑖superscriptsoftmaxsubscriptsubscript𝑓𝜃superscriptsubscript𝐱𝑖𝑡subscript𝑇𝑖superscript𝑗softmaxsubscriptsubscript𝑓𝜃superscriptsubscript𝐱𝑖𝑡subscript𝑇𝑖superscript𝑗absent1\delta_{i}\leftarrow{\big{(}{\text{softmax}\big{(}f_{\theta}(\mathbf{x}_{i}^{t% })/T_{i}\big{)}}_{j^{*}}-{\text{softmax}\big{(}f_{\theta}(\mathbf{x}_{i}^{t})/% T_{i}\big{)}}_{j^{**}}\big{)}}^{-1}italic_δ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ← ( softmax ( italic_f start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( bold_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) / italic_T start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) start_POSTSUBSCRIPT italic_j start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT - softmax ( italic_f start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( bold_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) / italic_T start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) start_POSTSUBSCRIPT italic_j start_POSTSUPERSCRIPT ∗ ∗ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT \triangleright Define uncertainty of 𝐱itsuperscriptsubscript𝐱𝑖𝑡\mathbf{x}_{i}^{t}bold_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT as a margin of fθ(𝐱it)/Tisubscript𝑓𝜃superscriptsubscript𝐱𝑖𝑡subscript𝑇𝑖f_{\theta}(\mathbf{x}_{i}^{t})/T_{i}italic_f start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( bold_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) / italic_T start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT
12:     ptde(y|𝐱it)norm(pt(y|𝐱it)/ps(y))subscriptsuperscript𝑝de𝑡conditional𝑦superscriptsubscript𝐱𝑖𝑡normsubscript𝑝𝑡conditional𝑦superscriptsubscript𝐱𝑖𝑡subscript𝑝𝑠𝑦p^{\text{de}}_{t}(y|\mathbf{x}_{i}^{t})\leftarrow\text{norm}\big{(}p_{t}(y|% \mathbf{x}_{i}^{t})/p_{s}(y)\big{)}italic_p start_POSTSUPERSCRIPT de end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_y | bold_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) ← norm ( italic_p start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_y | bold_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) / italic_p start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ( italic_y ) ) \triangleright Compute debiased target label estimator
13:end for
14:pt(y)(1α)1Ni=1Nptde(y|𝐱it)+αptoe(y)subscript𝑝𝑡𝑦1𝛼1𝑁superscriptsubscript𝑖1𝑁subscriptsuperscript𝑝de𝑡conditional𝑦superscriptsubscript𝐱𝑖𝑡𝛼subscriptsuperscript𝑝oe𝑡𝑦p_{t}(y)\leftarrow(1-\alpha)\cdot\frac{1}{N}{\sum_{i=1}^{N}p^{\text{de}}_{t}(y% |\mathbf{x}_{i}^{t})}+\alpha\cdot{p^{\text{oe}}_{t}(y)}italic_p start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_y ) ← ( 1 - italic_α ) ⋅ divide start_ARG 1 end_ARG start_ARG italic_N end_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT italic_p start_POSTSUPERSCRIPT de end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_y | bold_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) + italic_α ⋅ italic_p start_POSTSUPERSCRIPT oe end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_y ) \triangleright Estimate target label distribution
15:for i=1𝑖1i=1italic_i = 1 to N𝑁Nitalic_N do
16:     if δiQ({δi}i=1N,qhigh)subscript𝛿𝑖𝑄superscriptsubscriptsubscript𝛿superscript𝑖superscript𝑖1𝑁subscript𝑞high\delta_{i}\geq Q\big{(}\{\delta_{i^{\prime}}\}_{i^{\prime}=1}^{N},q_{\text{% high}}\big{)}italic_δ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ≥ italic_Q ( { italic_δ start_POSTSUBSCRIPT italic_i start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_i start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT , italic_q start_POSTSUBSCRIPT high end_POSTSUBSCRIPT ) then
17:         T~iTsubscript~𝑇𝑖𝑇\tilde{T}_{i}\leftarrow Tover~ start_ARG italic_T end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ← italic_T
18:     else if δiQ({δi}i=1N,qlow)subscript𝛿𝑖𝑄superscriptsubscriptsubscript𝛿superscript𝑖superscript𝑖1𝑁subscript𝑞low\delta_{i}\leq Q\big{(}\{\delta_{i^{\prime}}\}_{i^{\prime}=1}^{N},q_{\text{low% }}\big{)}italic_δ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ≤ italic_Q ( { italic_δ start_POSTSUBSCRIPT italic_i start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_i start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT , italic_q start_POSTSUBSCRIPT low end_POSTSUBSCRIPT ) then
19:         T~i1/Tsubscript~𝑇𝑖1𝑇\tilde{T}_{i}\leftarrow 1/Tover~ start_ARG italic_T end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ← 1 / italic_T \triangleright Calculate temperature T~isubscript~𝑇𝑖\tilde{T}_{i}over~ start_ARG italic_T end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT using uncertainty δisubscript𝛿𝑖\delta_{i}italic_δ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT
20:     else
21:         T~i1subscript~𝑇𝑖1\tilde{T}_{i}\leftarrow 1over~ start_ARG italic_T end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ← 1
22:     end if
23:     p~t(y|𝐱it)softmax(fθ(𝐱it)/T~i)subscript~𝑝𝑡conditional𝑦superscriptsubscript𝐱𝑖𝑡softmaxsubscript𝑓𝜃superscriptsubscript𝐱𝑖𝑡subscript~𝑇𝑖\tilde{p}_{t}(y|\mathbf{x}_{i}^{t})\leftarrow\text{softmax}\big{(}f_{\theta}(% \mathbf{x}_{i}^{t})/\tilde{T}_{i}\big{)}over~ start_ARG italic_p end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_y | bold_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) ← softmax ( italic_f start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( bold_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) / over~ start_ARG italic_T end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) \triangleright Perform temperature scaling with T~isubscript~𝑇𝑖\tilde{T}_{i}over~ start_ARG italic_T end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT
24:     p¯t(y|𝐱it)(p~t(y|𝐱it)+norm(p~t(y|𝐱it)pt(y)/ps(y)))/2subscript¯𝑝𝑡conditional𝑦superscriptsubscript𝐱𝑖𝑡subscript~𝑝𝑡conditional𝑦superscriptsubscript𝐱𝑖𝑡normsubscript~𝑝𝑡conditional𝑦superscriptsubscript𝐱𝑖𝑡subscript𝑝𝑡𝑦subscript𝑝𝑠𝑦2\bar{p}_{t}(y|\mathbf{x}_{i}^{t})\leftarrow\big{(}\tilde{p}_{t}(y|\mathbf{x}_{% i}^{t})+\text{norm}(\tilde{p}_{t}(y|\mathbf{x}_{i}^{t})p_{t}(y)/p_{s}(y))\big{% )}/2over¯ start_ARG italic_p end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_y | bold_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) ← ( over~ start_ARG italic_p end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_y | bold_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) + norm ( over~ start_ARG italic_p end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_y | bold_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) italic_p start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_y ) / italic_p start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ( italic_y ) ) ) / 2 \triangleright Perform self-ensembling
25:end for
26:ptoe(y)(1α)1Ni=1Np¯t(y|𝐱it)+αptoe(y)subscriptsuperscript𝑝oe𝑡𝑦1𝛼1𝑁superscriptsubscript𝑖1𝑁subscript¯𝑝𝑡conditional𝑦superscriptsubscript𝐱𝑖𝑡𝛼subscriptsuperscript𝑝oe𝑡𝑦p^{\text{oe}}_{t}(y)\leftarrow(1-\alpha)\cdot\frac{1}{N}\sum_{i=1}^{N}{\bar{p}% _{t}(y|\mathbf{x}_{i}^{t})}+\alpha\cdot p^{\text{oe}}_{t}(y)italic_p start_POSTSUPERSCRIPT oe end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_y ) ← ( 1 - italic_α ) ⋅ divide start_ARG 1 end_ARG start_ARG italic_N end_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT over¯ start_ARG italic_p end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_y | bold_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) + italic_α ⋅ italic_p start_POSTSUPERSCRIPT oe end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_y ) \triangleright Update online target label estimator
27:Output: Final predictions {p¯i(y)}i=1Nsuperscriptsubscriptsubscript¯𝑝𝑖𝑦𝑖1𝑁{\{\bar{p}_{i}(y)\}}_{i=1}^{N}{ over¯ start_ARG italic_p end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_y ) } start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT

Appendix B Proof of Theorem 3.1

Let’s first define the balanced source error BSE(Y^)𝐵𝑆𝐸^𝑌BSE(\hat{Y})italic_B italic_S italic_E ( over^ start_ARG italic_Y end_ARG ) on the source dataset and the conditional error gap ΔCE(Y^)subscriptΔ𝐶𝐸^𝑌\Delta_{CE}(\hat{Y})roman_Δ start_POSTSUBSCRIPT italic_C italic_E end_POSTSUBSCRIPT ( over^ start_ARG italic_Y end_ARG ) between (Y^Y|Xs)^𝑌conditional𝑌subscript𝑋𝑠\mathbb{P}(\hat{Y}\neq Y|X_{s})blackboard_P ( over^ start_ARG italic_Y end_ARG ≠ italic_Y | italic_X start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ) and (Y^Y|Xt)^𝑌conditional𝑌subscript𝑋𝑡\mathbb{P}(\hat{Y}\neq Y|X_{t})blackboard_P ( over^ start_ARG italic_Y end_ARG ≠ italic_Y | italic_X start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) as follows:

BSE(Y^)𝐵𝑆𝐸^𝑌\displaystyle BSE(\hat{Y})italic_B italic_S italic_E ( over^ start_ARG italic_Y end_ARG ) =maxi𝒴(Y^i|Y=i,Xs),\displaystyle=\max_{i\in\mathcal{Y}}\mathbb{P}(\hat{Y}\neq i|Y=i,X_{s}),= roman_max start_POSTSUBSCRIPT italic_i ∈ caligraphic_Y end_POSTSUBSCRIPT blackboard_P ( over^ start_ARG italic_Y end_ARG ≠ italic_i | italic_Y = italic_i , italic_X start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ) , (12)
ΔCE(Y^)subscriptΔ𝐶𝐸^𝑌\displaystyle\Delta_{CE}(\hat{Y})roman_Δ start_POSTSUBSCRIPT italic_C italic_E end_POSTSUBSCRIPT ( over^ start_ARG italic_Y end_ARG ) =maxii𝒴|(Y^=i|Y=i,Xs)(Y^=i|Y=i,Xt)|.\displaystyle=\max_{i\neq i^{\prime}\in\mathcal{Y}}\Big{|}\mathbb{P}(\hat{Y}=i% |Y=i^{\prime},X_{s})-\mathbb{P}(\hat{Y}=i|Y=i^{\prime},X_{t})\Big{|}.= roman_max start_POSTSUBSCRIPT italic_i ≠ italic_i start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∈ caligraphic_Y end_POSTSUBSCRIPT | blackboard_P ( over^ start_ARG italic_Y end_ARG = italic_i | italic_Y = italic_i start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_X start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ) - blackboard_P ( over^ start_ARG italic_Y end_ARG = italic_i | italic_Y = italic_i start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_X start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) | . (13)
Definition B.1.

(Generalized Label Shift in Tachet des Combes et al. (2020)). Both input covariate distribution (Xs)(Xt)subscript𝑋𝑠subscript𝑋𝑡\mathbb{P}(X_{s})\neq\mathbb{P}(X_{t})blackboard_P ( italic_X start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ) ≠ blackboard_P ( italic_X start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) and output label distribution (Y|Xs)(Y|Xt)conditional𝑌subscript𝑋𝑠conditional𝑌subscript𝑋𝑡\mathbb{P}(Y|X_{s})\neq\mathbb{P}(Y|X_{t})blackboard_P ( italic_Y | italic_X start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ) ≠ blackboard_P ( italic_Y | italic_X start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) change. Yet, there exists a hidden representation H=g(X)𝐻superscript𝑔𝑋H=g^{*}(X)italic_H = italic_g start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( italic_X ) such that the conditional distribution of H𝐻Hitalic_H given Y𝑌Yitalic_Y remains the same across both domains, i.e., i𝒴for-all𝑖𝒴\forall i\in\mathcal{Y}∀ italic_i ∈ caligraphic_Y,

(H|Y=i,Xs)=(H|Y=i,Xt).conditional𝐻𝑌𝑖subscript𝑋𝑠conditional𝐻𝑌𝑖subscript𝑋𝑡\mathbb{P}(H|Y=i,X_{s})=\mathbb{P}(H|Y=i,X_{t}).blackboard_P ( italic_H | italic_Y = italic_i , italic_X start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ) = blackboard_P ( italic_H | italic_Y = italic_i , italic_X start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) . (14)
Theorem B.2.

Let Y^|Xconditional^𝑌𝑋\hat{Y}|Xover^ start_ARG italic_Y end_ARG | italic_X and Y^o|Xconditionalsubscript^𝑌𝑜𝑋\hat{Y}_{o}|Xover^ start_ARG italic_Y end_ARG start_POSTSUBSCRIPT italic_o end_POSTSUBSCRIPT | italic_X be defined as follows:

Y^|Xconditional^𝑌𝑋\displaystyle\hat{Y}|Xover^ start_ARG italic_Y end_ARG | italic_X ={argmaxj𝒴fθ(𝐱)j|𝐱X},absentconditional-setsubscriptargmax𝑗𝒴subscript𝑓𝜃subscript𝐱𝑗𝐱𝑋\displaystyle=\{\operatorname*{arg\,max}_{j\in\mathcal{Y}}{f_{\theta}(\mathbf{% x})_{j}}|\mathbf{x}\in X\},= { start_OPERATOR roman_arg roman_max end_OPERATOR start_POSTSUBSCRIPT italic_j ∈ caligraphic_Y end_POSTSUBSCRIPT italic_f start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( bold_x ) start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT | bold_x ∈ italic_X } , (15)
Y^o|Xconditionalsubscript^𝑌𝑜𝑋\displaystyle\hat{Y}_{o}|Xover^ start_ARG italic_Y end_ARG start_POSTSUBSCRIPT italic_o end_POSTSUBSCRIPT | italic_X ={argmaxj𝒴fθ(𝐱)j+logptoe(y)j|𝐱X}.absentconditional-setsubscriptargmax𝑗𝒴subscript𝑓𝜃subscript𝐱𝑗superscriptsubscript𝑝𝑡𝑜𝑒subscript𝑦𝑗𝐱𝑋\displaystyle=\{\operatorname*{arg\,max}_{j\in\mathcal{Y}}{f_{\theta}(\mathbf{% x})_{j}+\log p_{t}^{oe}(y)_{j}}|\mathbf{x}\in X\}.= { start_OPERATOR roman_arg roman_max end_OPERATOR start_POSTSUBSCRIPT italic_j ∈ caligraphic_Y end_POSTSUBSCRIPT italic_f start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( bold_x ) start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT + roman_log italic_p start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_o italic_e end_POSTSUPERSCRIPT ( italic_y ) start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT | bold_x ∈ italic_X } . (16)

Given the error ϵ(Y^|X)=(Y^Y|X)italic-ϵconditional^𝑌𝑋^𝑌conditional𝑌𝑋\epsilon(\hat{Y}|X)=\mathbb{P}(\hat{Y}\neq Y|X)italic_ϵ ( over^ start_ARG italic_Y end_ARG | italic_X ) = blackboard_P ( over^ start_ARG italic_Y end_ARG ≠ italic_Y | italic_X ), with true labels Y𝑌Yitalic_Y of inputs X𝑋Xitalic_X, the error gap |ϵ(Y^|Xs)ϵ(Y^o|Xt)||\epsilon(\hat{Y}|X_{s})-\epsilon(\hat{Y}_{o}|X_{t})|| italic_ϵ ( over^ start_ARG italic_Y end_ARG | italic_X start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ) - italic_ϵ ( over^ start_ARG italic_Y end_ARG start_POSTSUBSCRIPT italic_o end_POSTSUBSCRIPT | italic_X start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) | is upper bounded by

K11ptoe(y)pt(y)1BSE(Y^)+K2ΔCE(Y^),subscript𝐾1subscriptnorm1superscriptsubscript𝑝𝑡𝑜𝑒𝑦subscript𝑝𝑡𝑦1𝐵𝑆𝐸^𝑌subscript𝐾2subscriptΔ𝐶𝐸^𝑌K_{1}\Big{\|}1-\frac{p_{t}^{oe}(y)}{p_{t}(y)}\Big{\|}_{1}BSE(\hat{Y})+K_{2}% \Delta_{CE}(\hat{Y}),italic_K start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ∥ 1 - divide start_ARG italic_p start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_o italic_e end_POSTSUPERSCRIPT ( italic_y ) end_ARG start_ARG italic_p start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_y ) end_ARG ∥ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT italic_B italic_S italic_E ( over^ start_ARG italic_Y end_ARG ) + italic_K start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT roman_Δ start_POSTSUBSCRIPT italic_C italic_E end_POSTSUBSCRIPT ( over^ start_ARG italic_Y end_ARG ) , (17)

where K1subscript𝐾1K_{1}italic_K start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT and K2subscript𝐾2K_{2}italic_K start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT are constants related to pt(y)subscript𝑝𝑡𝑦p_{t}(y)italic_p start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_y ), and ps(y)subscript𝑝𝑠𝑦p_{s}(y)italic_p start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ( italic_y ), respectively.

Proof.

We start by applying the law of total probability and triangle inequality to derive the following inequality:

|ϵ(Y^|Xs)ϵ(Y^o|Xt)|=|(Y^Y|Xs)(Y^oY|Xt)|=|ii(Y^=i,Y=i|Xs)ii(Y^o=i,Y=i|Xt)|=|ii(Y=i|Xs)(Y^=i|Y=i,Xs)ii(Y=i|Xt)(Y^o=i|Y=i,Xt)|ii|(Y=i|Xs)(Y^=i|Y=i,Xs)(Y=i|Xt)(Y^o=i|Y=i,Xt)|.\displaystyle\begin{split}&\Big{|}\epsilon(\hat{Y}|X_{s})-\epsilon(\hat{Y}_{o}% |X_{t})\Big{|}\\ &=\Big{|}\mathbb{P}(\hat{Y}\neq Y|X_{s})-\mathbb{P}(\hat{Y}_{o}\neq Y|X_{t})% \Big{|}\\ &=\Big{|}\sum_{i\neq i^{\prime}}\mathbb{P}(\hat{Y}=i,Y=i^{\prime}|X_{s})-\sum_% {i\neq i^{\prime}}\mathbb{P}(\hat{Y}_{o}=i,Y=i^{\prime}|X_{t})\Big{|}\\ &=\Big{|}\sum_{i\neq i^{\prime}}\mathbb{P}(Y=i^{\prime}|X_{s})\mathbb{P}(\hat{% Y}=i|Y=i^{\prime},X_{s})-\sum_{i\neq i^{\prime}}\mathbb{P}(Y=i^{\prime}|X_{t})% \mathbb{P}(\hat{Y}_{o}=i|Y=i^{\prime},X_{t})\Big{|}\\ &\leq\sum_{i\neq i^{\prime}}\Big{|}\mathbb{P}(Y=i^{\prime}|X_{s})\mathbb{P}(% \hat{Y}=i|Y=i^{\prime},X_{s})-\mathbb{P}(Y=i^{\prime}|X_{t})\mathbb{P}(\hat{Y}% _{o}=i|Y=i^{\prime},X_{t})\Big{|}.\end{split}start_ROW start_CELL end_CELL start_CELL | italic_ϵ ( over^ start_ARG italic_Y end_ARG | italic_X start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ) - italic_ϵ ( over^ start_ARG italic_Y end_ARG start_POSTSUBSCRIPT italic_o end_POSTSUBSCRIPT | italic_X start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) | end_CELL end_ROW start_ROW start_CELL end_CELL start_CELL = | blackboard_P ( over^ start_ARG italic_Y end_ARG ≠ italic_Y | italic_X start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ) - blackboard_P ( over^ start_ARG italic_Y end_ARG start_POSTSUBSCRIPT italic_o end_POSTSUBSCRIPT ≠ italic_Y | italic_X start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) | end_CELL end_ROW start_ROW start_CELL end_CELL start_CELL = | ∑ start_POSTSUBSCRIPT italic_i ≠ italic_i start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT blackboard_P ( over^ start_ARG italic_Y end_ARG = italic_i , italic_Y = italic_i start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT | italic_X start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ) - ∑ start_POSTSUBSCRIPT italic_i ≠ italic_i start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT blackboard_P ( over^ start_ARG italic_Y end_ARG start_POSTSUBSCRIPT italic_o end_POSTSUBSCRIPT = italic_i , italic_Y = italic_i start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT | italic_X start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) | end_CELL end_ROW start_ROW start_CELL end_CELL start_CELL = | ∑ start_POSTSUBSCRIPT italic_i ≠ italic_i start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT blackboard_P ( italic_Y = italic_i start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT | italic_X start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ) blackboard_P ( over^ start_ARG italic_Y end_ARG = italic_i | italic_Y = italic_i start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_X start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ) - ∑ start_POSTSUBSCRIPT italic_i ≠ italic_i start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT blackboard_P ( italic_Y = italic_i start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT | italic_X start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) blackboard_P ( over^ start_ARG italic_Y end_ARG start_POSTSUBSCRIPT italic_o end_POSTSUBSCRIPT = italic_i | italic_Y = italic_i start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_X start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) | end_CELL end_ROW start_ROW start_CELL end_CELL start_CELL ≤ ∑ start_POSTSUBSCRIPT italic_i ≠ italic_i start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT | blackboard_P ( italic_Y = italic_i start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT | italic_X start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ) blackboard_P ( over^ start_ARG italic_Y end_ARG = italic_i | italic_Y = italic_i start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_X start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ) - blackboard_P ( italic_Y = italic_i start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT | italic_X start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) blackboard_P ( over^ start_ARG italic_Y end_ARG start_POSTSUBSCRIPT italic_o end_POSTSUBSCRIPT = italic_i | italic_Y = italic_i start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_X start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) | . end_CELL end_ROW (18)

According to Equation 8 in (Menon et al. 2021), Y^osubscript^𝑌𝑜\hat{Y}_{o}over^ start_ARG italic_Y end_ARG start_POSTSUBSCRIPT italic_o end_POSTSUBSCRIPT satisfies the following condition under generalized label shift condition in Definition B.1:

(Y^o=i|H,Xt)=ptoe(y)i(Y=i|Xs)(Y^=i|H,Xt).subscript^𝑌𝑜conditional𝑖𝐻subscript𝑋𝑡superscriptsubscript𝑝𝑡𝑜𝑒subscript𝑦𝑖𝑌conditional𝑖subscript𝑋𝑠^𝑌conditional𝑖𝐻subscript𝑋𝑡\displaystyle\mathbb{P}(\hat{Y}_{o}=i|H,X_{t})=\frac{p_{t}^{oe}(y)_{i}}{% \mathbb{P}(Y=i|X_{s})}\mathbb{P}(\hat{Y}=i|H,X_{t}).blackboard_P ( over^ start_ARG italic_Y end_ARG start_POSTSUBSCRIPT italic_o end_POSTSUBSCRIPT = italic_i | italic_H , italic_X start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) = divide start_ARG italic_p start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_o italic_e end_POSTSUPERSCRIPT ( italic_y ) start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_ARG start_ARG blackboard_P ( italic_Y = italic_i | italic_X start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ) end_ARG blackboard_P ( over^ start_ARG italic_Y end_ARG = italic_i | italic_H , italic_X start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) . (19)

By multiplying both sides of Equation 19 by (H|Y,Xt)conditional𝐻𝑌subscript𝑋𝑡\mathbb{P}(H|Y,X_{t})blackboard_P ( italic_H | italic_Y , italic_X start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ), we obtain:

(Y^o=i|H,Xt)(H|Y,Xt)=ptoe(y)i(Y=i|Xs)(Y^=i|H,Xt)(H|Y,Xt)(Y^o=i|Y,Xt)=ptoe(y)i(Y=i|Xs)(Y^=i|Y,Xt).subscript^𝑌𝑜conditional𝑖𝐻subscript𝑋𝑡conditional𝐻𝑌subscript𝑋𝑡superscriptsubscript𝑝𝑡𝑜𝑒subscript𝑦𝑖𝑌conditional𝑖subscript𝑋𝑠^𝑌conditional𝑖𝐻subscript𝑋𝑡conditional𝐻𝑌subscript𝑋𝑡subscript^𝑌𝑜conditional𝑖𝑌subscript𝑋𝑡superscriptsubscript𝑝𝑡𝑜𝑒subscript𝑦𝑖𝑌conditional𝑖subscript𝑋𝑠^𝑌conditional𝑖𝑌subscript𝑋𝑡\displaystyle\begin{split}\mathbb{P}(\hat{Y}_{o}=i|H,X_{t})\mathbb{P}(H|Y,X_{t% })&=\frac{p_{t}^{oe}(y)_{i}}{\mathbb{P}(Y=i|X_{s})}\mathbb{P}(\hat{Y}=i|H,X_{t% })\mathbb{P}(H|Y,X_{t})\\ \mathbb{P}(\hat{Y}_{o}=i|Y,X_{t})&=\frac{p_{t}^{oe}(y)_{i}}{\mathbb{P}(Y=i|X_{% s})}\mathbb{P}(\hat{Y}=i|Y,X_{t}).\end{split}start_ROW start_CELL blackboard_P ( over^ start_ARG italic_Y end_ARG start_POSTSUBSCRIPT italic_o end_POSTSUBSCRIPT = italic_i | italic_H , italic_X start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) blackboard_P ( italic_H | italic_Y , italic_X start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) end_CELL start_CELL = divide start_ARG italic_p start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_o italic_e end_POSTSUPERSCRIPT ( italic_y ) start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_ARG start_ARG blackboard_P ( italic_Y = italic_i | italic_X start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ) end_ARG blackboard_P ( over^ start_ARG italic_Y end_ARG = italic_i | italic_H , italic_X start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) blackboard_P ( italic_H | italic_Y , italic_X start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) end_CELL end_ROW start_ROW start_CELL blackboard_P ( over^ start_ARG italic_Y end_ARG start_POSTSUBSCRIPT italic_o end_POSTSUBSCRIPT = italic_i | italic_Y , italic_X start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) end_CELL start_CELL = divide start_ARG italic_p start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_o italic_e end_POSTSUPERSCRIPT ( italic_y ) start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_ARG start_ARG blackboard_P ( italic_Y = italic_i | italic_X start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ) end_ARG blackboard_P ( over^ start_ARG italic_Y end_ARG = italic_i | italic_Y , italic_X start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) . end_CELL end_ROW (20)

Next, by substituting Equation 20 into Equation 18, and letting Y=i𝑌superscript𝑖Y=i^{\prime}italic_Y = italic_i start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT, we have:

|ϵ(Y^|Xs)ϵ(Y^o|Xt)|ii|(Y=i|Xs)(Y^=i|Y=i,Xs)(Y=i|Xt)ptoe(y)i(Y=i|Xs)(Y^=i|Y=i,Xt)|.\displaystyle\begin{split}&\Big{|}\epsilon(\hat{Y}|X_{s})-\epsilon(\hat{Y}_{o}% |X_{t})\Big{|}\\ &\leq\sum_{i\neq i^{\prime}}\Big{|}\mathbb{P}(Y=i^{\prime}|X_{s})\mathbb{P}(% \hat{Y}=i|Y=i^{\prime},X_{s})-\mathbb{P}(Y=i^{\prime}|X_{t})\frac{p_{t}^{oe}(y% )_{i}}{\mathbb{P}(Y=i|X_{s})}\mathbb{P}(\hat{Y}=i|Y=i^{\prime},X_{t})\Big{|}.% \end{split}start_ROW start_CELL end_CELL start_CELL | italic_ϵ ( over^ start_ARG italic_Y end_ARG | italic_X start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ) - italic_ϵ ( over^ start_ARG italic_Y end_ARG start_POSTSUBSCRIPT italic_o end_POSTSUBSCRIPT | italic_X start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) | end_CELL end_ROW start_ROW start_CELL end_CELL start_CELL ≤ ∑ start_POSTSUBSCRIPT italic_i ≠ italic_i start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT | blackboard_P ( italic_Y = italic_i start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT | italic_X start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ) blackboard_P ( over^ start_ARG italic_Y end_ARG = italic_i | italic_Y = italic_i start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_X start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ) - blackboard_P ( italic_Y = italic_i start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT | italic_X start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) divide start_ARG italic_p start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_o italic_e end_POSTSUPERSCRIPT ( italic_y ) start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_ARG start_ARG blackboard_P ( italic_Y = italic_i | italic_X start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ) end_ARG blackboard_P ( over^ start_ARG italic_Y end_ARG = italic_i | italic_Y = italic_i start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_X start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) | . end_CELL end_ROW (21)

Using Lemma A.2 from (Tachet des Combes et al. 2020), we can further estimate the upper bound of Equation 21 as follows:

|ϵ(Y^|Xs)ϵ(Y^o|Xt)|ii(Y=i|Xt)|1ptoe(y)i(Y=i|Xs)|(αi(Y^=i|Y=i,Xs)+βi(Y^=i|Y=i,Xt))+(Y=i|Xs)ΔCE(Y^)+(Y=i|Xt)ptoe(y)i(Y=i|Xs)ΔCE(Y^)(i)ii(Y=i|Xt)|1ptoe(y)i(Y=i|Xs)|(αi(Y^=i|Y=i,Xs)+βi(Y^=i|Y=i,Xt))+(C1)ΔCE(Y^)+(ii(Y=i|Xt)(Y=i|Xs))(iiptoe(y)i)ΔCE(Y^)ii(Y=i|Xt)|1ptoe(y)i(Y=i|Xs)|(αi(Y^=i|Y=i,Xs)+βi(Y^=i|Y=i,Xt))+(C1)ΔCE(Y^)+(C1)2mini𝒴(Y=i|Xs)ΔCE(Y^),\displaystyle\begin{split}&\Big{|}\epsilon(\hat{Y}|X_{s})-\epsilon(\hat{Y}_{o}% |X_{t})\Big{|}\\ &\leq\sum_{i\neq i^{\prime}}\mathbb{P}(Y=i^{\prime}|X_{t})\Bigg{|}1-\frac{p_{t% }^{oe}(y)_{i}}{\mathbb{P}(Y=i|X_{s})}\Bigg{|}\left(\alpha_{i^{\prime}}\mathbb{% P}(\hat{Y}=i|Y=i^{\prime},X_{s})+\beta_{i^{\prime}}\mathbb{P}(\hat{Y}=i|Y=i^{% \prime},X_{t})\right)\\ &\quad+\mathbb{P}(Y=i^{\prime}|X_{s})\Delta_{CE}(\hat{Y})+\mathbb{P}(Y=i^{% \prime}|X_{t})\frac{p_{t}^{oe}(y)_{i}}{\mathbb{P}(Y=i|X_{s})}\Delta_{CE}(\hat{% Y})\\ &\stackrel{{\scriptstyle\mathclap{(i)}}}{{\leq}}\sum_{i\neq i^{\prime}}\mathbb% {P}(Y=i^{\prime}|X_{t})\Bigg{|}1-\frac{p_{t}^{oe}(y)_{i}}{\mathbb{P}(Y=i|X_{s}% )}\Bigg{|}\left(\alpha_{i^{\prime}}\mathbb{P}(\hat{Y}=i|Y=i^{\prime},X_{s})+% \beta_{i^{\prime}}\mathbb{P}(\hat{Y}=i|Y=i^{\prime},X_{t})\right)\\ &\quad+(C-1)\Delta_{CE}(\hat{Y})+\left(\sum_{i\neq i^{\prime}}\frac{\mathbb{P}% (Y=i^{\prime}|X_{t})}{\mathbb{P}(Y=i|X_{s})}\right)\left(\sum_{i\neq i^{\prime% }}p_{t}^{oe}(y)_{i}\right)\Delta_{CE}(\hat{Y})\\ &\leq\sum_{i\neq i^{\prime}}\mathbb{P}(Y=i^{\prime}|X_{t})\Bigg{|}1-\frac{p_{t% }^{oe}(y)_{i}}{\mathbb{P}(Y=i|X_{s})}\Bigg{|}\left(\alpha_{i^{\prime}}\mathbb{% P}(\hat{Y}=i|Y=i^{\prime},X_{s})+\beta_{i^{\prime}}\mathbb{P}(\hat{Y}=i|Y=i^{% \prime},X_{t})\right)\\ &\quad+(C-1)\Delta_{CE}(\hat{Y})+\frac{(C-1)^{2}}{\min_{i\in\mathcal{Y}}% \mathbb{P}(Y=i|X_{s})}\Delta_{CE}(\hat{Y}),\end{split}start_ROW start_CELL end_CELL start_CELL | italic_ϵ ( over^ start_ARG italic_Y end_ARG | italic_X start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ) - italic_ϵ ( over^ start_ARG italic_Y end_ARG start_POSTSUBSCRIPT italic_o end_POSTSUBSCRIPT | italic_X start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) | end_CELL end_ROW start_ROW start_CELL end_CELL start_CELL ≤ ∑ start_POSTSUBSCRIPT italic_i ≠ italic_i start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT blackboard_P ( italic_Y = italic_i start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT | italic_X start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) | 1 - divide start_ARG italic_p start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_o italic_e end_POSTSUPERSCRIPT ( italic_y ) start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_ARG start_ARG blackboard_P ( italic_Y = italic_i | italic_X start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ) end_ARG | ( italic_α start_POSTSUBSCRIPT italic_i start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT blackboard_P ( over^ start_ARG italic_Y end_ARG = italic_i | italic_Y = italic_i start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_X start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ) + italic_β start_POSTSUBSCRIPT italic_i start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT blackboard_P ( over^ start_ARG italic_Y end_ARG = italic_i | italic_Y = italic_i start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_X start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ) end_CELL end_ROW start_ROW start_CELL end_CELL start_CELL + blackboard_P ( italic_Y = italic_i start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT | italic_X start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ) roman_Δ start_POSTSUBSCRIPT italic_C italic_E end_POSTSUBSCRIPT ( over^ start_ARG italic_Y end_ARG ) + blackboard_P ( italic_Y = italic_i start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT | italic_X start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) divide start_ARG italic_p start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_o italic_e end_POSTSUPERSCRIPT ( italic_y ) start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_ARG start_ARG blackboard_P ( italic_Y = italic_i | italic_X start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ) end_ARG roman_Δ start_POSTSUBSCRIPT italic_C italic_E end_POSTSUBSCRIPT ( over^ start_ARG italic_Y end_ARG ) end_CELL end_ROW start_ROW start_CELL end_CELL start_CELL start_RELOP SUPERSCRIPTOP start_ARG ≤ end_ARG start_ARG ( italic_i ) end_ARG end_RELOP ∑ start_POSTSUBSCRIPT italic_i ≠ italic_i start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT blackboard_P ( italic_Y = italic_i start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT | italic_X start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) | 1 - divide start_ARG italic_p start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_o italic_e end_POSTSUPERSCRIPT ( italic_y ) start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_ARG start_ARG blackboard_P ( italic_Y = italic_i | italic_X start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ) end_ARG | ( italic_α start_POSTSUBSCRIPT italic_i start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT blackboard_P ( over^ start_ARG italic_Y end_ARG = italic_i | italic_Y = italic_i start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_X start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ) + italic_β start_POSTSUBSCRIPT italic_i start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT blackboard_P ( over^ start_ARG italic_Y end_ARG = italic_i | italic_Y = italic_i start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_X start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ) end_CELL end_ROW start_ROW start_CELL end_CELL start_CELL + ( italic_C - 1 ) roman_Δ start_POSTSUBSCRIPT italic_C italic_E end_POSTSUBSCRIPT ( over^ start_ARG italic_Y end_ARG ) + ( ∑ start_POSTSUBSCRIPT italic_i ≠ italic_i start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT divide start_ARG blackboard_P ( italic_Y = italic_i start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT | italic_X start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) end_ARG start_ARG blackboard_P ( italic_Y = italic_i | italic_X start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ) end_ARG ) ( ∑ start_POSTSUBSCRIPT italic_i ≠ italic_i start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT italic_p start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_o italic_e end_POSTSUPERSCRIPT ( italic_y ) start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) roman_Δ start_POSTSUBSCRIPT italic_C italic_E end_POSTSUBSCRIPT ( over^ start_ARG italic_Y end_ARG ) end_CELL end_ROW start_ROW start_CELL end_CELL start_CELL ≤ ∑ start_POSTSUBSCRIPT italic_i ≠ italic_i start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT blackboard_P ( italic_Y = italic_i start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT | italic_X start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) | 1 - divide start_ARG italic_p start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_o italic_e end_POSTSUPERSCRIPT ( italic_y ) start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_ARG start_ARG blackboard_P ( italic_Y = italic_i | italic_X start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ) end_ARG | ( italic_α start_POSTSUBSCRIPT italic_i start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT blackboard_P ( over^ start_ARG italic_Y end_ARG = italic_i | italic_Y = italic_i start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_X start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ) + italic_β start_POSTSUBSCRIPT italic_i start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT blackboard_P ( over^ start_ARG italic_Y end_ARG = italic_i | italic_Y = italic_i start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_X start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ) end_CELL end_ROW start_ROW start_CELL end_CELL start_CELL + ( italic_C - 1 ) roman_Δ start_POSTSUBSCRIPT italic_C italic_E end_POSTSUBSCRIPT ( over^ start_ARG italic_Y end_ARG ) + divide start_ARG ( italic_C - 1 ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG roman_min start_POSTSUBSCRIPT italic_i ∈ caligraphic_Y end_POSTSUBSCRIPT blackboard_P ( italic_Y = italic_i | italic_X start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ) end_ARG roman_Δ start_POSTSUBSCRIPT italic_C italic_E end_POSTSUBSCRIPT ( over^ start_ARG italic_Y end_ARG ) , end_CELL end_ROW (22)

where αi,βi0subscript𝛼superscript𝑖subscript𝛽superscript𝑖0\alpha_{i^{\prime}},\beta_{i^{\prime}}\geq 0italic_α start_POSTSUBSCRIPT italic_i start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT , italic_β start_POSTSUBSCRIPT italic_i start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ≥ 0 and αi+βi=1subscript𝛼superscript𝑖subscript𝛽superscript𝑖1\alpha_{i^{\prime}}+\beta_{i^{\prime}}=1italic_α start_POSTSUBSCRIPT italic_i start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT + italic_β start_POSTSUBSCRIPT italic_i start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT = 1, (i)𝑖(i)( italic_i ) holds by Hölder’s inequality. By letting αi=1subscript𝛼superscript𝑖1\alpha_{i^{\prime}}=1italic_α start_POSTSUBSCRIPT italic_i start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT = 1 and βi=0subscript𝛽superscript𝑖0\beta_{i^{\prime}}=0italic_β start_POSTSUBSCRIPT italic_i start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT = 0 for all i𝒴superscript𝑖𝒴i^{\prime}\in\mathcal{Y}italic_i start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∈ caligraphic_Y, and defining K1subscript𝐾1K_{1}italic_K start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT and K2subscript𝐾2K_{2}italic_K start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT as:

K1subscript𝐾1\displaystyle K_{1}italic_K start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT =C(C1)2maxi𝒴(Y=i|Xt),absent𝐶superscript𝐶12subscript𝑖𝒴𝑌conditional𝑖subscript𝑋𝑡\displaystyle=C(C-1)^{2}\max_{i\in\mathcal{Y}}\mathbb{P}(Y=i|X_{t}),= italic_C ( italic_C - 1 ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT roman_max start_POSTSUBSCRIPT italic_i ∈ caligraphic_Y end_POSTSUBSCRIPT blackboard_P ( italic_Y = italic_i | italic_X start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ,
K2subscript𝐾2\displaystyle K_{2}italic_K start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT =(C1)+(C1)2mini𝒴(Y=i|Xs),absent𝐶1superscript𝐶12subscript𝑖𝒴𝑌conditional𝑖subscript𝑋𝑠\displaystyle=(C-1)+\frac{(C-1)^{2}}{\min_{i\in\mathcal{Y}}\mathbb{P}(Y=i|X_{s% })},= ( italic_C - 1 ) + divide start_ARG ( italic_C - 1 ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG roman_min start_POSTSUBSCRIPT italic_i ∈ caligraphic_Y end_POSTSUBSCRIPT blackboard_P ( italic_Y = italic_i | italic_X start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ) end_ARG ,

we finally get:

|ϵ(Y^|Xs)ϵ(Y^o|Xt)|ii(Y=i|Xt)|1ptoe(y)i(Y=i|Xs)|(Y^=i|Y=i,Xs)+K2ΔCE(Y^)maxi𝒴(Y=i|Xt)ii|1ptoe(y)i(Y=i|Xs)|(Y^=i|Y=i,Xs)+K2ΔCE(Y^)(i)maxi𝒴(Y=i|Xt)(ii|1ptoe(y)i(Y=i|Xs)|)(ii(Y^=i|Y=i,Xs))+K2ΔCE(Y^)(ii)maxi𝒴(Y=i|Xt)(C1)i=1C|1ptoe(y)i(Y=i|Xs)|C(C1)BSE(Y^)+K2ΔCE(Y^)=maxi𝒴(Y=i|Xt)C(C1)21ptoe(y)pt(y)1BSE(Y^)+K2ΔCE(Y^)=(iii)K11ptoe(y)pt(y)1BSE(Y^)+K2ΔCE(Y^),\displaystyle\begin{split}&\Big{|}\epsilon(\hat{Y}|X_{s})-\epsilon(\hat{Y}_{o}% |X_{t})\Big{|}\\ &\leq\sum_{i\neq i^{\prime}}\mathbb{P}(Y=i^{\prime}|X_{t})\Bigg{|}1-\frac{p_{t% }^{oe}(y)_{i}}{\mathbb{P}(Y=i|X_{s})}\Bigg{|}\mathbb{P}(\hat{Y}=i|Y=i^{\prime}% ,X_{s})+K_{2}\Delta_{CE}(\hat{Y})\\ &\leq\max_{i^{\prime}\in\mathcal{Y}}\mathbb{P}(Y=i^{\prime}|X_{t})\sum_{i\neq i% ^{\prime}}\Bigg{|}1-\frac{p_{t}^{oe}(y)_{i}}{\mathbb{P}(Y=i|X_{s})}\Bigg{|}% \mathbb{P}(\hat{Y}=i|Y=i^{\prime},X_{s})+K_{2}\Delta_{CE}(\hat{Y})\\ &\stackrel{{\scriptstyle\mathclap{(i)}}}{{\leq}}\max_{i^{\prime}\in\mathcal{Y}% }\mathbb{P}(Y=i^{\prime}|X_{t})\left(\sum_{i\neq i^{\prime}}\Bigg{|}1-\frac{p_% {t}^{oe}(y)_{i}}{\mathbb{P}(Y=i|X_{s})}\Bigg{|}\right)\left(\sum_{i\neq i^{% \prime}}\mathbb{P}(\hat{Y}=i|Y=i^{\prime},X_{s})\right)+K_{2}\Delta_{CE}(\hat{% Y})\\ &\stackrel{{\scriptstyle\mathclap{(ii)}}}{{\leq}}\max_{i^{\prime}\in\mathcal{Y% }}\mathbb{P}(Y=i^{\prime}|X_{t})(C-1)\sum_{i=1}^{C}\Big{|}1-\frac{p_{t}^{oe}(y% )_{i}}{\mathbb{P}(Y=i|X_{s})}\Big{|}C(C-1)BSE(\hat{Y})+K_{2}\Delta_{CE}(\hat{Y% })\\ &=\max_{i^{\prime}\in\mathcal{Y}}\mathbb{P}(Y=i^{\prime}|X_{t})C(C-1)^{2}\Big{% \|}1-\frac{p_{t}^{oe}(y)}{p_{t}(y)}\Big{\|}_{1}BSE(\hat{Y})+K_{2}\Delta_{CE}(% \hat{Y})\\ &\stackrel{{\scriptstyle\mathclap{(iii)}}}{{=}}K_{1}\Big{\|}1-\frac{p_{t}^{oe}% (y)}{p_{t}(y)}\Big{\|}_{1}BSE(\hat{Y})+K_{2}\Delta_{CE}(\hat{Y}),\end{split}start_ROW start_CELL end_CELL start_CELL | italic_ϵ ( over^ start_ARG italic_Y end_ARG | italic_X start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ) - italic_ϵ ( over^ start_ARG italic_Y end_ARG start_POSTSUBSCRIPT italic_o end_POSTSUBSCRIPT | italic_X start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) | end_CELL end_ROW start_ROW start_CELL end_CELL start_CELL ≤ ∑ start_POSTSUBSCRIPT italic_i ≠ italic_i start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT blackboard_P ( italic_Y = italic_i start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT | italic_X start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) | 1 - divide start_ARG italic_p start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_o italic_e end_POSTSUPERSCRIPT ( italic_y ) start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_ARG start_ARG blackboard_P ( italic_Y = italic_i | italic_X start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ) end_ARG | blackboard_P ( over^ start_ARG italic_Y end_ARG = italic_i | italic_Y = italic_i start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_X start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ) + italic_K start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT roman_Δ start_POSTSUBSCRIPT italic_C italic_E end_POSTSUBSCRIPT ( over^ start_ARG italic_Y end_ARG ) end_CELL end_ROW start_ROW start_CELL end_CELL start_CELL ≤ roman_max start_POSTSUBSCRIPT italic_i start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∈ caligraphic_Y end_POSTSUBSCRIPT blackboard_P ( italic_Y = italic_i start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT | italic_X start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ∑ start_POSTSUBSCRIPT italic_i ≠ italic_i start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT | 1 - divide start_ARG italic_p start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_o italic_e end_POSTSUPERSCRIPT ( italic_y ) start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_ARG start_ARG blackboard_P ( italic_Y = italic_i | italic_X start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ) end_ARG | blackboard_P ( over^ start_ARG italic_Y end_ARG = italic_i | italic_Y = italic_i start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_X start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ) + italic_K start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT roman_Δ start_POSTSUBSCRIPT italic_C italic_E end_POSTSUBSCRIPT ( over^ start_ARG italic_Y end_ARG ) end_CELL end_ROW start_ROW start_CELL end_CELL start_CELL start_RELOP SUPERSCRIPTOP start_ARG ≤ end_ARG start_ARG ( italic_i ) end_ARG end_RELOP roman_max start_POSTSUBSCRIPT italic_i start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∈ caligraphic_Y end_POSTSUBSCRIPT blackboard_P ( italic_Y = italic_i start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT | italic_X start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ( ∑ start_POSTSUBSCRIPT italic_i ≠ italic_i start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT | 1 - divide start_ARG italic_p start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_o italic_e end_POSTSUPERSCRIPT ( italic_y ) start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_ARG start_ARG blackboard_P ( italic_Y = italic_i | italic_X start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ) end_ARG | ) ( ∑ start_POSTSUBSCRIPT italic_i ≠ italic_i start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT blackboard_P ( over^ start_ARG italic_Y end_ARG = italic_i | italic_Y = italic_i start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_X start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ) ) + italic_K start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT roman_Δ start_POSTSUBSCRIPT italic_C italic_E end_POSTSUBSCRIPT ( over^ start_ARG italic_Y end_ARG ) end_CELL end_ROW start_ROW start_CELL end_CELL start_CELL start_RELOP SUPERSCRIPTOP start_ARG ≤ end_ARG start_ARG ( italic_i italic_i ) end_ARG end_RELOP roman_max start_POSTSUBSCRIPT italic_i start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∈ caligraphic_Y end_POSTSUBSCRIPT blackboard_P ( italic_Y = italic_i start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT | italic_X start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ( italic_C - 1 ) ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_C end_POSTSUPERSCRIPT | 1 - divide start_ARG italic_p start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_o italic_e end_POSTSUPERSCRIPT ( italic_y ) start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_ARG start_ARG blackboard_P ( italic_Y = italic_i | italic_X start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ) end_ARG | italic_C ( italic_C - 1 ) italic_B italic_S italic_E ( over^ start_ARG italic_Y end_ARG ) + italic_K start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT roman_Δ start_POSTSUBSCRIPT italic_C italic_E end_POSTSUBSCRIPT ( over^ start_ARG italic_Y end_ARG ) end_CELL end_ROW start_ROW start_CELL end_CELL start_CELL = roman_max start_POSTSUBSCRIPT italic_i start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∈ caligraphic_Y end_POSTSUBSCRIPT blackboard_P ( italic_Y = italic_i start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT | italic_X start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) italic_C ( italic_C - 1 ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ∥ 1 - divide start_ARG italic_p start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_o italic_e end_POSTSUPERSCRIPT ( italic_y ) end_ARG start_ARG italic_p start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_y ) end_ARG ∥ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT italic_B italic_S italic_E ( over^ start_ARG italic_Y end_ARG ) + italic_K start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT roman_Δ start_POSTSUBSCRIPT italic_C italic_E end_POSTSUBSCRIPT ( over^ start_ARG italic_Y end_ARG ) end_CELL end_ROW start_ROW start_CELL end_CELL start_CELL start_RELOP SUPERSCRIPTOP start_ARG = end_ARG start_ARG ( italic_i italic_i italic_i ) end_ARG end_RELOP italic_K start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ∥ 1 - divide start_ARG italic_p start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_o italic_e end_POSTSUPERSCRIPT ( italic_y ) end_ARG start_ARG italic_p start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_y ) end_ARG ∥ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT italic_B italic_S italic_E ( over^ start_ARG italic_Y end_ARG ) + italic_K start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT roman_Δ start_POSTSUBSCRIPT italic_C italic_E end_POSTSUBSCRIPT ( over^ start_ARG italic_Y end_ARG ) , end_CELL end_ROW (23)

where (i)𝑖(i)( italic_i ) holds by Hölder’s inequality, (ii)𝑖𝑖(ii)( italic_i italic_i ) holds by the definition of BSE(Y^)𝐵𝑆𝐸^𝑌BSE(\hat{Y})italic_B italic_S italic_E ( over^ start_ARG italic_Y end_ARG ), and (iii)𝑖𝑖𝑖(iii)( italic_i italic_i italic_i ) holds by the definition of K1subscript𝐾1K_{1}italic_K start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT. ∎

We observe that in practice, using Y^o|X={argmaxj𝒴fθ(𝐱)j+logptoe(y)j|𝐱X}conditionalsubscript^𝑌𝑜𝑋conditional-setsubscriptargmax𝑗𝒴subscript𝑓𝜃subscript𝐱𝑗superscriptsubscript𝑝𝑡𝑜𝑒subscript𝑦𝑗𝐱𝑋\hat{Y}_{o}|X=\{\operatorname*{arg\,max}_{j\in\mathcal{Y}}{f_{\theta}(\mathbf{% x})_{j}+\log p_{t}^{oe}(y)_{j}}|\mathbf{x}\in X\}over^ start_ARG italic_Y end_ARG start_POSTSUBSCRIPT italic_o end_POSTSUBSCRIPT | italic_X = { start_OPERATOR roman_arg roman_max end_OPERATOR start_POSTSUBSCRIPT italic_j ∈ caligraphic_Y end_POSTSUBSCRIPT italic_f start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( bold_x ) start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT + roman_log italic_p start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_o italic_e end_POSTSUPERSCRIPT ( italic_y ) start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT | bold_x ∈ italic_X } can result in performance degradation due to an error accumulation in ptoe(y)superscriptsubscript𝑝𝑡𝑜𝑒𝑦p_{t}^{oe}(y)italic_p start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_o italic_e end_POSTSUPERSCRIPT ( italic_y ). However, our approach, which integrates a two-stage uncertainty calibration with gϕsubscript𝑔italic-ϕg_{\phi}italic_g start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT and a debiased target label estimator ptde(y)superscriptsubscript𝑝𝑡𝑑𝑒𝑦p_{t}^{de}(y)italic_p start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_d italic_e end_POSTSUPERSCRIPT ( italic_y ), demonstrates empirical efficacy across various experiments.

Appendix C Dataset Descriptions

C.1 Natural Distibution Shifts

In our experiments, we verify our method across six different datasets—HELOC, Voting, Hospital Readmission, ICU Mortality, Childhood Lead, and Diabetes—within the Tableshift Benchmark (Gardner, Popovic, and Schmidt 2023), all of which include natural distribution shifts between training and test data. For all datasets, the numerical features are normalized—subtraction of mean and division by standard deviation, while categorical features are one-hot encoded. We find that different encoding types do not play a significant role in terms of accuracy, as noted in Grinsztajn, Oyallon, and Varoquaux (2022). Detailed statistics specifications of each dataset are listed in Table 5.

  • HELOC: This task predicts Home Equity Line of Credit (HELOC) (Brown et al. 2018) repayment using FICO data (Studies 2019), focusing on shifts in third-party risk estimates. The dataset includes 10,459 observations, and a distribution shift occurs by using the ’External Risk Estimate’ as a domain split. Estimates above 63 are used for training, while those 63 or below are held out for testing, illustrating potential biases in credit assessments.

  • Voting: Using ANES (Studies 2022) data, this task predicts U.S. presidential election voting behavior with 8,280 observations. Distribution shift is introduced by splitting the data based on geographic region, with the southern U.S. serving as the out-of-domain region. This simulates how voter behavior predictions might vary when polling data is collected in one region and used to predict outcomes in another.

  • Hospital Readmission: Hospital Readmission (Clore et al. 2014) predicts 30-day readmission of diabetic patients using data from 130 U.S. hospitals over 10 years. The distribution shift occurs by splitting the data based on admission source, with emergency room admissions held out as the target domain. This tests how well models trained on other sources perform when applied to patients admitted through the emergency room.

  • ICU Mortality: The task predicts ICU patient mortality using MIMIC-iii data (Johnson et al. 2016), focusing on shifts related to insurance type. The dataset includes 23,944 observations, and a distribution shift is created by excluding Medicare and Medicaid patients from the training set, designating them as the target domain. This highlights how insurance type can affect mortality predictions.

  • Childhood Lead: This task predicts elevated blood lead levels in children using NHANES data (for Disease Control, Prevention et al. 2003), with 27,499 observations. A distribution shift is introduced by splitting the data based on poverty using the poverty-income ratio (PIR) as a threshold. Those with a PIR of 1.3 or lower are held out for testing, simulating risk assessment in lower-income households.

  • Diabetes: This task predicts diabetes using BRFSS data (Association 2018), focusing on racial shifts across 1.4 million observations. Distribution shift occurs by focusing on the differences in diabetes risk between racial and ethnic groups, particularly highlighting the higher risk faced by non-white groups compared to White non-Hispanic individuals.

Table 5: Summary of the datasets used in our experiments, including the total number of instances (Total Instances), the number of instances allocated to training, validation, and test sets (Training Set, Validation Set, Test Set), the total number of features (Total Features), and a breakdown into numerical and categorical features (Numerical Features, Categorical Features). All tasks involve binary classification.
Statistic HELOC Voting Hospital Readmission ICU Mortality Childhood Lead Diabetes
Total Instances 9,412 60,376 89,542 21,549 24,749 1,299,758
Training Set 2,220 34,796 34,288 7,116 11,807 969,229
Validation Set 278 4,349 4,286 889 1,476 121,154
Test Set 6,914 21,231 50,968 13,544 11,466 209,375
Total Features 22 54 46 7491 7 25
Numerical Features 20 8 12 7490 4 6
Categorical Features 2 46 34 1 3 19

C.2 Common Corruptions

Let 𝐱it=(𝐱ijt)j=1DDsuperscriptsubscript𝐱𝑖𝑡superscriptsubscriptsuperscriptsubscript𝐱𝑖𝑗𝑡𝑗1𝐷superscript𝐷\mathbf{x}_{i}^{t}=(\mathbf{x}_{ij}^{t})_{j=1}^{D}\in\mathbb{R}^{D}bold_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT = ( bold_x start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_D end_POSTSUPERSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_D end_POSTSUPERSCRIPT be the i𝑖iitalic_i-th row of a table with D𝐷Ditalic_D columns in the test data. We define 𝐱¯jssuperscriptsubscript¯𝐱𝑗𝑠\bar{\mathbf{x}}_{j}^{s}over¯ start_ARG bold_x end_ARG start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_s end_POSTSUPERSCRIPT as a random variable that follows the empirical marginal distribution of the j𝑗jitalic_j-th column in the training set 𝒟ssubscript𝒟𝑠\mathcal{D}_{s}caligraphic_D start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT, given by:

(𝐱¯js=k)=1|𝒟s|i=1|𝒟s|𝟙{𝕜}(𝐱𝕚𝕛𝕤),superscriptsubscript¯𝐱𝑗𝑠𝑘1subscript𝒟𝑠superscriptsubscript𝑖1subscript𝒟𝑠subscript1𝕜superscriptsubscript𝐱𝕚𝕛𝕤\mathbb{P}(\bar{\mathbf{x}}_{j}^{s}=k)=\frac{1}{|\mathcal{D}_{s}|}\sum_{i=1}^{% |\mathcal{D}_{s}|}\mathbbold{1}_{\{k\}}(\mathbf{x}_{ij}^{s}),blackboard_P ( over¯ start_ARG bold_x end_ARG start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_s end_POSTSUPERSCRIPT = italic_k ) = divide start_ARG 1 end_ARG start_ARG | caligraphic_D start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT | end_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT | caligraphic_D start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT | end_POSTSUPERSCRIPT blackboard_1 start_POSTSUBSCRIPT { blackboard_k } end_POSTSUBSCRIPT ( bold_x start_POSTSUBSCRIPT blackboard_i blackboard_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT blackboard_s end_POSTSUPERSCRIPT ) ,

where k𝑘k\in\mathbb{R}italic_k ∈ blackboard_R. Additionally, let μjs=𝔼[𝐱¯js]superscriptsubscript𝜇𝑗𝑠𝔼delimited-[]superscriptsubscript¯𝐱𝑗𝑠\mu_{j}^{s}=\mathbb{E}[\bar{\mathbf{x}}_{j}^{s}]italic_μ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_s end_POSTSUPERSCRIPT = blackboard_E [ over¯ start_ARG bold_x end_ARG start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_s end_POSTSUPERSCRIPT ] and σjs=Var(𝐱¯js)superscriptsubscript𝜎𝑗𝑠Varsuperscriptsubscript¯𝐱𝑗𝑠\sigma_{j}^{s}=\sqrt{\text{Var}(\bar{\mathbf{x}}_{j}^{s})}italic_σ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_s end_POSTSUPERSCRIPT = square-root start_ARG Var ( over¯ start_ARG bold_x end_ARG start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_s end_POSTSUPERSCRIPT ) end_ARG be the mean and standard deviation of the random variable 𝐱¯jssuperscriptsubscript¯𝐱𝑗𝑠\bar{\mathbf{x}}_{j}^{s}over¯ start_ARG bold_x end_ARG start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_s end_POSTSUPERSCRIPT, respectively. To effectively simulate natural distribution shifts that commonly occur beyond label distribution shifts, we introduce six types of corruptions—Gaussian noise (Gaussian), uniform noise (Uniform), random missing values (Random Drop), common column missing across all test data (Column Drop), important numerical column shift (Numerical), and important categorical column shift (Categorical)—as follows:

  • Gaussian: For 𝐱ijtsuperscriptsubscript𝐱𝑖𝑗𝑡\mathbf{x}_{ij}^{t}bold_x start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT, Gaussian noise z𝒩(0,0.12)similar-to𝑧𝒩0superscript0.12z\sim\mathcal{N}(0,0.1^{2})italic_z ∼ caligraphic_N ( 0 , 0.1 start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) is independently injected as:

    𝐱ijt𝐱ijt+zσjs.superscriptsubscript𝐱𝑖𝑗𝑡superscriptsubscript𝐱𝑖𝑗𝑡𝑧superscriptsubscript𝜎𝑗𝑠\mathbf{x}_{ij}^{t}\leftarrow\mathbf{x}_{ij}^{t}+z\cdot\sigma_{j}^{s}.bold_x start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ← bold_x start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT + italic_z ⋅ italic_σ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_s end_POSTSUPERSCRIPT .
  • Uniform: For the 𝐱ijtsuperscriptsubscript𝐱𝑖𝑗𝑡\mathbf{x}_{ij}^{t}bold_x start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT, uniform noise u𝒰(0.1,0.1)similar-to𝑢𝒰0.10.1u\sim\mathcal{U}(-0.1,0.1)italic_u ∼ caligraphic_U ( - 0.1 , 0.1 ) is independently injected as:

    𝐱ijt𝐱ijt+uσjs.superscriptsubscript𝐱𝑖𝑗𝑡superscriptsubscript𝐱𝑖𝑗𝑡𝑢superscriptsubscript𝜎𝑗𝑠\mathbf{x}_{ij}^{t}\leftarrow\mathbf{x}_{ij}^{t}+u\cdot\sigma_{j}^{s}.bold_x start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ← bold_x start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT + italic_u ⋅ italic_σ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_s end_POSTSUPERSCRIPT .
  • Random Drop: For each column 𝐱ijtsuperscriptsubscript𝐱𝑖𝑗𝑡\mathbf{x}_{ij}^{t}bold_x start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT, a random mask mijBernoulli(0.2)similar-tosubscript𝑚𝑖𝑗Bernoulli0.2m_{ij}\sim\text{Bernoulli}(0.2)italic_m start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT ∼ Bernoulli ( 0.2 ) is applied, and the feature is replaced by a random sample 𝐱¯jssuperscriptsubscript¯𝐱𝑗𝑠\bar{\mathbf{x}}_{j}^{s}over¯ start_ARG bold_x end_ARG start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_s end_POSTSUPERSCRIPT drawn from the empirical marginal distribution of the j𝑗jitalic_j-th column of the training set:

    𝐱ijt(1mij)𝐱ijt+mij𝐱¯js.superscriptsubscript𝐱𝑖𝑗𝑡1subscript𝑚𝑖𝑗superscriptsubscript𝐱𝑖𝑗𝑡subscript𝑚𝑖𝑗superscriptsubscript¯𝐱𝑗𝑠\mathbf{x}_{ij}^{t}\leftarrow(1-m_{ij})\cdot\mathbf{x}_{ij}^{t}+m_{ij}\cdot% \bar{\mathbf{x}}_{j}^{s}.bold_x start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ← ( 1 - italic_m start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT ) ⋅ bold_x start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT + italic_m start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT ⋅ over¯ start_ARG bold_x end_ARG start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_s end_POSTSUPERSCRIPT .
  • Column Drop: For each column 𝐱ijtsuperscriptsubscript𝐱𝑖𝑗𝑡\mathbf{x}_{ij}^{t}bold_x start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT, a random mask mjBernoulli(0.2)similar-tosubscript𝑚𝑗Bernoulli0.2m_{j}\sim\text{Bernoulli}(0.2)italic_m start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ∼ Bernoulli ( 0.2 ) is applied, and the feature is replaced by a random sample 𝐱¯jssuperscriptsubscript¯𝐱𝑗𝑠\bar{\mathbf{x}}_{j}^{s}over¯ start_ARG bold_x end_ARG start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_s end_POSTSUPERSCRIPT as follows:

    𝐱ijt(1mj)𝐱ijt+mj𝐱¯js.superscriptsubscript𝐱𝑖𝑗𝑡1subscript𝑚𝑗superscriptsubscript𝐱𝑖𝑗𝑡subscript𝑚𝑗superscriptsubscript¯𝐱𝑗𝑠\mathbf{x}_{ij}^{t}\leftarrow(1-m_{j})\cdot\mathbf{x}_{ij}^{t}+m_{j}\cdot\bar{% \mathbf{x}}_{j}^{s}.bold_x start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ← ( 1 - italic_m start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) ⋅ bold_x start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT + italic_m start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ⋅ over¯ start_ARG bold_x end_ARG start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_s end_POSTSUPERSCRIPT .

    Unlike random drop corruption, where the mask mijsubscript𝑚𝑖𝑗m_{ij}italic_m start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT is resampled for each j𝑗jitalic_j-th column of the i𝑖iitalic_i-th test instance 𝐱ijtsuperscriptsubscript𝐱𝑖𝑗𝑡\mathbf{x}_{ij}^{t}bold_x start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT, a single random mask mjsubscript𝑚𝑗m_{j}italic_m start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT is sampled for each j𝑗jitalic_j-th column and applied uniformly across all test data.

  • Numerical: Important numerical column shift simulates natural domain shifts where the test distribution of the most important numerical column deviates significantly from the training distribution. We first identify the most important numerical column, jsuperscript𝑗j^{*}italic_j start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT, using a pre-trained XGBoost (Chen and Guestrin 2016). A Gaussian distribution

    𝒩(z|μjs,σjs)=12πσjsexp((zμjs)22(σjs)2)𝒩conditional𝑧superscriptsubscript𝜇superscript𝑗𝑠superscriptsubscript𝜎superscript𝑗𝑠12𝜋superscriptsubscript𝜎superscript𝑗𝑠superscript𝑧superscriptsubscript𝜇superscript𝑗𝑠22superscriptsuperscriptsubscript𝜎superscript𝑗𝑠2\mathcal{N}(z|\mu_{j^{*}}^{s},\sigma_{j^{*}}^{s})=\frac{1}{\sqrt{2\pi\sigma_{j% ^{*}}^{s}}}\exp\left(-\frac{(z-\mu_{j^{*}}^{s})^{2}}{2(\sigma_{j^{*}}^{s})^{2}% }\right)caligraphic_N ( italic_z | italic_μ start_POSTSUBSCRIPT italic_j start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_s end_POSTSUPERSCRIPT , italic_σ start_POSTSUBSCRIPT italic_j start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_s end_POSTSUPERSCRIPT ) = divide start_ARG 1 end_ARG start_ARG square-root start_ARG 2 italic_π italic_σ start_POSTSUBSCRIPT italic_j start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_s end_POSTSUPERSCRIPT end_ARG end_ARG roman_exp ( - divide start_ARG ( italic_z - italic_μ start_POSTSUBSCRIPT italic_j start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_s end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG 2 ( italic_σ start_POSTSUBSCRIPT italic_j start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_s end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG )

    is then fitted to the jsuperscript𝑗j^{*}italic_j start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT-th column of the training data, using μjssuperscriptsubscript𝜇superscript𝑗𝑠\mu_{j^{*}}^{s}italic_μ start_POSTSUBSCRIPT italic_j start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_s end_POSTSUPERSCRIPT and σjssuperscriptsubscript𝜎superscript𝑗𝑠\sigma_{j^{*}}^{s}italic_σ start_POSTSUBSCRIPT italic_j start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_s end_POSTSUPERSCRIPT. The likelihood of each test sample 𝐱itsuperscriptsubscript𝐱𝑖𝑡\mathbf{x}_{i}^{t}bold_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT is then computed as 𝒩(𝐱ijt|μjs,σjs)𝒩conditionalsuperscriptsubscript𝐱𝑖superscript𝑗𝑡superscriptsubscript𝜇superscript𝑗𝑠superscriptsubscript𝜎superscript𝑗𝑠\mathcal{N}(\mathbf{x}_{ij^{*}}^{t}|\mu_{j^{*}}^{s},\sigma_{j^{*}}^{s})caligraphic_N ( bold_x start_POSTSUBSCRIPT italic_i italic_j start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT | italic_μ start_POSTSUBSCRIPT italic_j start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_s end_POSTSUPERSCRIPT , italic_σ start_POSTSUBSCRIPT italic_j start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_s end_POSTSUPERSCRIPT ). Finally, test samples are drawn inversely proportional to their likelihood, with the sampling probability (𝐱it)superscriptsubscript𝐱𝑖𝑡\mathbb{P}(\mathbf{x}_{i}^{t})blackboard_P ( bold_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) of 𝐱itsuperscriptsubscript𝐱𝑖𝑡\mathbf{x}_{i}^{t}bold_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT is defined as:

    (𝐱it)=𝒩(𝐱ijt|μjs,σjs)1i=1|𝒟t|𝒩(𝐱ijt|μjs,σjs)1.superscriptsubscript𝐱𝑖𝑡𝒩superscriptconditionalsuperscriptsubscript𝐱𝑖superscript𝑗𝑡superscriptsubscript𝜇superscript𝑗𝑠superscriptsubscript𝜎superscript𝑗𝑠1superscriptsubscriptsuperscript𝑖1subscript𝒟𝑡𝒩superscriptconditionalsuperscriptsubscript𝐱superscript𝑖superscript𝑗𝑡superscriptsubscript𝜇superscript𝑗𝑠superscriptsubscript𝜎superscript𝑗𝑠1\mathbb{P}(\mathbf{x}_{i}^{t})=\frac{\mathcal{N}(\mathbf{x}_{ij^{*}}^{t}|\mu_{% j^{*}}^{s},\sigma_{j^{*}}^{s})^{-1}}{\sum_{i^{\prime}=1}^{|\mathcal{D}_{t}|}% \mathcal{N}(\mathbf{x}_{i^{\prime}j^{*}}^{t}|\mu_{j^{*}}^{s},\sigma_{j^{*}}^{s% })^{-1}}.blackboard_P ( bold_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) = divide start_ARG caligraphic_N ( bold_x start_POSTSUBSCRIPT italic_i italic_j start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT | italic_μ start_POSTSUBSCRIPT italic_j start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_s end_POSTSUPERSCRIPT , italic_σ start_POSTSUBSCRIPT italic_j start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_s end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT end_ARG start_ARG ∑ start_POSTSUBSCRIPT italic_i start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT | caligraphic_D start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT | end_POSTSUPERSCRIPT caligraphic_N ( bold_x start_POSTSUBSCRIPT italic_i start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT italic_j start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT | italic_μ start_POSTSUBSCRIPT italic_j start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_s end_POSTSUPERSCRIPT , italic_σ start_POSTSUBSCRIPT italic_j start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_s end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT end_ARG .
  • Categorical: Important categorical column shift simulates natural domain shifts where the test distribution of the most important categorical column deviates significantly from the training distribution. Again, we first identify the most important categorical column, jsuperscript𝑗j^{*}italic_j start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT, using a pre-trained XGBoost (Chen and Guestrin 2016). A categorical distribution, which generalizes the Bernoulli distribution,

    𝒞(z|p1,,pK)=p1𝟙{𝟙}(𝕫)pK𝟙{𝕂}(𝕫),𝒞conditional𝑧subscript𝑝1subscript𝑝𝐾superscriptsubscript𝑝1subscript11𝕫superscriptsubscript𝑝𝐾subscript1𝕂𝕫\mathcal{C}(z|p_{1},\cdots,p_{K})=p_{1}^{\mathbbold{1}_{\{1\}}(z)}\cdots p_{K}% ^{\mathbbold{1}_{\{K\}}(z)},caligraphic_C ( italic_z | italic_p start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , ⋯ , italic_p start_POSTSUBSCRIPT italic_K end_POSTSUBSCRIPT ) = italic_p start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT blackboard_1 start_POSTSUBSCRIPT { blackboard_1 } end_POSTSUBSCRIPT ( blackboard_z ) end_POSTSUPERSCRIPT ⋯ italic_p start_POSTSUBSCRIPT italic_K end_POSTSUBSCRIPT start_POSTSUPERSCRIPT blackboard_1 start_POSTSUBSCRIPT { blackboard_K } end_POSTSUBSCRIPT ( blackboard_z ) end_POSTSUPERSCRIPT ,

    is then fitted to the jsuperscript𝑗j^{*}italic_j start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT-th column of the training data, where K𝐾Kitalic_K is the number of distinct categorical features in the jsuperscript𝑗j^{*}italic_j start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT-th column, and pk=(𝐱¯js=k)subscript𝑝𝑘superscriptsubscript¯𝐱𝑗𝑠𝑘p_{k}=\mathbb{P}(\bar{\mathbf{x}}_{j}^{s}=k)italic_p start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT = blackboard_P ( over¯ start_ARG bold_x end_ARG start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_s end_POSTSUPERSCRIPT = italic_k ) for k=1,,K𝑘1𝐾k=1,\cdots,Kitalic_k = 1 , ⋯ , italic_K. The likelihood of each test sample 𝐱itsuperscriptsubscript𝐱𝑖𝑡\mathbf{x}_{i}^{t}bold_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT is then computed as 𝒞(𝐱ijt|p1,,pK)𝒞conditionalsuperscriptsubscript𝐱𝑖superscript𝑗𝑡subscript𝑝1subscript𝑝𝐾\mathcal{C}(\mathbf{x}_{ij^{*}}^{t}|p_{1},\cdots,p_{K})caligraphic_C ( bold_x start_POSTSUBSCRIPT italic_i italic_j start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT | italic_p start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , ⋯ , italic_p start_POSTSUBSCRIPT italic_K end_POSTSUBSCRIPT ). Finally, test samples are drawn inversely proportional to their likelihood, with the sampling probability (𝐱it)superscriptsubscript𝐱𝑖𝑡\mathbb{P}(\mathbf{x}_{i}^{t})blackboard_P ( bold_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) of 𝐱itsuperscriptsubscript𝐱𝑖𝑡\mathbf{x}_{i}^{t}bold_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT is defined as:

    (𝐱it)=𝒞(𝐱ijt|p1,,pK)1i=1|𝒟t|𝒞(𝐱ijt|p1,,pK)1.superscriptsubscript𝐱𝑖𝑡𝒞superscriptconditionalsuperscriptsubscript𝐱𝑖superscript𝑗𝑡subscript𝑝1subscript𝑝𝐾1superscriptsubscriptsuperscript𝑖1subscript𝒟𝑡𝒞superscriptconditionalsuperscriptsubscript𝐱superscript𝑖superscript𝑗𝑡subscript𝑝1subscript𝑝𝐾1\mathbb{P}(\mathbf{x}_{i}^{t})=\frac{\mathcal{C}(\mathbf{x}_{ij^{*}}^{t}|p_{1}% ,\cdots,p_{K})^{-1}}{\sum_{i^{\prime}=1}^{|\mathcal{D}_{t}|}\mathcal{C}(% \mathbf{x}_{i^{\prime}j^{*}}^{t}|p_{1},\cdots,p_{K})^{-1}}.blackboard_P ( bold_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) = divide start_ARG caligraphic_C ( bold_x start_POSTSUBSCRIPT italic_i italic_j start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT | italic_p start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , ⋯ , italic_p start_POSTSUBSCRIPT italic_K end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT end_ARG start_ARG ∑ start_POSTSUBSCRIPT italic_i start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT | caligraphic_D start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT | end_POSTSUPERSCRIPT caligraphic_C ( bold_x start_POSTSUBSCRIPT italic_i start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT italic_j start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT | italic_p start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , ⋯ , italic_p start_POSTSUBSCRIPT italic_K end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT end_ARG .

C.3 Label Distribution Shifts

  • Class Imbalance: This label distribution shift simulates a highly class-imbalanced test stream, where labels that are rare in the training set are more likely to appear frequently in the test set. Given a class imbalance ratio ρ=10𝜌10\rho=10italic_ρ = 10, we first rank the output labels yit𝒴superscriptsubscript𝑦𝑖𝑡𝒴y_{i}^{t}\in\mathcal{Y}italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ∈ caligraphic_Y for each test sample 𝐱itsuperscriptsubscript𝐱𝑖𝑡\mathbf{x}_{i}^{t}bold_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT in ascending order of their frequency in the training set, assigning ranks from 1 to C𝐶Citalic_C, where C𝐶Citalic_C is the number of classes. Specifically, rank(yit)=1ranksuperscriptsubscript𝑦𝑖𝑡1\text{rank}(y_{i}^{t})=1rank ( italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) = 1 indicates that yitsuperscriptsubscript𝑦𝑖𝑡y_{i}^{t}italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT is the least frequent label in the training set, while rank(yit)=Cranksuperscriptsubscript𝑦𝑖𝑡𝐶\text{rank}(y_{i}^{t})=Crank ( italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) = italic_C indicates that yitsuperscriptsubscript𝑦𝑖𝑡y_{i}^{t}italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT is the most frequent. We then define the unnormalized sampling probability for each test sample 𝐱itsuperscriptsubscript𝐱𝑖𝑡\mathbf{x}_{i}^{t}bold_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT as:

    ~(𝐱it)=rank(yit)C(ρ1)+1.~superscriptsubscript𝐱𝑖𝑡ranksuperscriptsubscript𝑦𝑖𝑡𝐶𝜌11\tilde{\mathbb{P}}(\mathbf{x}_{i}^{t})=\frac{\text{rank}(y_{i}^{t})}{C}(\rho-1% )+1.over~ start_ARG blackboard_P end_ARG ( bold_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) = divide start_ARG rank ( italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) end_ARG start_ARG italic_C end_ARG ( italic_ρ - 1 ) + 1 .

    The normalized sampling probability (𝐱it)superscriptsubscript𝐱𝑖𝑡\mathbb{P}(\mathbf{x}_{i}^{t})blackboard_P ( bold_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) for each test sample 𝐱itsuperscriptsubscript𝐱𝑖𝑡\mathbf{x}_{i}^{t}bold_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT is then defined as:

    (𝐱it)=~(𝐱it)i=1|𝒟t|~(𝐱it).superscriptsubscript𝐱𝑖𝑡~superscriptsubscript𝐱𝑖𝑡superscriptsubscriptsuperscript𝑖1subscript𝒟𝑡~superscriptsubscript𝐱superscript𝑖𝑡\mathbb{P}(\mathbf{x}_{i}^{t})=\frac{\tilde{\mathbb{P}}(\mathbf{x}_{i}^{t})}{% \sum_{i^{\prime}=1}^{|\mathcal{D}_{t}|}\tilde{\mathbb{P}}(\mathbf{x}_{i^{% \prime}}^{t})}.blackboard_P ( bold_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) = divide start_ARG over~ start_ARG blackboard_P end_ARG ( bold_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) end_ARG start_ARG ∑ start_POSTSUBSCRIPT italic_i start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT | caligraphic_D start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT | end_POSTSUPERSCRIPT over~ start_ARG blackboard_P end_ARG ( bold_x start_POSTSUBSCRIPT italic_i start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) end_ARG .
  • Temporal Correlation: To simulate temporal correlations in test data, we employ a custom sampling strategy using the Dirichlet distribution. This approach effectively captures temporal dependencies by dynamically adjusting the label distribution over time. We begin with a uniform probability distribution 0=(1/C)j=1Csubscript0superscriptsubscript1𝐶𝑗1𝐶\mathbb{P}_{0}=\left(1/C\right)_{j=1}^{C}blackboard_P start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT = ( 1 / italic_C ) start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_C end_POSTSUPERSCRIPT, where C𝐶Citalic_C is the number of classes. For sampling the i𝑖iitalic_i-th test instance, a probability distribution 𝝅isubscript𝝅𝑖\bm{\pi}_{i}bold_italic_π start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT is drawn from the Dirichlet distribution:

    𝝅iDirichlet(i1),similar-tosubscript𝝅𝑖Dirichletsubscript𝑖1\bm{\pi}_{i}\sim\text{Dirichlet}(\mathbb{P}_{i-1}),bold_italic_π start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∼ Dirichlet ( blackboard_P start_POSTSUBSCRIPT italic_i - 1 end_POSTSUBSCRIPT ) ,

    and then smoothed using η=106𝜂superscript106\eta=10^{-6}italic_η = 10 start_POSTSUPERSCRIPT - 6 end_POSTSUPERSCRIPT to avoid zero probabilities for any class j𝑗jitalic_j:

    𝝅i=max(η,𝝅i)j=1Cmax(η,𝝅ij).subscript𝝅𝑖𝜂subscript𝝅𝑖superscriptsubscript𝑗1𝐶𝜂subscript𝝅𝑖𝑗\bm{\pi}_{i}=\frac{\max(\eta,\bm{\pi}_{i})}{\sum_{j=1}^{C}\max(\eta,\bm{\pi}_{% ij})}.bold_italic_π start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = divide start_ARG roman_max ( italic_η , bold_italic_π start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) end_ARG start_ARG ∑ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_C end_POSTSUPERSCRIPT roman_max ( italic_η , bold_italic_π start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT ) end_ARG .

    A label yitsuperscriptsubscript𝑦𝑖𝑡y_{i}^{t}italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT is subsequently sampled according to 𝝅isubscript𝝅𝑖\bm{\pi}_{i}bold_italic_π start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT, and the corresponding test instance 𝐱itsuperscriptsubscript𝐱𝑖𝑡\mathbf{x}_{i}^{t}bold_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT is randomly selected from the test data with label yitsuperscriptsubscript𝑦𝑖𝑡y_{i}^{t}italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT. After the i𝑖iitalic_i-th sampling, the distribution isubscript𝑖\mathbb{P}_{i}blackboard_P start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT is updated using the recent history of sampled labels within a sliding window of size w=5𝑤5w=5italic_w = 5:

    i(1wi=iw+1i𝟙{𝕛}(𝕪𝕚𝕥))𝕛=𝟙.subscript𝑖superscriptsubscript1𝑤superscriptsubscriptsuperscript𝑖𝑖𝑤1𝑖subscript1𝕛superscriptsubscript𝕪superscript𝕚𝕥𝕛1\mathbb{P}_{i}\leftarrow\bigg{(}\frac{1}{w}\sum_{i^{\prime}=i-w+1}^{i}% \mathbbold{1}_{\{j\}}(y_{i^{\prime}}^{t})\bigg{)}_{j=1}^{C}.blackboard_P start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ← ( divide start_ARG 1 end_ARG start_ARG italic_w end_ARG ∑ start_POSTSUBSCRIPT italic_i start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT = italic_i - italic_w + 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT blackboard_1 start_POSTSUBSCRIPT { blackboard_j } end_POSTSUBSCRIPT ( blackboard_y start_POSTSUBSCRIPT blackboard_i start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT blackboard_t end_POSTSUPERSCRIPT ) ) start_POSTSUBSCRIPT blackboard_j = blackboard_1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT blackboard_C end_POSTSUPERSCRIPT .

Appendix D Baseline Details

D.1 Deep Tabular Learning Architectures

  • MLP: Multi-Layer Perceptron (MLP) (Murtagh 1991) is a foundational deep learning architecture characterized by multiple layers of interconnected nodes, where each node applies a non-linear activation function to a weighted sum of its inputs. In the tabular domain, MLP is often employed as a default deep learning model, with each input feature corresponding to a node in the input layer.

  • AutoInt: Automatic Feature Interaction Learning via Self-Attentive Neural Networks (AutoInt) (Song et al. 2019) is a model that automatically learns complex feature interactions in tasks like click-through rate (CTR) prediction, where features are typically sparse and high-dimensional. It uses a multi-head self-attentive neural network to map features into a low-dimensional space and capture high-order combinations, eliminating the need for manual feature engineering. AutoInt efficiently handles large datasets, outperforms existing methods, and provides good explainability.

  • ResNet: ResNet for tabular data (Gorishniy et al. 2021), is a modified version of the original ResNet architecture (He et al. 2016), tailored to capture intricate patterns within structured datasets. Although earlier efforts yielded modest results, recent studies have re-explored ResNet’s capabilities, inspired by its success in computer vision and NLP. This ResNet-like model for tabular data is characterized by a streamlined design that facilitates optimization through nearly direct paths from input to output, enabling the effective learning of deeper feature representations.

  • FT-Transformer: Feature Tokenizer along with Transformer (FT-Transformer) (Gorishniy et al. 2021), represents a straightforward modification of the Transformer architecture tailored for tabular data. In this model, the feature tokenizer component plays a crucial role by converting all features, whether categorical or numerical, into tokens. Subsequently, a series of Transformer layers are applied to these tokens within the Transformer component, along with the added [CLS] token. The ultimate representation of the [CLS] token in the final Transformer layer is then utilized for the prediction.

D.2 Supervised Baselines

  • k𝑘kitalic_k-NN: k𝑘kitalic_k-Nearest Neighbors (k𝑘kitalic_k-NN) is a fundamental model in tabular learning that identifies the k𝑘kitalic_k closest data points based on a chosen metric. It makes predictions through majority voting for classification or weighted averaging for regression. The hyperparameter k𝑘kitalic_k influences the model’s sensitivity.

  • LogReg: Logistic Regression (LogReg) is a linear classification model that estimates the probability of class membership using a logistic function, which maps the linear combination of features to a range of [0,1]01[0,1][ 0 , 1 ]. With proper regularization, LogReg can achieve performance comparable to state-of-the-art tabular models.

  • RandomForest: Random Forest is an ensemble learning algorithm that builds multiple decision trees to improve accuracy and reduce overfitting. It is particularly effective at capturing non-linear patterns and is robust against outliers.

  • XGBoost: Extreme Gradient Boosting (XGBoost) (Chen and Guestrin 2016) is a boosting algorithm that sequentially builds weak learners, typically decision trees, to correct errors made by previous models. XGBoost is known for its high predictive performance and ability to handle complex relationships through regularization.

  • CatBoost: CatBoost (Dorogush, Ershov, and Gulin 2017), like XGBoost, is a boosting algorithm that excels in handling categorical features without extensive preprocessing. It is highly effective in real-world datasets, offering strong performance, albeit at the cost of increased computational resources and the need for parameter tuning.

D.3 Test-Time Adaptation Baselines

  • PL: Pseudo-Labeling (PL) (Lee 2013) leverages a pseudo-labeling strategy to update model parameters during test time.

  • TTT++: Improved Test-Time Training (TTT++) (Liu et al. 2021) enhances test-time adaptation by using feature alignment strategies and regularization, eliminating the need to access source data during adaptation.

  • TENT: Test ENTropy minimization (TENT) (Wang et al. 2021a) updates the scale and bias parameters in the batch normalization layer during test time by minimizing entropy within a given test batch.

  • EATA: Efficient Anti-forgetting Test-time Adaptation (EATA) (Niu et al. 2022) mitigates the risk of unreliable gradients by filtering out high-entropy samples and applying a Fisher regularizer to constrain key model parameters during adaptation.

  • SAR: Sharpness-Aware and Reliable optimization (SAR) (Niu et al. 2023) builds on TENT by filtering samples with large entropy, which can cause model collapse during test time, using a predefined threshold.

  • LAME: Laplacian Adjusted Maximum-likelihood Estimation (LAME) (Boudiaf et al. 2022) employs an output adaptation strategy during test-time, focusing on adjusting the model’s output probabilities rather than tuning its parameters.

Appendix E Further Experimental Details

E.1 Further Implementation Details

All experiments are conducted on two servers. The first server is equipped with a 40-core Intel Xeon E5-2630 v4 CPU, 252GB RAM, 4 NVIDIA TITAN Xp GPUs, and runs Ubuntu 18.04.4. The second server has a 40-core Intel Xeon E5-2640 v4 CPU, 128GB RAM, 8 NVIDIA TITAN Xp GPUs, and runs Ubuntu 22.04.4. All architectures were implemented using Python 3.8.16 with PyTorch (Paszke et al. 2019) and PyTorch Geometric (Fey and Lenssen 2019). The specific versions of all software libraries and frameworks used are provided in the AdapTable/requirements.txt file of the supplementary materials. We also include our source code in AdapTable folder of the supplementary materials. Please refer to this for all experimental details and to clarify any uncertainties.

E.2 Hyperparameters for Supervised Baselines

For k𝑘kitalic_k-NN, LogReg, RandomForest, XGBoost, and CatBoost, optimal parameters are determined for each dataset using a random search with 10 iterations on the validation set. The search space for each method is specified in Table 6.

Table 6: Hyperparameter search space of supervised baselines. # neighbors denotes the number of neighbors, # estim denotes the number of estimators, depth denotes the maximum depth, and lr denotes the learning rate, respectively.

Method Search Space k𝑘kitalic_k-NN # neighbors: {2,,12}212\{2,\cdots,12\}{ 2 , ⋯ , 12 } RandomForest # estim: {50,100,150,200}50100150200\{50,100,150,200\}{ 50 , 100 , 150 , 200 }, depth: {2,3,,12}2312\{2,3,\cdots,12\}{ 2 , 3 , ⋯ , 12 } XGBoost # estim: {50,100,150,200}50100150200\{50,100,150,200\}{ 50 , 100 , 150 , 200 }, depth: {2,3,,12}2312\{2,3,\cdots,12\}{ 2 , 3 , ⋯ , 12 }, lr: {0.01,0.01+(10.01)/19,,1}0.010.0110.01191\{0.01,0.01+(1-0.01)/19,\cdots,1\}{ 0.01 , 0.01 + ( 1 - 0.01 ) / 19 , ⋯ , 1 }, gamma: {0,0.05,,0.5}00.050.5\{0,0.05,\cdots,0.5\}{ 0 , 0.05 , ⋯ , 0.5 } CatBoost # iterations: {50,100,,2000}501002000\{50,100,\cdots,2000\}{ 50 , 100 , ⋯ , 2000 }, lr: {0.01,0.01+(10.01)/19,,1}0.010.0110.01191\{0.01,0.01+(1-0.01)/19,\cdots,1\}{ 0.01 , 0.01 + ( 1 - 0.01 ) / 19 , ⋯ , 1 }, depth: {5,,40}540\{5,\cdots,40\}{ 5 , ⋯ , 40 }

E.3 Hyperparameters for TTA Baselines

In scenarios where the test set is unknown, tuning the hyperparameters of TTA methods on the test set would be considered cheating. Therefore, we tune all hyperparameters for each TTA method and backbone classifier architecture using the Numerical common corruption on the CMC tabular dataset, which we did not use as test data in OpenML-CC18 (Bischl et al. 2021) benchmark. PL, TENT (Wang et al. 2021a), and SAR (Niu et al. 2023) require three main hyperparameters—learning rate, number of adaptation steps per batch, and the option for episodic adaptation, where the model is reset after each batch. PL (Lee 2013) and TENT use a learning rate of 0.0001 with 1 adaptation step and episodic updates. Additionally, SAR requires a threshold to filter high-entropy samples and is configured with a learning rate of 0.001, 1 adaptation step, and episodic updates. For TTT++ (Liu et al. 2021), EATA (Niu et al. 2022), and LAME (Boudiaf et al. 2022), we follow the authors’ hyperparameter settings, except for the learning rate and adaptation steps. TTT++ and EATA were configured with a learning rate of 0.00001, 10 adaptation steps, and episodic updates. LAME, which only adjusts output logits, does not require hyperparameters related to gradient updates. For all baselines, hyperparameter choices remained consistent across different architectures, including MLP, AutoInt, ResNet, and FT-Transformer. The hyperparameter search space for each method are detailed in Table 7.

Table 7: Hyperparameter search space of test-time adaptation baselines. Here, we only denote the common hyperparameters, where method specific hyperparameters are specified in Section E.3.
Hyperparameter Search Space
lr {103,104,105,106}superscript103superscript104superscript105superscript106\{10^{-3},10^{-4},10^{-5},10^{-6}\}{ 10 start_POSTSUPERSCRIPT - 3 end_POSTSUPERSCRIPT , 10 start_POSTSUPERSCRIPT - 4 end_POSTSUPERSCRIPT , 10 start_POSTSUPERSCRIPT - 5 end_POSTSUPERSCRIPT , 10 start_POSTSUPERSCRIPT - 6 end_POSTSUPERSCRIPT }
# steps {1,5,10,15,20}15101520\{1,5,10,15,20\}{ 1 , 5 , 10 , 15 , 20 }
episodic {True, False}

E.4 Hyperparameters for AdapTable

AdapTable requires three test-time hyperparameters—the smoothing factor α𝛼\alphaitalic_α, and the low and high uncertainty quantiles qlowsubscript𝑞lowq_{\text{low}}italic_q start_POSTSUBSCRIPT low end_POSTSUBSCRIPT and qhighsubscript𝑞highq_{\text{high}}italic_q start_POSTSUBSCRIPT high end_POSTSUBSCRIPT. For fairness, we tune all AdapTable hyperparameters across different backbone architectures using the Numerical common corruption on the CMC dataset from the OpenML-CC18 benchmark (Bischl et al. 2021), which is not used as test data. We observe that AdapTable’s hyperparameter choices remain consistent across various architectures, including MLP, AutoInt, ResNet, and FT-Transformer. Notably, AdapTable demonstrates high insensitivity to variations in α𝛼\alphaitalic_α, qlowsubscript𝑞lowq_{\text{low}}italic_q start_POSTSUBSCRIPT low end_POSTSUBSCRIPT, and qhighsubscript𝑞highq_{\text{high}}italic_q start_POSTSUBSCRIPT high end_POSTSUBSCRIPT, which are uniformly set to 0.1, 0.25, and 0.75, respectively, across all datasets and architectures.

Appendix F Additional Analysis

F.1 Latent Space Visualizations

In Figure 9, we further visualize latent spaces of test instances using t-SNE across six different datasets and four representative deep tabular learning architectures to illustrate the observation discussed in Section 2.1. This visualization highlights the complex decision boundaries within the latent space of tabular data, which are significantly more intricate than those observed in other domains. By comparing the upper four rows—HELOC, Voting, Hospital Readmission, and Childhood Lead—with the lower two rows—linearized image data (MFEAT-PIXEL) and homogeneous DNA string sequences (DNA)—it becomes evident that the latent space decision boundaries in the tabular domain are particularly complex. According to WhyShift (Liu et al. 2023), this complexity is primarily due to latent confounders inherent in tabular data and concept shifts, where such confounders cause output labels to vary greatly for nearly identical inputs. As discussed in Section 2.1, this further underscores the limitations of existing TTA methods (Sun et al. 2020; Gandelsman et al. 2022; Liu et al. 2021; Boudiaf et al. 2022; Zhou et al. 2023), which often depend on the cluster assumption.

F.2 Reliability Diagrams

Figure 10 presents additional reliability diagrams across five different datasets and four representative deep tabular learning architectures, illustrating that tabular data often displays a mix of overconfident and underconfident prediction patterns. This contrasts with the consistent overconfidence observed in the image domain (Stylianou and Flournoy 2002) and underconfidence in the graph domain (Wang et al. 2021b). As shown in Figure 10, the Voting and Hospital Readmission datasets consistently exhibit overconfident behavior across all architectures, while the HELOC, Childhood Lead, and Diabetes datasets demonstrate underconfident tendencies. These observations underscore the need for a tabular-specific uncertainty calibration method.

F.3 Label Distribution Shifts and Prediction Bias Towards Source Label Distributions

We demonstrate that the data distribution shift we primarily target in the tabular domain—label distribution shift—occurs frequently in practice. Figure 11 presents the source label distribution (a), target label distribution (b), pseudo label distribution for test data using the source model (c), and the estimated target label distribution after applying our label distribution handler (d) across the five datasets. Comparing (a) and (b) in each row, it is evident that label distribution shift occurs across all datasets. In (c), we observe that the marginal label distribution predicted by the source model is commonly biased towards the source label distribution. Lastly, (d) illustrates that our label distribution handler effectively estimates the target label distribution, guiding the pseudo label distribution towards the target label distribution.

F.4 Entropy Distributions

We highlight a unique characteristic of tabular data: model prediction entropy consistently shows a strong bias toward underconfidence. To illustrate this, we present entropy distribution histograms for test instances across six datasets and four representative deep tabular learning architectures in Figure 12. A clear pattern emerges when comparing the upper four rows (HELOC, Voting, Hospital Readmission, Childhood Lead) with the lower two (Optdigits, DNA). The upper rows exhibit consistently high entropy, indicating a skew toward underconfidence, while the lower rows do not, except for Childhood Lead, where extreme class imbalance causes the model to collapse to the major class. This analysis highlights the distinct bias of tabular data toward underconfident predictions, a pattern less common in other domains. This aligns with findings that applying unsupervised objectives like entropy minimization to high-entropy samples can result in gradient explosions and model collapse (Niu et al. 2023).

Appendix G Additional Experiments

G.1 Detailed Results Across Common Corruptions and Datasets

Figure 4 presents the average F1 score across six types of common corruption and three datasets. Here, we provide more detailed results, including the standard errors. As shown in Table 8, AdapTable outperforms baseline TTA methods by a large margin across all datasets and corruption types. This further highlights the empirical efficacy of AdapTable, not only in handling label distribution shifts but also in addressing various common corruptions.

G.2 All Results Across Datasets and Model Architectures

In Figure 5, we demonstrate the effectiveness of AdapTable across various tabular model architectures by reporting the average performance across three datasets. Here, we provide the mean and standard error for each dataset and architecture. As shown in Table 9, AdapTable consistently achieves state-of-the-art performance with significant improvements across all model architectures and datasets. This further underscores the versatility and robustness of AdapTable.

G.3 Additional Computational Efficiency Analysis

One may wonder whether the post-training time required for AdapTable’s shift-aware uncertainty calibrator is prohibitively long. To address this concern, we measure and report the elapsed real time for post-training our shift-aware uncertainty calibrator on the medium-scale Hospital Readmission dataset using the FT-Transformer architecture. The post-training process takes approximately 9.2 seconds. For small- and medium-scale datasets, the post-training process typically requires only a few seconds, and even in our largest experimental setting, the time remains minimal, taking at most a few minutes.

Appendix H Limitations and Broader Impacts

H.1 Limitations

Similar to other test-time training (TTT) methods (Sun et al. 2020; Liu et al. 2021; Gandelsman et al. 2022), AdapTable requires an additional post-training stage to integrate a shift-aware uncertainty calibrator during the source model’s training phase. While full test-time adaptation methods (Wang et al. 2021a; Niu et al. 2022, 2023) avoid this, our analysis in Section 2.1 and experiments in Section 4 show that they fail in the tabular domain due to their focus on input covariate shifts, which are often entangled with concept shifts. According to WhyShift (Liu et al. 2023), concept shifts, driven by changes in latent confounders, require natural language descriptions of the shift conditions, necessitating a data-centric approach. Additionally, while AdapTable performs well across various corruptions beyond label distribution shifts (Figure 4), it is primarily focused on addressing label distribution shifts. Further exploration is needed to assess its effectiveness in handling input covariate shifts or concept shifts.

H.2 Broader Impacts

Tabular data is prevalent across industries such as healthcare (Johnson et al. 2016, 2021), finance (Studies 2019, 2022), manufacturing (Hein et al. 2017), and public administration (Gardner, Popovic, and Schmidt 2023). Our research addresses the critical yet underexplored challenge of distribution shifts in tabular data, a problem that has not received sufficient attention. We believe that our approach can significantly enhance the performance of machine learning models in various industries by improving model adaptation to tabular data, thereby creating meaningful value in practical applications. Through our data-centric analysis in Section 2, we identify why existing TTA methods fail in the tabular domain and introduce a tabular-specific approach for handling label distribution shifts in Section 3. We hope this work will provide valuable insights for future research on test-time adaptation in tabular data. Additionally, by making our source code publicly available, we aim to support real-world applications across various fields, benefiting both academia and industry.

Table 8: The average macro F1 score (%) with their standard errors for TTA baselines is reported across six common corruptions—Gaussian, Uniform, Random Drop, Column Drop, Numerical, and Categorical—over three datasets—HELOC, Voting, and Childhood Lead. The results are averaged over three random repetitions.
Method Gaussian Uniform Random Drop Column Drop Numerical Categorical
HELOC Source 33.1 ±plus-or-minus\pm± 0.0 33.0 ±plus-or-minus\pm± 0.0 31.4 ±plus-or-minus\pm± 0.1 32.3 ±plus-or-minus\pm± 1.4 33.8 ±plus-or-minus\pm± 0.2 32.3 ±plus-or-minus\pm± 0.3
PL 31.2 ±plus-or-minus\pm± 0.0 31.2 ±plus-or-minus\pm± 0.0 30.6 ±plus-or-minus\pm± 0.0 31.1 ±plus-or-minus\pm± 0.7 32.1 ±plus-or-minus\pm± 0.2 30.4 ±plus-or-minus\pm± 0.2
TENT 33.1 ±plus-or-minus\pm± 0.0 33.0 ±plus-or-minus\pm± 0.0 31.4 ±plus-or-minus\pm± 0.1 32.3 ±plus-or-minus\pm± 1.4 33.8 ±plus-or-minus\pm± 0.2 32.3 ±plus-or-minus\pm± 0.3
EATA 33.1 ±plus-or-minus\pm± 0.0 33.0 ±plus-or-minus\pm± 0.0 31.4 ±plus-or-minus\pm± 0.1 32.3 ±plus-or-minus\pm± 1.4 33.8 ±plus-or-minus\pm± 0.2 32.3 ±plus-or-minus\pm± 0.3
SAR 31.9 ±plus-or-minus\pm± 0.1 32.0 ±plus-or-minus\pm± 0.1 30.7 ±plus-or-minus\pm± 0.2 31.3 ±plus-or-minus\pm± 0.8 32.4 ±plus-or-minus\pm± 0.4 31.4 ±plus-or-minus\pm± 0.3
LAME 30.1 ±plus-or-minus\pm± 0.0 30.1 ±plus-or-minus\pm± 0.0 30.1 ±plus-or-minus\pm± 0.0 30.1 ±plus-or-minus\pm± 0.0 30.9 ±plus-or-minus\pm± 0.1 29.4 ±plus-or-minus\pm± 0.2
AdapTable 57.6 ±plus-or-minus\pm± 0.1 57.8 ±plus-or-minus\pm± 0.0 53.0 ±plus-or-minus\pm± 0.1 52.1 ±plus-or-minus\pm± 3.2 58.1 ±plus-or-minus\pm± 0.1 58.9 ±plus-or-minus\pm± 0.4
Voting Source 76.6 ±plus-or-minus\pm± 0.0 76.5 ±plus-or-minus\pm± 0.0 72.5 ±plus-or-minus\pm± 0.2 72.8 ±plus-or-minus\pm± 0.4 76.3 ±plus-or-minus\pm± 0.1 85.2 ±plus-or-minus\pm± 0.1
PL 75.6 ±plus-or-minus\pm± 0.3 75.2 ±plus-or-minus\pm± 0.3 71.1 ±plus-or-minus\pm± 0.5 70.6 ±plus-or-minus\pm± 0.5 75.9 ±plus-or-minus\pm± 0.1 85.1 ±plus-or-minus\pm± 0.1
TENT 76.6 ±plus-or-minus\pm± 0.0 76.5 ±plus-or-minus\pm± 0.0 72.5 ±plus-or-minus\pm± 0.2 72.8 ±plus-or-minus\pm± 0.4 76.3 ±plus-or-minus\pm± 0.1 85.2 ±plus-or-minus\pm± 0.1
EATA 76.6 ±plus-or-minus\pm± 0.0 76.5 ±plus-or-minus\pm± 0.0 72.5 ±plus-or-minus\pm± 0.2 72.8 ±plus-or-minus\pm± 0.4 76.3 ±plus-or-minus\pm± 0.1 85.2 ±plus-or-minus\pm± 0.1
SAR 67.2 ±plus-or-minus\pm± 1.0 64.0 ±plus-or-minus\pm± 0.2 61.8 ±plus-or-minus\pm± 1.0 60.8 ±plus-or-minus\pm± 0.8 69.9 ±plus-or-minus\pm± 0.1 84.2 ±plus-or-minus\pm± 0.1
LAME 39.4 ±plus-or-minus\pm± 0.2 39.4 ±plus-or-minus\pm± 0.1 37.3 ±plus-or-minus\pm± 0.0 37.8 ±plus-or-minus\pm± 0.2 39.4 ±plus-or-minus\pm± 0.2 81.4 ±plus-or-minus\pm± 0.2
AdapTable 78.9 ±plus-or-minus\pm± 0.0 78.6 ±plus-or-minus\pm± 0.1 74.9 ±plus-or-minus\pm± 0.1 75.5 ±plus-or-minus\pm± 0.5 78.0 ±plus-or-minus\pm± 0.1 85.0 ±plus-or-minus\pm± 0.4
Childhood Lead Source 47.9 ±plus-or-minus\pm± 0.0 47.9 ±plus-or-minus\pm± 0.0 47.9 ±plus-or-minus\pm± 0.0 47.9 ±plus-or-minus\pm± 0.0 48.1 ±plus-or-minus\pm± 0.0 48.8 ±plus-or-minus\pm± 0.0
PL 47.9 ±plus-or-minus\pm± 0.0 47.9 ±plus-or-minus\pm± 0.0 47.9 ±plus-or-minus\pm± 0.0 47.9 ±plus-or-minus\pm± 0.0 48.1 ±plus-or-minus\pm± 0.0 48.8 ±plus-or-minus\pm± 0.0
TENT 47.9 ±plus-or-minus\pm± 0.0 47.9 ±plus-or-minus\pm± 0.0 47.9 ±plus-or-minus\pm± 0.0 47.9 ±plus-or-minus\pm± 0.0 48.1 ±plus-or-minus\pm± 0.0 48.8 ±plus-or-minus\pm± 0.0
EATA 47.9 ±plus-or-minus\pm± 0.0 47.9 ±plus-or-minus\pm± 0.0 47.9 ±plus-or-minus\pm± 0.0 47.9 ±plus-or-minus\pm± 0.0 48.1 ±plus-or-minus\pm± 0.0 48.8 ±plus-or-minus\pm± 0.0
SAR 47.9 ±plus-or-minus\pm± 0.0 47.9 ±plus-or-minus\pm± 0.0 47.9 ±plus-or-minus\pm± 0.0 47.9 ±plus-or-minus\pm± 0.0 48.1 ±plus-or-minus\pm± 0.0 48.8 ±plus-or-minus\pm± 0.0
LAME 47.9 ±plus-or-minus\pm± 0.0 47.9 ±plus-or-minus\pm± 0.0 47.9 ±plus-or-minus\pm± 0.0 47.9 ±plus-or-minus\pm± 0.0 48.1 ±plus-or-minus\pm± 0.0 48.8 ±plus-or-minus\pm± 0.0
AdapTable 61.4 ±plus-or-minus\pm± 0.1 61.5 ±plus-or-minus\pm± 0.0 58.0 ±plus-or-minus\pm± 0.1 55.9 ±plus-or-minus\pm± 1.6 62.8 ±plus-or-minus\pm± 0.2 53.1 ±plus-or-minus\pm± 0.2
Table 9: The average macro F1 score (%) with their standard errors for TTA baselines is reported across three datasets—HELOC, Voting, and Childhood Lead—using three model architectures—AutoInt, ResNet, and FT-Transformer. The results are averaged over three random repetitions.
Method HELOC Voting Childhood Lead
AutoInt Source 34.9 ±plus-or-minus\pm± 0.0 77.5 ±plus-or-minus\pm± 0.0 47.9 ±plus-or-minus\pm± 0.0
PL 31.6 ±plus-or-minus\pm± 0.0 76.5 ±plus-or-minus\pm± 0.1 47.9 ±plus-or-minus\pm± 0.0
TENT 34.9 ±plus-or-minus\pm± 0.0 77.5 ±plus-or-minus\pm± 0.0 47.9 ±plus-or-minus\pm± 0.0
EATA 34.9 ±plus-or-minus\pm± 0.0 77.5 ±plus-or-minus\pm± 0.0 47.9 ±plus-or-minus\pm± 0.0
SAR 62.0 ±plus-or-minus\pm± 0.4 31.2 ±plus-or-minus\pm± 0.7 47.9 ±plus-or-minus\pm± 0.0
LAME 30.1 ±plus-or-minus\pm± 0.0 37.3 ±plus-or-minus\pm± 0.0 47.9 ±plus-or-minus\pm± 0.0
AdapTable 56.3 ±plus-or-minus\pm± 0.1 79.2 ±plus-or-minus\pm± 0.0 61.8 ±plus-or-minus\pm± 0.1
ResNet Source 52.0 ±plus-or-minus\pm± 0.0 76.6 ±plus-or-minus\pm± 0.0 47.9 ±plus-or-minus\pm± 0.0
PL 34.3 ±plus-or-minus\pm± 0.1 73.3 ±plus-or-minus\pm± 0.1 47.9 ±plus-or-minus\pm± 0.0
TENT 52.0 ±plus-or-minus\pm± 0.0 76.6 ±plus-or-minus\pm± 0.0 47.9 ±plus-or-minus\pm± 0.0
EATA 52.0 ±plus-or-minus\pm± 0.0 76.7 ±plus-or-minus\pm± 0.0 47.9 ±plus-or-minus\pm± 0.0
SAR 55.1 ±plus-or-minus\pm± 0.5 52.2 ±plus-or-minus\pm± 0.5 47.9 ±plus-or-minus\pm± 0.0
LAME 30.1 ±plus-or-minus\pm± 0.0 75.1 ±plus-or-minus\pm± 0.1 47.9 ±plus-or-minus\pm± 0.0
AdapTable 61.9 ±plus-or-minus\pm± 0.0 78.7 ±plus-or-minus\pm± 0.0 61.3 ±plus-or-minus\pm± 0.1
FT-Transformer Source 33.0 ±plus-or-minus\pm± 0.0 77.3 ±plus-or-minus\pm± 0.0 47.9 ±plus-or-minus\pm± 0.0
PL 30.6 ±plus-or-minus\pm± 0.0 76.0 ±plus-or-minus\pm± 0.1 47.9 ±plus-or-minus\pm± 0.0
TENT 33.0 ±plus-or-minus\pm± 0.0 77.3 ±plus-or-minus\pm± 0.0 47.9 ±plus-or-minus\pm± 0.0
EATA 33.0 ±plus-or-minus\pm± 0.0 77.3 ±plus-or-minus\pm± 0.0 47.9 ±plus-or-minus\pm± 0.0
SAR 35.3 ±plus-or-minus\pm± 0.1 73.6 ±plus-or-minus\pm± 0.3 47.9 ±plus-or-minus\pm± 0.0
LAME 30.7 ±plus-or-minus\pm± 0.1 71.5 ±plus-or-minus\pm± 0.1 47.9 ±plus-or-minus\pm± 0.0
AdapTable 55.0 ±plus-or-minus\pm± 0.0 79.2 ±plus-or-minus\pm± 0.1 61.7 ±plus-or-minus\pm± 0.1
Refer to caption
Figure 9: Latent space visualizations of test samples using t-SNE across six diverse datasets, including tabular datasets (HELOC, Voting, Hospital Readmission, and Childhood Lead) and non-tabular datasets (Optdigits, DNA), applied to various deep tabular learning architectures.
Refer to caption
Figure 10: Reliability diagrams for test instances across five different tabular datasets (HELOC, Voting, Hospital Readmission, Childhood Lead, and Diabetes) and four representative deep tabular learning architectures (MLP, AutoInt, ResNet, FT-Transformer).
Refer to caption
Figure 11: Label distribution histograms for test instances showing (a) source label distribution, (b) target label distribution, (c) pseudo label distribution, and (d) estimated target label distribution after applying our label distribution handler, across five tabular datasets (HELOC, Voting, Hospital Readmission, Childhood Lead, and Diabetes) using MLP.
Refer to caption
Figure 12: Entropy distribution histograms of test samples across six diverse datasets, including tabular datasets (HELOC, Voting, Hospital Readmission, and Childhood Lead) and non-tabular datasets (Optdigits, DNA), applied to four deep tabular learning architectures. Prediction entropies are normalized by dividing by the maximum entropy, logC𝐶\log{C}roman_log italic_C, where C𝐶Citalic_C represents the number of classes for each dataset.