AdapTable: Test-Time Adaptation for Tabular Data
via Shift-Aware Uncertainty Calibrator and Label Distribution Handler

Changhun Kim^1,2\equalcontrib Taewon Kim¹\equalcontrib Seungyeon Woo^1,3 June Yong Yang¹ Eunho Yang^1,2

Abstract

In real-world scenarios, tabular data often suffer from distribution shifts that threaten the performance of machine learning models. Despite its prevalence and importance, handling distribution shifts in the tabular domain remains underexplored due to the inherent challenges within the tabular data itself. In this sense, test-time adaptation (TTA) offers a promising solution by adapting models to target data without accessing source data, crucial for privacy-sensitive tabular domains. However, existing TTA methods either 1) overlook the nature of tabular distribution shifts, often involving label distribution shifts, or 2) impose architectural constraints on the model, leading to a lack of applicability. To this end, we propose AdapTable, a novel TTA framework for tabular data. AdapTable operates in two stages: 1) calibrating model predictions using a shift-aware uncertainty calibrator, and 2) adjusting these predictions to match the target label distribution with a label distribution handler. We validate the effectiveness of AdapTable through theoretical analysis and extensive experiments on various distribution shift scenarios. Our results demonstrate AdapTable’s ability to handle various real-world distribution shifts, achieving up to a 16% improvement on the HELOC dataset.

1 Introduction

Tabular data is one of the most abundant forms across various industries, including healthcare (Johnson et al. 2016), finance (Studies 2019), manufacturing (Hein et al. 2017), and public administration (Gardner, Popovic, and Schmidt 2023). However, tabular learning models often face challenges in real-world applications due to distribution shifts, which severely degrade their integrity and reliability. In this regard, test-time adaptation (TTA) (Lee 2013; Liu et al. 2021; Wang et al. 2021a; Niu et al. 2022, 2023; Boudiaf et al. 2022) offers a promising solution to address this issue by adapting models under unknown distribution shifts using only unlabeled test data without access to training data.

Despite its potential, direct application of TTA without the consideration of characteristics of tabular data, results in limited performance gain or model collapse. We identify two primary reasons for this. First, representation learning in the tabular domain is often hindered by the entanglement of covariate shifts and concept drifts (Liu et al. 2023). Consequently, TTA methods leveraging unsupervised objectives, which rely on cluster assumption often fail or lead to model collapse. Second, these approaches often do not take label distribution shifts into account, a key factor in the performance decline within the tabular domain. This issue is further aggravated by the tendency for predictions in the target domain to be biased towards the source label.

To address these issues, we propose AdapTable, a novel TTA approach tailored for tabular data. AdapTable consists of two main components: 1) a shift-aware uncertainty calibrator and 2) a label distribution handler. Our shift-aware uncertainty calibrator utilizes graph neural networks to assign per-sample temperature for each model prediction. By treating each column as a node, it captures not only individual feature shifts but also complex patterns across features. Our label distribution handler then adjusts the calibrated model probabilities by estimating the label distribution of the current target batch. This process aligns predictions with the target label distribution, addressing biases towards the source distribution. AdapTable requires no parameter updates, making it model-agnostic and thus compatible with both deep learning models and gradient-boosted decision trees, offering high versatility for tabular data.

We evaluate AdapTable under various distribution shifts and demonstrate AdapTable consistently outperforms baselines, achieving up to 16% gains on the HELOC dataset. Furthermore, we provide theoretical insights into AdapTable’s performance, supported by extensive ablation studies. We hereby summarize our contributions:

•

We analyze the challenges of tabular distribution shifts to reveal why existing TTA methods fail, highlighting the entanglement of covariate, concept shifts, and label distribution shifts as key factors in performance degradation.
•

Building on these analyses, we introduce AdapTable, a first model-agnostic TTA method specifically designed for tabular data. AdapTable addresses label distribution shifts by estimating and adjusting the label distribution of the current test batch, while also calibrating model predictions with a shift-aware uncertainty calibrator.
•

Our extensive experiments demonstrate that AdapTable exhibits robust adaptation performance across various model architectures and under diverse natural distribution shifts and common corruptions, further supported by extensive ablation studies.

2 Analysis of Tabular Distribution Shifts

In this section, we examine why prior TTA methods struggle with distribution shifts in the tabular domain. First, we note that deep learning models’ latent representations do not follow label-based cluster assumptions due to the entanglement of covariate and concept shifts, causing TTA methods relying on these assumptions (Wang et al. 2021a; Lee 2013; Niu et al. 2022) to falter in tabular data. Second, we identify label distribution shift as a key driver of performance degradation under distribution shifts, as discussed further in Section 2.1 and Section 2.2.

2.1 Indistinguishable Representations

Refer to caption — Figure 1: Latent space visualization with t-SNE comparing (a) tabular data (Gardner, Popovic, and Schmidt 2023) and (b) image data (Bischl et al. 2021). Reliability diagrams of (c) underconfident and (d) overconfident scenarios are shown. All experiments are conducted using an MLP architecture.

We first reveal that deep tabular models fail to learn distinguishable embeddings. In Figures 1 (a) and (b), we visualize the embedding spaces of models trained on two datasets: HELOC (Gardner, Popovic, and Schmidt 2023), a pure tabular dataset, and Optdigits (Bischl et al. 2021), a linearized image dataset. Notably, the deep learning models’ representations adhere to the cluster assumption by labels only in the image data, not in the tabular data.

We attribute this unique behavior of deep tabular models to the high-frequency nature of tabular data. In the tabular domain, weak causality from inputs $X$ to outputs $Y$ due to latent confounders $Z$ often leads to vastly different labels for similar inputs (Grinsztajn, Oyallon, and Varoquaux 2022; Liu et al. 2023). For instance, cardiovascular disease risk predictions based on cholesterol, blood pressure, age, and smoking history are influenced by gender as a latent confounder, resulting in different risk levels for men and women despite identical inputs (Mosca, Barrett-Connor, and Kass Wenger 2011; DeFilippis and Van Spall 2021). This leads to high-frequency functions that are difficult for deep neural networks, which are biased toward low-frequency functions, to accurately model (Beyazit et al. 2024).

Consequently, prior TTA methods, which rely on cluster assumptions and primarily target input covariate shifts, show limited performance gains. Figure 5 demonstrates that these methods fail to improve beyond the vanilla performance of the source model due to the lack of a cluster assumption.

2.2 Importance of Label Distribution Shifts

Second, we find that label distribution shift is a primary cause of performance degradation, and accurate estimation of target label distribution can lead to significant performance gains. A recent benchmark study, TableShift (Gardner, Popovic, and Schmidt 2023) have emphasized that label distribution shift is a primary cause of performance degradation in tabular data. Specifically, They investigated the relationship between three key shift factors—input covariate shift ( $X$ -shift), concept shift ( $Y|X$ -shift), and label distribution shift ( $Y$ -shift)—and model performance, and discovered that label distribution shifts are strongly correlated with performance degradation. Our analysis in Section F further reveals that these shifts are highly prevalent in tabular data. This underscores the need for a test-time adaptation method that addresses label distribution shifts by estimating the target label distribution and adjusting predictions accordingly.

Moreover, we visualize model predictions in Figure 2 and observe that, similar to other domains (Wu et al. 2021; Hwang et al. 2022; Park, Seo, and Yang 2023), the marginal distribution of output labels is biased toward the source label distribution. Given that tabular models are often poorly calibrated (Figure 1), we conduct an experiment using a perfectly calibrated model, which yields high confidence for correct samples and low confidence for incorrect ones. As shown in Table 1, our label distribution adaptation method significantly improves under these conditions. This underscores the need for an uncertainty calibrator specific to tabular data.

Table 1: Key findings demonstrate that uncertainty calibration enhances the performance of the label distribution handler.

Method HELOC Voting Source 47.6 79.3 AdapTable 63.7 79.6 AdapTable (Oracle) 90.1 84.7

3 AdapTable

This section introduces AdapTable, the first model-agnostic test-time adaptation strategy for tabular data. AdapTable uses per-sample temperature scaling to correct overconfident yet incorrect predictions; by treating each column as a graph node, it employs a shift-aware uncertainty calibrator with graph neural networks to capture both individual and complex feature shifts (Section 3.2). It also estimates the average label distribution of the current test batch and adjusts the model’s output predictions accordingly (Section 3.3). We also provide a theoretical justification for how our label distribution estimation reduces the error bound in Section 3.4. The overall framework of AdapTable is depicted in Figure 3.

3.1 Test-Time Adaptation Setup for Tabular Data

We begin by defining the problem setup for test-time adaptation (TTA) for tabular data. Let $f_{\theta}:\mathbb{R}^{D}\rightarrow\mathbb{R}^{C}$ be a pre-trained classifier on the labeled source tabular domain $\mathcal{D}_{s}=\{(\mathbf{x}_{i}^{s},y_{i}^{s})\}_{i}\subset X_{s}\times Y_{s}$ , where each pair consists of a tabular input $\mathbf{x}_{i}^{s}\in\mathcal{X}=\mathbb{R}^{D}$ and its corresponding output class label $y_{i}^{s}\in\mathcal{Y}=\{1,\cdots,C\}$ . The classifier takes a row $\mathbf{x}_{i}\in\mathbb{R}^{D}$ from a table and returns output logit $f_{\theta}(\mathbf{x}_{i})\in\mathbb{R}^{C}$ . Here, $D$ and $C$ are the number of input features and output classes, respectively. The objective of TTA for tabular data is to adapt $f_{\theta}$ to the unlabeled target tabular domain $\mathcal{D}_{t}={\{\mathbf{x}_{i}^{t}\}}_{i}=X_{t}$ during inference, without access to $\mathcal{D}_{s}$ . Unlike most TTA methods that fine-tune model parameters $\theta$ with unsupervised objectives, our approach directly adjusts the output prediction $f_{\theta}(\mathbf{x}_{i}^{t})$ .

3.2 Shift-Aware Uncertainty Calibrator

This section describes a shift-aware uncertainty calibrator $g_{\phi}:\mathbb{R}^{C}\times\mathbb{R}^{D\times N}\rightarrow\mathbb{R}^{+}$ designed to adjust the poorly calibrated original predictions $p_{t}(y|\mathbf{x}_{i}^{t})=\text{softmax}(f_{\theta}(\mathbf{x}_{i}^{t}))$ , where $\text{softmax}(z)_{i}=\exp(z_{i})/\sum_{i^{\prime}}\exp(z_{i^{\prime}})$ normalizes the logits. Our shift-aware uncertainty calibrator lowers the confidence of overconfident yet incorrect predictions, thereby 1) facilitating better alignment of these predictions with the estimated target label distribution, and 2) mitigating their impact on the inaccurate estimation of the target label distribution.

Conventional post-hoc calibration methods (Platt 2000; Stylianou and Flournoy 2002) typically take solely the original model prediction $f_{\theta}(\mathbf{x}_{i}^{t})$ as input and return the corresponding temperature $T_{i}$ without taking input variations into account. We argue that this can be suboptimal as it fails to account for the uncertainty arising from variations in the input itself. Instead, our $g_{\phi}$ not only considers $f_{\theta}(\mathbf{x}_{i}^{t})$ but also incorporates $\mathbf{x}_{i}^{t}$ with the shift trend $\mathbf{s}^{t}$ of the current batch as additional inputs. Capturing the common shift patterns within the current batch enables a more accurate reflection of the uncertainty caused by the overall shift patterns within the current batch.

In detail, the shift trend $\mathbf{s}^{t}=(\mathbf{s}_{u}^{t})_{u=1}^{D}\in\mathbb{R}^{D\times N}$ is defined for a specific column index $u$ as follows:

\mathbf{s}_{u}^{t}=\big{(}\mathbf{x}^{t}_{iu}-\frac{1}{|\mathcal{D}_{s}|}\sum_% {i^{\prime}=1}^{|\mathcal{D}_{s}|}\mathbf{x}_{i^{\prime}u}^{s}\big{)}_{i=1}^{N% }\in\mathbb{R}^{N}.

(1)

Here, $\mathbf{s}_{u}^{t}$ represents the difference between the values of the $u$ -th column within the current test batch and the average values of the corresponding column in the source data. Using $\mathbf{s}_{u}^{t}$ for each column $u$ , we define a shift trend graph where each node $u$ represents a column, and each edge captures the relationship between different columns; the node feature for each node $u$ is defined as $\mathbf{s}_{u}^{t}$ , and the adjacency matrix is represented by an $D\times D$ all-ones matrix.

A graph neural network (GNN) is then applied to the graph formed above, enabling the exchange of shift trends between different columns through message passing. This process generates a column-wise contextualized representation, which is then averaged to produce an overall feature representation that encompasses all columns. Finally, the averaged node representation is concatenated with the initial prediction $f_{\theta}(\mathbf{x}_{i}^{t})$ to yield the final output temperature $T_{i}$ . This GNN-based uncertainty calibration not only captures shifts in individual columns but also sensitively detects correlation shifts occurring simultaneously across different columns, which are common in the tabular domain. A more detailed explanation of the architecture and training of the shift-aware uncertainty calibrator can be found in Section A.

3.3 Label Distribution Handler

This section introduces a label distribution handler designed to accurately estimate the target label distribution for the current test batch and adjust the model’s output predictions accordingly. This approach is empirically justified by our observation that the marginal distribution of model predictions $p_{t}(y)$ in the target domain tends to be biased towards the source label distribution $p_{s}(y)$ , as discussed in Section 2.2 and illustrated in Figure 2.

A straightforward solution to correct this bias is to simply multiply $p_{t}(y)/p_{s}(y)$ to align the marginal label distribution (Berthelot et al. 2020). Specifically, given $p_{t}(y|\mathbf{x}_{i}^{t})=\text{softmax}(f_{\theta}(\mathbf{x}_{i}^{t}))$ , the adjusted prediction would be:

\text{norm}(p_{t}(y|\mathbf{x}_{i}^{t})p_{t}(y)/p_{s}(y))

(2)

where $\text{norm}(z)_{i}=z_{i}/\sum_{i^{\prime}}{z_{i^{\prime}}}$ normalizes the unnormalized probability. However, we find two major issues: 1) $p_{t}(y|\mathbf{x}_{i}^{t})$ is often poorly calibrated and 2) overconfident yet incorrect predictions significantly hinder the accurate estimation of the target label distribution $p_{t}(y)$ (Section 2.2).

To tackle these challenges, we propose a simple yet effective estimator $\bar{p}_{i}(y|\mathbf{x}_{i}^{t})$ defined like below:

\bar{p}_{i}(y|\mathbf{x}_{i}^{t})=\frac{\tilde{p}_{t}(y|\mathbf{x}_{i}^{t})+% \text{norm}\big{(}\tilde{p}_{t}(y|\mathbf{x}_{i}^{t})p_{t}(y)/p_{s}(y)\big{)}}% {2}.

(3)

The key differences between the original Equation 2 and our Equation 3 are: 1) we use the calibrated prediction $\tilde{p}_{t}(y|\mathbf{x}_{i}^{t})$ instead of the original prediction $p_{t}(y|\mathbf{x}_{i}^{t})$ to enhance uncertainty quantification, and 2) we combine the calibrated estimate $\tilde{p}_{t}(y|\mathbf{x}_{i}^{t})$ with the distributionally aligned prediction $\text{norm}(\tilde{p}_{t}(y|\mathbf{x}_{i}^{t})p_{t}(y)/p_{s}(y))$ for more robust estimation.

Given the already-known source label distribution $p_{s}(y)$ , we now explain the step-by-step process for estimating $\tilde{p}_{t}(y|\mathbf{x}_{i}^{t})$ and $p_{t}(y)$ . $p_{t}(y|\mathbf{x}_{i}^{t})$ is calibrated into $\tilde{p}_{t}(y|\mathbf{x}_{i}^{t})$ through a two-stage uncertainty calibration process. Specifically, for a current test batch $\{\mathbf{x}_{i}^{t}\}_{i=1}^{N}$ , we calculate shift trend $\mathbf{s}^{t}$ using Equation 1 and get per-sample temperature $T_{i}=g_{\phi}(f_{\theta}(\mathbf{x}_{i}^{t}),\mathbf{s}^{t})$ using shift-aware uncertainty calibrator $g_{\phi}$ to capture overall distribution shifts, as well as correlation and individual column shifts within the current batch. Here, we define the uncertainty $\delta_{i}$ of $f_{\theta}(\mathbf{x}_{i}^{t})$ as a reciprocal of the margin of the calibrated probability distribution $\text{softmax}(f_{\theta}(\mathbf{x}_{i}^{t})/T_{i})$ . We then measure the quantiles for each instance $\mathbf{x}_{i}$ using $\delta_{i}$ within the current batch and recalibrate the original probability with $\tilde{T}_{i}$ , resulting in $\tilde{p}_{t}(y|\mathbf{x}_{i}^{t})=\text{softmax}(f_{\theta}(\mathbf{x}_{i}^{% t})/\tilde{T}_{i})$ . This process calibrates predictions by leveraging relative uncertainty within the batch. Our temperature $\tilde{T}_{i}$ is defined as:

\tilde{T}_{i}=\begin{cases}T&\text{if }\delta_{i}\geq Q\big{(}\{\delta_{i^{% \prime}}\}_{i^{\prime}=1}^{N},q_{\text{high}}\big{)}\\ 1/T&\text{if }\delta_{i}\leq Q\big{(}\{\delta_{i^{\prime}}\}_{i^{\prime}=1}^{N% },q_{\text{low}}\big{)}\\ 1&\text{otherwise},\end{cases}

(4)

where $Q(X,q)$ is a quantile function which gives the value corresponding to the lower $q$ quantile in $X$ , $T=1.5\max_{j}p_{s}(y)_{j}/\min_{j}p_{s}(y)_{j}$ is a temperature, and $q_{\text{low}}$ and $q_{\text{high}}$ represent the low and high uncertainty quantiles, respectively. This two-stage uncertainty calibration comprehensively evaluates the current batch and estimates relative uncertainty using $\mathbf{s}^{t}$ , $g_{\phi}$ , and $\tilde{T}_{i}$ .

Meanwhile, the target label distribution $p_{t}(y)$ is estimated as follows:

p_{t}(y)=(1-\alpha)\cdot\frac{1}{N}{\sum_{i=1}^{N}p^{\text{de}}_{t}(y|\mathbf{% x}_{i}^{t})}+\alpha\cdot{p^{\text{oe}}_{t}(y)},

(5)

where $p^{\text{de}}_{t}(y|\mathbf{x}_{i}^{t})=\text{norm}\big{(}p_{t}(y|\mathbf{x}_{% i}^{t})/p_{s}(y)\big{)}$ is a debiased target label estimator that departs from $p_{s}(y)$ , and $p_{t}^{oe}(y)$ is an online target label estimator, initialized as uniform distribution and updated as:

p^{\text{oe}}_{t}(y)=(1-\alpha)\cdot\frac{1}{N}\sum_{i=1}^{N}{\bar{p}_{t}(y|% \mathbf{x}_{i}^{t})}+\alpha\cdot p^{\text{oe}}_{t}(y)

(6)

from the current batch to the next, with a smoothing factor $\alpha$ . This online target label estimator leverages label locality between nearby test batches, making it effective for accurately estimating the next batch’s target label distribution. A more detailed explanation of AdapTable is provided in Section A.

Table 2: The average balanced accuracy (%) and macro F1 score (%) with their standard errors for both supervised models and TTA baselines are reported across six datasets including natural distribution shifts within the TableShift (Gardner, Popovic, and Schmidt 2023) benchmark. The results are averaged over three random repetitions.

Method HELOC Voting Hospital Readmission ICU Mortality Childhood Lead Diabetes bAcc. F1 bAcc. F1 bAcc. F1 bAcc. F1 bAcc. F1 bAcc. F1 $k$ -NN 62.0 $\pm$ 0.0 40.3 $\pm$ 0.0 76.9 $\pm$ 0.0 71.1 $\pm$ 0.0 57.7 $\pm$ 0.0 56.9 $\pm$ 0.0 81.5 $\pm$ 0.3 47.6 $\pm$ 0.0 57.6 $\pm$ 0.1 56.9 $\pm$ 0.0 67.9 $\pm$ 0.3 53.3 $\pm$ 0.1 LogReg 63.5 $\pm$ 0.0 44.2 $\pm$ 0.0 80.2 $\pm$ 0.0 76.2 $\pm$ 0.0 61.4 $\pm$ 0.0 58.9 $\pm$ 0.0 61.6 $\pm$ 0.0 62.2 $\pm$ 0.0 50.0 $\pm$ 0.0 47.9 $\pm$ 0.0 71.0 $\pm$ 0.0 55.4 $\pm$ 0.0 RandomForest 58.2 $\pm$ 7.6 32.2 $\pm$ 1.5 81.7 $\pm$ 0.1 68.4 $\pm$ 0.7 64.4 $\pm$ 0.5 42.1 $\pm$ 1.2 85.2 $\pm$ 0.4 52.0 $\pm$ 0.1 50.0 $\pm$ 0.0 47.9 $\pm$ 0.0 76.5 $\pm$ 0.1 46.9 $\pm$ 0.1 XGBoost 57.6 $\pm$ 7.2 39.9 $\pm$ 4.9 80.5 $\pm$ 0.2 75.8 $\pm$ 0.4 63.1 $\pm$ 0.1 61.3 $\pm$ 0.4 79.9 $\pm$ 0.1 64.3 $\pm$ 0.1 50.0 $\pm$ 0.0 47.9 $\pm$ 0.0 71.5 $\pm$ 0.1 56.2 $\pm$ 0.1 CatBoost 65.4 $\pm$ 0.0 51.7 $\pm$ 0.0 80.4 $\pm$ 0.0 76.8 $\pm$ 0.0 63.4 $\pm$ 0.0 61.8 $\pm$ 0.5 81.4 $\pm$ 0.0 59.8 $\pm$ 0.0 50.0 $\pm$ 0.0 47.9 $\pm$ 0.0 65.0 $\pm$ 0.0 59.3 $\pm$ 0.0 + AdapTable 65.5 $\pm$ 0.0 65.4 $\pm$ 0.0 79.6 $\pm$ 0.0 78.6 $\pm$ 0.0 65.4 $\pm$ 0.0 62.5 $\pm$ 0.3 82.6 $\pm$ 0.0 64.8 $\pm$ 0.3 62.8 $\pm$ 0.4 61.7 $\pm$ 0.3 74.2 $\pm$ 0.0 62.5 $\pm$ 0.3 Source 53.2 $\pm$ 1.5 38.2 $\pm$ 3.5 76.5 $\pm$ 0.5 77.3 $\pm$ 0.4 61.1 $\pm$ 0.1 60.2 $\pm$ 0.3 56.3 $\pm$ 0.0 58.1 $\pm$ 0.0 50.0 $\pm$ 0.0 47.9 $\pm$ 0.0 55.2 $\pm$ 0.0 55.5 $\pm$ 0.0 PL 51.8 $\pm$ 1.0 34.9 $\pm$ 2.3 75.6 $\pm$ 0.5 76.6 $\pm$ 0.5 60.5 $\pm$ 0.1 58.9 $\pm$ 0.3 56.3 $\pm$ 0.0 58.0 $\pm$ 0.1 50.0 $\pm$ 0.0 47.9 $\pm$ 0.0 55.1 $\pm$ 0.1 55.3 $\pm$ 0.0 TTT++ 53.2 $\pm$ 1.5 38.2 $\pm$ 3.6 76.8 $\pm$ 0.5 77.6 $\pm$ 0.2 61.1 $\pm$ 0.1 60.2 $\pm$ 0.3 56.6 $\pm$ 0.5 58.5 $\pm$ 0.1 50.0 $\pm$ 0.0 47.9 $\pm$ 0.0 55.4 $\pm$ 0.0 55.7 $\pm$ 0.0 TENT 51.2 $\pm$ 1.2 33.2 $\pm$ 2.6 74.0 $\pm$ 0.6 74.9 $\pm$ 0.6 60.2 $\pm$ 0.1 58.3 $\pm$ 0.3 55.1 $\pm$ 0.1 56.3 $\pm$ 0.1 50.0 $\pm$ 0.0 47.9 $\pm$ 0.0 55.0 $\pm$ 0.0 55.0 $\pm$ 0.0 EATA 53.2 $\pm$ 1.5 38.2 $\pm$ 3.6 76.5 $\pm$ 0.5 77.3 $\pm$ 0.4 61.1 $\pm$ 0.1 60.2 $\pm$ 0.4 56.3 $\pm$ 0.0 58.1 $\pm$ 0.0 50.0 $\pm$ 0.0 47.9 $\pm$ 0.0 55.2 $\pm$ 0.0 55.5 $\pm$ 0.0 SAR 50.0 $\pm$ 0.0 30.1 $\pm$ 0.0 62.0 $\pm$ 1.2 59.4 $\pm$ 1.6 57.1 $\pm$ 1.1 51.3 $\pm$ 2.2 51.1 $\pm$ 0.1 49.1 $\pm$ 0.2 50.0 $\pm$ 0.0 47.9 $\pm$ 0.0 53.4 $\pm$ 0.0 52.2 $\pm$ 0.0 LAME 50.0 $\pm$ 0.0 30.1 $\pm$ 0.0 54.6 $\pm$ 0.5 46.8 $\pm$ 1.0 54.9 $\pm$ 0.5 46.9 $\pm$ 1.0 50.0 $\pm$ 0.0 46.7 $\pm$ 0.0 50.0 $\pm$ 0.0 47.9 $\pm$ 0.0 54.8 $\pm$ 0.1 54.8 $\pm$ 0.2 AdapTable 65.8 $\pm$ 0.6 64.5 $\pm$ 0.3 78.4 $\pm$ 0.3 78.6 $\pm$ 0.0 61.7 $\pm$ 0.0 61.7 $\pm$ 0.0 65.9 $\pm$ 0.1 65.4 $\pm$ 0.1 69.2 $\pm$ 0.1 60.9 $\pm$ 0.3 70.9 $\pm$ 0.1 68.3 $\pm$ 0.1

3.4 Theoretical Insights

Theorem 3.1.

Let $\hat{Y}|X$ and $\hat{Y}_{o}|X$ be defined as follows:

	$\displaystyle\hat{Y}\|X$	$\displaystyle=\{\operatorname*{arg\,max}_{j\in\mathcal{Y}}{f_{\theta}(\mathbf{% x})_{j}}\|\mathbf{x}\in X\},$		(7)
	$\displaystyle\hat{Y}_{o}\|X$	$\displaystyle=\{\operatorname*{arg\,max}_{j\in\mathcal{Y}}{f_{\theta}(\mathbf{% x})_{j}+\log p_{t}^{oe}(y)_{j}}\|\mathbf{x}\in X\}.$		(8)

Given the error $\epsilon(\hat{Y}|X)=\mathbb{P}(\hat{Y}\neq Y|X)$ , with true labels $Y$ of inputs $X$ , the error gap $|\epsilon(\hat{Y}|X_{s})-\epsilon(\hat{Y}_{o}|X_{t})|$ is upper bounded by

K_{1}\Big{\|}1-\frac{p_{t}^{oe}(y)}{p_{t}(y)}\Big{\|}_{1}BSE(\hat{Y})+K_{2}% \Delta_{CE}(\hat{Y}),

(9)

where $K_{1}$ and $K_{2}$ are constants related to $p_{t}(y)$ , and $p_{s}(y)$ , respectively.

Theorem 3.1 extends Theorem 2.3 in ODS (Zhou et al. 2023) to cases where the source label distribution is not uniform. It decomposes the error gap between the original model on the source domain and the adapted model for the target model with $p_{t}^{oe}(y)$ on the target domain into several components. These components include $\|1-p_{t}^{oe}(y)/p_{t}(y)\|_{1}$ , which is an error of the estimated target label distribution, $BSE(\hat{Y})$ , which reflects the model’s performance on the source domain, and $\Delta_{CE}(\hat{Y})$ , which measures the generalization of feature representations adapted by the TTA algorithm. Overall, Theorem 3.1 underscores the importance of tracking label distributions and efficiently adapting models to handle label distribution shifts. The detailed explanation and proof of Theorem 3.1 can be found in Section B.

4 Experiments

This section validates AdapTable’s effectiveness. We begin with an overview of our experimental setup in Section 4.1 and then address key research questions:

•

Is AdapTable effective across various tabular distribution shifts, including natural shifts and common corruptions across different tabular models? (Section 4.2)
•

Do AdapTable’s components contribute to overall performance improvements, and do they function as intended? (Section 4.3)
•

Does AdapTable demonstrate strengths in computational efficiency and hyperparameter sensitivity, which are crucial for test time adaptation? (Section 4.4)

4.1 Experimental Setup

Datasets.

We evaluate AdapTable on six diverse datasets—HELOC, Voting, Hospital Readmission, ICU Mortality, Childhood Lead, and Diabetes—within the tabular distribution shift benchmark (Gardner, Popovic, and Schmidt 2023), covering healthcare, finance, and politics with both numerical and categorical features. Additionally, we verify its robustness against six common corruptions—Gaussian, Uniform, Random Drop, Column Drop, Numerical, and Categorical—to ensure its efficacy beyond label distribution shifts. More details of these shifts are in Section C.

Model architectures.

To verify the proposed method under various tabular model architectures, we mainly use MLP, a widely used tabular learning architecture. Additionally, we validate AdapTable on CatBoost (Dorogush, Ershov, and Gulin 2017) and three other representative deep tabular learning models—AutoInt (Song et al. 2019), ResNet (Gorishniy et al. 2021), and FT-Transformer (Gorishniy et al. 2021).

Baselines.

We compare AdapTable with six TTA baselines—PL (Lee 2013), TTT++ (Liu et al. 2021), TENT (Wang et al. 2021a), EATA (Niu et al. 2022), SAR (Niu et al. 2023), and LAME (Boudiaf et al. 2022). TabLog (Ren et al. 2024) is excluded due to its architectural constraint on logical neural networks (Riegel et al. 2020). We also provide performance references from classical machine learning models: $k$ -nearest neighbors ( $k$ -NN), logistic regression (LogReg), random forest (RandomForest), XGBoost (Chen and Guestrin 2016), and CatBoost (Dorogush, Ershov, and Gulin 2017).

Evaluation metrics.

As shown in Figure 2 and Section F, tabular data often exhibit extreme class imbalance. Since accuracy may not be effective in these cases, we use macro F1 score (F1) and balanced accuracy (bAcc.) as the primary evaluation metrics.

Implementation details.

For all experiments, we use a fixed batch size of 64, a common setting in TTA baselines (Schneider et al. 2020; Wang et al. 2021a). The smoothing factor $\alpha$ , low uncertainty quantile $q_{\text{low}}$ , and high uncertainty quantile $q_{\text{high}}$ are set to 0.1, 0.25, and 0.75, respectively. In all tables, the best and second-best results are highlighted in bold and underline, respectively. Further implementation and hyperparameter details are provided in Section E.

4.2 Main Results

Result on natural distribution shifts.

Table 2 presents results on natural distribution shifts. Existing TTA methods, successful in computer vision, struggle in the tabular domain, often failing to outperform the source model or offering limited performance gains. In contrast, AdapTable achieves state-of-the-art results across all datasets, with dramatic performance improvements of up to 26% on the HELOC dataset. Since AdapTable does not rely on model parameter tuning, it can be easily applied to classical machine learning models; when integrated with CatBoost, AdapTable consistently improves performance across all datasets, showcasing its versatility, whereas other baselines cannot be similarly integrated as they require model parameter updates.

Result on common corruptions.

We further evaluate the efficacy of AdapTable across six types of common corruptions in real-world applications by applying them to the test sets of three datasets—HELOC, Voting, and Childhood Lead. As shown in Figure 4, prior TTA methods fail considerably, showing only marginal gains over the unadapted source model across all corruption types. It is worth noting that previous TTA methods have demonstrated significant improvements when dealing with common corruptions in vision data, highlighting the difference between corruptions in the tabular domain its counterpart in vision domain. Meanwhile, Adaptable shows substantial improvements across all types of corruptions, showing more than 10% gains of accuracy on all scenarios, demonstrating its robustness across different types of corruptions.

Result across diverse model architectures.

In Figure 5, we report AdapTable’s effectiveness across three mainstream tabular learning architectures—AutoInt (Song et al. 2019), ResNet (Gorishniy et al. 2021), and FT-Transformer (Gorishniy et al. 2021). We report the average macro F1 score across three datasets—HELOC, Voting, and Childhood Lead. None of the baselines outperform the original source model, with LAME (Boudiaf et al. 2022) even showing significant performance drops. In contrast, AdapTable consistently achieves significant improvements across all architectures, highlighting its robustness and versatility.

Table 3: Ablation study comparing the shift-aware uncertainty calibrator with classical methods—Platt scaling (PS) and isotonic regression (IR). The results are averaged over three random repetitions.

Method HELOC Voting Hospital Readmission Source 38.2 $\pm$ 3.5 77.3 $\pm$ 0.4 60.2 $\pm$ 0.3 PS 61.6 $\pm$ 1.3 73.3 $\pm$ 0.2 59.4 $\pm$ 0.3 IR 61.3 $\pm$ 1.7 74.3 $\pm$ 0.2 58.0 $\pm$ 0.4 AdapTable 64.5 $\pm$ 0.6 78.6 $\pm$ 0.0 61.7 $\pm$ 0.0

4.3 Ablation Study

Shift-aware uncertainty calibrator.

First, we validate the shift-aware uncertainty calibrator described in Section 3.2. Figure 6 (a) and (b) present reliability diagrams before and after applying calibration. The results demonstrate that our calibrator significantly reduces both overconfidence and underconfidence. Next, we assess the shift-awareness of our calibrator in Figure 6 (c), where we plot the average temperature against the maximum mean discrepancy (MMD) with the training data. As expected, greater shifts from the training data correspond to higher temperatures, indicating increased uncertainty. The strong positive correlation between MMD and average temperature confirms the calibrator’s effectiveness in capturing prediction uncertainty under distribution shifts in tabular data. Finally, we compare our shift-aware uncertainty calibrator with classical methods, namely Platt scaling and isotonic regression, as shown in Table 3. The results indicate that prior methods exhibit inconsistent performance across different datasets. For example, Platt scaling improves performance on the HELOC dataset but degrades it on Voting and Hospital Readmissions. In contrast, our shift-aware uncertainty calibrator consistently outperforms these classical methods, which do not account for domain shift during calibration.

Label distribution handler.

In the following section, we validate the workings and efficacy of the label distribution handler. First, Figure 7 compares the Jensen–Shannon (JS) divergence between the true and estimated label distributions across each online batch. The results show that our handler significantly improves the accuracy of label distribution estimation, showing low JS divergence across all datasets. Next, we assess the robustness of our label distribution handler with respect to test batch distribution. Namely, we evaluate under 1) a severe class imbalance scenario with an imbalance ratio of 10, and 2) class-wise temporal correlation scenario, where class labels exhibit strong temporal locality. Results are shown in Table 4. AdapTable shows performance gains in all scenarios, showing up to 27% and 19% improvements in class imbalance and temporal correlation scenarios in HELOC and Childhood Lead, respectively. More experimental details are provided in Section C.

Table 4: The average macro F1 score (%) with standard errors for TTA baselines is reported using MLP across three datasets with 1) class imbalance and 2) temporal correlation from the TableShift benchmark. The results are averaged over three random repetitions.

Method Class Imbalance Temporal Correlation HELOC Voting Childhood Lead HELOC Voting Childhood Lead Source 32.5 $\pm$ 3.5 52.3 $\pm$ 4.9 36.7 $\pm$ 6.5 31.6 $\pm$ 0.3 62.2 $\pm$ 0.1 35.1 $\pm$ 0.2 PL 32.0 $\pm$ 3.6 52.1 $\pm$ 4.9 36.7 $\pm$ 6.5 30.9 $\pm$ 0.2 54.9 $\pm$ 0.1 35.1 $\pm$ 0.2 TENT 32.5 $\pm$ 3.5 52.3 $\pm$ 4.9 36.7 $\pm$ 6.5 31.6 $\pm$ 0.3 55.7 $\pm$ 0.1 35.1 $\pm$ 0.2 EATA 32.5 $\pm$ 3.5 52.3 $\pm$ 4.9 36.7 $\pm$ 6.5 31.6 $\pm$ 0.3 55.7 $\pm$ 0.1 35.1 $\pm$ 0.2 SAR 31.8 $\pm$ 3.5 57.1 $\pm$ 5.3 36.7 $\pm$ 6.5 32.0 $\pm$ 0.2 54.4 $\pm$ 0.5 35.1 $\pm$ 0.2 LAME 29.9 $\pm$ 3.5 58.7 $\pm$ 4.0 36.7 $\pm$ 6.5 29.0 $\pm$ 0.1 38.0 $\pm$ 0.4 35.1 $\pm$ 0.2 AdapTable 59.7 $\pm$ 0.8 62.0 $\pm$ 4.6 63.9 $\pm$ 1.0 56.1 $\pm$ 0.3 64.5 $\pm$ 0.0 64.8 $\pm$ 0.3

4.4 Further Analysis

Computational efficiency.

The leftmost part of Figure 8 compares the computational efficiency of AdapTable with TTA baselines. On the HELOC dataset, AdapTable’s total elapsed time is approximately 1.54 seconds, translating to about 0.0002 seconds per sample, which is highly desirable. Moreover, AdapTable achieves an optimal efficiency-efficacy trade-off.

Hyperparameter sensitivity.

Figure 8 further analyzes the hyperparameter sensitivity of AdapTable on the Childhood Lead dataset. As shown in the figure, AdapTable remains highly insensitive to changes in the smoothing factor $\alpha$ , low uncertainty quantile $q_{\text{low}}$ , and high uncertainty quantile $q_{\text{high}}$ .

5 Related Work

Machine learning for tabular data.

The distinct nature of tabular data reduces the effectiveness of deep neural networks, making gradient-boosted decision trees (Chen and Guestrin 2016; Dorogush, Ershov, and Gulin 2017) more suitable. However, research continues to develop deep learning models tailored for tabular data (Murtagh 1991; Song et al. 2019; Arik and Pfister 2021; Gorishniy et al. 2021), including recent efforts involving large language models (Fang et al. 2024; Hegselmann et al. 2023; Hollmann et al. 2023; Dinh et al. 2022) that leverage textual prior knowledge. Notably, our method is architecture-agnostic and can be applied to any model.

Distribution shifts in the tabular domain.

Recently, distribution shift benchmarks for tabular data have been introduced (Liu et al. 2023; Gardner, Popovic, and Schmidt 2023). WhyShift (Liu et al. 2023) reveals that concept shifts ( $Y|X$ -shifts) are more prevalent and detrimental than covariate shifts ( $X$ -shifts). TableShift (Gardner, Popovic, and Schmidt 2023) offers a benchmark with 15 classification tasks, highlighting a strong correlation between shift gaps and label distribution shifts ( $Y$ -shifts), which supports the validity of our method.

Test-time adaptation.

Over the past years, test-time adaptation (TTA) methods have been proposed across various domains, such as computer vision (Wang et al. 2021a; Gong et al. 2022; Niu et al. 2023; Shim, Kim, and Yang 2024), natural language processing (Shi et al. 2024; Liang, He, and Tan 2024), and speech processing (Kim et al. 2023). These methods adapt pre-trained models to unlabeled target domains without requiring access to source data, making them well-suited for sensitive tabular data. TabLog (Ren et al. 2024) is a recent TTA method specifically for tabular data, but it has architectural constraints and lacks a comprehensive analysis of distribution shifts. This underscores the need for model-agnostic TTA methods with a deeper understanding of tabular data, which we address in this paper.

6 Conclusion

In this paper, we have introduced AdapTable, a test-time adaptation framework tailored for tabular data. AdapTable overcomes the limitations of previous methods, which fail to address label distribution shifts, and lack versatility across architectures. Our approach, combined with a shift-aware uncertainty calibrator that enhances calibration via modeling column shifts, and a label distribution handler that adjusts the output distribution based on real-time estimates of the current batch’s label distribution. Extensive experiments show that AdapTable achieves state-of-the-art performance across various datasets and architectures, effectively managing both natural distribution shifts and common corruptions.

References

Arik and Pfister (2021) Arik, S. Ö.; and Pfister, T. 2021. TabNet: Attentive interpretable tabular learning. In AAAI Conference on Artificial Intelligence (AAAI).
Association (2018) Association, A. D. 2018. Economic costs of diabetes in the US in 2017. Diabetes care.
Berthelot et al. (2020) Berthelot, D.; Carlini, N.; Cubuk, E. D.; Kurakin, A.; Sohn, K.; Zhang, H.; and Raffel, C. 2020. Remixmatch: Semi-supervised learning with distribution alignment and augmentation anchoring. In International Conference on Learning Representations (ICLR).
Beyazit et al. (2024) Beyazit, E.; Kozaczuk, J.; Li, B.; Wallace, V.; and Fadlallah, B. 2024. An inductive bias for tabular deep learning. Conference in Neural Information Processing Systems (NeurIPS).
Bischl et al. (2021) Bischl, B.; Casalicchio, G.; Feurer, M.; Hutter, F.; Lang, M.; Mantovani, R. G.; van Rijn, J. N.; and Vanschoren, J. 2021. OpenML Benchmarking Suites. In Conference on Neural Information Processing Systems (NeurIPS).
Boudiaf et al. (2022) Boudiaf, M.; Mueller, R.; Ayed, I. B.; and Bertinetto, L. 2022. Parameter-free Online Test-time Adaptation. In IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).
Brown et al. (2018) Brown, K.; Doran, D.; Kramer, R.; and Reynolds, B. 2018. HELOC Applicant Risk Performance Evaluation by Topological Hierarchical Decomposition. In NeurIPS Workshop on Challenges and Opportunities for AI in Financial Services.
Chen and Guestrin (2016) Chen, T.; and Guestrin, C. 2016. XGBoost: A Scalable Tree Boosting System. In ACM SIGKDD Conference on Knowledge Discovery and Data Mining (KDD).
Clore et al. (2014) Clore, J.; Cios, K.; DeShazo, J.; and Strack, B. 2014. Diabetes 130-US hospitals for years 1999-2008. UCI Machine Learning Repository.
DeFilippis and Van Spall (2021) DeFilippis, E. M.; and Van Spall, H. G. 2021. Is it time for sex-specific guidelines for cardiovascular disease? Journal of the American College of Cardiology (JACC).
Dinh et al. (2022) Dinh, T.; Zeng, Y.; Zhang, R.; Lin, Z.; Gira, M.; Rajput, S.; Sohn, J.-y.; Papailiopoulos, D.; and Lee, K. 2022. Lift: Language-interfaced fine-tuning for non-language machine learning tasks. In Conference on Neural Information Processing Systems (NeurIPS).
Dorogush, Ershov, and Gulin (2017) Dorogush, A. V.; Ershov, V.; and Gulin, A. 2017. CatBoost: gradient boosting with categorical features support. In NeurIPS Workshop on ML Systems.
Fang et al. (2024) Fang, X.; Xu, W.; Tan, F. A.; Zhang, J.; Hu, Z.; Qi, Y. J.; Nickleach, S.; Socolinsky, D.; Sengamedu, S.; Faloutsos, C.; et al. 2024. Large language models (LLMs) on tabular data: Prediction, generation, and understanding-a survey. Transactions on Machine Learning Research (TMLR).
Fey and Lenssen (2019) Fey, M.; and Lenssen, J. E. 2019. Fast graph representation learning with PyTorch Geometric. In ICLR Workshop on Representation Learning on Graphs and Manifolds.
for Disease Control, Prevention et al. (2003) for Disease Control, C.; Prevention; et al. 2003. National Health and Nutrition Examination Survey (NHANES) Data. NCfHS, editor. NCHS.
Gandelsman et al. (2022) Gandelsman, Y.; Sun, Y.; Chen, X.; and Efros, A. 2022. Test-time training with masked autoencoders. In Conference on Neural Information Processing Systems (NeurIPS).
Gardner, Popovic, and Schmidt (2023) Gardner, J.; Popovic, Z.; and Schmidt, L. 2023. Benchmarking Distribution Shift in Tabular Data with TableShift. In Conference on Neural Information Processing Systems (NeurIPS).
Gong et al. (2022) Gong, T.; Jeong, J.; Kim, T.; Kim, Y.; Shin, J.; and Lee, S.-J. 2022. Note: Robust continual test-time adaptation against temporal correlation. In Conference on Neural Information Processing Systems (NeurIPS).
Gorishniy et al. (2021) Gorishniy, Y.; Rubachev, I.; Khrulkov, V.; and Babenko, A. 2021. Revisiting Deep Learning Models for Tabular Data. In Conference on Neural Information Processing Systems (NeurIPS).
Grinsztajn, Oyallon, and Varoquaux (2022) Grinsztajn, L.; Oyallon, E.; and Varoquaux, G. 2022. Why do tree-based models still outperform deep learning on tabular data? arXiv 2022. In Conference on Neural Information Processing Systems (NeurIPS).
He et al. (2016) He, K.; Zhang, X.; Ren, S.; and Sun, J. 2016. Deep residual learning for image recognition. In IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).
Hegselmann et al. (2023) Hegselmann, S.; Buendia, A.; Lang, H.; Agrawal, M.; Jiang, X.; and Sontag, D. 2023. Tabllm: Few-shot classification of tabular data with large language models. In International Conference on Artificial Intelligence and Statistics (AISTATS).
Hein et al. (2017) Hein, D.; Depeweg, S.; Tokic, M.; Udluft, S.; Hentschel, A.; Runkler, T. A.; and Sterzing, V. 2017. A benchmark environment motivated by industrial control problems. In IEEE Symposium Series on Computational Intelligence (SSCI).
Hollmann et al. (2023) Hollmann, N.; Müller, S.; Eggensperger, K.; and Hutter, F. 2023. TabPFN: A Transformer That Solves Small Tabular Classification Problems in a Second. In International Conference on Learning Representations (ICLR).
Hwang et al. (2022) Hwang, S.; Lee, S.; Kim, S.; Ok, J.; and Kwak, S. 2022. Combating label distribution shift for active domain adaptation. In European Conference on Computer Vision (ECCV).
Johnson et al. (2021) Johnson, A.; Bulgarelli, L.; Pollard, T.; Horng, S.; Celi, L. A.; and Mark, R. 2021. MIMIC-IV.
Johnson et al. (2016) Johnson, A. E.; Pollard, T. J.; Shen, L.; Lehman, L.-w. H.; Feng, M.; Ghassemi, M.; Moody, B.; Szolovits, P.; Anthony Celi, L.; and Mark, R. G. 2016. MIMIC-III, a freely accessible critical care database. Scientific data.
Kim et al. (2023) Kim, C.; Park, J.; Shim, H.; and Yang, E. 2023. SGEM: Test-Time Adaptation for Automatic Speech Recognition via Sequential-Level Generalized Entropy Minimization. In Conference of the International Speech Communication Association (INTERSPEECH).
Lee (2013) Lee, D.-H. 2013. Pseudo-Label: The Simple and Efficient Semi-Supervised Learning Method for Deep Neural Networks. In ICML Workshop on Challenges in Representation Learning.
Liang, He, and Tan (2024) Liang, J.; He, R.; and Tan, T. 2024. A comprehensive survey on test-time adaptation under distribution shifts. International Journal of Computer Vision (IJCV).
Lin et al. (2017) Lin, T.-Y.; Goyal, P.; Girshick, R.; He, K.; and Dollár, P. 2017. Focal loss for dense object detection. In IEEE/CVF International Conference on Computer Vision (ICCV).
Liu et al. (2023) Liu, J.; Wang, T.; Cui, P.; and Namkoong, H. 2023. On the Need for a Language Describing Distribution Shifts: Illustrations on Tabular Datasets. In Conference on Neural Information Processing Systems (NeurIPS).
Liu et al. (2021) Liu, Y.; Kothari, P.; Van Delft, B.; Bellot-Gurlet, B.; Mordan, T.; and Alahi, A. 2021. TTT++: When does self-supervised test-time training fail or thrive? In Conference on Neural Information Processing Systems (NeurIPS).
Menon et al. (2021) Menon, A. K.; Jayasumana, S.; Rawat, A. S.; Jain, H.; Veit, A.; and Kumar, S. 2021. Long-tail learning via logit adjustment. In International Conference on Learning Representations (ICLR).
Mosca, Barrett-Connor, and Kass Wenger (2011) Mosca, L.; Barrett-Connor, E.; and Kass Wenger, N. 2011. Sex/gender differences in cardiovascular disease prevention: what a difference a decade makes. Circulation.
Murtagh (1991) Murtagh, F. 1991. Multilayer perceptrons for classification and regression. Neurocomputing.
Niu et al. (2022) Niu, S.; Wu, J.; Zhang, Y.; Chen, Y.; Zheng, S.; Zhao, P.; and Tan, M. 2022. Efficient test-time model adaptation without forgetting. In International Conference on Machine Learning (ICML).
Niu et al. (2023) Niu, S.; Wu, J.; Zhang, Y.; Wen, Z.; Chen, Y.; Zhao, P.; and Tan, M. 2023. Towards stable test-time adaptation in dynamic wild world. In International Conference on Learning Representations (ICLR).
Park, Seo, and Yang (2023) Park, J.; Seo, H.; and Yang, E. 2023. Pc-adapter: Topology-aware adapter for efficient domain adaption on point clouds with rectified pseudo-label. In IEEE/CVF International Conference on Computer Vision (ICCV).
Paszke et al. (2019) Paszke, A.; Gross, S.; Massa, F.; Lerer, A.; Bradbury, J.; Chanan, G.; Killeen, T.; Lin, Z.; Gimelshein, N.; Antiga, L.; et al. 2019. Pytorch: An imperative style, high-performance deep learning library. In Conference on neural information processing systems (NeurIPS).
Platt (2000) Platt, J. 2000. Probabilistic Outputs for Support Vector Machines and Comparisons to Regularized Likelihood Methods. In Advances in Large Margin Classifiers.
Ren et al. (2024) Ren, W.; Li, X.; Chen, H.; Rakesh, V.; Wang, Z.; Das, M.; and Honavar, V. G. 2024. TabLog: Test-Time Adaptation for Tabular Data Using Logic Rules. In International Conference on Machine Learning (ICML).
Riegel et al. (2020) Riegel, R.; Gray, A.; Luus, F.; Khan, N.; Makondo, N.; Akhalwaya, I. Y.; Qian, H.; Fagin, R.; Barahona, F.; Sharma, U.; et al. 2020. Logical neural networks. arXiv preprint arXiv:2006.13155.
Schneider et al. (2020) Schneider, S.; Rusak, E.; Eck, L.; Bringmann, O.; Brendel, W.; and Bethge, M. 2020. Improving robustness against common corruptions by covariate shift adaptation. In Conference on Neural Information Processing Systems (NeurIPS).
Shi et al. (2024) Shi, W.; Xu, R.; Zhuang, Y.; Yu, Y.; Wu, H.; Yang, C.; and Wang, M. D. 2024. MedAdapter: Efficient Test-Time Adaptation of Large Language Models towards Medical Reasoning. arXiv preprint arXiv:2405.03000.
Shim, Kim, and Yang (2024) Shim, H.; Kim, C.; and Yang, E. 2024. CloudFixer: Test-Time Adaptation for 3D Point Clouds via Diffusion-Guided Geometric Transformation. In European Conference on Computer Vision (ECCV).
Song et al. (2019) Song, W.; Shi, C.; Xiao, Z.; Duan, Z.; Xu, Y.; Zhang, M.; and Tang, J. 2019. Autoint: Automatic feature interaction learning via self-attentive neural networks. In ACM International Conference on Information and Knowledge Management (CIKM).
Studies (2019) Studies, A. N. E. 2019. FICO. The Explainable Machine Learning Challenge.
Studies (2022) Studies, A. N. E. 2022. ANES Time Series Cumulative Data File [dataset and documentation]. September 16, 2022 version.
Stylianou and Flournoy (2002) Stylianou, M.; and Flournoy, N. 2002. Dose Finding Using the Biased Coin Up-and-Down Design and Isotonic Regression. In Biometrics.
Sun et al. (2020) Sun, Y.; Wang, X.; Liu, Z.; Miller, J.; Efros, A.; and Hardt, M. 2020. Test-time training with self-supervision for generalization under distribution shifts. In International Conference on Machine Learning (ICML).
Tachet des Combes et al. (2020) Tachet des Combes, R.; Zhao, H.; Wang, Y.-X.; and Gordon, G. J. 2020. Domain adaptation with conditional distribution matching and generalized label shift. In Conference on Neural Information Processing Systems (NeurIPS).
Wang et al. (2021a) Wang, D.; Shelhamer, E.; Liu, S.; Olshausen, B.; and Darrell, T. 2021a. Tent: Fully test-time adaptation by entropy minimization. In International Conference on Learning Representations (ICLR).
Wang et al. (2021b) Wang, X.; Liu, H.; Shi, C.; and Yang, C. 2021b. Be confident! towards trustworthy graph neural networks via confidence calibration. In Conference on Neural Information Processing Systems (NeurIPS).
Wu et al. (2021) Wu, R.; Guo, C.; Su, Y.; and Weinberger, K. Q. 2021. Online adaptation to label distribution shift. In Conference on Neural Information Processing Systems (NeurIPS).
Zhou et al. (2023) Zhou, Z.; Guo, L.-Z.; Jia, L.-H.; Zhang, D.; and Li, Y.-F. 2023. ODS: Test-Time Adaptation in the Presence of Open-World Data Shift. In International Conference on Machine Learning (ICML).

Appendix

Appendix A Detailed Algorithm of AdapTable

Post-training shift-aware uncertainty calibrator.

Given a pre-trained tabular classifier $f_{\theta}:\mathbb{R}^{D}\rightarrow\mathbb{R}^{C}$ on the source domain $\mathcal{D}_{s}=\{(\mathbf{x}_{i}^{s},y_{i}^{s})\}_{i}$ , we introduce a post-training phase for a shift-aware uncertainty calibrator $g_{\phi}:\mathbb{R}^{C}\times\mathbb{R}^{D\times N}\rightarrow\mathbb{R}^{+}$ . This calibrator is trained after the initial training of $f_{\theta}$ using the same training dataset $\mathcal{D}_{s}$ . For a given training batch $\{(\mathbf{x}_{i}^{s},y_{i}^{s})\}_{i=1}^{N}$ , we compute the shift trend $\mathbf{s}^{s}=(\mathbf{s}_{u}^{s})_{u=1}^{D}$ for a specific column index $u$ as follows:

\mathbf{s}_{u}^{s}=\big{(}\mathbf{x}^{s}_{iu}-\frac{1}{|\mathcal{D}_{s}|}\sum_% {i^{\prime}=1}^{|\mathcal{D}_{s}|}\mathbf{x}_{i^{\prime}u}^{s}\big{)}_{i=1}^{N},

where we add a linear layer to $\mathbf{s}_{u}^{s}$ for categorical column $u$ to transform it into a one-dimensional representation, ensuring alignment with the numerical columns. Using $\mathbf{s}^{s}$ , we construct a shift trend graph, where each node $u$ represents a column, and edges capture the relationships between columns. The node features are given by $\mathbf{s}_{u}^{t}$ , and the graph is connected using an all-ones adjacency matrix. A graph neural network (GNN) is applied to this graph, facilitating the exchange of shift trends between columns through message passing, which generates a contextualized column-wise representation ${\bm{h}}_{u}^{s}$ . These representations are averaged to form a global feature representation ${\bm{h}}^{s}=\frac{1}{D}\sum_{u=1}^{D}{{\bm{h}}_{u}^{s}}$ , which is then concatenated with the initial model prediction $f_{\theta}(\mathbf{x}_{i}^{s})$ to produce the final output temperature $T_{i}$ . With the calibrated probability $p_{i}=\text{softmax}\big{(}f_{\theta}(\mathbf{x}_{i}^{s})/T_{i}\big{)}$ , with the per-sample temperature $T_{i}$ calculated above, we define the most plausible and second plausible class indices $j^{*}$ and $j^{**}$ as follows:

j^{*}=\operatorname*{arg\,max}_{j\in\mathcal{Y}}p_{ij}\quad\text{and}\quad j^{% **}=\operatorname*{arg\,max}_{j\in\mathcal{Y},j\neq j^{*}}p_{ij}.

The focal loss $\mathcal{L}_{\text{FL}}$ (Lin et al. 2017) and the calibration loss $\mathcal{L}_{\text{CAL}}$ (Wang et al. 2021b) are used to train the shift-aware uncertainty calibrator $g_{\phi}$ , defined as:

	$\displaystyle\mathcal{L}_{\text{FL}}(\mathbf{x}_{i}^{s},y_{i}^{s})$	$\displaystyle=\sum_{j=1}^{C}\mathbbold{1}_{\{y_{i}^{s}\}}(j)(1-p_{ij})^{\gamma% }\log{p_{ij}},$		(10)
	$\displaystyle\mathcal{L}_{\text{CAL}}(\mathbf{x}_{i}^{s},y_{i}^{s})$	$\displaystyle=\mathbbold{1}_{\{y_{i}^{s}\}}(j^{})(1-p_{ij^{}}+p_{ij^{*}})+% \mathbbold{1}_{\mathcal{Y}\backslash\{y_{i}^{s}\}}(j^{})(p_{ij^{}}-p_{ij^{*% }}),$		(11)

where $\mathbbold{1}_{{\bm{A}}}(x)$ is an indicator function:

\mathbbold{1}_{{\bm{A}}}(x)=\begin{cases}1&\text{if }x\in{\bm{A}}\\ 0&\text{otherwise}.\end{cases}

$\mathcal{L}_{\text{FL}}$ addresses class imbalance by reducing the impact of easily classified examples, while $\mathcal{L}_{\text{CAL}}$ penalizes the gap between $p_{ij^{*}}$ and $p_{ij^{**}}$ for correct predictions, encouraging them to converge for incorrect predictions. For all experiments, we set $\gamma=2$ and $\lambda_{\text{CAL}}=0.1$ .

Label distribution handler.

During the test phase after post-training $g_{\phi}$ , we introduce a label distribution handler using an estimator $\bar{p}_{i}(y|\mathbf{x}_{i}^{t})$ , defined as:

\bar{p}_{i}(y|\mathbf{x}_{i}^{t})=\frac{\tilde{p}_{t}(y|\mathbf{x}_{i}^{t})+% \text{norm}\big{(}\tilde{p}_{t}(y|\mathbf{x}_{i}^{t})p_{t}(y)/p_{s}(y)\big{)}}% {2},

where $\tilde{p}_{t}(y|\mathbf{x}_{i}^{t})$ represents the calibrated prediction. This approach enhances uncertainty quantification and combines the calibrated estimation with the distributionally aligned prediction for more robust estimation. To compute $\tilde{p}_{t}(y|\mathbf{x}_{i}^{t})$ , we perform a two-stage uncertainty calibration. Specifically, for a given test batch $\{\mathbf{x}_{i}^{t}\}_{i=1}^{N}$ , we calculate the shift trend $\mathbf{s}^{t}=(\mathbf{s}_{u}^{t})_{u=1}^{D}\in\mathbb{R}^{D\times N}$ as:

\mathbf{s}_{u}^{t}=\big{(}\mathbf{x}^{t}_{iu}-\frac{1}{|\mathcal{D}_{s}|}\sum_% {i^{\prime}=1}^{|\mathcal{D}_{s}|}\mathbf{x}_{i^{\prime}u}^{s}\big{)}_{i=1}^{N% }\in\mathbb{R}^{N}.

Then, a per-sample temperature $T_{i}=g_{\phi}(f_{\theta}(\mathbf{x}_{i}^{t}),\mathbf{s}^{t})$ , which was defined in Equation 1 is computed. The uncertainty $\delta_{i}$ of $f_{\theta}(\mathbf{x}_{i}^{t})$ is defined as the reciprocal of the margin of the calibrated probability distribution $\text{softmax}(f_{\theta}(\mathbf{x}_{i}^{t})/T_{i})$ :

\delta_{i}=\frac{1}{\text{softmax}\big{(}f_{\theta}(\mathbf{x}_{i}^{t})/T_{i}% \big{)}_{j^{*}}-{\text{softmax}\big{(}f_{\theta}(\mathbf{x}_{i}^{t})/T_{i}\big% {)}}_{j^{**}}},

where $j^{*}$ and $j^{**}$ are the most plausible and second plausible class indices:

j^{*}=\operatorname*{arg\,max}_{j\in\mathcal{Y}}f_{\theta}(\mathbf{x}_{i}^{t})% _{j}\quad\text{and}\quad j^{**}=\operatorname*{arg\,max}_{j\in\mathcal{Y},j% \neq j^{*}}f_{\theta}(\mathbf{x}_{i}^{t})_{j}.

Based on $\delta_{i}$ , the recalibrated temperature $\tilde{T}_{i}$ is applied:

\tilde{T}_{i}=\begin{cases}T&\text{if }\delta_{i}\geq Q\big{(}\{\delta_{i^{% \prime}}\}_{i^{\prime}=1}^{N},q_{\text{high}}\big{)}\\ 1/T&\text{if }\delta_{i}\leq Q\big{(}\{\delta_{i^{\prime}}\}_{i^{\prime}=1}^{N% },q_{\text{low}}\big{)}\\ 1&\text{otherwise},\end{cases}

where $T=1.5\max_{j}p_{s}(y)_{j}/\min_{j}p_{s}(y)_{j}$ , and $q_{\text{low}}$ and $q_{\text{high}}$ are the low and high uncertainty quantiles, respectively. The target label distribution $p_{t}(y)$ is then estimated using the following formula:

p_{t}(y)=(1-\alpha)\cdot\frac{1}{N}\sum_{i=1}^{N}p^{\text{de}}_{t}(y|\mathbf{x% }_{i}^{t})+\alpha\cdot p^{\text{oe}}_{t}(y),

where $p^{\text{de}}_{t}(y|\mathbf{x}_{i}^{t})=\text{norm}\big{(}p_{t}(y|\mathbf{x}_{% i}^{t})/p_{s}(y)\big{)}$ serves as a debiased target label estimator, deviating from the source label distribution $p_{s}(y)$ . The online target label estimator $p_{t}^{\text{oe}}(y)$ is initialized with a uniform distribution and updated with each new batch as follows:

p^{\text{oe}}_{t}(y)=(1-\alpha)\cdot\frac{1}{N}\sum_{i=1}^{N}\bar{p}_{t}(y|% \mathbf{x}_{i}^{t})+\alpha\cdot p^{\text{oe}}_{t}(y),

where $\alpha$ is a smoothing factor. This update process leverages information from the current batch to refine the target label distribution estimation over time. The overall procedure of the proposed AdapTable method is summarized in Algorithm 1.

Algorithm 1 AdapTable

1:Input: Pre-trained classifier

f_{\theta}(\cdot)

, post-trained shift-aware uncertainty calibrator

g_{\phi}(\cdot,\cdot)

, indicator function

\mathbbold{1}_{(\cdot)}(\cdot)

, quantile function

Q(\cdot,\cdot)

, softmax function

\text{softmax}(\cdot)

, normalization function

\text{norm}(\cdot)

, source data

\mathcal{D}_{s}={\{(\mathbf{x}_{i}^{s},y_{i}^{s})\}}_{i}

, current test batch

{\{\mathbf{x}_{i}^{t}\}}_{i=1}^{N}

2:Parameters: Smoothing factor

\alpha

, low uncertainty quantile

q_{\text{low}}

, high uncertainty quantile

q_{\text{high}}

p_{s}(y),\leavevmode\nobreak\ T\leftarrow{\big{(}\frac{1}{|\mathcal{D}_{s}|}% \sum_{i=1}^{|\mathcal{D}_{s}|}{\mathbbold{1}_{\{j\}}(y_{i}^{s})}\big{)}}_{j=1}% ^{C},\leavevmode\nobreak\ 1.5\max_{j}p_{s}(y)_{j}/\min_{j}p_{s}(y)_{j}

4:for

u=1

D

\mathbf{s}_{u}^{t}\leftarrow\big{(}\mathbf{x}^{t}_{iu}-\frac{1}{|\mathcal{D}_{% s}|}\sum_{i^{\prime}=1}^{|\mathcal{D}_{s}|}{\mathbf{x}_{i^{\prime}u}^{s}}\big{% )}_{i=1}^{N}

\triangleright

Compute shift trend

\textbf{s}^{t}

6:end for

7:for

i=1

N

p_{t}(y|\mathbf{x}_{i}^{t})\leftarrow\text{softmax}\big{(}f_{\theta}(\mathbf{x% }_{i}^{t})\big{)}

T_{i}\leftarrow g_{\phi}\big{(}f_{\theta}(\mathbf{x}_{i}^{t}),\mathbf{s}^{t}% \big{)}

\triangleright

Determine per-sample temperature

\mathbf{x}_{i}^{t}

10:

j^{*},\leavevmode\nobreak\ j^{**}\leftarrow\operatorname*{arg\,max}_{1\leq j% \leq C}{p_{t}(y|\mathbf{x}_{i}^{t})}_{j},\leavevmode\nobreak\ \operatorname*{% arg\,max}_{1\leq j\leq C,j\neq j^{*}}{p_{t}(y|\mathbf{x}_{i}^{t})}_{j}

11:

\delta_{i}\leftarrow{\big{(}{\text{softmax}\big{(}f_{\theta}(\mathbf{x}_{i}^{t% })/T_{i}\big{)}}_{j^{*}}-{\text{softmax}\big{(}f_{\theta}(\mathbf{x}_{i}^{t})/% T_{i}\big{)}}_{j^{**}}\big{)}}^{-1}

\triangleright

Define uncertainty of

\mathbf{x}_{i}^{t}

as a margin of

f_{\theta}(\mathbf{x}_{i}^{t})/T_{i}

12:

p^{\text{de}}_{t}(y|\mathbf{x}_{i}^{t})\leftarrow\text{norm}\big{(}p_{t}(y|% \mathbf{x}_{i}^{t})/p_{s}(y)\big{)}

\triangleright

Compute debiased target label estimator

13:end for

14:

p_{t}(y)\leftarrow(1-\alpha)\cdot\frac{1}{N}{\sum_{i=1}^{N}p^{\text{de}}_{t}(y% |\mathbf{x}_{i}^{t})}+\alpha\cdot{p^{\text{oe}}_{t}(y)}

\triangleright

Estimate target label distribution

15:for

i=1

N

16: if

\delta_{i}\geq Q\big{(}\{\delta_{i^{\prime}}\}_{i^{\prime}=1}^{N},q_{\text{% high}}\big{)}

then

17:

\tilde{T}_{i}\leftarrow T

18: else if

\delta_{i}\leq Q\big{(}\{\delta_{i^{\prime}}\}_{i^{\prime}=1}^{N},q_{\text{low% }}\big{)}

then

19:

\tilde{T}_{i}\leftarrow 1/T

\triangleright

Calculate temperature

\tilde{T}_{i}

using uncertainty

\delta_{i}

20: else

21:

\tilde{T}_{i}\leftarrow 1

22: end if

23:

\tilde{p}_{t}(y|\mathbf{x}_{i}^{t})\leftarrow\text{softmax}\big{(}f_{\theta}(% \mathbf{x}_{i}^{t})/\tilde{T}_{i}\big{)}

\triangleright

Perform temperature scaling with

\tilde{T}_{i}

24:

\bar{p}_{t}(y|\mathbf{x}_{i}^{t})\leftarrow\big{(}\tilde{p}_{t}(y|\mathbf{x}_{% i}^{t})+\text{norm}(\tilde{p}_{t}(y|\mathbf{x}_{i}^{t})p_{t}(y)/p_{s}(y))\big{% )}/2

\triangleright

Perform self-ensembling

25:end for

26:

p^{\text{oe}}_{t}(y)\leftarrow(1-\alpha)\cdot\frac{1}{N}\sum_{i=1}^{N}{\bar{p}% _{t}(y|\mathbf{x}_{i}^{t})}+\alpha\cdot p^{\text{oe}}_{t}(y)

\triangleright

Update online target label estimator

27:Output: Final predictions

{\{\bar{p}_{i}(y)\}}_{i=1}^{N}

Appendix B Proof of Theorem 3.1

Let’s first define the balanced source error $BSE(\hat{Y})$ on the source dataset and the conditional error gap $\Delta_{CE}(\hat{Y})$ between $\mathbb{P}(\hat{Y}\neq Y|X_{s})$ and $\mathbb{P}(\hat{Y}\neq Y|X_{t})$ as follows:

	$\displaystyle BSE(\hat{Y})$	$\displaystyle=\max_{i\in\mathcal{Y}}\mathbb{P}(\hat{Y}\neq i\|Y=i,X_{s}),$		(12)
	$\displaystyle\Delta_{CE}(\hat{Y})$	$\displaystyle=\max_{i\neq i^{\prime}\in\mathcal{Y}}\Big{\|}\mathbb{P}(\hat{Y}=i% \|Y=i^{\prime},X_{s})-\mathbb{P}(\hat{Y}=i\|Y=i^{\prime},X_{t})\Big{\|}.$		(13)

Definition B.1.

(Generalized Label Shift in Tachet des Combes et al. (2020)). Both input covariate distribution $\mathbb{P}(X_{s})\neq\mathbb{P}(X_{t})$ and output label distribution $\mathbb{P}(Y|X_{s})\neq\mathbb{P}(Y|X_{t})$ change. Yet, there exists a hidden representation $H=g^{*}(X)$ such that the conditional distribution of $H$ given $Y$ remains the same across both domains, i.e., $\forall i\in\mathcal{Y}$ ,

\mathbb{P}(H|Y=i,X_{s})=\mathbb{P}(H|Y=i,X_{t}).

(14)

Theorem B.2.

Let $\hat{Y}|X$ and $\hat{Y}_{o}|X$ be defined as follows:

	$\displaystyle\hat{Y}\|X$	$\displaystyle=\{\operatorname*{arg\,max}_{j\in\mathcal{Y}}{f_{\theta}(\mathbf{% x})_{j}}\|\mathbf{x}\in X\},$		(15)
	$\displaystyle\hat{Y}_{o}\|X$	$\displaystyle=\{\operatorname*{arg\,max}_{j\in\mathcal{Y}}{f_{\theta}(\mathbf{% x})_{j}+\log p_{t}^{oe}(y)_{j}}\|\mathbf{x}\in X\}.$		(16)

Given the error $\epsilon(\hat{Y}|X)=\mathbb{P}(\hat{Y}\neq Y|X)$ , with true labels $Y$ of inputs $X$ , the error gap $|\epsilon(\hat{Y}|X_{s})-\epsilon(\hat{Y}_{o}|X_{t})|$ is upper bounded by

K_{1}\Big{\|}1-\frac{p_{t}^{oe}(y)}{p_{t}(y)}\Big{\|}_{1}BSE(\hat{Y})+K_{2}% \Delta_{CE}(\hat{Y}),

(17)

where $K_{1}$ and $K_{2}$ are constants related to $p_{t}(y)$ , and $p_{s}(y)$ , respectively.

Proof.

We start by applying the law of total probability and triangle inequality to derive the following inequality:

\displaystyle\begin{split}&\Big{|}\epsilon(\hat{Y}|X_{s})-\epsilon(\hat{Y}_{o}% |X_{t})\Big{|}\\ &=\Big{|}\mathbb{P}(\hat{Y}\neq Y|X_{s})-\mathbb{P}(\hat{Y}_{o}\neq Y|X_{t})% \Big{|}\\ &=\Big{|}\sum_{i\neq i^{\prime}}\mathbb{P}(\hat{Y}=i,Y=i^{\prime}|X_{s})-\sum_% {i\neq i^{\prime}}\mathbb{P}(\hat{Y}_{o}=i,Y=i^{\prime}|X_{t})\Big{|}\\ &=\Big{|}\sum_{i\neq i^{\prime}}\mathbb{P}(Y=i^{\prime}|X_{s})\mathbb{P}(\hat{% Y}=i|Y=i^{\prime},X_{s})-\sum_{i\neq i^{\prime}}\mathbb{P}(Y=i^{\prime}|X_{t})% \mathbb{P}(\hat{Y}_{o}=i|Y=i^{\prime},X_{t})\Big{|}\\ &\leq\sum_{i\neq i^{\prime}}\Big{|}\mathbb{P}(Y=i^{\prime}|X_{s})\mathbb{P}(% \hat{Y}=i|Y=i^{\prime},X_{s})-\mathbb{P}(Y=i^{\prime}|X_{t})\mathbb{P}(\hat{Y}% _{o}=i|Y=i^{\prime},X_{t})\Big{|}.\end{split}

(18)

According to Equation 8 in (Menon et al. 2021), $\hat{Y}_{o}$ satisfies the following condition under generalized label shift condition in Definition B.1:

\displaystyle\mathbb{P}(\hat{Y}_{o}=i|H,X_{t})=\frac{p_{t}^{oe}(y)_{i}}{% \mathbb{P}(Y=i|X_{s})}\mathbb{P}(\hat{Y}=i|H,X_{t}).

(19)

By multiplying both sides of Equation 19 by $\mathbb{P}(H|Y,X_{t})$ , we obtain:

\displaystyle\begin{split}\mathbb{P}(\hat{Y}_{o}=i|H,X_{t})\mathbb{P}(H|Y,X_{t% })&=\frac{p_{t}^{oe}(y)_{i}}{\mathbb{P}(Y=i|X_{s})}\mathbb{P}(\hat{Y}=i|H,X_{t% })\mathbb{P}(H|Y,X_{t})\\ \mathbb{P}(\hat{Y}_{o}=i|Y,X_{t})&=\frac{p_{t}^{oe}(y)_{i}}{\mathbb{P}(Y=i|X_{% s})}\mathbb{P}(\hat{Y}=i|Y,X_{t}).\end{split}

(20)

Next, by substituting Equation 20 into Equation 18, and letting $Y=i^{\prime}$ , we have:

\displaystyle\begin{split}&\Big{|}\epsilon(\hat{Y}|X_{s})-\epsilon(\hat{Y}_{o}% |X_{t})\Big{|}\\ &\leq\sum_{i\neq i^{\prime}}\Big{|}\mathbb{P}(Y=i^{\prime}|X_{s})\mathbb{P}(% \hat{Y}=i|Y=i^{\prime},X_{s})-\mathbb{P}(Y=i^{\prime}|X_{t})\frac{p_{t}^{oe}(y% )_{i}}{\mathbb{P}(Y=i|X_{s})}\mathbb{P}(\hat{Y}=i|Y=i^{\prime},X_{t})\Big{|}.% \end{split}

(21)

Using Lemma A.2 from (Tachet des Combes et al. 2020), we can further estimate the upper bound of Equation 21 as follows:

\displaystyle\begin{split}&\Big{|}\epsilon(\hat{Y}|X_{s})-\epsilon(\hat{Y}_{o}% |X_{t})\Big{|}\\ &\leq\sum_{i\neq i^{\prime}}\mathbb{P}(Y=i^{\prime}|X_{t})\Bigg{|}1-\frac{p_{t% }^{oe}(y)_{i}}{\mathbb{P}(Y=i|X_{s})}\Bigg{|}\left(\alpha_{i^{\prime}}\mathbb{% P}(\hat{Y}=i|Y=i^{\prime},X_{s})+\beta_{i^{\prime}}\mathbb{P}(\hat{Y}=i|Y=i^{% \prime},X_{t})\right)\\ &\quad+\mathbb{P}(Y=i^{\prime}|X_{s})\Delta_{CE}(\hat{Y})+\mathbb{P}(Y=i^{% \prime}|X_{t})\frac{p_{t}^{oe}(y)_{i}}{\mathbb{P}(Y=i|X_{s})}\Delta_{CE}(\hat{% Y})\\ &\stackrel{{\scriptstyle\mathclap{(i)}}}{{\leq}}\sum_{i\neq i^{\prime}}\mathbb% {P}(Y=i^{\prime}|X_{t})\Bigg{|}1-\frac{p_{t}^{oe}(y)_{i}}{\mathbb{P}(Y=i|X_{s}% )}\Bigg{|}\left(\alpha_{i^{\prime}}\mathbb{P}(\hat{Y}=i|Y=i^{\prime},X_{s})+% \beta_{i^{\prime}}\mathbb{P}(\hat{Y}=i|Y=i^{\prime},X_{t})\right)\\ &\quad+(C-1)\Delta_{CE}(\hat{Y})+\left(\sum_{i\neq i^{\prime}}\frac{\mathbb{P}% (Y=i^{\prime}|X_{t})}{\mathbb{P}(Y=i|X_{s})}\right)\left(\sum_{i\neq i^{\prime% }}p_{t}^{oe}(y)_{i}\right)\Delta_{CE}(\hat{Y})\\ &\leq\sum_{i\neq i^{\prime}}\mathbb{P}(Y=i^{\prime}|X_{t})\Bigg{|}1-\frac{p_{t% }^{oe}(y)_{i}}{\mathbb{P}(Y=i|X_{s})}\Bigg{|}\left(\alpha_{i^{\prime}}\mathbb{% P}(\hat{Y}=i|Y=i^{\prime},X_{s})+\beta_{i^{\prime}}\mathbb{P}(\hat{Y}=i|Y=i^{% \prime},X_{t})\right)\\ &\quad+(C-1)\Delta_{CE}(\hat{Y})+\frac{(C-1)^{2}}{\min_{i\in\mathcal{Y}}% \mathbb{P}(Y=i|X_{s})}\Delta_{CE}(\hat{Y}),\end{split}

(22)

where $\alpha_{i^{\prime}},\beta_{i^{\prime}}\geq 0$ and $\alpha_{i^{\prime}}+\beta_{i^{\prime}}=1$ , $(i)$ holds by Hölder’s inequality. By letting $\alpha_{i^{\prime}}=1$ and $\beta_{i^{\prime}}=0$ for all $i^{\prime}\in\mathcal{Y}$ , and defining $K_{1}$ and $K_{2}$ as:

	$\displaystyle K_{1}$	$\displaystyle=C(C-1)^{2}\max_{i\in\mathcal{Y}}\mathbb{P}(Y=i\|X_{t}),$
	$\displaystyle K_{2}$	$\displaystyle=(C-1)+\frac{(C-1)^{2}}{\min_{i\in\mathcal{Y}}\mathbb{P}(Y=i\|X_{s% })},$

we finally get:

\displaystyle\begin{split}&\Big{|}\epsilon(\hat{Y}|X_{s})-\epsilon(\hat{Y}_{o}% |X_{t})\Big{|}\\ &\leq\sum_{i\neq i^{\prime}}\mathbb{P}(Y=i^{\prime}|X_{t})\Bigg{|}1-\frac{p_{t% }^{oe}(y)_{i}}{\mathbb{P}(Y=i|X_{s})}\Bigg{|}\mathbb{P}(\hat{Y}=i|Y=i^{\prime}% ,X_{s})+K_{2}\Delta_{CE}(\hat{Y})\\ &\leq\max_{i^{\prime}\in\mathcal{Y}}\mathbb{P}(Y=i^{\prime}|X_{t})\sum_{i\neq i% ^{\prime}}\Bigg{|}1-\frac{p_{t}^{oe}(y)_{i}}{\mathbb{P}(Y=i|X_{s})}\Bigg{|}% \mathbb{P}(\hat{Y}=i|Y=i^{\prime},X_{s})+K_{2}\Delta_{CE}(\hat{Y})\\ &\stackrel{{\scriptstyle\mathclap{(i)}}}{{\leq}}\max_{i^{\prime}\in\mathcal{Y}% }\mathbb{P}(Y=i^{\prime}|X_{t})\left(\sum_{i\neq i^{\prime}}\Bigg{|}1-\frac{p_% {t}^{oe}(y)_{i}}{\mathbb{P}(Y=i|X_{s})}\Bigg{|}\right)\left(\sum_{i\neq i^{% \prime}}\mathbb{P}(\hat{Y}=i|Y=i^{\prime},X_{s})\right)+K_{2}\Delta_{CE}(\hat{% Y})\\ &\stackrel{{\scriptstyle\mathclap{(ii)}}}{{\leq}}\max_{i^{\prime}\in\mathcal{Y% }}\mathbb{P}(Y=i^{\prime}|X_{t})(C-1)\sum_{i=1}^{C}\Big{|}1-\frac{p_{t}^{oe}(y% )_{i}}{\mathbb{P}(Y=i|X_{s})}\Big{|}C(C-1)BSE(\hat{Y})+K_{2}\Delta_{CE}(\hat{Y% })\\ &=\max_{i^{\prime}\in\mathcal{Y}}\mathbb{P}(Y=i^{\prime}|X_{t})C(C-1)^{2}\Big{% \|}1-\frac{p_{t}^{oe}(y)}{p_{t}(y)}\Big{\|}_{1}BSE(\hat{Y})+K_{2}\Delta_{CE}(% \hat{Y})\\ &\stackrel{{\scriptstyle\mathclap{(iii)}}}{{=}}K_{1}\Big{\|}1-\frac{p_{t}^{oe}% (y)}{p_{t}(y)}\Big{\|}_{1}BSE(\hat{Y})+K_{2}\Delta_{CE}(\hat{Y}),\end{split}

(23)

where $(i)$ holds by Hölder’s inequality, $(ii)$ holds by the definition of $BSE(\hat{Y})$ , and $(iii)$ holds by the definition of $K_{1}$ . ∎

We observe that in practice, using $\hat{Y}_{o}|X=\{\operatorname*{arg\,max}_{j\in\mathcal{Y}}{f_{\theta}(\mathbf{% x})_{j}+\log p_{t}^{oe}(y)_{j}}|\mathbf{x}\in X\}$ can result in performance degradation due to an error accumulation in $p_{t}^{oe}(y)$ . However, our approach, which integrates a two-stage uncertainty calibration with $g_{\phi}$ and a debiased target label estimator $p_{t}^{de}(y)$ , demonstrates empirical efficacy across various experiments.

Appendix C Dataset Descriptions

C.1 Natural Distibution Shifts

In our experiments, we verify our method across six different datasets—HELOC, Voting, Hospital Readmission, ICU Mortality, Childhood Lead, and Diabetes—within the Tableshift Benchmark (Gardner, Popovic, and Schmidt 2023), all of which include natural distribution shifts between training and test data. For all datasets, the numerical features are normalized—subtraction of mean and division by standard deviation, while categorical features are one-hot encoded. We find that different encoding types do not play a significant role in terms of accuracy, as noted in Grinsztajn, Oyallon, and Varoquaux (2022). Detailed statistics specifications of each dataset are listed in Table 5.

•

HELOC: This task predicts Home Equity Line of Credit (HELOC) (Brown et al. 2018) repayment using FICO data (Studies 2019), focusing on shifts in third-party risk estimates. The dataset includes 10,459 observations, and a distribution shift occurs by using the ’External Risk Estimate’ as a domain split. Estimates above 63 are used for training, while those 63 or below are held out for testing, illustrating potential biases in credit assessments.
•

Voting: Using ANES (Studies 2022) data, this task predicts U.S. presidential election voting behavior with 8,280 observations. Distribution shift is introduced by splitting the data based on geographic region, with the southern U.S. serving as the out-of-domain region. This simulates how voter behavior predictions might vary when polling data is collected in one region and used to predict outcomes in another.
•

Hospital Readmission: Hospital Readmission (Clore et al. 2014) predicts 30-day readmission of diabetic patients using data from 130 U.S. hospitals over 10 years. The distribution shift occurs by splitting the data based on admission source, with emergency room admissions held out as the target domain. This tests how well models trained on other sources perform when applied to patients admitted through the emergency room.
•

ICU Mortality: The task predicts ICU patient mortality using MIMIC-iii data (Johnson et al. 2016), focusing on shifts related to insurance type. The dataset includes 23,944 observations, and a distribution shift is created by excluding Medicare and Medicaid patients from the training set, designating them as the target domain. This highlights how insurance type can affect mortality predictions.
•

Childhood Lead: This task predicts elevated blood lead levels in children using NHANES data (for Disease Control, Prevention et al. 2003), with 27,499 observations. A distribution shift is introduced by splitting the data based on poverty using the poverty-income ratio (PIR) as a threshold. Those with a PIR of 1.3 or lower are held out for testing, simulating risk assessment in lower-income households.
•

Diabetes: This task predicts diabetes using BRFSS data (Association 2018), focusing on racial shifts across 1.4 million observations. Distribution shift occurs by focusing on the differences in diabetes risk between racial and ethnic groups, particularly highlighting the higher risk faced by non-white groups compared to White non-Hispanic individuals.

Table 5: Summary of the datasets used in our experiments, including the total number of instances (Total Instances), the number of instances allocated to training, validation, and test sets (Training Set, Validation Set, Test Set), the total number of features (Total Features), and a breakdown into numerical and categorical features (Numerical Features, Categorical Features). All tasks involve binary classification.

Statistic	HELOC	Voting	Hospital Readmission	ICU Mortality	Childhood Lead	Diabetes
Total Instances	9,412	60,376	89,542	21,549	24,749	1,299,758
Training Set	2,220	34,796	34,288	7,116	11,807	969,229
Validation Set	278	4,349	4,286	889	1,476	121,154
Test Set	6,914	21,231	50,968	13,544	11,466	209,375
Total Features	22	54	46	7491	7	25
Numerical Features	20	8	12	7490	4	6
Categorical Features	2	46	34	1	3	19

C.2 Common Corruptions

Let $\mathbf{x}_{i}^{t}=(\mathbf{x}_{ij}^{t})_{j=1}^{D}\in\mathbb{R}^{D}$ be the $i$ -th row of a table with $D$ columns in the test data. We define $\bar{\mathbf{x}}_{j}^{s}$ as a random variable that follows the empirical marginal distribution of the $j$ -th column in the training set $\mathcal{D}_{s}$ , given by:

\mathbb{P}(\bar{\mathbf{x}}_{j}^{s}=k)=\frac{1}{|\mathcal{D}_{s}|}\sum_{i=1}^{% |\mathcal{D}_{s}|}\mathbbold{1}_{\{k\}}(\mathbf{x}_{ij}^{s}),

where $k\in\mathbb{R}$ . Additionally, let $\mu_{j}^{s}=\mathbb{E}[\bar{\mathbf{x}}_{j}^{s}]$ and $\sigma_{j}^{s}=\sqrt{\text{Var}(\bar{\mathbf{x}}_{j}^{s})}$ be the mean and standard deviation of the random variable $\bar{\mathbf{x}}_{j}^{s}$ , respectively. To effectively simulate natural distribution shifts that commonly occur beyond label distribution shifts, we introduce six types of corruptions—Gaussian noise (Gaussian), uniform noise (Uniform), random missing values (Random Drop), common column missing across all test data (Column Drop), important numerical column shift (Numerical), and important categorical column shift (Categorical)—as follows:

•

Gaussian: For $\mathbf{x}_{ij}^{t}$ , Gaussian noise $z\sim\mathcal{N}(0,0.1^{2})$ is independently injected as:

\mathbf{x}_{ij}^{t}\leftarrow\mathbf{x}_{ij}^{t}+z\cdot\sigma_{j}^{s}.

•

Uniform: For the $\mathbf{x}_{ij}^{t}$ , uniform noise $u\sim\mathcal{U}(-0.1,0.1)$ is independently injected as:

\mathbf{x}_{ij}^{t}\leftarrow\mathbf{x}_{ij}^{t}+u\cdot\sigma_{j}^{s}.

•

Random Drop: For each column $\mathbf{x}_{ij}^{t}$ , a random mask $m_{ij}\sim\text{Bernoulli}(0.2)$ is applied, and the feature is replaced by a random sample $\bar{\mathbf{x}}_{j}^{s}$ drawn from the empirical marginal distribution of the $j$ -th column of the training set:

\mathbf{x}_{ij}^{t}\leftarrow(1-m_{ij})\cdot\mathbf{x}_{ij}^{t}+m_{ij}\cdot% \bar{\mathbf{x}}_{j}^{s}.

•

Column Drop: For each column $\mathbf{x}_{ij}^{t}$ , a random mask $m_{j}\sim\text{Bernoulli}(0.2)$ is applied, and the feature is replaced by a random sample $\bar{\mathbf{x}}_{j}^{s}$ as follows:

\mathbf{x}_{ij}^{t}\leftarrow(1-m_{j})\cdot\mathbf{x}_{ij}^{t}+m_{j}\cdot\bar{% \mathbf{x}}_{j}^{s}.

Unlike random drop corruption, where the mask $m_{ij}$ is resampled for each $j$ -th column of the $i$ -th test instance $\mathbf{x}_{ij}^{t}$ , a single random mask $m_{j}$ is sampled for each $j$ -th column and applied uniformly across all test data.

•

Numerical: Important numerical column shift simulates natural domain shifts where the test distribution of the most important numerical column deviates significantly from the training distribution. We first identify the most important numerical column, $j^{*}$ , using a pre-trained XGBoost (Chen and Guestrin 2016). A Gaussian distribution

\mathcal{N}(z|\mu_{j^{*}}^{s},\sigma_{j^{*}}^{s})=\frac{1}{\sqrt{2\pi\sigma_{j% ^{*}}^{s}}}\exp\left(-\frac{(z-\mu_{j^{*}}^{s})^{2}}{2(\sigma_{j^{*}}^{s})^{2}% }\right)

is then fitted to the $j^{*}$ -th column of the training data, using $\mu_{j^{*}}^{s}$ and $\sigma_{j^{*}}^{s}$ . The likelihood of each test sample $\mathbf{x}_{i}^{t}$ is then computed as $\mathcal{N}(\mathbf{x}_{ij^{*}}^{t}|\mu_{j^{*}}^{s},\sigma_{j^{*}}^{s})$ . Finally, test samples are drawn inversely proportional to their likelihood, with the sampling probability $\mathbb{P}(\mathbf{x}_{i}^{t})$ of $\mathbf{x}_{i}^{t}$ is defined as:

\mathbb{P}(\mathbf{x}_{i}^{t})=\frac{\mathcal{N}(\mathbf{x}_{ij^{*}}^{t}|\mu_{% j^{*}}^{s},\sigma_{j^{*}}^{s})^{-1}}{\sum_{i^{\prime}=1}^{|\mathcal{D}_{t}|}% \mathcal{N}(\mathbf{x}_{i^{\prime}j^{*}}^{t}|\mu_{j^{*}}^{s},\sigma_{j^{*}}^{s% })^{-1}}.

•

Categorical: Important categorical column shift simulates natural domain shifts where the test distribution of the most important categorical column deviates significantly from the training distribution. Again, we first identify the most important categorical column, $j^{*}$ , using a pre-trained XGBoost (Chen and Guestrin 2016). A categorical distribution, which generalizes the Bernoulli distribution,

\mathcal{C}(z|p_{1},\cdots,p_{K})=p_{1}^{\mathbbold{1}_{\{1\}}(z)}\cdots p_{K}% ^{\mathbbold{1}_{\{K\}}(z)},

is then fitted to the $j^{*}$ -th column of the training data, where $K$ is the number of distinct categorical features in the $j^{*}$ -th column, and $p_{k}=\mathbb{P}(\bar{\mathbf{x}}_{j}^{s}=k)$ for $k=1,\cdots,K$ . The likelihood of each test sample $\mathbf{x}_{i}^{t}$ is then computed as $\mathcal{C}(\mathbf{x}_{ij^{*}}^{t}|p_{1},\cdots,p_{K})$ . Finally, test samples are drawn inversely proportional to their likelihood, with the sampling probability $\mathbb{P}(\mathbf{x}_{i}^{t})$ of $\mathbf{x}_{i}^{t}$ is defined as:

\mathbb{P}(\mathbf{x}_{i}^{t})=\frac{\mathcal{C}(\mathbf{x}_{ij^{*}}^{t}|p_{1}% ,\cdots,p_{K})^{-1}}{\sum_{i^{\prime}=1}^{|\mathcal{D}_{t}|}\mathcal{C}(% \mathbf{x}_{i^{\prime}j^{*}}^{t}|p_{1},\cdots,p_{K})^{-1}}.

C.3 Label Distribution Shifts

•

Class Imbalance: This label distribution shift simulates a highly class-imbalanced test stream, where labels that are rare in the training set are more likely to appear frequently in the test set. Given a class imbalance ratio $\rho=10$ , we first rank the output labels $y_{i}^{t}\in\mathcal{Y}$ for each test sample $\mathbf{x}_{i}^{t}$ in ascending order of their frequency in the training set, assigning ranks from 1 to $C$ , where $C$ is the number of classes. Specifically, $\text{rank}(y_{i}^{t})=1$ indicates that $y_{i}^{t}$ is the least frequent label in the training set, while $\text{rank}(y_{i}^{t})=C$ indicates that $y_{i}^{t}$ is the most frequent. We then define the unnormalized sampling probability for each test sample $\mathbf{x}_{i}^{t}$ as:

\tilde{\mathbb{P}}(\mathbf{x}_{i}^{t})=\frac{\text{rank}(y_{i}^{t})}{C}(\rho-1% )+1.

The normalized sampling probability $\mathbb{P}(\mathbf{x}_{i}^{t})$ for each test sample $\mathbf{x}_{i}^{t}$ is then defined as:

\mathbb{P}(\mathbf{x}_{i}^{t})=\frac{\tilde{\mathbb{P}}(\mathbf{x}_{i}^{t})}{% \sum_{i^{\prime}=1}^{|\mathcal{D}_{t}|}\tilde{\mathbb{P}}(\mathbf{x}_{i^{% \prime}}^{t})}.

•

Temporal Correlation: To simulate temporal correlations in test data, we employ a custom sampling strategy using the Dirichlet distribution. This approach effectively captures temporal dependencies by dynamically adjusting the label distribution over time. We begin with a uniform probability distribution $\mathbb{P}_{0}=\left(1/C\right)_{j=1}^{C}$ , where $C$ is the number of classes. For sampling the $i$ -th test instance, a probability distribution $\bm{\pi}_{i}$ is drawn from the Dirichlet distribution:

\bm{\pi}_{i}\sim\text{Dirichlet}(\mathbb{P}_{i-1}),

and then smoothed using $\eta=10^{-6}$ to avoid zero probabilities for any class $j$ :

\bm{\pi}_{i}=\frac{\max(\eta,\bm{\pi}_{i})}{\sum_{j=1}^{C}\max(\eta,\bm{\pi}_{% ij})}.

A label $y_{i}^{t}$ is subsequently sampled according to $\bm{\pi}_{i}$ , and the corresponding test instance $\mathbf{x}_{i}^{t}$ is randomly selected from the test data with label $y_{i}^{t}$ . After the $i$ -th sampling, the distribution $\mathbb{P}_{i}$ is updated using the recent history of sampled labels within a sliding window of size $w=5$ :

\mathbb{P}_{i}\leftarrow\bigg{(}\frac{1}{w}\sum_{i^{\prime}=i-w+1}^{i}% \mathbbold{1}_{\{j\}}(y_{i^{\prime}}^{t})\bigg{)}_{j=1}^{C}.

Appendix D Baseline Details

D.1 Deep Tabular Learning Architectures

•

MLP: Multi-Layer Perceptron (MLP) (Murtagh 1991) is a foundational deep learning architecture characterized by multiple layers of interconnected nodes, where each node applies a non-linear activation function to a weighted sum of its inputs. In the tabular domain, MLP is often employed as a default deep learning model, with each input feature corresponding to a node in the input layer.
•

AutoInt: Automatic Feature Interaction Learning via Self-Attentive Neural Networks (AutoInt) (Song et al. 2019) is a model that automatically learns complex feature interactions in tasks like click-through rate (CTR) prediction, where features are typically sparse and high-dimensional. It uses a multi-head self-attentive neural network to map features into a low-dimensional space and capture high-order combinations, eliminating the need for manual feature engineering. AutoInt efficiently handles large datasets, outperforms existing methods, and provides good explainability.
•

ResNet: ResNet for tabular data (Gorishniy et al. 2021), is a modified version of the original ResNet architecture (He et al. 2016), tailored to capture intricate patterns within structured datasets. Although earlier efforts yielded modest results, recent studies have re-explored ResNet’s capabilities, inspired by its success in computer vision and NLP. This ResNet-like model for tabular data is characterized by a streamlined design that facilitates optimization through nearly direct paths from input to output, enabling the effective learning of deeper feature representations.
•

FT-Transformer: Feature Tokenizer along with Transformer (FT-Transformer) (Gorishniy et al. 2021), represents a straightforward modification of the Transformer architecture tailored for tabular data. In this model, the feature tokenizer component plays a crucial role by converting all features, whether categorical or numerical, into tokens. Subsequently, a series of Transformer layers are applied to these tokens within the Transformer component, along with the added [CLS] token. The ultimate representation of the [CLS] token in the final Transformer layer is then utilized for the prediction.

D.2 Supervised Baselines

•

$k$ -NN: $k$ -Nearest Neighbors ( $k$ -NN) is a fundamental model in tabular learning that identifies the $k$ closest data points based on a chosen metric. It makes predictions through majority voting for classification or weighted averaging for regression. The hyperparameter $k$ influences the model’s sensitivity.
•

LogReg: Logistic Regression (LogReg) is a linear classification model that estimates the probability of class membership using a logistic function, which maps the linear combination of features to a range of $[0,1]$ . With proper regularization, LogReg can achieve performance comparable to state-of-the-art tabular models.
•

RandomForest: Random Forest is an ensemble learning algorithm that builds multiple decision trees to improve accuracy and reduce overfitting. It is particularly effective at capturing non-linear patterns and is robust against outliers.
•

XGBoost: Extreme Gradient Boosting (XGBoost) (Chen and Guestrin 2016) is a boosting algorithm that sequentially builds weak learners, typically decision trees, to correct errors made by previous models. XGBoost is known for its high predictive performance and ability to handle complex relationships through regularization.
•

CatBoost: CatBoost (Dorogush, Ershov, and Gulin 2017), like XGBoost, is a boosting algorithm that excels in handling categorical features without extensive preprocessing. It is highly effective in real-world datasets, offering strong performance, albeit at the cost of increased computational resources and the need for parameter tuning.

D.3 Test-Time Adaptation Baselines

•

PL: Pseudo-Labeling (PL) (Lee 2013) leverages a pseudo-labeling strategy to update model parameters during test time.
•

TTT++: Improved Test-Time Training (TTT++) (Liu et al. 2021) enhances test-time adaptation by using feature alignment strategies and regularization, eliminating the need to access source data during adaptation.
•

TENT: Test ENTropy minimization (TENT) (Wang et al. 2021a) updates the scale and bias parameters in the batch normalization layer during test time by minimizing entropy within a given test batch.
•

EATA: Efficient Anti-forgetting Test-time Adaptation (EATA) (Niu et al. 2022) mitigates the risk of unreliable gradients by filtering out high-entropy samples and applying a Fisher regularizer to constrain key model parameters during adaptation.
•

SAR: Sharpness-Aware and Reliable optimization (SAR) (Niu et al. 2023) builds on TENT by filtering samples with large entropy, which can cause model collapse during test time, using a predefined threshold.
•

LAME: Laplacian Adjusted Maximum-likelihood Estimation (LAME) (Boudiaf et al. 2022) employs an output adaptation strategy during test-time, focusing on adjusting the model’s output probabilities rather than tuning its parameters.

Appendix E Further Experimental Details

E.1 Further Implementation Details

All experiments are conducted on two servers. The first server is equipped with a 40-core Intel Xeon E5-2630 v4 CPU, 252GB RAM, 4 NVIDIA TITAN Xp GPUs, and runs Ubuntu 18.04.4. The second server has a 40-core Intel Xeon E5-2640 v4 CPU, 128GB RAM, 8 NVIDIA TITAN Xp GPUs, and runs Ubuntu 22.04.4. All architectures were implemented using Python 3.8.16 with PyTorch (Paszke et al. 2019) and PyTorch Geometric (Fey and Lenssen 2019). The specific versions of all software libraries and frameworks used are provided in the AdapTable/requirements.txt file of the supplementary materials. We also include our source code in AdapTable folder of the supplementary materials. Please refer to this for all experimental details and to clarify any uncertainties.

E.2 Hyperparameters for Supervised Baselines

For $k$ -NN, LogReg, RandomForest, XGBoost, and CatBoost, optimal parameters are determined for each dataset using a random search with 10 iterations on the validation set. The search space for each method is specified in Table 6.

Table 6: Hyperparameter search space of supervised baselines. # neighbors denotes the number of neighbors, # estim denotes the number of estimators, depth denotes the maximum depth, and lr denotes the learning rate, respectively.

Method Search Space $k$ -NN # neighbors: $\{2,\cdots,12\}$ RandomForest # estim: $\{50,100,150,200\}$ , depth: $\{2,3,\cdots,12\}$ XGBoost # estim: $\{50,100,150,200\}$ , depth: $\{2,3,\cdots,12\}$ , lr: $\{0.01,0.01+(1-0.01)/19,\cdots,1\}$ , gamma: $\{0,0.05,\cdots,0.5\}$ CatBoost # iterations: $\{50,100,\cdots,2000\}$ , lr: $\{0.01,0.01+(1-0.01)/19,\cdots,1\}$ , depth: $\{5,\cdots,40\}$

E.3 Hyperparameters for TTA Baselines

In scenarios where the test set is unknown, tuning the hyperparameters of TTA methods on the test set would be considered cheating. Therefore, we tune all hyperparameters for each TTA method and backbone classifier architecture using the Numerical common corruption on the CMC tabular dataset, which we did not use as test data in OpenML-CC18 (Bischl et al. 2021) benchmark. PL, TENT (Wang et al. 2021a), and SAR (Niu et al. 2023) require three main hyperparameters—learning rate, number of adaptation steps per batch, and the option for episodic adaptation, where the model is reset after each batch. PL (Lee 2013) and TENT use a learning rate of 0.0001 with 1 adaptation step and episodic updates. Additionally, SAR requires a threshold to filter high-entropy samples and is configured with a learning rate of 0.001, 1 adaptation step, and episodic updates. For TTT++ (Liu et al. 2021), EATA (Niu et al. 2022), and LAME (Boudiaf et al. 2022), we follow the authors’ hyperparameter settings, except for the learning rate and adaptation steps. TTT++ and EATA were configured with a learning rate of 0.00001, 10 adaptation steps, and episodic updates. LAME, which only adjusts output logits, does not require hyperparameters related to gradient updates. For all baselines, hyperparameter choices remained consistent across different architectures, including MLP, AutoInt, ResNet, and FT-Transformer. The hyperparameter search space for each method are detailed in Table 7.

Table 7: Hyperparameter search space of test-time adaptation baselines. Here, we only denote the common hyperparameters, where method specific hyperparameters are specified in Section E.3.

Hyperparameter	Search Space
lr	$\{10^{-3},10^{-4},10^{-5},10^{-6}\}$
# steps	$\{1,5,10,15,20\}$
episodic	{True, False}

E.4 Hyperparameters for AdapTable

AdapTable requires three test-time hyperparameters—the smoothing factor $\alpha$ , and the low and high uncertainty quantiles $q_{\text{low}}$ and $q_{\text{high}}$ . For fairness, we tune all AdapTable hyperparameters across different backbone architectures using the Numerical common corruption on the CMC dataset from the OpenML-CC18 benchmark (Bischl et al. 2021), which is not used as test data. We observe that AdapTable’s hyperparameter choices remain consistent across various architectures, including MLP, AutoInt, ResNet, and FT-Transformer. Notably, AdapTable demonstrates high insensitivity to variations in $\alpha$ , $q_{\text{low}}$ , and $q_{\text{high}}$ , which are uniformly set to 0.1, 0.25, and 0.75, respectively, across all datasets and architectures.

Appendix F Additional Analysis

F.1 Latent Space Visualizations

In Figure 9, we further visualize latent spaces of test instances using t-SNE across six different datasets and four representative deep tabular learning architectures to illustrate the observation discussed in Section 2.1. This visualization highlights the complex decision boundaries within the latent space of tabular data, which are significantly more intricate than those observed in other domains. By comparing the upper four rows—HELOC, Voting, Hospital Readmission, and Childhood Lead—with the lower two rows—linearized image data (MFEAT-PIXEL) and homogeneous DNA string sequences (DNA)—it becomes evident that the latent space decision boundaries in the tabular domain are particularly complex. According to WhyShift (Liu et al. 2023), this complexity is primarily due to latent confounders inherent in tabular data and concept shifts, where such confounders cause output labels to vary greatly for nearly identical inputs. As discussed in Section 2.1, this further underscores the limitations of existing TTA methods (Sun et al. 2020; Gandelsman et al. 2022; Liu et al. 2021; Boudiaf et al. 2022; Zhou et al. 2023), which often depend on the cluster assumption.

F.2 Reliability Diagrams

Figure 10 presents additional reliability diagrams across five different datasets and four representative deep tabular learning architectures, illustrating that tabular data often displays a mix of overconfident and underconfident prediction patterns. This contrasts with the consistent overconfidence observed in the image domain (Stylianou and Flournoy 2002) and underconfidence in the graph domain (Wang et al. 2021b). As shown in Figure 10, the Voting and Hospital Readmission datasets consistently exhibit overconfident behavior across all architectures, while the HELOC, Childhood Lead, and Diabetes datasets demonstrate underconfident tendencies. These observations underscore the need for a tabular-specific uncertainty calibration method.

F.3 Label Distribution Shifts and Prediction Bias Towards Source Label Distributions

We demonstrate that the data distribution shift we primarily target in the tabular domain—label distribution shift—occurs frequently in practice. Figure 11 presents the source label distribution (a), target label distribution (b), pseudo label distribution for test data using the source model (c), and the estimated target label distribution after applying our label distribution handler (d) across the five datasets. Comparing (a) and (b) in each row, it is evident that label distribution shift occurs across all datasets. In (c), we observe that the marginal label distribution predicted by the source model is commonly biased towards the source label distribution. Lastly, (d) illustrates that our label distribution handler effectively estimates the target label distribution, guiding the pseudo label distribution towards the target label distribution.

F.4 Entropy Distributions

We highlight a unique characteristic of tabular data: model prediction entropy consistently shows a strong bias toward underconfidence. To illustrate this, we present entropy distribution histograms for test instances across six datasets and four representative deep tabular learning architectures in Figure 12. A clear pattern emerges when comparing the upper four rows (HELOC, Voting, Hospital Readmission, Childhood Lead) with the lower two (Optdigits, DNA). The upper rows exhibit consistently high entropy, indicating a skew toward underconfidence, while the lower rows do not, except for Childhood Lead, where extreme class imbalance causes the model to collapse to the major class. This analysis highlights the distinct bias of tabular data toward underconfident predictions, a pattern less common in other domains. This aligns with findings that applying unsupervised objectives like entropy minimization to high-entropy samples can result in gradient explosions and model collapse (Niu et al. 2023).

Appendix G Additional Experiments

G.1 Detailed Results Across Common Corruptions and Datasets

Figure 4 presents the average F1 score across six types of common corruption and three datasets. Here, we provide more detailed results, including the standard errors. As shown in Table 8, AdapTable outperforms baseline TTA methods by a large margin across all datasets and corruption types. This further highlights the empirical efficacy of AdapTable, not only in handling label distribution shifts but also in addressing various common corruptions.

G.2 All Results Across Datasets and Model Architectures

In Figure 5, we demonstrate the effectiveness of AdapTable across various tabular model architectures by reporting the average performance across three datasets. Here, we provide the mean and standard error for each dataset and architecture. As shown in Table 9, AdapTable consistently achieves state-of-the-art performance with significant improvements across all model architectures and datasets. This further underscores the versatility and robustness of AdapTable.

G.3 Additional Computational Efficiency Analysis

One may wonder whether the post-training time required for AdapTable’s shift-aware uncertainty calibrator is prohibitively long. To address this concern, we measure and report the elapsed real time for post-training our shift-aware uncertainty calibrator on the medium-scale Hospital Readmission dataset using the FT-Transformer architecture. The post-training process takes approximately 9.2 seconds. For small- and medium-scale datasets, the post-training process typically requires only a few seconds, and even in our largest experimental setting, the time remains minimal, taking at most a few minutes.

Appendix H Limitations and Broader Impacts

H.1 Limitations

Similar to other test-time training (TTT) methods (Sun et al. 2020; Liu et al. 2021; Gandelsman et al. 2022), AdapTable requires an additional post-training stage to integrate a shift-aware uncertainty calibrator during the source model’s training phase. While full test-time adaptation methods (Wang et al. 2021a; Niu et al. 2022, 2023) avoid this, our analysis in Section 2.1 and experiments in Section 4 show that they fail in the tabular domain due to their focus on input covariate shifts, which are often entangled with concept shifts. According to WhyShift (Liu et al. 2023), concept shifts, driven by changes in latent confounders, require natural language descriptions of the shift conditions, necessitating a data-centric approach. Additionally, while AdapTable performs well across various corruptions beyond label distribution shifts (Figure 4), it is primarily focused on addressing label distribution shifts. Further exploration is needed to assess its effectiveness in handling input covariate shifts or concept shifts.

H.2 Broader Impacts

Tabular data is prevalent across industries such as healthcare (Johnson et al. 2016, 2021), finance (Studies 2019, 2022), manufacturing (Hein et al. 2017), and public administration (Gardner, Popovic, and Schmidt 2023). Our research addresses the critical yet underexplored challenge of distribution shifts in tabular data, a problem that has not received sufficient attention. We believe that our approach can significantly enhance the performance of machine learning models in various industries by improving model adaptation to tabular data, thereby creating meaningful value in practical applications. Through our data-centric analysis in Section 2, we identify why existing TTA methods fail in the tabular domain and introduce a tabular-specific approach for handling label distribution shifts in Section 3. We hope this work will provide valuable insights for future research on test-time adaptation in tabular data. Additionally, by making our source code publicly available, we aim to support real-world applications across various fields, benefiting both academia and industry.

Table 8: The average macro F1 score (%) with their standard errors for TTA baselines is reported across six common corruptions—Gaussian, Uniform, Random Drop, Column Drop, Numerical, and Categorical—over three datasets—HELOC, Voting, and Childhood Lead. The results are averaged over three random repetitions.

	Method	Gaussian	Uniform	Random Drop	Column Drop	Numerical	Categorical
HELOC	Source	33.1 $\pm$ 0.0	33.0 $\pm$ 0.0	31.4 $\pm$ 0.1	32.3 $\pm$ 1.4	33.8 $\pm$ 0.2	32.3 $\pm$ 0.3
	PL	31.2 $\pm$ 0.0	31.2 $\pm$ 0.0	30.6 $\pm$ 0.0	31.1 $\pm$ 0.7	32.1 $\pm$ 0.2	30.4 $\pm$ 0.2
	TENT	33.1 $\pm$ 0.0	33.0 $\pm$ 0.0	31.4 $\pm$ 0.1	32.3 $\pm$ 1.4	33.8 $\pm$ 0.2	32.3 $\pm$ 0.3
	EATA	33.1 $\pm$ 0.0	33.0 $\pm$ 0.0	31.4 $\pm$ 0.1	32.3 $\pm$ 1.4	33.8 $\pm$ 0.2	32.3 $\pm$ 0.3
	SAR	31.9 $\pm$ 0.1	32.0 $\pm$ 0.1	30.7 $\pm$ 0.2	31.3 $\pm$ 0.8	32.4 $\pm$ 0.4	31.4 $\pm$ 0.3
	LAME	30.1 $\pm$ 0.0	30.1 $\pm$ 0.0	30.1 $\pm$ 0.0	30.1 $\pm$ 0.0	30.9 $\pm$ 0.1	29.4 $\pm$ 0.2
	AdapTable	57.6 $\pm$ 0.1	57.8 $\pm$ 0.0	53.0 $\pm$ 0.1	52.1 $\pm$ 3.2	58.1 $\pm$ 0.1	58.9 $\pm$ 0.4
Voting	Source	76.6 $\pm$ 0.0	76.5 $\pm$ 0.0	72.5 $\pm$ 0.2	72.8 $\pm$ 0.4	76.3 $\pm$ 0.1	85.2 $\pm$ 0.1
	PL	75.6 $\pm$ 0.3	75.2 $\pm$ 0.3	71.1 $\pm$ 0.5	70.6 $\pm$ 0.5	75.9 $\pm$ 0.1	85.1 $\pm$ 0.1
	TENT	76.6 $\pm$ 0.0	76.5 $\pm$ 0.0	72.5 $\pm$ 0.2	72.8 $\pm$ 0.4	76.3 $\pm$ 0.1	85.2 $\pm$ 0.1
	EATA	76.6 $\pm$ 0.0	76.5 $\pm$ 0.0	72.5 $\pm$ 0.2	72.8 $\pm$ 0.4	76.3 $\pm$ 0.1	85.2 $\pm$ 0.1
	SAR	67.2 $\pm$ 1.0	64.0 $\pm$ 0.2	61.8 $\pm$ 1.0	60.8 $\pm$ 0.8	69.9 $\pm$ 0.1	84.2 $\pm$ 0.1
	LAME	39.4 $\pm$ 0.2	39.4 $\pm$ 0.1	37.3 $\pm$ 0.0	37.8 $\pm$ 0.2	39.4 $\pm$ 0.2	81.4 $\pm$ 0.2
	AdapTable	78.9 $\pm$ 0.0	78.6 $\pm$ 0.1	74.9 $\pm$ 0.1	75.5 $\pm$ 0.5	78.0 $\pm$ 0.1	85.0 $\pm$ 0.4
Childhood Lead	Source	47.9 $\pm$ 0.0	47.9 $\pm$ 0.0	47.9 $\pm$ 0.0	47.9 $\pm$ 0.0	48.1 $\pm$ 0.0	48.8 $\pm$ 0.0
	PL	47.9 $\pm$ 0.0	47.9 $\pm$ 0.0	47.9 $\pm$ 0.0	47.9 $\pm$ 0.0	48.1 $\pm$ 0.0	48.8 $\pm$ 0.0
	TENT	47.9 $\pm$ 0.0	47.9 $\pm$ 0.0	47.9 $\pm$ 0.0	47.9 $\pm$ 0.0	48.1 $\pm$ 0.0	48.8 $\pm$ 0.0
	EATA	47.9 $\pm$ 0.0	47.9 $\pm$ 0.0	47.9 $\pm$ 0.0	47.9 $\pm$ 0.0	48.1 $\pm$ 0.0	48.8 $\pm$ 0.0
	SAR	47.9 $\pm$ 0.0	47.9 $\pm$ 0.0	47.9 $\pm$ 0.0	47.9 $\pm$ 0.0	48.1 $\pm$ 0.0	48.8 $\pm$ 0.0
	LAME	47.9 $\pm$ 0.0	47.9 $\pm$ 0.0	47.9 $\pm$ 0.0	47.9 $\pm$ 0.0	48.1 $\pm$ 0.0	48.8 $\pm$ 0.0
	AdapTable	61.4 $\pm$ 0.1	61.5 $\pm$ 0.0	58.0 $\pm$ 0.1	55.9 $\pm$ 1.6	62.8 $\pm$ 0.2	53.1 $\pm$ 0.2

Table 9: The average macro F1 score (%) with their standard errors for TTA baselines is reported across three datasets—HELOC, Voting, and Childhood Lead—using three model architectures—AutoInt, ResNet, and FT-Transformer. The results are averaged over three random repetitions.

	Method	HELOC	Voting	Childhood Lead
AutoInt	Source	34.9 $\pm$ 0.0	77.5 $\pm$ 0.0	47.9 $\pm$ 0.0
	PL	31.6 $\pm$ 0.0	76.5 $\pm$ 0.1	47.9 $\pm$ 0.0
	TENT	34.9 $\pm$ 0.0	77.5 $\pm$ 0.0	47.9 $\pm$ 0.0
	EATA	34.9 $\pm$ 0.0	77.5 $\pm$ 0.0	47.9 $\pm$ 0.0
	SAR	62.0 $\pm$ 0.4	31.2 $\pm$ 0.7	47.9 $\pm$ 0.0
	LAME	30.1 $\pm$ 0.0	37.3 $\pm$ 0.0	47.9 $\pm$ 0.0
	AdapTable	56.3 $\pm$ 0.1	79.2 $\pm$ 0.0	61.8 $\pm$ 0.1
ResNet	Source	52.0 $\pm$ 0.0	76.6 $\pm$ 0.0	47.9 $\pm$ 0.0
	PL	34.3 $\pm$ 0.1	73.3 $\pm$ 0.1	47.9 $\pm$ 0.0
	TENT	52.0 $\pm$ 0.0	76.6 $\pm$ 0.0	47.9 $\pm$ 0.0
	EATA	52.0 $\pm$ 0.0	76.7 $\pm$ 0.0	47.9 $\pm$ 0.0
	SAR	55.1 $\pm$ 0.5	52.2 $\pm$ 0.5	47.9 $\pm$ 0.0
	LAME	30.1 $\pm$ 0.0	75.1 $\pm$ 0.1	47.9 $\pm$ 0.0
	AdapTable	61.9 $\pm$ 0.0	78.7 $\pm$ 0.0	61.3 $\pm$ 0.1
FT-Transformer	Source	33.0 $\pm$ 0.0	77.3 $\pm$ 0.0	47.9 $\pm$ 0.0
	PL	30.6 $\pm$ 0.0	76.0 $\pm$ 0.1	47.9 $\pm$ 0.0
	TENT	33.0 $\pm$ 0.0	77.3 $\pm$ 0.0	47.9 $\pm$ 0.0
	EATA	33.0 $\pm$ 0.0	77.3 $\pm$ 0.0	47.9 $\pm$ 0.0
	SAR	35.3 $\pm$ 0.1	73.6 $\pm$ 0.3	47.9 $\pm$ 0.0
	LAME	30.7 $\pm$ 0.1	71.5 $\pm$ 0.1	47.9 $\pm$ 0.0
	AdapTable	55.0 $\pm$ 0.0	79.2 $\pm$ 0.1	61.7 $\pm$ 0.1

	$\displaystyle\hat{Y}\|X$	$\displaystyle=\{\operatorname*{arg\,max}_{j\in\mathcal{Y}}{f_{\theta}(\mathbf{% x})_{j}}\|\mathbf{x}\in X\},$		(7)
	$\displaystyle\hat{Y}_{o}\|X$	$\displaystyle=\{\operatorname*{arg\,max}_{j\in\mathcal{Y}}{f_{\theta}(\mathbf{% x})_{j}+\log p_{t}^{oe}(y)_{j}}\|\mathbf{x}\in X\}.$		(8)

	$\displaystyle\hat{Y}\|X$	$\displaystyle=\{\operatorname*{arg\,max}_{j\in\mathcal{Y}}{f_{\theta}(\mathbf{% x})_{j}}\|\mathbf{x}\in X\},$		(15)
	$\displaystyle\hat{Y}_{o}\|X$	$\displaystyle=\{\operatorname*{arg\,max}_{j\in\mathcal{Y}}{f_{\theta}(\mathbf{% x})_{j}+\log p_{t}^{oe}(y)_{j}}\|\mathbf{x}\in X\}.$		(16)

AdapTable: Test-Time Adaptation for Tabular Data via Shift-Aware Uncertainty Calibrator and Label Distribution Handler

Abstract

1 Introduction

2 Analysis of Tabular Distribution Shifts

2.1 Indistinguishable Representations

2.2 Importance of Label Distribution Shifts

3 AdapTable

3.1 Test-Time Adaptation Setup for Tabular Data

3.2 Shift-Aware Uncertainty Calibrator

3.3 Label Distribution Handler

3.4 Theoretical Insights

Theorem 3.1.

4 Experiments

4.1 Experimental Setup

Datasets.

Model architectures.

Baselines.

Evaluation metrics.

Implementation details.

4.2 Main Results

Result on natural distribution shifts.

Result on common corruptions.

Result across diverse model architectures.

4.3 Ablation Study

Shift-aware uncertainty calibrator.

Label distribution handler.

4.4 Further Analysis

Computational efficiency.

Hyperparameter sensitivity.

5 Related Work

Machine learning for tabular data.

Distribution shifts in the tabular domain.

Test-time adaptation.

6 Conclusion

References

Appendix A Detailed Algorithm of AdapTable

Post-training shift-aware uncertainty calibrator.

Label distribution handler.

Appendix B Proof of Theorem 3.1

Definition B.1.

Theorem B.2.

Proof.

Appendix C Dataset Descriptions

C.1 Natural Distibution Shifts

C.2 Common Corruptions

C.3 Label Distribution Shifts

Appendix D Baseline Details

D.1 Deep Tabular Learning Architectures

D.2 Supervised Baselines

D.3 Test-Time Adaptation Baselines

Appendix E Further Experimental Details

E.1 Further Implementation Details

E.2 Hyperparameters for Supervised Baselines

E.3 Hyperparameters for TTA Baselines

E.4 Hyperparameters for AdapTable

Appendix F Additional Analysis

F.1 Latent Space Visualizations

F.2 Reliability Diagrams

F.3 Label Distribution Shifts and Prediction Bias Towards Source Label Distributions

F.4 Entropy Distributions

Appendix G Additional Experiments

G.1 Detailed Results Across Common Corruptions and Datasets

G.2 All Results Across Datasets and Model Architectures

G.3 Additional Computational Efficiency Analysis

Appendix H Limitations and Broader Impacts

H.1 Limitations

H.2 Broader Impacts

AdapTable: Test-Time Adaptation for Tabular Data
via Shift-Aware Uncertainty Calibrator and Label Distribution Handler