Information-theoretic evaluation of covariate distributions models

Niklas Hartung^1,∗, Aleksandra Khatova^1,2,∗

(¹ Institute of Mathematics, University of Potsdam, Germany
² Institute of BTU Cottbus-Senftenberg, Germany
*: equal contribution )

Abstract

Statistical modelling of covariate distributions allows to generate virtual populations or to impute missing values in a covariate dataset. Covariate distributions typically have non-Gaussian margins and show nonlinear correlation structures, which simple descriptions like multivariate Gaussian distributions fail to represent. Prominent non-Gaussian frameworks for covariate distribution modelling are copula-based models and models based on multiple imputation by chained equations (MICE). While both frameworks have already found applications in the life sciences, a systematic investigation of their goodness-of-fit to the theoretical underlying distribution, indicating strengths and weaknesses under different conditions, is still lacking. To bridge this gap, we thoroughly evaluated covariate distribution models in terms of Kullback-Leibler (KL) divergence, a scale-invariant information-theoretic goodness-of-fit criterion for distributions. Methodologically, we proposed a new approach to construct confidence intervals for KL divergence by combining nearest neighbour-based KL divergence estimators with subsampling-based uncertainty quantification. In relevant data sets of different sizes and dimensionalities with both continuous and discrete covariates, non-Gaussian models showed consistent improvements in KL divergence, compared to simpler Gaussian or scale transform approximations. KL divergence estimates were also robust to the inclusion of latent variables and large fractions of missing values. While good generalization behaviour to new data could be seen in copula-based models, MICE shows a trend for overfitting and its performance should always be evaluated on separate test data. Parametric copula models and MICE were found to scale much better with the dimension of the dataset than nonparametric copula models. These findings corroborate the potential of non-Gaussian models for modelling realistic life science covariate distributions.

1 Introduction

The impact of covariates on an outcome of interest is of relevance in many modelling exercises. To name a few examples, age, weight, sex as well as biomarkers and genotypic information enter many pharmacokinetic models (Hamberg et al., 2010; Bajaj et al., 2019) and physiologically-based pharmacokinetic (PBPK) models even rely on a large set of organ weights and blood flows (Jones and Rowland‐Yeo, 2013). Often, a set of observed covariates is used for such simulations, but this approach comes with important limitations: (i) Mostly, only summary statistics are shared — other researchers cannot reproduce the results since the information on the correlation structure is lost; (ii) Specific subpopulations may be too sparsely observed for meaningful simulations; (iii) Missing values in a covariate dataset need to be dealt with. In contrast, statistical models for covariate distributions allow to generate virtual populations of arbitrary sizes and can often naturally deal with missing values. Since covariate distributions typically have non-Gaussian margins and show nonlinear correlation structures, simple approaches based on multivariate Gaussian distributions may fail to represent them accurately.

A natural approach to overcome a Gaussian assumption for margins of continuous covariates is to apply a scale transform (e.g., log-transform for intrinsically positive covariates). An effort to include discrete variables into such a lognormal model has been made by Tannenbaum et al. (2006). Recently, two non-Gaussian methods, namely Multivariate Imputation by Chained Equations (MICE) and copula-based models, have also been proposed for the simulation of covariate distributions. MICE, an imputation method for missing data (Buuren and Groothuis-Oudshoorn, 2011), has been applied for the simulation of covariate distributions in Smania and Jonsson (2021). Copula modeling, which has a history in financial mathematics, has been introduced to life sciences by Zwep et al. (2023). Beyond these data-driven approaches, mechanistic scaling methods have been proposed for some domains like PBPK model parametrization (organ weights/blood flows) (Huisinga et al., 2012). Also, regression models have been developed for some examples, like the distribution of body weight as a function of age and sex (Sumpter and Holford, 2011). Despite these efforts, multivariate covariate distribution modelling remains the most generally applicable method.

Existing studies have evaluated model performance in terms of univariate or linear measures of association (mean and (co-)variance), which do not discriminate between Gaussian and non-Gaussian models, or by visual inspection (Smania and Jonsson, 2021; Zwep et al., 2023). However, a goodness-of-fit metric for the evaluation of covariate distribution models needs to be able to detect any deviation, including non-Gaussian, from the data generating distribution. Kullback-Leibler (KL) divergence fulfils these demands; furthermore it is scale-invariant and can be interpreted well, namely as the information loss when approximating the true population distribution by a surrogate model (Kullback and Leibler, 1951). Also, typical complicating factors like mixed discrete/continuous data, missing data, latent variables and overfitting have received little attention. We therefore aimed at evaluating goodness-of-fit via KL divergence and in a variety of situations, also covering the above-mentioned issues.

Since in practice, only a sample from the data generating distribution is available, sample-based estimation of KL divergence is required. For high-dimensional continuous data, methods based on nearest neighbour density estimation, combined with a finite sample bias correction, have been found to perform well (Perez-Cruz, 2008; Wang et al., 2009). However, both the extension to mixed discrete/continuous data as well as uncertainty quantification have not been discussed in the literature. Therefore, a secondary goal of the present work is to extend the available methodology to accommodate mixed data types (a common case in covariate data sets) and to determine confidence intervals for KL divergence estimates (which allows to judge the significance of differences in KL divergence estimates between models/scenarios). An implementation of this approach, combining KL divergence estimators with subsampling-based uncertainty quantification, has recently been distributed as R package kldest (Hartung, 2024).

2 Methods

Datasets

We considered three different datasets that cover both data-rich and -sparse, as a well low- and high-dimensional situations, thereby providing a broad range for the benchmarking covariate distribution models.

NHANES-3/-11.

The US national health and nutrition examination survey (NHANES) contains a collection of demographic, physical, health-related and lifestyle-related variables from a representative sample of the US population (NHA, 2009-2012). The dataset distributed in R package NHANES contains a resampled version of $n=10000$ individuals from the 2009-2012 NHANES survey, to counterbalance oversampling of certain subpopulations (Pruim, 2015). From this dataset, we ignored individuals with age of 80 years or above (for whom exact age is not given) and those with a diastolic blood pressure of 0 (likely a database error). Subsequently, we derived two NHANES analysis datasets by selecting a subset of covariates and removing duplicates and missing values:

NHANES-3: (data-rich low-dimensional setting). 6230 unique measurements of 3 continuous covariates: age, weight and height.
NHANES-11: (medium-sparse high-dimensional setting). 2133 unique measurements of 11 covariates, of which 10 are continuous (age, weight, height, heart rate, systolic blood pressure, diastolic blood pressure, testosterone concentrations, total cholesterol, urine volume, urine flow rate) and one is discrete (sex).

ORGANWT

(very sparse high-dimensional setting). A dataset on organ weights of two different mouse strains has been reported by Marxfeld et al. (2019). From this dataset, we selected 6 continuous covariates (body weight and the 5 most relevant organs weights for PBPK modelling (Jones and Rowland‐Yeo, 2013) that are available in the dataset, namely brain, heart, kidneys, liver, spleen) and 2 discrete variables (sex and strain), totalling 145 measurements of 8 covariates (after removing one animal in which brain weight was missing).

The sizes of the datasets are summarised in Table 1.

	Number of covariates		Number of
	continuous	discrete	measurements
NHANES-3	3	0	6230
NHANES-11	10	1	2133
ORGANWT	6	2	145

Table 1: Dimensions of the three datasets.

General notation and scope

We consider the observed covariates $\mathbf{x}=(x^{(1)},...,x^{(n)})$ to be an i.i.d. sample from an (unknown) $d$ -dimensional probability distribution with density $p$ and (cumulative) distribution function $F$ . For example, in dataset NHANES-3, $x^{(i)}=(x^{(i)}_{1},x^{(i)}_{2},x^{(i)}_{3})$ is age, weight and height of the $i$ -th observed individual ( $d=3$ ). The aim of covariate distribution modelling is to estimate a surrogate model $q$ which approximates $p$ , in order to be able to simulate i.i.d. data $\mathbf{y}$ from this surrogate model. Alternatively, a method may directly provide such data $\mathbf{y}$ .

Copula models

Here we give a short introduction to copula modelling for continuous distributions. Since the notation required for a general presentation of theory and estimation (especially for vine copulas) is complex, we refer to Czado (2019) for further details. Any multivariate density $p$ can be uniquely decomposed into a product of marginals and their dependency structure (known as Sklar’s theorem),

p(x_{1},...,x_{d})=p_{1}(x_{1})\cdot\ldots\cdot p_{d}(x_{d})\cdot c\big{(}F_{1% }(x_{1}),...,F_{d}(x_{d})\big{)},

(1)

where $p_{i}$ and $F_{i}$ are the $i$ -th marginal density and distribution functions, respectively, and where $c$ is a copula density, supported on the unit cube. This decomposition allows to estimate the marginals and the dependency structure in two separate steps.

First, the marginals are modelled, giving rise to estimated marginal distribution functions $\hat{F}_{1},...,\hat{F}_{d}$ . In this step, standard univariate distribution fitting methods can be used. Here, we employed univariate local polynomial kernel density estimators as implemented in R package kde1d (Nagler and Vatter, 2022). If the margins are regular enough, parametric methods can be used as well.

Next, using the estimated marginals, the covariates are transformed to uniform scale via

x\mapsto\hat{u}(x):=\big{(}\hat{F}_{1}(x_{1}),...,\hat{F}_{d}(x_{d})\big{)}.

(2)

A copula model can then be estimated on the transformed data $\mathbf{u}=\hat{u}(\mathbf{x})$ . One possible approach consists of using multivariate copula models such as Gaussian copulas. While this approach is conceptually simple, it assumes the same type of dependency across all variables. A more flexible copula modelling approach consists of combining bivariate copula models in a pair copula decomposition of the multivariate copula. These copula models, termed vine copulas, use a graph-theoretical structure (vine) describing the conditional dependency structure of the pair copula decomposition. Modelling with vine copulas is supported by libraries such as rvinecopulib (Nagler and Vatter, 2023), which automatize the selection of both the vine structure and the pair copulas, and allow for efficient simulation.

Multiple imputation by chained equations (MICE)

MICE is, at its origin, a method for imputing missing values in a datasets. Also known as conditional distribution modelling, it aims at predicting a missing value given all other (non-missing) observations. Within MICE, different algorithms are available. We used the defaults in the R implementation in package mice namely predictive mean matching (Little, 1988) for continuous variables and logistic regression for categorical variables, for which robust performance has been reported (Buuren and Groothuis-Oudshoorn, 2011). To use this method for the simulation of covariate distributions, we used the method described in Smania and Jonsson (2021): (i) if missing data are present, a single standard MICE step is used to fill the missing data; (ii) then, all observed covariates are again labeled as missing and drawn in turn according to MICE in order to obtain a sample.

Covariate distribution models

Six different covariate distribution models are considered in this article, which differ in how marginals are modelled, the choice of dependency structure and the use of parametric or nonparametric elements.

GaussDist.: A multivariate Gaussian distribution (with density $q$ ) is fitted to the data $\mathbf{x}$ on original scale. Hence, all marginals are also Gaussian. This method only applies to continuous covariates.
IndepCop.: An independence copula is used for the transformed data $\mathbf{u}$ . This means that marginals are modelled, but no dependency structure.
GaussCop.: A multivariate Gaussian copula is fitted to the transformed data $\mathbf{u}$ . This scenario combines GaussDist and IndepCop. It can be considered an optimized version of the “continuous method” by Tannenbaum et al. (2006), in which the log-transform is replaced by an kernel-based transform.
ParVine.: A vine copula is fitted to the transformed data $\mathbf{u}$ . For each dependency element within the vine copula structure, the best bivariate copula is selected from a family of parametric copulas.
NonparVine.: A vine copula is fitted to the transformed data $\mathbf{u}$ . For each dependency element, a nonparametric bivariate copula model is constructed based on a transform kernel approach (Nagler et al., 2017).
MICE.: The MICE method is applied to the entire dataset, as described in Smania and Jonsson (2021).

Kullback-Leibler divergence estimation

For ease of notation, we assume here that all considered covariates are continuous. A combined continuous/discrete framework is described in Appendix A. We assume a surrogate model with density $q$ (i.e., a multivariate Gaussian distribution or a copula model) has been estimated from the data $\mathbf{x}$ . The KL divergence between $p$ and $q$ is then defined as

D_{\text{KL}}(p||q):=\int_{{\mathbb{R}}^{d}}\log\left(\frac{p(x)}{q(x)}\right)% p(x)dx.

(3)

To estimate $D_{\text{KL}}(p||q)$ , we used a nearest neighbour density-based method proposed in Wang et al. (2009). In their approach, $D_{\text{KL}}(p||q)$ is estimated from two samples, $\mathbf{x}$ from $p$ and $\mathbf{y}$ from $q$ . While $\mathbf{x}$ is already available, a (large) sample $\mathbf{y}=(y^{(1)},...,y^{(m)})$ is simulated from the surrogate model $q$ . For copula-based models, $\mathbf{y}$ is obtained by transforming a copula sample $\mathbf{v}$ back to the original scale, $\mathbf{y}=\hat{u}^{-1}(\mathbf{v})$ . Working with a second sample $\mathbf{y}$ rather than directly using the surrogate model density $q$ allowed for the use of a two-sample bias reduction method. Also, it allows a natural comparison to the MICE method, in which a sample $\mathbf{y}$ is obtained directly without estimating a density first. Since samples obtained by the MICE method may contain duplicates but the nearest neighbour-based KL divergence estimator (4) cannot deal with duplicates in the data, we apply post-processing to the MICE samples by selecting only unique items. To define this KL divergence estimator, we denote by $\rho_{k}(i)$ and $\nu_{k}(i)$ the distance to the $k$ -th nearest neighbour of $x^{(i)}$ in $\mathbf{x}\setminus x^{(i)}$ and in $\mathbf{y}$ , respectively. Furthermore, let $\varepsilon_{i}=\max(\rho_{1}(i),\nu_{1}(i))$ be the larger of the distances to the first nearest neighbour of $x^{(i)}$ in $\mathbf{x}$ and $\mathbf{y}$ , and $k_{i},l_{i}$ (one of $k_{i},l_{i}$ is always 1) be the largest possible values such that $\rho_{k_{i}}(i)\leq\varepsilon_{i}$ and $\nu_{l_{i}}(i)\leq\varepsilon_{i}$ . Then, the bias-corrected KL divergence estimator by Wang et al. (2009) is given by

\widehat{D_{\text{KL}}}(\mathbf{x},\mathbf{y})=\frac{d}{n}\sum_{i=1}^{n}\log% \frac{\nu_{l_{i}}(i)}{\rho_{k_{i}}(i)}+\log\frac{m}{n-1}+\frac{1}{n}\sum_{i=1}% ^{n}\Big{(}\psi(l_{i})-\psi(k_{i})\Big{)}.

(4)

The first two summands are a Monte Carlo approximation of KL divergence using nearest neighbour density estimates, the last term (involving the digamma function $\psi(x)=\frac{d}{dx}\log\Gamma(x)$ ) compensates the asymptotic bias of nearest neighbour density estimation, and the use of different numbers of neighbours in $\rho$ and $\nu$ counterbalances finite sample bias originating from different convergence speed of nearest neighbour density estimates (for details, see Wang et al., 2009). The estimator is implemented and documented in the R package kldest (Hartung, 2024).

While KL divergence is scale-invariant, the estimator (4) might differ between scales. In the main text, all results are shown on the original scale. An investigation of differences between original and uniform scale (where the copula models are fitted) is shown in Figure S1 (Appendix B).

Uncertainty quantification

Uncertainty quantification of KL divergence estimators has not been addressed in the literature beyond the 1-D discrete case, and in particular, no asymptotic distributions of nearest neighbour-based KL divergence estimators are available. Also, as mentioned, the nearest neighbour-based KL divergence estimator (4) cannot deal with samples containing duplicates. This means that standard bootstrapping, which relies on resampling with replacement, cannot be used for uncertainty quantification either. Instead, we opted for using subsampling without replacement as described by Politis and Romano (1994).

The following description of the generic subsampling procedure is modified after Geyer (2013). Let $\hat{\theta}_{n}=\hat{\theta}_{n}(x^{(1)},...,x^{(n)})$ be an estimator for a parameter of interest $\theta$ such that $\tau_{n}(\hat{\theta}_{n}-\theta)$ converges in distribution to some cumulative distribution function $G$ . Asymptotically, it follows that $\mathbb{P}\big{[}G^{-1}(\alpha/2)<\tau_{n}(\hat{\theta}_{n}-\theta)<G^{-1}(1-% \alpha/2)\big{]}\rightarrow 1-\alpha$ . If $\theta^{*}_{b}$ denotes the estimator for a subsample from $\mathbf{x}$ of size $b$ such that $b\rightarrow\infty$ and $b/n\rightarrow 0$ as $n\rightarrow\infty$ , then $\tau_{b}(\theta^{*}_{b}-\hat{\theta}_{n})$ converges in distribution to the same limiting distribution $G$ . Also, if $G_{s,b}$ denotes the empirical cumulative distribution function of $s$ subsamples of size $b$ , with $s\rightarrow\infty$ as $n\rightarrow\infty$ , then $G_{s,b}\rightarrow G$ . Combining these equations, it follows that

\mathbb{P}\big{[}G_{s,b}^{-1}(\alpha/2)<\tau_{n}(\hat{\theta}_{n}-\theta)<G^{-% 1}_{s,b}(1-\alpha/2)\big{]}\rightarrow 1-\alpha

(5)

as $n\rightarrow\infty$ . Hence, by solving the inequality in brackets in (5) for $\theta$ , an asymptotic $(1-\alpha)$ confidence interval for $\theta$ is obtained.

In our setting, $\theta=D_{\text{KL}}(p||q)$ and $\hat{\theta}=\widehat{D_{\text{KL}}}(\mathbf{x},\mathbf{y})$ is the estimator (4). We chose $b=n^{2/3}$ and large fixed $s=1000$ and $m=10000$ . The convergence rate $\tau_{n}$ of this estimator has not been studied so far, but results are available for a non-bias-corrected 1-nearest neighbour density estimator. Depending on the regularity of $p$ and $q$ as well as the dimension $d$ , a rate $\tau_{n}=\sqrt{n}$ can be obtained (Noh et al., 2014; Singh and Póczos, 2016; Zhao and Lai, 2020). We confirmed this convergence rate empirically for the estimator (4) using the approach described in Politis et al. (1999, Ch. 8).

Simulation framework

All simulations were performed in R version 4.2.2 (R Core Team, 2022). The source code for reproducing the results, including the three datasets, is available on Zenodo (Hartung and Khatova, 2024).

3 Results

Non-Gaussian methods lead to a considerably better fit

We start by considering dataset NHANES-3. As can be seen in Fig. 1, the three covariates age, weight and height have a non-Gaussian dependency structure, which a parametric vine copula model (ParVine) is able to capture. Importantly, the clear visual improvement in goodness-of-fit of ParVine over GaussDist is not reflected in linear metrics: the mean and (co-)variance of GaussDist coincides with the respective empirical metric in the dataset (e.g., mean age $\pm$ SD $\approx 35.0\pm 21.1$ years), while that of ParVine differs somewhat (mean age $\pm$ SD $\approx 32.3\pm 21.0$ years). In contrast, KL divergence is much lower for ParVine than for GaussDist (see also Table 2), as would be expected from graphics.

Similar non-Gaussian relationships are present in the higher-dimensional datasets NHANES-11 and ORGANWT, see Figures S2 and S3 (Appendix C). KL divergence estimates for all three datasets are summarized in Table 2. For the large and low-dimensional dataset NHANES-3, MICE performed best, followed by NonparVine and ParVine. In the high-dimensional dataset NHANES-11, MICE still had the best performance, but ParVine was better than NonparVine, showing that the two vine copula-based methods scale quite differently with the dataset dimension. Lastly, in the sparse dataset ORGANWT, ParVine showed better performance than MICE and NonparVine. In all three datasets, the two vine copula-based methods and MICE outperformed the simpler (e.g. Gaussian) approaches.

Refer to caption — Figure 1: Scatterplot matrix and 90% highest density regions (HDR), obtained by bivariate kernel density estimation, for dataset NHANES-3 and two fitted covariate distribution models, ParVine and GaussDist.

	NHANES-3	NHANES-11	ORGANWT
GaussDist	1.12 (1.11–1.25)	–	–
IndepCop	1.30 (1.27–1.42)	1.72 (1.63–2.04)	3.62 (2.91–4.92)
GaussCop	0.99 (0.97–1.12)	1.45 (1.39–1.75)	2.73 (2.00–3.83)
ParVine	0.56 (0.52–0.66)	0.62 (0.51–0.85)	1.96 (1.16–2.77)
NonparVine	0.46 (0.42–0.57)	1.03 (0.95–1.27)	2.33 (1.61–3.24)
MICE	0.21 (0.17–0.30)	0.36 (0.25–0.58)	1.96 (1.77–3.27)

Table 2: Estimated Kullback-Leibler divergence (with 95% CI) for all 6 covariate models and 3 datasets. The lowest KL divergence estimate per dataset is highlighted in bold.

MICE, but not copula models, shows overfitting to the data

Due to the flexibility in the shapes that non-Gaussian covariate distribution models can describe, the question naturally arises whether they tend to overfit the data. We investigate this question by splitting the NHANES-3, NHANES-11 and ORGANWT dataset in half at random, fitting models on one half and comparing KL divergence to the fitted (training) vs. the non-fitted (test) data. As shown in Fig. 2, Gaussian and copula-based models have very similar KL divergence estimates on the two data splits. In contrast, the MICE method tends to overfit the data, with much lower KL divergence estimates on the training compared to the test data. To account for this difference in model behaviour, all simulations results shown in this article employ this data split and evaluate KL divergence on the test data only. While Fig. 2 only shows one data split, choosing other splits does not qualitatively change the results. Fig. 2 also illustrates that the estimator can take negative values. In practice, this would be interpreted as 0.

All models tolerate a moderate degree of missing data

Both the copula framework and MICE can handle missing data, and we investigated the fit to the data for different degrees of missingness. Methods GaussDist and IndepCop were excluded from this analysis since GaussDist would need additional post-processing and IndepCop assumes covariates to be independent, which prohibits useful inference for missing values from the model. Two scenarios were considered:

•

NHANES-11-M: datasets with different fractions $p$ of data missing completely at random were simulated. Of note, due to the high dimensionality of this dataset, the fraction of complete observations $f_{\text{complete}}=(1-p)^{11}$ quickly decreases with $p$ (e.g., $f_{\text{complete}}<0.02$ for $p=0.3$ ).
•

ORGANWT-M: datasets with complete observations on sex, strain and body weight, but fractions $p$ of missing organ weights are simulated. This corresponds to the common case where some covariates are much easier to measure than others.

Subsequently, covariate distribution models were fitted to the incomplete datasets, samples were generated from the estimated covariate distributions and KL divergence estimated to an independent (complete) test dataset. The results of this analysis are depicted in Fig. 3. All investigated methods were able to handle large fractions of missing data, with almost unchanged performance for up to $p=0.3$ and mostly moderate increases in KL divergence thereafter.

Latent variables do not compromise model performance

The variables included during covariate distribution model development may not all be needed in a given application scenario. Thus, the question naturally arises how fitting a high-dimensional covariate distribution impacts on the fit to a subset of these covariates. To formally address this question, we split a covariate vector $x$ into observed and latent covariates, $x=(x_{o},x_{l})$ , and investigate the KL divergence between the distribution of observed covariates and the joint covariate distribution model, subsequently marginalized over the latent covariates. We considered two scenarios:

•

NHANES-L: impact of including all variables in NHANES-11 while only those in NHANES-3 are of interest.
•

ORGANWT-L: impact of including body weight, sex and strain as latent variables when the variables of interest are the organ weights.

In both scenarios, the inclusion of latent variables only had a minor impact on estimated KL divergence (Fig. 4). Therefore, including additional covariates during development of a covariate model allows to increase its range of applicability without deteriorating model performance.

4 Conclusion

In this work, the performance of non-Gaussian covariate distribution models was evaluated systematically by means of KL divergence between the underlying distribution and the estimated covariate distribution model. Non-Gaussian methods showed considerably improved goodness-of-fit over Gaussian ones, and packages are available that greatly facilitate the use of these more advanced methods.

Parametric vine copulas (ParVine) and MICE performed best overall. When deciding between these two approaches, two main points need to be considered. First, MICE generally works better than ParVine in a rich data scenario, while ParVine performs as good as MICE for sparse data. The second aspect concerns data sharing: while parametric model fits can be shared easily (e.g., as a table) without having to disclose original data, MICE always relies on the underlying dataset. Therefore, if a dataset cannot be shared, e.g. for intellectual property reasons, a parametric vine copula model like ParVine still allows for a reproducible analysis and easier adoption by other researchers. The nonparametric vine copula model did not perform as well as one might have expected. While it performed better than ParVine in the low-dimensional and observation-rich dataset NHANES-3, the other nonparametric method, MICE, was even better. Also, increasing dimensionality affected the nonparametric vine copula method much more than the other methods.

All copula-based models and MICE were robust to quite large fractions of missing values and inclusion of latent variables. Therefore, building a high-dimensional and/or sparse dataset for covariate modelling does not invalidate the use of such models on a subset of covariates. Since different process models require different covariates as input and complete observations may be difficult to obtain, this is a desirable feature.

The MICE methodology is quite different to the other (distribution-based) models, which has several implications. First, MICE may produce (generally, a small fraction) of simulated values that are identical to observed sets of covariates since the imputed values are drawn from the set of observed values. In this respect, MICE behaves like a (more refined) bootstrapping procedure. Since nearest neighbour-based KL divergence estimator cannot deal with ties in the data, these were removed from the simulated covariate set. Also, sampling from observed values makes MICE much more prone to overfitting, which needs to be accounted for when comparing different models. Here, this was realized by a training/test data split approach.

KL divergence was chosen as a measure to quantify the information loss in an estimated covariate model compared to the data-generating process. While this measure is frequently used and has good theoretical properties, alternatives do exist. For example, other types of $f$ -divergences, such as Hellinger distance or total variation distance, could be used (Rényi, 1961). An alternative of a different type would be the Wasserstein distance (Kantorovich, 1960), but it is not scale-invariant.

Sample-based KL divergence estimation, including uncertainty quantification and mixed continuous/discrete data, was rendered possible by combining nearest neighbour density estimation with subsampling and a reformulation conditional on the discrete covariates. While this approach extends already existing methods significantly, some caveats do apply. First, the conditional formulation does not scale well with the number of categories and discrete variables. Also, KL divergence estimation by nearest neighbour-based methods cannot deal with ties in the dataset. Such ties could appear at several levels, namely the data itself (which might be resampled for representativity reasons, as the NHANES data), the covariate distribution method (as it may happen in MICE), or resampling-based uncertainty quantification. In this work, all ties were avoided, either by the choice of method (subsampling instead of bootstrapping) or as a post-processing step (for NHANES and MICE). Jittering could be an alternative to this approach, but the jittering strength is another parameter which is expected to have a significant impact. Therefore, this option was not considered in the present work.

While the investigated covariate distribution models cover a broad spectrum of modelling practice, some additional approaches are worth mentioning. First, bootstrapping from the covariate set $\mathbf{x}$ is often done (Teutonico et al., 2015). Since all samples $y^{(i)}$ produced this way have an identical counterpart in the dataset $\mathbf{x}$ , the KL divergence estimator cannot deal with this method – again, jittering would be an option. Although bootstrapping was excluded from our analysis, we did consider MICE, which has many similarities with bootstrapping (and, in addition, can readily deal with missing data). Second, generative machine learning models like diffusion models have been applied with success to the simulation of, e.g., realistic images (Cao et al., 2024). In principle, such methods could also be applied to covariate distribution modelling, but since covariate datasets are differently structured (non-spatial) and much more sparse when compared to image data, considerable adjustments would certainly be required for these methods, which were out of scope of this work.

Furthermore, all covariates were considered static, i.e. time-varying covariates were out of scope. First results on copula modelling for time-varying covariates were given in Zwep et al. (2023). Also, the impact of covariate distribution model misspecification on (process) model predictions was not investigated. While the question is of much relevance and could be investigated in future work, we decided to focus on the covariate distribution model itself, in order to sharpen the focus and not to add another model type (a process model) to the present study.

In summary, our findings demonstrate that non-Gaussian covariate distribution modelling can be successfully applied to realistic life science covariate datasets.

Acknowledgments

We thank Laura Zwep and Yushen Guo (University of Leiden, Netherlands) for fruitful discussions on copula modelling, as well as the Mathematical Modelling and Systems Biology group at University of Potsdam, Germany for helpful comments on the manuscript.

References

NHA (2009-2012) Centers for Disease Control and Prevention (CDC). National Center for Health Statistics (NCHS). National Health and Nutrition Examination Survey Questionnaire (or Examination Protocol, or Laboratory Protocol). Hyattsville, MD: U.S. Department of Health and Human Services, Centers for Disease Control and Prevention., 2009-2012. URL https://www.cdc.gov/nchs/nhanes/index.htm.
Bajaj et al. (2019) Gaurav Bajaj, Satyendra Suryawanshi, Amit Roy, and Manish Gupta. Evaluation of covariate effects on pharmacokinetics of monoclonal antibodies in oncology. British Journal of Clinical Pharmacology, 85(9):2045–2058, July 2019. ISSN 1365-2125. doi: 10.1111/bcp.13996.
Buuren and Groothuis-Oudshoorn (2011) Stef van Buuren and Karin Groothuis-Oudshoorn. mice: Multivariate Imputation by Chained Equations in R. Journal of Statistical Software, 45(3), 2011. ISSN 1548-7660. doi: 10.18637/jss.v045.i03.
Cao et al. (2024) Hanqun Cao, Cheng Tan, Zhangyang Gao, Yilun Xu, Guangyong Chen, Pheng-Ann Heng, and Stan Z. Li. A survey on generative diffusion models. IEEE Transactions on Knowledge and Data Engineering, pages 1–20, 2024. ISSN 2326-3865. doi: 10.1109/tkde.2024.3361474.
Czado (2019) Claudia Czado. Analyzing Dependent Data with Vine Copulas: A Practical Guide With R. Springer International Publishing, 2019. ISBN 9783030137854. doi: 10.1007/978-3-030-13785-4.
Geyer (2013) Charles J. Geyer. 5601 notes: The subsampling bootstrap, 2013. URL https://www.stat.umn.edu/geyer/5601/notes/sub.pdf.
Hamberg et al. (2010) A.-K. Hamberg, M Wadelius, J D Lindh, M L Dahl, R Padrini, P Deloukas, A Rane, and E N Jonsson. A Pharmacometric Model Describing the Relationship Between Warfarin Dose and INR Response With Respect to Variations in CYP2C9, VKORC1, and Age. Clinical Pharmacology &https://doi.org/10.1111/bcp.13996 Therapeutics, 87(6):727–734, April 2010. ISSN 1532-6535. doi: 10.1038/clpt.2010.37.
Hartung (2024) Niklas Hartung. kldest: Sample-Based Estimation of Kullback-Leibler Divergence, 2024. URL https://cran.r-project.org/package=kldest. R package version 1.0.0.
Hartung and Khatova (2024) Niklas Hartung and Aleksandra Khatova. Supplementary Code for article ”Information-theoretic evaluation of covariate distributions models”, 2024. URL https://doi.org/10.5281/zenodo.11656643. Zenodo (version 1.0.0).
Huisinga et al. (2012) W. Huisinga, A. Solms, L. Fronton, and S. Pilari. Modeling Interindividual Variability in Physiologically Based Pharmacokinetics and Its Link to Mechanistic Covariate Modeling. CPT: Pharmacometrics & Systems Pharmacology, 1(9):1–10, September 2012. ISSN 2163-8306. doi: 10.1038/psp.2012.3.
Jones and Rowland‐Yeo (2013) H. M. Jones and K. Rowland‐Yeo. Basic Concepts in Physiologically Based Pharmacokinetic Modeling in Drug Discovery and Development. CPT: Pharmacometrics & Systems Pharmacology, 2(8):1–12, August 2013. ISSN 2163-8306. doi: 10.1038/psp.2013.41.
Kantorovich (1960) L. V. Kantorovich. Mathematical methods of organizing and planning production. Management Science, 6(4):366–422, July 1960. ISSN 1526-5501. doi: 10.1287/mnsc.6.4.366.
Kullback and Leibler (1951) S. Kullback and R. A. Leibler. On Information and Sufficiency. The Annals of Mathematical Statistics, 22(1):79–86, March 1951. ISSN 0003-4851. doi: 10.1214/aoms/1177729694.
Little (1988) Roderick J. A. Little. Missing-Data Adjustments in Large Surveys. Journal of Business & Economic Statistics, 6(3):287–296, July 1988. ISSN 1537-2707. doi: 10.1080/07350015.1988.10509663.
Marxfeld et al. (2019) Heike Antje Marxfeld, Karin Küttler, Martina Dammann, Sibylle Gröters, and Bennard van Ravenzwaay. Body and organ weight data in 28-day toxicological studies in two mouse strains. Data in Brief, 27:104632, December 2019. ISSN 2352-3409. doi: 10.1016/j.dib.2019.104632.
Nagler and Vatter (2022) Thomas Nagler and Thibault Vatter. kde1d: Univariate Kernel Density Estimation, 2022. URL https://CRAN.R-project.org/package=kde1d. R package version 1.0.5.
Nagler and Vatter (2023) Thomas Nagler and Thibault Vatter. rvinecopulib: High Performance Algorithms for Vine Copula Modeling, 2023. URL https://CRAN.R-project.org/package=rvinecopulib. R package version 0.6.3.1.1.
Nagler et al. (2017) Thomas Nagler, Christian Schellhase, and Claudia Czado. Nonparametric estimation of simplified vine copula models: comparison of methods. Dependence Modeling, 5(1):99–120, January 2017. ISSN 2300-2298. doi: 10.1515/demo-2017-0007.
Noh et al. (2014) Yung-Kyun Noh, Masashi Sugiyama, Song Liu, Marthinus C. Plessis, Frank Chongwoo Park, and Daniel D. Lee. Bias Reduction and Metric Learning for Nearest-Neighbor Estimation of Kullback-Leibler Divergence. In Samuel Kaski and Jukka Corander, editors, Proceedings of the Seventeenth International Conference on Artificial Intelligence and Statistics, volume 33 of Proceedings of Machine Learning Research, pages 669–677, Reykjavik, Iceland, 22–25 Apr 2014. PMLR. URL https://proceedings.mlr.press/v33/noh14.html.
Perez-Cruz (2008) Fernando Perez-Cruz. Kullback-Leibler divergence estimation of continuous distributions. In 2008 IEEE International Symposium on Information Theory. IEEE, July 2008. doi: 10.1109/isit.2008.4595271.
Politis et al. (1999) D. N. Politis, J. P. Romano, and M. Wolf. Subsampling. Springer Series in Statistics. Springer New York, 1999. ISBN 9780387988542. URL https://books.google.de/books?id=nGu6rqjE6JoC.
Politis and Romano (1994) Dimitris N. Politis and Joseph P. Romano. Large Sample Confidence Regions Based on Subsamples under Minimal Assumptions. The Annals of Statistics, 22(4), December 1994. ISSN 0090-5364. doi: 10.1214/aos/1176325770.
Pruim (2015) Randall Pruim. NHANES: Data from the US National Health and Nutrition Examination Study, 2015. URL https://CRAN.R-project.org/package=NHANES. R package version 2.1.0.
R Core Team (2022) R Core Team. R: A Language and Environment for Statistical Computing. R Foundation for Statistical Computing, Vienna, Austria, 2022. URL https://www.R-project.org/.
Rényi (1961) Alfréd Rényi. On Measures of Entropy and Information. Berkeley Symp. on Math. Statist. and Prob., pages 547–561, 1961.
Schervish (1995) Mark J. Schervish. Theory of Statistics. Springer New York, 1995. ISBN 9781461242505. doi: 10.1007/978-1-4612-4250-5.
Singh and Póczos (2016) Shashank Singh and Barnabás Póczos. Finite-Sample Analysis of Fixed-k Nearest NeighborDensity Functional Estimators. Conference on Neural Information Processing Systems, 2016.
Smania and Jonsson (2021) Giovanni Smania and E. Niclas Jonsson. Conditional distribution modeling as an alternative method for covariates simulation: Comparison with joint multivariate normal and bootstrap techniques. CPT: Pharmacometrics & Systems Pharmacology, 10(4):330–339, April 2021. ISSN 2163-8306. doi: 10.1002/psp4.12613.
Sumpter and Holford (2011) Anita L Sumpter and Nick H. G. Holford. Predicting weight using postmenstrual age – neonates to adults. Pediatric Anesthesia, 21(3):309–315, February 2011. ISSN 1460-9592. doi: 10.1111/j.1460-9592.2011.03534.x.
Tannenbaum et al. (2006) Stacey J. Tannenbaum, Nicholas H. G. Holford, Howard Lee, Carl C. Peck, and Diane R. Mould. Simulation of Correlated Continuous and Categorical Variables using a Single Multivariate Distribution. Journal of Pharmacokinetics and Pharmacodynamics, 33(6):773–794, October 2006. ISSN 1573-8744. doi: 10.1007/s10928-006-9033-1.
Teutonico et al. (2015) D. Teutonico, F. Musuamba, H. J. Maas, A. Facius, S. Yang, M. Danhof, and O. Della Pasqua. Generating Virtual Patients by Multivariate and Discrete Re-Sampling Techniques. Pharmaceutical Research, 32(10):3228–3237, May 2015. ISSN 1573-904X. doi: 10.1007/s11095-015-1699-x.
Wang et al. (2009) Qing Wang, Sanjeev R. Kulkarni, and Sergio Verdu. Divergence Estimation for Multidimensional Densities Via k-Nearest-Neighbor Distances. IEEE Transactions on Information Theory, 55(5):2392–2405, May 2009. ISSN 0018-9448. doi: 10.1109/tit.2009.2016060.
Zhao and Lai (2020) Puning Zhao and Lifeng Lai. Analysis of K Nearest Neighbor KL Divergence Estimation for Continuous Distributions. In 2020 IEEE International Symposium on Information Theory (ISIT). IEEE, June 2020. doi: 10.1109/isit44484.2020.9174033.
Zwep et al. (2023) Laura B. Zwep, Tingjie Guo, Thomas Nagler, Catherijne A.J. Knibbe, Jacqueline J. Meulman, and J.G. Coen van Hasselt. Virtual patient simulation using copula modeling. Clinical Pharmacology & Therapeutics, November 2023. ISSN 1532-6535. doi: 10.1002/cpt.3099.

Appendix

Appendix A KL divergence estimation for combined continuous/discrete data

Here, we describe KL divergence estimation in the general case where some covariates are continuous and others discrete. Decomposing $x=(x_{c},x_{d})$ into the continuous and discrete parts $x_{c}$ and $x_{d}$ , respectively, KL divergence between $p$ and $q$ is then given by

D_{\text{KL}}(p||q):=\sum\limits_{x_{d}}\int\log\left(\frac{p(x_{c},x_{d})}{q(% x_{c},x_{d})}\right)p(x_{c},x_{d})dx_{c}.

(6)

Conditioning on the discrete variables and reordering, we can rewrite $D_{\text{KL}}(p||q)$ as

D_{\text{KL}}(p||q)=\sum\limits_{x_{d}}p(x_{d})\int\log\left(\frac{p(x_{c}|x_{% d})}{q(x_{c}|x_{d})}\right)dx_{c}+\sum\limits_{x_{d}}\log\left(\frac{p(x_{d})}% {q(x_{d})}\right)p(x_{d}),

(7)

known as the chain rule for KL divergence (Schervish, 1995, p.115f).

The integral in the first summand can be estimated on the stratified data $\mathbf{x}_{c}|x_{d}$ and $\mathbf{y}_{c}|x_{d}$ using the estimator (4) for continuous data described in the main text (where $x_{c}\in\mathbf{x}_{c}|x_{d}\Leftrightarrow(x_{c},x_{d})\in\mathbf{x}$ ). The terms $p(x_{d})$ and $q(x_{d})$ are estimated using relative frequencies of $x_{d}$ in $\mathbf{x}$ and $\mathbf{y}$ , respectively; for details, see Hartung (2024).

Appendix B Scale dependency of KL divergence estimates

While KL divergence is scale-invariant, the estimator (4) might differ between scales. In the main text, all KL divergence estimates are computed on the original data scale. We investigated a potential scale dependency by computing KL divergence on both the original scale (using samples $\mathbf{x}$ and $\mathbf{y}$ ) and the uniform scale (using the transformed samples $\hat{u}(\mathbf{x})$ and $\hat{u}(\mathbf{y})$ , cf. (2)) and found that the impact was very low for NHANES-3, low for NHANES-11 datasets, and at most moderate for the ORGANWT dataset (see Fig. S1).