Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                

Information-theoretic evaluation of covariate distributions models

Niklas Hartung1,∗, Aleksandra Khatova1,2,∗
(1 Institute of Mathematics, University of Potsdam, Germany
2 Institute of BTU Cottbus-Senftenberg, Germany
*: equal contribution
)
Abstract

Statistical modelling of covariate distributions allows to generate virtual populations or to impute missing values in a covariate dataset. Covariate distributions typically have non-Gaussian margins and show nonlinear correlation structures, which simple descriptions like multivariate Gaussian distributions fail to represent. Prominent non-Gaussian frameworks for covariate distribution modelling are copula-based models and models based on multiple imputation by chained equations (MICE). While both frameworks have already found applications in the life sciences, a systematic investigation of their goodness-of-fit to the theoretical underlying distribution, indicating strengths and weaknesses under different conditions, is still lacking. To bridge this gap, we thoroughly evaluated covariate distribution models in terms of Kullback-Leibler (KL) divergence, a scale-invariant information-theoretic goodness-of-fit criterion for distributions. Methodologically, we proposed a new approach to construct confidence intervals for KL divergence by combining nearest neighbour-based KL divergence estimators with subsampling-based uncertainty quantification. In relevant data sets of different sizes and dimensionalities with both continuous and discrete covariates, non-Gaussian models showed consistent improvements in KL divergence, compared to simpler Gaussian or scale transform approximations. KL divergence estimates were also robust to the inclusion of latent variables and large fractions of missing values. While good generalization behaviour to new data could be seen in copula-based models, MICE shows a trend for overfitting and its performance should always be evaluated on separate test data. Parametric copula models and MICE were found to scale much better with the dimension of the dataset than nonparametric copula models. These findings corroborate the potential of non-Gaussian models for modelling realistic life science covariate distributions.


1 Introduction

The impact of covariates on an outcome of interest is of relevance in many modelling exercises. To name a few examples, age, weight, sex as well as biomarkers and genotypic information enter many pharmacokinetic models (Hamberg et al., 2010; Bajaj et al., 2019) and physiologically-based pharmacokinetic (PBPK) models even rely on a large set of organ weights and blood flows (Jones and Rowland‐Yeo, 2013). Often, a set of observed covariates is used for such simulations, but this approach comes with important limitations: (i) Mostly, only summary statistics are shared — other researchers cannot reproduce the results since the information on the correlation structure is lost; (ii) Specific subpopulations may be too sparsely observed for meaningful simulations; (iii) Missing values in a covariate dataset need to be dealt with. In contrast, statistical models for covariate distributions allow to generate virtual populations of arbitrary sizes and can often naturally deal with missing values. Since covariate distributions typically have non-Gaussian margins and show nonlinear correlation structures, simple approaches based on multivariate Gaussian distributions may fail to represent them accurately.

A natural approach to overcome a Gaussian assumption for margins of continuous covariates is to apply a scale transform (e.g., log-transform for intrinsically positive covariates). An effort to include discrete variables into such a lognormal model has been made by Tannenbaum et al. (2006). Recently, two non-Gaussian methods, namely Multivariate Imputation by Chained Equations (MICE) and copula-based models, have also been proposed for the simulation of covariate distributions. MICE, an imputation method for missing data (Buuren and Groothuis-Oudshoorn, 2011), has been applied for the simulation of covariate distributions in Smania and Jonsson (2021). Copula modeling, which has a history in financial mathematics, has been introduced to life sciences by Zwep et al. (2023). Beyond these data-driven approaches, mechanistic scaling methods have been proposed for some domains like PBPK model parametrization (organ weights/blood flows) (Huisinga et al., 2012). Also, regression models have been developed for some examples, like the distribution of body weight as a function of age and sex (Sumpter and Holford, 2011). Despite these efforts, multivariate covariate distribution modelling remains the most generally applicable method.

Existing studies have evaluated model performance in terms of univariate or linear measures of association (mean and (co-)variance), which do not discriminate between Gaussian and non-Gaussian models, or by visual inspection (Smania and Jonsson, 2021; Zwep et al., 2023). However, a goodness-of-fit metric for the evaluation of covariate distribution models needs to be able to detect any deviation, including non-Gaussian, from the data generating distribution. Kullback-Leibler (KL) divergence fulfils these demands; furthermore it is scale-invariant and can be interpreted well, namely as the information loss when approximating the true population distribution by a surrogate model (Kullback and Leibler, 1951). Also, typical complicating factors like mixed discrete/continuous data, missing data, latent variables and overfitting have received little attention. We therefore aimed at evaluating goodness-of-fit via KL divergence and in a variety of situations, also covering the above-mentioned issues.

Since in practice, only a sample from the data generating distribution is available, sample-based estimation of KL divergence is required. For high-dimensional continuous data, methods based on nearest neighbour density estimation, combined with a finite sample bias correction, have been found to perform well (Perez-Cruz, 2008; Wang et al., 2009). However, both the extension to mixed discrete/continuous data as well as uncertainty quantification have not been discussed in the literature. Therefore, a secondary goal of the present work is to extend the available methodology to accommodate mixed data types (a common case in covariate data sets) and to determine confidence intervals for KL divergence estimates (which allows to judge the significance of differences in KL divergence estimates between models/scenarios). An implementation of this approach, combining KL divergence estimators with subsampling-based uncertainty quantification, has recently been distributed as R package kldest (Hartung, 2024).

2 Methods

Datasets

We considered three different datasets that cover both data-rich and -sparse, as a well low- and high-dimensional situations, thereby providing a broad range for the benchmarking covariate distribution models.

NHANES-3/-11.

The US national health and nutrition examination survey (NHANES) contains a collection of demographic, physical, health-related and lifestyle-related variables from a representative sample of the US population (NHA, 2009-2012). The dataset distributed in R package NHANES contains a resampled version of n=10000𝑛10000n=10000italic_n = 10000 individuals from the 2009-2012 NHANES survey, to counterbalance oversampling of certain subpopulations (Pruim, 2015). From this dataset, we ignored individuals with age of 80 years or above (for whom exact age is not given) and those with a diastolic blood pressure of 0 (likely a database error). Subsequently, we derived two NHANES analysis datasets by selecting a subset of covariates and removing duplicates and missing values:

NHANES-3

(data-rich low-dimensional setting). 6230 unique measurements of 3 continuous covariates: age, weight and height.

NHANES-11

(medium-sparse high-dimensional setting). 2133 unique measurements of 11 covariates, of which 10 are continuous (age, weight, height, heart rate, systolic blood pressure, diastolic blood pressure, testosterone concentrations, total cholesterol, urine volume, urine flow rate) and one is discrete (sex).

ORGANWT

(very sparse high-dimensional setting). A dataset on organ weights of two different mouse strains has been reported by Marxfeld et al. (2019). From this dataset, we selected 6 continuous covariates (body weight and the 5 most relevant organs weights for PBPK modelling (Jones and Rowland‐Yeo, 2013) that are available in the dataset, namely brain, heart, kidneys, liver, spleen) and 2 discrete variables (sex and strain), totalling 145 measurements of 8 covariates (after removing one animal in which brain weight was missing).

The sizes of the datasets are summarised in Table 1.

Number of covariates Number of
continuous discrete measurements
NHANES-3 3 0 6230
NHANES-11 10 1 2133
ORGANWT 6 2 145
Table 1: Dimensions of the three datasets.

General notation and scope

We consider the observed covariates 𝐱=(x(1),,x(n))𝐱superscript𝑥1superscript𝑥𝑛\mathbf{x}=(x^{(1)},...,x^{(n)})bold_x = ( italic_x start_POSTSUPERSCRIPT ( 1 ) end_POSTSUPERSCRIPT , … , italic_x start_POSTSUPERSCRIPT ( italic_n ) end_POSTSUPERSCRIPT ) to be an i.i.d. sample from an (unknown) d𝑑ditalic_d-dimensional probability distribution with density p𝑝pitalic_p and (cumulative) distribution function F𝐹Fitalic_F. For example, in dataset NHANES-3, x(i)=(x1(i),x2(i),x3(i))superscript𝑥𝑖subscriptsuperscript𝑥𝑖1subscriptsuperscript𝑥𝑖2subscriptsuperscript𝑥𝑖3x^{(i)}=(x^{(i)}_{1},x^{(i)}_{2},x^{(i)}_{3})italic_x start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT = ( italic_x start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_x start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , italic_x start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT ) is age, weight and height of the i𝑖iitalic_i-th observed individual (d=3𝑑3d=3italic_d = 3). The aim of covariate distribution modelling is to estimate a surrogate model q𝑞qitalic_q which approximates p𝑝pitalic_p, in order to be able to simulate i.i.d. data 𝐲𝐲\mathbf{y}bold_y from this surrogate model. Alternatively, a method may directly provide such data 𝐲𝐲\mathbf{y}bold_y.

Copula models

Here we give a short introduction to copula modelling for continuous distributions. Since the notation required for a general presentation of theory and estimation (especially for vine copulas) is complex, we refer to Czado (2019) for further details. Any multivariate density p𝑝pitalic_p can be uniquely decomposed into a product of marginals and their dependency structure (known as Sklar’s theorem),

p(x1,,xd)=p1(x1)pd(xd)c(F1(x1),,Fd(xd)),𝑝subscript𝑥1subscript𝑥𝑑subscript𝑝1subscript𝑥1subscript𝑝𝑑subscript𝑥𝑑𝑐subscript𝐹1subscript𝑥1subscript𝐹𝑑subscript𝑥𝑑p(x_{1},...,x_{d})=p_{1}(x_{1})\cdot\ldots\cdot p_{d}(x_{d})\cdot c\big{(}F_{1% }(x_{1}),...,F_{d}(x_{d})\big{)},italic_p ( italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_x start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT ) = italic_p start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) ⋅ … ⋅ italic_p start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT ) ⋅ italic_c ( italic_F start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) , … , italic_F start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT ) ) , (1)

where pisubscript𝑝𝑖p_{i}italic_p start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT and Fisubscript𝐹𝑖F_{i}italic_F start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT are the i𝑖iitalic_i-th marginal density and distribution functions, respectively, and where c𝑐citalic_c is a copula density, supported on the unit cube. This decomposition allows to estimate the marginals and the dependency structure in two separate steps.

First, the marginals are modelled, giving rise to estimated marginal distribution functions F^1,,F^dsubscript^𝐹1subscript^𝐹𝑑\hat{F}_{1},...,\hat{F}_{d}over^ start_ARG italic_F end_ARG start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , over^ start_ARG italic_F end_ARG start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT. In this step, standard univariate distribution fitting methods can be used. Here, we employed univariate local polynomial kernel density estimators as implemented in R package kde1d (Nagler and Vatter, 2022). If the margins are regular enough, parametric methods can be used as well.

Next, using the estimated marginals, the covariates are transformed to uniform scale via

xu^(x):=(F^1(x1),,F^d(xd)).maps-to𝑥^𝑢𝑥assignsubscript^𝐹1subscript𝑥1subscript^𝐹𝑑subscript𝑥𝑑x\mapsto\hat{u}(x):=\big{(}\hat{F}_{1}(x_{1}),...,\hat{F}_{d}(x_{d})\big{)}.italic_x ↦ over^ start_ARG italic_u end_ARG ( italic_x ) := ( over^ start_ARG italic_F end_ARG start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) , … , over^ start_ARG italic_F end_ARG start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT ) ) . (2)

A copula model can then be estimated on the transformed data 𝐮=u^(𝐱)𝐮^𝑢𝐱\mathbf{u}=\hat{u}(\mathbf{x})bold_u = over^ start_ARG italic_u end_ARG ( bold_x ). One possible approach consists of using multivariate copula models such as Gaussian copulas. While this approach is conceptually simple, it assumes the same type of dependency across all variables. A more flexible copula modelling approach consists of combining bivariate copula models in a pair copula decomposition of the multivariate copula. These copula models, termed vine copulas, use a graph-theoretical structure (vine) describing the conditional dependency structure of the pair copula decomposition. Modelling with vine copulas is supported by libraries such as rvinecopulib (Nagler and Vatter, 2023), which automatize the selection of both the vine structure and the pair copulas, and allow for efficient simulation.

Multiple imputation by chained equations (MICE)

MICE is, at its origin, a method for imputing missing values in a datasets. Also known as conditional distribution modelling, it aims at predicting a missing value given all other (non-missing) observations. Within MICE, different algorithms are available. We used the defaults in the R implementation in package mice namely predictive mean matching (Little, 1988) for continuous variables and logistic regression for categorical variables, for which robust performance has been reported (Buuren and Groothuis-Oudshoorn, 2011). To use this method for the simulation of covariate distributions, we used the method described in Smania and Jonsson (2021): (i) if missing data are present, a single standard MICE step is used to fill the missing data; (ii) then, all observed covariates are again labeled as missing and drawn in turn according to MICE in order to obtain a sample.

Covariate distribution models

Six different covariate distribution models are considered in this article, which differ in how marginals are modelled, the choice of dependency structure and the use of parametric or nonparametric elements.

GaussDist.

A multivariate Gaussian distribution (with density q𝑞qitalic_q) is fitted to the data 𝐱𝐱\mathbf{x}bold_x on original scale. Hence, all marginals are also Gaussian. This method only applies to continuous covariates.

IndepCop.

An independence copula is used for the transformed data 𝐮𝐮\mathbf{u}bold_u. This means that marginals are modelled, but no dependency structure.

GaussCop.

A multivariate Gaussian copula is fitted to the transformed data 𝐮𝐮\mathbf{u}bold_u. This scenario combines GaussDist and IndepCop. It can be considered an optimized version of the “continuous method” by Tannenbaum et al. (2006), in which the log-transform is replaced by an kernel-based transform.

ParVine.

A vine copula is fitted to the transformed data 𝐮𝐮\mathbf{u}bold_u. For each dependency element within the vine copula structure, the best bivariate copula is selected from a family of parametric copulas.

NonparVine.

A vine copula is fitted to the transformed data 𝐮𝐮\mathbf{u}bold_u. For each dependency element, a nonparametric bivariate copula model is constructed based on a transform kernel approach (Nagler et al., 2017).

MICE.

The MICE method is applied to the entire dataset, as described in Smania and Jonsson (2021).

Kullback-Leibler divergence estimation

For ease of notation, we assume here that all considered covariates are continuous. A combined continuous/discrete framework is described in Appendix A. We assume a surrogate model with density q𝑞qitalic_q (i.e., a multivariate Gaussian distribution or a copula model) has been estimated from the data 𝐱𝐱\mathbf{x}bold_x. The KL divergence between p𝑝pitalic_p and q𝑞qitalic_q is then defined as

DKL(p||q):=dlog(p(x)q(x))p(x)dx.D_{\text{KL}}(p||q):=\int_{{\mathbb{R}}^{d}}\log\left(\frac{p(x)}{q(x)}\right)% p(x)dx.italic_D start_POSTSUBSCRIPT KL end_POSTSUBSCRIPT ( italic_p | | italic_q ) := ∫ start_POSTSUBSCRIPT blackboard_R start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT end_POSTSUBSCRIPT roman_log ( divide start_ARG italic_p ( italic_x ) end_ARG start_ARG italic_q ( italic_x ) end_ARG ) italic_p ( italic_x ) italic_d italic_x . (3)

To estimate DKL(p||q)D_{\text{KL}}(p||q)italic_D start_POSTSUBSCRIPT KL end_POSTSUBSCRIPT ( italic_p | | italic_q ), we used a nearest neighbour density-based method proposed in Wang et al. (2009). In their approach, DKL(p||q)D_{\text{KL}}(p||q)italic_D start_POSTSUBSCRIPT KL end_POSTSUBSCRIPT ( italic_p | | italic_q ) is estimated from two samples, 𝐱𝐱\mathbf{x}bold_x from p𝑝pitalic_p and 𝐲𝐲\mathbf{y}bold_y from q𝑞qitalic_q. While 𝐱𝐱\mathbf{x}bold_x is already available, a (large) sample 𝐲=(y(1),,y(m))𝐲superscript𝑦1superscript𝑦𝑚\mathbf{y}=(y^{(1)},...,y^{(m)})bold_y = ( italic_y start_POSTSUPERSCRIPT ( 1 ) end_POSTSUPERSCRIPT , … , italic_y start_POSTSUPERSCRIPT ( italic_m ) end_POSTSUPERSCRIPT ) is simulated from the surrogate model q𝑞qitalic_q. For copula-based models, 𝐲𝐲\mathbf{y}bold_y is obtained by transforming a copula sample 𝐯𝐯\mathbf{v}bold_v back to the original scale, 𝐲=u^1(𝐯)𝐲superscript^𝑢1𝐯\mathbf{y}=\hat{u}^{-1}(\mathbf{v})bold_y = over^ start_ARG italic_u end_ARG start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ( bold_v ). Working with a second sample 𝐲𝐲\mathbf{y}bold_y rather than directly using the surrogate model density q𝑞qitalic_q allowed for the use of a two-sample bias reduction method. Also, it allows a natural comparison to the MICE method, in which a sample 𝐲𝐲\mathbf{y}bold_y is obtained directly without estimating a density first. Since samples obtained by the MICE method may contain duplicates but the nearest neighbour-based KL divergence estimator (4) cannot deal with duplicates in the data, we apply post-processing to the MICE samples by selecting only unique items. To define this KL divergence estimator, we denote by ρk(i)subscript𝜌𝑘𝑖\rho_{k}(i)italic_ρ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( italic_i ) and νk(i)subscript𝜈𝑘𝑖\nu_{k}(i)italic_ν start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( italic_i ) the distance to the k𝑘kitalic_k-th nearest neighbour of x(i)superscript𝑥𝑖x^{(i)}italic_x start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT in 𝐱x(i)𝐱superscript𝑥𝑖\mathbf{x}\setminus x^{(i)}bold_x ∖ italic_x start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT and in 𝐲𝐲\mathbf{y}bold_y, respectively. Furthermore, let εi=max(ρ1(i),ν1(i))subscript𝜀𝑖subscript𝜌1𝑖subscript𝜈1𝑖\varepsilon_{i}=\max(\rho_{1}(i),\nu_{1}(i))italic_ε start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = roman_max ( italic_ρ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( italic_i ) , italic_ν start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( italic_i ) ) be the larger of the distances to the first nearest neighbour of x(i)superscript𝑥𝑖x^{(i)}italic_x start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT in 𝐱𝐱\mathbf{x}bold_x and 𝐲𝐲\mathbf{y}bold_y, and ki,lisubscript𝑘𝑖subscript𝑙𝑖k_{i},l_{i}italic_k start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_l start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT (one of ki,lisubscript𝑘𝑖subscript𝑙𝑖k_{i},l_{i}italic_k start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_l start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT is always 1) be the largest possible values such that ρki(i)εisubscript𝜌subscript𝑘𝑖𝑖subscript𝜀𝑖\rho_{k_{i}}(i)\leq\varepsilon_{i}italic_ρ start_POSTSUBSCRIPT italic_k start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_i ) ≤ italic_ε start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT and νli(i)εisubscript𝜈subscript𝑙𝑖𝑖subscript𝜀𝑖\nu_{l_{i}}(i)\leq\varepsilon_{i}italic_ν start_POSTSUBSCRIPT italic_l start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_i ) ≤ italic_ε start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT. Then, the bias-corrected KL divergence estimator by Wang et al. (2009) is given by

DKL^(𝐱,𝐲)=dni=1nlogνli(i)ρki(i)+logmn1+1ni=1n(ψ(li)ψ(ki)).^subscript𝐷KL𝐱𝐲𝑑𝑛superscriptsubscript𝑖1𝑛subscript𝜈subscript𝑙𝑖𝑖subscript𝜌subscript𝑘𝑖𝑖𝑚𝑛11𝑛superscriptsubscript𝑖1𝑛𝜓subscript𝑙𝑖𝜓subscript𝑘𝑖\widehat{D_{\text{KL}}}(\mathbf{x},\mathbf{y})=\frac{d}{n}\sum_{i=1}^{n}\log% \frac{\nu_{l_{i}}(i)}{\rho_{k_{i}}(i)}+\log\frac{m}{n-1}+\frac{1}{n}\sum_{i=1}% ^{n}\Big{(}\psi(l_{i})-\psi(k_{i})\Big{)}.over^ start_ARG italic_D start_POSTSUBSCRIPT KL end_POSTSUBSCRIPT end_ARG ( bold_x , bold_y ) = divide start_ARG italic_d end_ARG start_ARG italic_n end_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT roman_log divide start_ARG italic_ν start_POSTSUBSCRIPT italic_l start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_i ) end_ARG start_ARG italic_ρ start_POSTSUBSCRIPT italic_k start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_i ) end_ARG + roman_log divide start_ARG italic_m end_ARG start_ARG italic_n - 1 end_ARG + divide start_ARG 1 end_ARG start_ARG italic_n end_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT ( italic_ψ ( italic_l start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) - italic_ψ ( italic_k start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ) . (4)

The first two summands are a Monte Carlo approximation of KL divergence using nearest neighbour density estimates, the last term (involving the digamma function ψ(x)=ddxlogΓ(x)𝜓𝑥𝑑𝑑𝑥Γ𝑥\psi(x)=\frac{d}{dx}\log\Gamma(x)italic_ψ ( italic_x ) = divide start_ARG italic_d end_ARG start_ARG italic_d italic_x end_ARG roman_log roman_Γ ( italic_x )) compensates the asymptotic bias of nearest neighbour density estimation, and the use of different numbers of neighbours in ρ𝜌\rhoitalic_ρ and ν𝜈\nuitalic_ν counterbalances finite sample bias originating from different convergence speed of nearest neighbour density estimates (for details, see Wang et al., 2009). The estimator is implemented and documented in the R package kldest (Hartung, 2024).

While KL divergence is scale-invariant, the estimator (4) might differ between scales. In the main text, all results are shown on the original scale. An investigation of differences between original and uniform scale (where the copula models are fitted) is shown in Figure S1 (Appendix B).

Uncertainty quantification

Uncertainty quantification of KL divergence estimators has not been addressed in the literature beyond the 1-D discrete case, and in particular, no asymptotic distributions of nearest neighbour-based KL divergence estimators are available. Also, as mentioned, the nearest neighbour-based KL divergence estimator (4) cannot deal with samples containing duplicates. This means that standard bootstrapping, which relies on resampling with replacement, cannot be used for uncertainty quantification either. Instead, we opted for using subsampling without replacement as described by Politis and Romano (1994).

The following description of the generic subsampling procedure is modified after Geyer (2013). Let θ^n=θ^n(x(1),,x(n))subscript^𝜃𝑛subscript^𝜃𝑛superscript𝑥1superscript𝑥𝑛\hat{\theta}_{n}=\hat{\theta}_{n}(x^{(1)},...,x^{(n)})over^ start_ARG italic_θ end_ARG start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT = over^ start_ARG italic_θ end_ARG start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ( italic_x start_POSTSUPERSCRIPT ( 1 ) end_POSTSUPERSCRIPT , … , italic_x start_POSTSUPERSCRIPT ( italic_n ) end_POSTSUPERSCRIPT ) be an estimator for a parameter of interest θ𝜃\thetaitalic_θ such that τn(θ^nθ)subscript𝜏𝑛subscript^𝜃𝑛𝜃\tau_{n}(\hat{\theta}_{n}-\theta)italic_τ start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ( over^ start_ARG italic_θ end_ARG start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT - italic_θ ) converges in distribution to some cumulative distribution function G𝐺Gitalic_G. Asymptotically, it follows that [G1(α/2)<τn(θ^nθ)<G1(1α/2)]1αdelimited-[]superscript𝐺1𝛼2subscript𝜏𝑛subscript^𝜃𝑛𝜃superscript𝐺11𝛼21𝛼\mathbb{P}\big{[}G^{-1}(\alpha/2)<\tau_{n}(\hat{\theta}_{n}-\theta)<G^{-1}(1-% \alpha/2)\big{]}\rightarrow 1-\alphablackboard_P [ italic_G start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ( italic_α / 2 ) < italic_τ start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ( over^ start_ARG italic_θ end_ARG start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT - italic_θ ) < italic_G start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ( 1 - italic_α / 2 ) ] → 1 - italic_α. If θbsubscriptsuperscript𝜃𝑏\theta^{*}_{b}italic_θ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT denotes the estimator for a subsample from 𝐱𝐱\mathbf{x}bold_x of size b𝑏bitalic_b such that b𝑏b\rightarrow\inftyitalic_b → ∞ and b/n0𝑏𝑛0b/n\rightarrow 0italic_b / italic_n → 0 as n𝑛n\rightarrow\inftyitalic_n → ∞, then τb(θbθ^n)subscript𝜏𝑏subscriptsuperscript𝜃𝑏subscript^𝜃𝑛\tau_{b}(\theta^{*}_{b}-\hat{\theta}_{n})italic_τ start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT ( italic_θ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT - over^ start_ARG italic_θ end_ARG start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) converges in distribution to the same limiting distribution G𝐺Gitalic_G. Also, if Gs,bsubscript𝐺𝑠𝑏G_{s,b}italic_G start_POSTSUBSCRIPT italic_s , italic_b end_POSTSUBSCRIPT denotes the empirical cumulative distribution function of s𝑠sitalic_s subsamples of size b𝑏bitalic_b, with s𝑠s\rightarrow\inftyitalic_s → ∞ as n𝑛n\rightarrow\inftyitalic_n → ∞, then Gs,bGsubscript𝐺𝑠𝑏𝐺G_{s,b}\rightarrow Gitalic_G start_POSTSUBSCRIPT italic_s , italic_b end_POSTSUBSCRIPT → italic_G. Combining these equations, it follows that

[Gs,b1(α/2)<τn(θ^nθ)<Gs,b1(1α/2)]1αdelimited-[]superscriptsubscript𝐺𝑠𝑏1𝛼2subscript𝜏𝑛subscript^𝜃𝑛𝜃subscriptsuperscript𝐺1𝑠𝑏1𝛼21𝛼\mathbb{P}\big{[}G_{s,b}^{-1}(\alpha/2)<\tau_{n}(\hat{\theta}_{n}-\theta)<G^{-% 1}_{s,b}(1-\alpha/2)\big{]}\rightarrow 1-\alphablackboard_P [ italic_G start_POSTSUBSCRIPT italic_s , italic_b end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ( italic_α / 2 ) < italic_τ start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ( over^ start_ARG italic_θ end_ARG start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT - italic_θ ) < italic_G start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_s , italic_b end_POSTSUBSCRIPT ( 1 - italic_α / 2 ) ] → 1 - italic_α (5)

as n𝑛n\rightarrow\inftyitalic_n → ∞. Hence, by solving the inequality in brackets in (5) for θ𝜃\thetaitalic_θ, an asymptotic (1α)1𝛼(1-\alpha)( 1 - italic_α ) confidence interval for θ𝜃\thetaitalic_θ is obtained.

In our setting, θ=DKL(p||q)\theta=D_{\text{KL}}(p||q)italic_θ = italic_D start_POSTSUBSCRIPT KL end_POSTSUBSCRIPT ( italic_p | | italic_q ) and θ^=DKL^(𝐱,𝐲)^𝜃^subscript𝐷KL𝐱𝐲\hat{\theta}=\widehat{D_{\text{KL}}}(\mathbf{x},\mathbf{y})over^ start_ARG italic_θ end_ARG = over^ start_ARG italic_D start_POSTSUBSCRIPT KL end_POSTSUBSCRIPT end_ARG ( bold_x , bold_y ) is the estimator (4). We chose b=n2/3𝑏superscript𝑛23b=n^{2/3}italic_b = italic_n start_POSTSUPERSCRIPT 2 / 3 end_POSTSUPERSCRIPT and large fixed s=1000𝑠1000s=1000italic_s = 1000 and m=10000𝑚10000m=10000italic_m = 10000. The convergence rate τnsubscript𝜏𝑛\tau_{n}italic_τ start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT of this estimator has not been studied so far, but results are available for a non-bias-corrected 1-nearest neighbour density estimator. Depending on the regularity of p𝑝pitalic_p and q𝑞qitalic_q as well as the dimension d𝑑ditalic_d, a rate τn=nsubscript𝜏𝑛𝑛\tau_{n}=\sqrt{n}italic_τ start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT = square-root start_ARG italic_n end_ARG can be obtained (Noh et al., 2014; Singh and Póczos, 2016; Zhao and Lai, 2020). We confirmed this convergence rate empirically for the estimator (4) using the approach described in Politis et al. (1999, Ch. 8).

Simulation framework

All simulations were performed in R version 4.2.2 (R Core Team, 2022). The source code for reproducing the results, including the three datasets, is available on Zenodo (Hartung and Khatova, 2024).

3 Results

Non-Gaussian methods lead to a considerably better fit

We start by considering dataset NHANES-3. As can be seen in Fig. 1, the three covariates age, weight and height have a non-Gaussian dependency structure, which a parametric vine copula model (ParVine) is able to capture. Importantly, the clear visual improvement in goodness-of-fit of ParVine over GaussDist is not reflected in linear metrics: the mean and (co-)variance of GaussDist coincides with the respective empirical metric in the dataset (e.g., mean age ±plus-or-minus\pm± SD 35.0±21.1absentplus-or-minus35.021.1\approx 35.0\pm 21.1≈ 35.0 ± 21.1 years), while that of ParVine differs somewhat (mean age±plus-or-minus\pm± SD 32.3±21.0absentplus-or-minus32.321.0\approx 32.3\pm 21.0≈ 32.3 ± 21.0 years). In contrast, KL divergence is much lower for ParVine than for GaussDist (see also Table 2), as would be expected from graphics.

Similar non-Gaussian relationships are present in the higher-dimensional datasets NHANES-11 and ORGANWT, see Figures S2 and S3 (Appendix C). KL divergence estimates for all three datasets are summarized in Table 2. For the large and low-dimensional dataset NHANES-3, MICE performed best, followed by NonparVine and ParVine. In the high-dimensional dataset NHANES-11, MICE still had the best performance, but ParVine was better than NonparVine, showing that the two vine copula-based methods scale quite differently with the dataset dimension. Lastly, in the sparse dataset ORGANWT, ParVine showed better performance than MICE and NonparVine. In all three datasets, the two vine copula-based methods and MICE outperformed the simpler (e.g. Gaussian) approaches.

Refer to caption
Refer to caption
Refer to caption
Refer to caption
Figure 1: Scatterplot matrix and 90% highest density regions (HDR), obtained by bivariate kernel density estimation, for dataset NHANES-3 and two fitted covariate distribution models, ParVine and GaussDist.
NHANES-3 NHANES-11 ORGANWT
GaussDist 1.12 (1.11–1.25)
IndepCop 1.30 (1.27–1.42) 1.72 (1.63–2.04) 3.62 (2.91–4.92)
GaussCop 0.99 (0.97–1.12) 1.45 (1.39–1.75) 2.73 (2.00–3.83)
ParVine 0.56 (0.52–0.66) 0.62 (0.51–0.85) 1.96 (1.16–2.77)
NonparVine 0.46 (0.42–0.57) 1.03 (0.95–1.27) 2.33 (1.61–3.24)
MICE 0.21 (0.17–0.30) 0.36 (0.25–0.58) 1.96 (1.77–3.27)
Table 2: Estimated Kullback-Leibler divergence (with 95% CI) for all 6 covariate models and 3 datasets. The lowest KL divergence estimate per dataset is highlighted in bold.

MICE, but not copula models, shows overfitting to the data

Due to the flexibility in the shapes that non-Gaussian covariate distribution models can describe, the question naturally arises whether they tend to overfit the data. We investigate this question by splitting the NHANES-3, NHANES-11 and ORGANWT dataset in half at random, fitting models on one half and comparing KL divergence to the fitted (training) vs. the non-fitted (test) data. As shown in Fig. 2, Gaussian and copula-based models have very similar KL divergence estimates on the two data splits. In contrast, the MICE method tends to overfit the data, with much lower KL divergence estimates on the training compared to the test data. To account for this difference in model behaviour, all simulations results shown in this article employ this data split and evaluate KL divergence on the test data only. While Fig. 2 only shows one data split, choosing other splits does not qualitatively change the results. Fig. 2 also illustrates that the estimator can take negative values. In practice, this would be interpreted as 0.

Refer to caption
Figure 2: Comparison of KL divergence estimates (with 95% CI) between training and test data, for all considered covariate distribution models and datasets.

All models tolerate a moderate degree of missing data

Both the copula framework and MICE can handle missing data, and we investigated the fit to the data for different degrees of missingness. Methods GaussDist and IndepCop were excluded from this analysis since GaussDist would need additional post-processing and IndepCop assumes covariates to be independent, which prohibits useful inference for missing values from the model. Two scenarios were considered:

  • NHANES-11-M: datasets with different fractions p𝑝pitalic_p of data missing completely at random were simulated. Of note, due to the high dimensionality of this dataset, the fraction of complete observations fcomplete=(1p)11subscript𝑓completesuperscript1𝑝11f_{\text{complete}}=(1-p)^{11}italic_f start_POSTSUBSCRIPT complete end_POSTSUBSCRIPT = ( 1 - italic_p ) start_POSTSUPERSCRIPT 11 end_POSTSUPERSCRIPT quickly decreases with p𝑝pitalic_p (e.g., fcomplete<0.02subscript𝑓complete0.02f_{\text{complete}}<0.02italic_f start_POSTSUBSCRIPT complete end_POSTSUBSCRIPT < 0.02 for p=0.3𝑝0.3p=0.3italic_p = 0.3).

  • ORGANWT-M: datasets with complete observations on sex, strain and body weight, but fractions p𝑝pitalic_p of missing organ weights are simulated. This corresponds to the common case where some covariates are much easier to measure than others.

Subsequently, covariate distribution models were fitted to the incomplete datasets, samples were generated from the estimated covariate distributions and KL divergence estimated to an independent (complete) test dataset. The results of this analysis are depicted in Fig. 3. All investigated methods were able to handle large fractions of missing data, with almost unchanged performance for up to p=0.3𝑝0.3p=0.3italic_p = 0.3 and mostly moderate increases in KL divergence thereafter.

Refer to caption
Figure 3: Comparison of KL divergence estimates (with 95% CI) for different fractions of missing data, for copula-based covariate distribution models (excluding IndepVine) and MICE, and two data scenarios.

Latent variables do not compromise model performance

The variables included during covariate distribution model development may not all be needed in a given application scenario. Thus, the question naturally arises how fitting a high-dimensional covariate distribution impacts on the fit to a subset of these covariates. To formally address this question, we split a covariate vector x𝑥xitalic_x into observed and latent covariates, x=(xo,xl)𝑥subscript𝑥𝑜subscript𝑥𝑙x=(x_{o},x_{l})italic_x = ( italic_x start_POSTSUBSCRIPT italic_o end_POSTSUBSCRIPT , italic_x start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT ), and investigate the KL divergence between the distribution of observed covariates and the joint covariate distribution model, subsequently marginalized over the latent covariates. We considered two scenarios:

  • NHANES-L: impact of including all variables in NHANES-11 while only those in NHANES-3 are of interest.

  • ORGANWT-L: impact of including body weight, sex and strain as latent variables when the variables of interest are the organ weights.

In both scenarios, the inclusion of latent variables only had a minor impact on estimated KL divergence (Fig. 4). Therefore, including additional covariates during development of a covariate model allows to increase its range of applicability without deteriorating model performance.

Refer to caption
Figure 4: Impact of latent variables in covariate distribution models on KL divergence estimates (with 95% CI). Since the multivariate Gaussian distribution model (GaussDist) can deal with the observed continuous covariates, but not with the discrete latent variables present in both datasets, the second error bar is missing for model GaussDist.

4 Conclusion

In this work, the performance of non-Gaussian covariate distribution models was evaluated systematically by means of KL divergence between the underlying distribution and the estimated covariate distribution model. Non-Gaussian methods showed considerably improved goodness-of-fit over Gaussian ones, and packages are available that greatly facilitate the use of these more advanced methods.

Parametric vine copulas (ParVine) and MICE performed best overall. When deciding between these two approaches, two main points need to be considered. First, MICE generally works better than ParVine in a rich data scenario, while ParVine performs as good as MICE for sparse data. The second aspect concerns data sharing: while parametric model fits can be shared easily (e.g., as a table) without having to disclose original data, MICE always relies on the underlying dataset. Therefore, if a dataset cannot be shared, e.g. for intellectual property reasons, a parametric vine copula model like ParVine still allows for a reproducible analysis and easier adoption by other researchers. The nonparametric vine copula model did not perform as well as one might have expected. While it performed better than ParVine in the low-dimensional and observation-rich dataset NHANES-3, the other nonparametric method, MICE, was even better. Also, increasing dimensionality affected the nonparametric vine copula method much more than the other methods.

All copula-based models and MICE were robust to quite large fractions of missing values and inclusion of latent variables. Therefore, building a high-dimensional and/or sparse dataset for covariate modelling does not invalidate the use of such models on a subset of covariates. Since different process models require different covariates as input and complete observations may be difficult to obtain, this is a desirable feature.

The MICE methodology is quite different to the other (distribution-based) models, which has several implications. First, MICE may produce (generally, a small fraction) of simulated values that are identical to observed sets of covariates since the imputed values are drawn from the set of observed values. In this respect, MICE behaves like a (more refined) bootstrapping procedure. Since nearest neighbour-based KL divergence estimator cannot deal with ties in the data, these were removed from the simulated covariate set. Also, sampling from observed values makes MICE much more prone to overfitting, which needs to be accounted for when comparing different models. Here, this was realized by a training/test data split approach.

KL divergence was chosen as a measure to quantify the information loss in an estimated covariate model compared to the data-generating process. While this measure is frequently used and has good theoretical properties, alternatives do exist. For example, other types of f𝑓fitalic_f-divergences, such as Hellinger distance or total variation distance, could be used (Rényi, 1961). An alternative of a different type would be the Wasserstein distance (Kantorovich, 1960), but it is not scale-invariant.

Sample-based KL divergence estimation, including uncertainty quantification and mixed continuous/discrete data, was rendered possible by combining nearest neighbour density estimation with subsampling and a reformulation conditional on the discrete covariates. While this approach extends already existing methods significantly, some caveats do apply. First, the conditional formulation does not scale well with the number of categories and discrete variables. Also, KL divergence estimation by nearest neighbour-based methods cannot deal with ties in the dataset. Such ties could appear at several levels, namely the data itself (which might be resampled for representativity reasons, as the NHANES data), the covariate distribution method (as it may happen in MICE), or resampling-based uncertainty quantification. In this work, all ties were avoided, either by the choice of method (subsampling instead of bootstrapping) or as a post-processing step (for NHANES and MICE). Jittering could be an alternative to this approach, but the jittering strength is another parameter which is expected to have a significant impact. Therefore, this option was not considered in the present work.

While the investigated covariate distribution models cover a broad spectrum of modelling practice, some additional approaches are worth mentioning. First, bootstrapping from the covariate set 𝐱𝐱\mathbf{x}bold_x is often done (Teutonico et al., 2015). Since all samples y(i)superscript𝑦𝑖y^{(i)}italic_y start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT produced this way have an identical counterpart in the dataset 𝐱𝐱\mathbf{x}bold_x, the KL divergence estimator cannot deal with this method – again, jittering would be an option. Although bootstrapping was excluded from our analysis, we did consider MICE, which has many similarities with bootstrapping (and, in addition, can readily deal with missing data). Second, generative machine learning models like diffusion models have been applied with success to the simulation of, e.g., realistic images (Cao et al., 2024). In principle, such methods could also be applied to covariate distribution modelling, but since covariate datasets are differently structured (non-spatial) and much more sparse when compared to image data, considerable adjustments would certainly be required for these methods, which were out of scope of this work.

Furthermore, all covariates were considered static, i.e. time-varying covariates were out of scope. First results on copula modelling for time-varying covariates were given in Zwep et al. (2023). Also, the impact of covariate distribution model misspecification on (process) model predictions was not investigated. While the question is of much relevance and could be investigated in future work, we decided to focus on the covariate distribution model itself, in order to sharpen the focus and not to add another model type (a process model) to the present study.

In summary, our findings demonstrate that non-Gaussian covariate distribution modelling can be successfully applied to realistic life science covariate datasets.

Acknowledgments

We thank Laura Zwep and Yushen Guo (University of Leiden, Netherlands) for fruitful discussions on copula modelling, as well as the Mathematical Modelling and Systems Biology group at University of Potsdam, Germany for helpful comments on the manuscript.

References

  • NHA (2009-2012) Centers for Disease Control and Prevention (CDC). National Center for Health Statistics (NCHS). National Health and Nutrition Examination Survey Questionnaire (or Examination Protocol, or Laboratory Protocol). Hyattsville, MD: U.S. Department of Health and Human Services, Centers for Disease Control and Prevention., 2009-2012. URL https://www.cdc.gov/nchs/nhanes/index.htm.
  • Bajaj et al. (2019) Gaurav Bajaj, Satyendra Suryawanshi, Amit Roy, and Manish Gupta. Evaluation of covariate effects on pharmacokinetics of monoclonal antibodies in oncology. British Journal of Clinical Pharmacology, 85(9):2045–2058, July 2019. ISSN 1365-2125. doi: 10.1111/bcp.13996.
  • Buuren and Groothuis-Oudshoorn (2011) Stef van Buuren and Karin Groothuis-Oudshoorn. mice: Multivariate Imputation by Chained Equations in R. Journal of Statistical Software, 45(3), 2011. ISSN 1548-7660. doi: 10.18637/jss.v045.i03.
  • Cao et al. (2024) Hanqun Cao, Cheng Tan, Zhangyang Gao, Yilun Xu, Guangyong Chen, Pheng-Ann Heng, and Stan Z. Li. A survey on generative diffusion models. IEEE Transactions on Knowledge and Data Engineering, pages 1–20, 2024. ISSN 2326-3865. doi: 10.1109/tkde.2024.3361474.
  • Czado (2019) Claudia Czado. Analyzing Dependent Data with Vine Copulas: A Practical Guide With R. Springer International Publishing, 2019. ISBN 9783030137854. doi: 10.1007/978-3-030-13785-4.
  • Geyer (2013) Charles J. Geyer. 5601 notes: The subsampling bootstrap, 2013. URL https://www.stat.umn.edu/geyer/5601/notes/sub.pdf.
  • Hamberg et al. (2010) A.-K. Hamberg, M Wadelius, J D Lindh, M L Dahl, R Padrini, P Deloukas, A Rane, and E N Jonsson. A Pharmacometric Model Describing the Relationship Between Warfarin Dose and INR Response With Respect to Variations in CYP2C9, VKORC1, and Age. Clinical Pharmacology &https://doi.org/10.1111/bcp.13996 Therapeutics, 87(6):727–734, April 2010. ISSN 1532-6535. doi: 10.1038/clpt.2010.37.
  • Hartung (2024) Niklas Hartung. kldest: Sample-Based Estimation of Kullback-Leibler Divergence, 2024. URL https://cran.r-project.org/package=kldest. R package version 1.0.0.
  • Hartung and Khatova (2024) Niklas Hartung and Aleksandra Khatova. Supplementary Code for article ”Information-theoretic evaluation of covariate distributions models”, 2024. URL https://doi.org/10.5281/zenodo.11656643. Zenodo (version 1.0.0).
  • Huisinga et al. (2012) W. Huisinga, A. Solms, L. Fronton, and S. Pilari. Modeling Interindividual Variability in Physiologically Based Pharmacokinetics and Its Link to Mechanistic Covariate Modeling. CPT: Pharmacometrics & Systems Pharmacology, 1(9):1–10, September 2012. ISSN 2163-8306. doi: 10.1038/psp.2012.3.
  • Jones and Rowland‐Yeo (2013) H. M. Jones and K. Rowland‐Yeo. Basic Concepts in Physiologically Based Pharmacokinetic Modeling in Drug Discovery and Development. CPT: Pharmacometrics & Systems Pharmacology, 2(8):1–12, August 2013. ISSN 2163-8306. doi: 10.1038/psp.2013.41.
  • Kantorovich (1960) L. V. Kantorovich. Mathematical methods of organizing and planning production. Management Science, 6(4):366–422, July 1960. ISSN 1526-5501. doi: 10.1287/mnsc.6.4.366.
  • Kullback and Leibler (1951) S. Kullback and R. A. Leibler. On Information and Sufficiency. The Annals of Mathematical Statistics, 22(1):79–86, March 1951. ISSN 0003-4851. doi: 10.1214/aoms/1177729694.
  • Little (1988) Roderick J. A. Little. Missing-Data Adjustments in Large Surveys. Journal of Business & Economic Statistics, 6(3):287–296, July 1988. ISSN 1537-2707. doi: 10.1080/07350015.1988.10509663.
  • Marxfeld et al. (2019) Heike Antje Marxfeld, Karin Küttler, Martina Dammann, Sibylle Gröters, and Bennard van Ravenzwaay. Body and organ weight data in 28-day toxicological studies in two mouse strains. Data in Brief, 27:104632, December 2019. ISSN 2352-3409. doi: 10.1016/j.dib.2019.104632.
  • Nagler and Vatter (2022) Thomas Nagler and Thibault Vatter. kde1d: Univariate Kernel Density Estimation, 2022. URL https://CRAN.R-project.org/package=kde1d. R package version 1.0.5.
  • Nagler and Vatter (2023) Thomas Nagler and Thibault Vatter. rvinecopulib: High Performance Algorithms for Vine Copula Modeling, 2023. URL https://CRAN.R-project.org/package=rvinecopulib. R package version 0.6.3.1.1.
  • Nagler et al. (2017) Thomas Nagler, Christian Schellhase, and Claudia Czado. Nonparametric estimation of simplified vine copula models: comparison of methods. Dependence Modeling, 5(1):99–120, January 2017. ISSN 2300-2298. doi: 10.1515/demo-2017-0007.
  • Noh et al. (2014) Yung-Kyun Noh, Masashi Sugiyama, Song Liu, Marthinus C. Plessis, Frank Chongwoo Park, and Daniel D. Lee. Bias Reduction and Metric Learning for Nearest-Neighbor Estimation of Kullback-Leibler Divergence. In Samuel Kaski and Jukka Corander, editors, Proceedings of the Seventeenth International Conference on Artificial Intelligence and Statistics, volume 33 of Proceedings of Machine Learning Research, pages 669–677, Reykjavik, Iceland, 22–25 Apr 2014. PMLR. URL https://proceedings.mlr.press/v33/noh14.html.
  • Perez-Cruz (2008) Fernando Perez-Cruz. Kullback-Leibler divergence estimation of continuous distributions. In 2008 IEEE International Symposium on Information Theory. IEEE, July 2008. doi: 10.1109/isit.2008.4595271.
  • Politis et al. (1999) D. N. Politis, J. P. Romano, and M. Wolf. Subsampling. Springer Series in Statistics. Springer New York, 1999. ISBN 9780387988542. URL https://books.google.de/books?id=nGu6rqjE6JoC.
  • Politis and Romano (1994) Dimitris N. Politis and Joseph P. Romano. Large Sample Confidence Regions Based on Subsamples under Minimal Assumptions. The Annals of Statistics, 22(4), December 1994. ISSN 0090-5364. doi: 10.1214/aos/1176325770.
  • Pruim (2015) Randall Pruim. NHANES: Data from the US National Health and Nutrition Examination Study, 2015. URL https://CRAN.R-project.org/package=NHANES. R package version 2.1.0.
  • R Core Team (2022) R Core Team. R: A Language and Environment for Statistical Computing. R Foundation for Statistical Computing, Vienna, Austria, 2022. URL https://www.R-project.org/.
  • Rényi (1961) Alfréd Rényi. On Measures of Entropy and Information. Berkeley Symp. on Math. Statist. and Prob., pages 547–561, 1961.
  • Schervish (1995) Mark J. Schervish. Theory of Statistics. Springer New York, 1995. ISBN 9781461242505. doi: 10.1007/978-1-4612-4250-5.
  • Singh and Póczos (2016) Shashank Singh and Barnabás Póczos. Finite-Sample Analysis of Fixed-k Nearest NeighborDensity Functional Estimators. Conference on Neural Information Processing Systems, 2016.
  • Smania and Jonsson (2021) Giovanni Smania and E. Niclas Jonsson. Conditional distribution modeling as an alternative method for covariates simulation: Comparison with joint multivariate normal and bootstrap techniques. CPT: Pharmacometrics & Systems Pharmacology, 10(4):330–339, April 2021. ISSN 2163-8306. doi: 10.1002/psp4.12613.
  • Sumpter and Holford (2011) Anita L Sumpter and Nick H. G. Holford. Predicting weight using postmenstrual age – neonates to adults. Pediatric Anesthesia, 21(3):309–315, February 2011. ISSN 1460-9592. doi: 10.1111/j.1460-9592.2011.03534.x.
  • Tannenbaum et al. (2006) Stacey J. Tannenbaum, Nicholas H. G. Holford, Howard Lee, Carl C. Peck, and Diane R. Mould. Simulation of Correlated Continuous and Categorical Variables using a Single Multivariate Distribution. Journal of Pharmacokinetics and Pharmacodynamics, 33(6):773–794, October 2006. ISSN 1573-8744. doi: 10.1007/s10928-006-9033-1.
  • Teutonico et al. (2015) D. Teutonico, F. Musuamba, H. J. Maas, A. Facius, S. Yang, M. Danhof, and O. Della Pasqua. Generating Virtual Patients by Multivariate and Discrete Re-Sampling Techniques. Pharmaceutical Research, 32(10):3228–3237, May 2015. ISSN 1573-904X. doi: 10.1007/s11095-015-1699-x.
  • Wang et al. (2009) Qing Wang, Sanjeev R. Kulkarni, and Sergio Verdu. Divergence Estimation for Multidimensional Densities Via k-Nearest-Neighbor Distances. IEEE Transactions on Information Theory, 55(5):2392–2405, May 2009. ISSN 0018-9448. doi: 10.1109/tit.2009.2016060.
  • Zhao and Lai (2020) Puning Zhao and Lifeng Lai. Analysis of K Nearest Neighbor KL Divergence Estimation for Continuous Distributions. In 2020 IEEE International Symposium on Information Theory (ISIT). IEEE, June 2020. doi: 10.1109/isit44484.2020.9174033.
  • Zwep et al. (2023) Laura B. Zwep, Tingjie Guo, Thomas Nagler, Catherijne A.J. Knibbe, Jacqueline J. Meulman, and J.G. Coen van Hasselt. Virtual patient simulation using copula modeling. Clinical Pharmacology & Therapeutics, November 2023. ISSN 1532-6535. doi: 10.1002/cpt.3099.

Appendix

Appendix A KL divergence estimation for combined continuous/discrete data

Here, we describe KL divergence estimation in the general case where some covariates are continuous and others discrete. Decomposing x=(xc,xd)𝑥subscript𝑥𝑐subscript𝑥𝑑x=(x_{c},x_{d})italic_x = ( italic_x start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT , italic_x start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT ) into the continuous and discrete parts xcsubscript𝑥𝑐x_{c}italic_x start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT and xdsubscript𝑥𝑑x_{d}italic_x start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT, respectively, KL divergence between p𝑝pitalic_p and q𝑞qitalic_q is then given by

DKL(p||q):=xdlog(p(xc,xd)q(xc,xd))p(xc,xd)dxc.D_{\text{KL}}(p||q):=\sum\limits_{x_{d}}\int\log\left(\frac{p(x_{c},x_{d})}{q(% x_{c},x_{d})}\right)p(x_{c},x_{d})dx_{c}.italic_D start_POSTSUBSCRIPT KL end_POSTSUBSCRIPT ( italic_p | | italic_q ) := ∑ start_POSTSUBSCRIPT italic_x start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT end_POSTSUBSCRIPT ∫ roman_log ( divide start_ARG italic_p ( italic_x start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT , italic_x start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT ) end_ARG start_ARG italic_q ( italic_x start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT , italic_x start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT ) end_ARG ) italic_p ( italic_x start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT , italic_x start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT ) italic_d italic_x start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT . (6)

Conditioning on the discrete variables and reordering, we can rewrite DKL(p||q)D_{\text{KL}}(p||q)italic_D start_POSTSUBSCRIPT KL end_POSTSUBSCRIPT ( italic_p | | italic_q ) as

DKL(p||q)=xdp(xd)log(p(xc|xd)q(xc|xd))dxc+xdlog(p(xd)q(xd))p(xd),D_{\text{KL}}(p||q)=\sum\limits_{x_{d}}p(x_{d})\int\log\left(\frac{p(x_{c}|x_{% d})}{q(x_{c}|x_{d})}\right)dx_{c}+\sum\limits_{x_{d}}\log\left(\frac{p(x_{d})}% {q(x_{d})}\right)p(x_{d}),italic_D start_POSTSUBSCRIPT KL end_POSTSUBSCRIPT ( italic_p | | italic_q ) = ∑ start_POSTSUBSCRIPT italic_x start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT end_POSTSUBSCRIPT italic_p ( italic_x start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT ) ∫ roman_log ( divide start_ARG italic_p ( italic_x start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT | italic_x start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT ) end_ARG start_ARG italic_q ( italic_x start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT | italic_x start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT ) end_ARG ) italic_d italic_x start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT + ∑ start_POSTSUBSCRIPT italic_x start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT end_POSTSUBSCRIPT roman_log ( divide start_ARG italic_p ( italic_x start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT ) end_ARG start_ARG italic_q ( italic_x start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT ) end_ARG ) italic_p ( italic_x start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT ) , (7)

known as the chain rule for KL divergence (Schervish, 1995, p.115f).

The integral in the first summand can be estimated on the stratified data 𝐱c|xdconditionalsubscript𝐱𝑐subscript𝑥𝑑\mathbf{x}_{c}|x_{d}bold_x start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT | italic_x start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT and 𝐲c|xdconditionalsubscript𝐲𝑐subscript𝑥𝑑\mathbf{y}_{c}|x_{d}bold_y start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT | italic_x start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT using the estimator (4) for continuous data described in the main text (where xc𝐱c|xd(xc,xd)𝐱subscript𝑥𝑐conditionalsubscript𝐱𝑐subscript𝑥𝑑subscript𝑥𝑐subscript𝑥𝑑𝐱x_{c}\in\mathbf{x}_{c}|x_{d}\Leftrightarrow(x_{c},x_{d})\in\mathbf{x}italic_x start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT ∈ bold_x start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT | italic_x start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT ⇔ ( italic_x start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT , italic_x start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT ) ∈ bold_x). The terms p(xd)𝑝subscript𝑥𝑑p(x_{d})italic_p ( italic_x start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT ) and q(xd)𝑞subscript𝑥𝑑q(x_{d})italic_q ( italic_x start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT ) are estimated using relative frequencies of xdsubscript𝑥𝑑x_{d}italic_x start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT in 𝐱𝐱\mathbf{x}bold_x and 𝐲𝐲\mathbf{y}bold_y, respectively; for details, see Hartung (2024).

Appendix B Scale dependency of KL divergence estimates

While KL divergence is scale-invariant, the estimator (4) might differ between scales. In the main text, all KL divergence estimates are computed on the original data scale. We investigated a potential scale dependency by computing KL divergence on both the original scale (using samples 𝐱𝐱\mathbf{x}bold_x and 𝐲𝐲\mathbf{y}bold_y) and the uniform scale (using the transformed samples u^(𝐱)^𝑢𝐱\hat{u}(\mathbf{x})over^ start_ARG italic_u end_ARG ( bold_x ) and u^(𝐲)^𝑢𝐲\hat{u}(\mathbf{y})over^ start_ARG italic_u end_ARG ( bold_y ), cf. (2)) and found that the impact was very low for NHANES-3, low for NHANES-11 datasets, and at most moderate for the ORGANWT dataset (see Fig. S1).

Refer to caption
Figure S1: KL divergence estimates (with 95% CI) on original and uniform scale.

Appendix C Supplementary Figures

Refer to caption
Figure S2: Scatter plot matrix of the NHANES-11 data. Abbreviations [units]: Age [years]; height [cm]; BW, body weight [kg]; Pulse, heart rate [beats/minute]; BPSysAve, average systolic blood pressure [mmHg]; BPDiaAve, average diastolic blood pressure [mmHg]; Testosterone, testosterone concentrations [ng/dL]; TotChol, total cholesterol [mmol/L]; UrinVol1, urine volume [mL]; UrineFlow1, urine flow rate [mL/min].
Refer to caption
Figure S3: Scatter plot matrix of the ORGANWT data. Abbreviations [units]: BW, body weight [g]; Brain, brain organ weight [g]; Heart, heart organ weight [g]; Kidney, kidney organ weight [g]; Liver, liver organ weight [g]; Spleen, spleen organ weight [g].