Artificial Inductive Bias for Synthetic Tabular Data Generation in Data-Scarce Scenarios

Patricia A. Apellániz
ETS Ingenieros de Telecomunicación
Universidad Politécnica de Madrid
Madrid
patricia.alonsod@upm.es
&Ana Jiménez^∗
ETS Ingenieros de Telecomunicación
Universidad Politécnica de Madrid
Madrid
ana.jimenezb@upm.es
&Borja Arroyo Galende
ETS Ingenieros de Telecomunicación
Universidad Politécnica de Madrid
Madrid
borja.arroyog@upm.es
&Juan Parras
ETS Ingenieros de Telecomunicación
Universidad Politécnica de Madrid
Madrid
j.parras@upm.es
&Santiago Zazo
ETS Ingenieros de Telecomunicación
Universidad Politécnica de Madrid
Madrid
santiago.zazo@upm.es
These authors contributed equally.

Abstract

While synthetic tabular data generation using Deep Generative Models (DGMs) offers a compelling solution to data scarcity and privacy concerns, their effectiveness relies on substantial training data, often unavailable in real-world applications. This paper addresses this challenge by proposing a novel methodology for generating realistic and reliable synthetic tabular data with DGMs in limited real-data environments. Our approach proposes several ways to generate an artificial inductive bias in a DGM through transfer learning and meta-learning techniques. We explore and compare four different methods within this framework, demonstrating that transfer learning strategies like pre-training and model averaging outperform meta-learning approaches, like Model-Agnostic Meta-Learning, and Domain Randomized Search. We validate our approach using two state-of-the-art DGMs, namely, a Variational Autoencoder and a Generative Adversarial Network, to show that our artificial inductive bias fuels superior synthetic data quality, as measured by Jensen-Shannon divergence, achieving relative gains of up to 50% when using our proposed approach. This methodology has broad applicability in various DGMs and machine learning tasks, particularly in areas like healthcare and finance, where data scarcity is often a critical issue.

Keywords Tabular Data $\cdot$ Deep Generative Model $\cdot$ Inductive Bias $\cdot$ Transfer Learning $\cdot$ Meta-Learning

1 Introduction

Recent advances in deep learning have led to the development of powerful Deep Generative Models (DGMs). These models excel at learning and representing complex, high-dimensional distributions, allowing them to sample new data points realistically. This capability has driven remarkable progress in various domains, including image generation [1], text generation [2], and video generation [3].

Tabular data has emerged as a data format of increasing interest within DGMs. Structured in a familiar spreadsheet-like format with rows and columns, tabular data are a fundamental cornerstone for information storage and analysis across diverse fields. Researchers are actively investigating techniques for generating realistic and informative tabular data, as evidenced by the growing body of work in this area [4, 5, 6]. Generative Adversarial Networks (GANs) are currently the most prominent architecture for tabular data generation. Within this domain, CTGAN [7] is a widely recognized model. However, the exploration of alternative methodologies is ongoing, as evidenced by surveys [8] that highlight diverse approaches [9, 10, 11]. Beyond GANs, there are promising approaches including Variational Autoencoders (VAEs) [12], methods based on diffusion models [13], and techniques that leverage language models [14], to mention some. Focusing on VAEs for tabular data generation, TVAE [7] stands as a foundational model. TVAE encodes tabular data into a latent space, a lower-dimensional representation that captures the essential features of the data. By sampling from this latent space and subsequently decoding the samples, TVAE facilitates the generation of new data points in the original tabular format. A recent advancement [15] builds upon TVAE by incorporating a Bayesian Gaussian Mixture model (BGM) within the VAE architecture. This novel approach outperforms the state-of-the-art models, including CTGAN and TVAE, on different validation metrics. This superior performance comes from the BGM integration enabling high-quality sampling from the VAE’s latent space. This addresses a limitation of TVAE, which assumes a Gaussian latent space for generation, which may not happen in practice [16]. As real-world data often deviates from this distribution, the BGM integration leads to improved performance.

The importance of sufficient data for DGM training cannot be overstated. Studies using popular DGMs, such as the CTGAN, use datasets ranging from 10k to 481k training instances. This contrasts starkly with the practices observed in numerous domains. For example, well-established datasets used to evaluate rudimentary models include the Iris dataset with 150 samples [17] and the Boston House Prices dataset with 506 instances [18]. Even within the realm of medical research, valuable datasets such as the breast cancer dataset encompass only around 300 patients [19]. Smaller datasets pose challenges for DGM training, including overfitting and difficulty in assessing the quality of generated data [20].

A critical challenge associated with using DGMs for tabular data generation lies in ensuring the quality and effectiveness of synthetic data. While standardized metrics exist for image [21, 22] and text data [23, 24], measuring the quality of synthetic tabular data presents unique challenges. Studies employ various metrics, including pairwise correlation difference, support coverage, likelihood fitness, and other statistics as described in [25]. However, a consistent method for holistic evaluation is lacking. Divergences, which quantify the discrepancy between probability distributions, offer a promising avenue for validation [20]. They can capture the overall differences between real and synthetic data by considering the joint distribution of all attributes. However, modeling joint distributions presents a trade-off between computational cost and accuracy. Large datasets, especially those with high dimensionality, require significant computational resources. With sufficient resources, accurate results can be achieved. In contrast, smaller datasets, common in real-world applications, present a challenge to accuracy. Limited data may hinder the capture of complex variable relationships, leading to models with poor generalization to unseen data. Consequently, even computationally efficient methods for joint distribution modeling can yield inaccurate results in small data settings.

The limitations of current validation techniques further compound the inherent limitations of small datasets. These techniques often focus on comparing synthetic data with real data used for training, failing to account for the limited scope of information on which the DGM was trained. This can lead to a false sense of security, where the synthetic data appear similar to the training data but may not generalize well. Deep learning models with high parameter counts are susceptible to overfitting too, especially on small datasets. Additionally, small datasets might not capture the full spectrum of real-world variations. Consequently, the synthetic data generated may not be representative of the underlying distribution, affecting its effectiveness for different tasks. Furthermore, smaller datasets are more prone to the influence of noise (random errors) and bias (systematic skews). These mislead DGMs into learning incorrect patterns, ultimately resulting in unrealistic synthetic data.

This work addresses the critical challenge of generating reliable synthetic tabular data from limited datasets, a prevalent scenario in many real-world applications. Traditional divergence metrics often struggle in these situations, which can lead to inaccurate assessments of the quality of synthetic data [20]. We propose a novel methodology specifically designed to address this issue by introducing a framework that leverages inductive biases to improve the performance of DGMs in small dataset environments. Inductive biases [26] are inherent preferences or assumptions built into a learning model. These biases can guide the learning process and improve model performance, particularly when data is scarce. Traditionally, inductive biases are introduced through domain knowledge or specific architectural choices. This work proposes an alternative approach that leverages the variability that is often found in the DGM training process to generate inductive biases through different learning techniques. As an example, we use VAEs, as they exhibit inherent variability between training seeds, allowing them to capture various aspects of data during training, but we note that our approach could be used with other DGMs. We exploit this variability by employing various transfer learning and meta-learning techniques to generate the inductive bias, ultimately leading to improved synthetic data generation. Our key contribution is threefold:

•

We propose a novel generation methodology for synthetic tabular data generated by DGMs in a small dataset environment. This methodology leverages inductive bias generation through transfer learning and meta-learning techniques to achieve a more reliable generation process.
•

We propose four different techniques (pre-training, model averaging, Model-Agnostic Meta-Learning (MAML), and Domain-Randomized Search (DRS)) to generate the inductive bias. We demonstrate the efficacy of our proposal using a common DGM as the VAE, and for the pre-training process, we also assess the performance with a CTGAN architecture, another common DGM.
•

We demonstrate the effectiveness of our proposed methodology through extensive experiments on benchmark datasets. We provide a comprehensive analysis of the results using established divergence metrics such as Kullback-Leibler (KL) and Jensen-Shannon (JS) divergences.

This contribution offers a significant advancement in the field of synthetic data generation. By enabling improved data quality in resource-constrained settings, our approach has the potential to broaden the applicability of synthetic tabular data across various disciplines.

2 Methodology: Generating Artificial Inductive Bias

Let us assume that we have a tabular dataset composed of $N$ entries $\{x_{r}^{i}\}_{i=1}^{N}$ . $N$ represents the number of samples available and each entry $x_{r}^{i}$ has a dimensionality of $C$ features. In other words, $C$ represents the number of attributes associated with each data point. Let us also define a DGM as a high-dimensional probability distribution $p_{\theta}$ , where $\theta$ represents the learnable parameters of the model. The objective of the DGM is to learn a representation, $p_{\theta}$ , that closely approximates the true underlying data distribution, denoted by $p(x_{r})$ . Once trained, the DGM can generate new synthetic generated samples $x_{g}$ , by drawing from its learned distribution:

x_{g}\sim p_{\theta}.

(1)

Ideally, a well-trained DGM should produce synthetic data $x_{g}$ that are statistically indistinguishable from real data $x_{r}$ .

In the prevalent big data setting, characterized by a large number of training samples ( $N\gg C$ ), DGMs with sufficient complexity can effectively capture the underlying data distribution $p(x_{r})$ . This is evidenced by the impressive results achieved in recent research, where high-dimensional synthetic samples are generated using vast amounts of training data [7, 1, 2, 3]. However, for scenarios with limited training data, which is common in tabular domains, DGMs struggle to accurately represent the complex inter-feature relationships. Consequently, the synthetic samples generated $x_{g}$ deviate significantly from the true data distribution $p(x_{r})$ , leading to high KL and JS divergences between real and synthetic data.

To address this challenge, we propose an approach that uses artificially generated inductive biases. Figure 1 illustrates the overall architecture. In the standard big data setting, a DGM $p_{\theta}$ is directly trained using real data $x_{r}$ generating high-quality synthetic data $x_{g}\sim p_{\theta}$ . However, when the number of real samples $N$ is limited, the quality of the generated data $x_{g}$ deteriorates. To mitigate this issue, we introduce an artificial inductive bias generator. This module takes the initial synthetic data $x_{g}$ as input and outputs an initial set of weights $\theta_{0}$ . These weights are then used as the inductive bias to train a second DGM $p_{\hat{\theta}}$ using real data $x_{r}$ . This second DGM generates a new set of synthetic samples, $\hat{x}_{g}$ . Notably, the only distinction between $p_{\theta}$ and $p_{\hat{\theta}}$ lies in the initial weights: $p_{\hat{\theta}}$ leverages the inductive bias encoded in $\theta_{0}$ to potentially achieve faster convergence to a distribution that better resembles $p(x_{r})$ , while $p_{\theta}$ begins training with random weights. As our simulations will demonstrate, this seemingly minor difference translates into significant improvements in the quality of the generated synthetic data.

Refer to caption — Figure 1: Block diagram for the proposed architecture. In a standard big data setting, the first generative model $p_{\theta}$ generates good enough samples. However, in cases where large amounts of data are not available ( $N$ is limited), we propose to use the data generated by the first DGM as input to an artificial inductive bias generator, which in return provides a set of initial weights $\theta_{0}$ for a DGM that contains the artificially generated inductive bias. This initial weight is then used to train a second DGM $p_{\hat{\theta}}$ using real data $x_{r}$ , generating a second set of synthetic data $\hat{x}_{g}$ , which is of higher quality than the synthetic data $x_{g}$ .

The proposed approach hinges on two key concepts: the importance of inductive biases and the feasibility of their artificial generation. The importance of inductive biases in supervised learning is well established. The no-free-lunch theorems state that a universally optimal learner does not exist. Consequently, specific learning biases can lead to substantial performance gains for particular problem domains (see [27] and the references therein). Convolutional Neural Networks (CNNs) exemplify this principle. Their inherent inductive bias, the fact that the image information possesses spatial correlation, makes them the preferred architecture for image processing tasks. Similarly, as highlighted in [26], the use of inductive biases is a cornerstone of Deep Learning’s success. In scenarios with limited training data, regularizers are commonly employed as inductive biases to prevent overfitting. This underscores the dual role of inductive biases: not only do they contribute to Deep Learning’s effectiveness, but they are also crucial in preventing overfitting. However, effective use of inductive biases is often contingent on having specific knowledge about the problem at hand. In the aforementioned example of CNNs, we inherently understand the existence of spatial correlation in images. However, in tabular data, this domain-specific knowledge is often scarce. To address this challenge, recent efforts have focused on designing large models trained on artificially generated data as inductive biases. The underlying hope is that the actual problem to be solved exhibits similarities to those encountered during training of the large model (e.g., [28] and [29]).

Therefore, our proposed approach departs from existing methods for incorporating inductive biases in synthetic tabular data generation. Unlike the brute-force approach employed in [29], we use data generated by a potentially low-quality DGM $p_{\theta}$ . This strategy aims to obtain an initial set of weights $\theta_{0}$ , which act as an inductive bias. Ideally, these weights should guide the model to a region of the parameter space that facilitates convergence towards a high-quality solution. IIn particular, we assess our ideas using a state-of-the-art VAE architecture for the DGM, although we hypothesize that similar results could be achieved with other DGM architectures. Due to its demonstrated superiority against other leading models, we will use the architecture proposed in [15]. VAEs are known to be sensitive to the initial random conditions (seeds) used during training. This dependence on seeds requires training with multiple seeds and selecting the one(s) that exhibit the best performance based on a chosen metric, such as minimum validation loss. The remaining runs, often discarded, may still contain valuable problem-specific information despite not achieving optimal solutions using traditional metrics. Our key idea lies in exploiting these potentially informative data from discarded VAE runs to create an artificial inductive bias for the final DGM trained with real data.

The following subsections explore two distinct paradigms for generating the initial set of weights, $\theta_{0}$ : transfer learning and meta-learning. Transfer learning techniques encompass pre-training and model averaging, while meta-learning techniques include MAML and DRS. Pre-training offers a versatile approach applicable to any DGM architecture, regardless of inherent characteristics. In contrast, model averaging and meta-learning techniques are particularly well suited for VAEs trained with multiple seeds due to their inherent variability in learned representations. Consequently, we will evaluate the latter two techniques within the chosen VAE architecture. Additionally, to assess the efficacy of pre-training across different DGM architectures, we will compare its performance on the CTGAN.

2.1 Transfer learning

Transfer learning is a machine learning paradigm that leverages knowledge acquired from a context domain (also called the source domain) to enhance learning performance in a new target domain [30]. This approach aims to improve the learning process in the target domain by capitalizing on the knowledge gained from solving related tasks in the context domain. This technique has demonstrated its efficacy in fields where data scarcity is a common challenge, such as the medical field [31].

Formally, based on the definition in [30], we can define a domain $\mathcal{D}$ by a feature space $\mathcal{X}$ and a marginal probability distribution $p(x_{r})$ . Two domains are considered distinct if their feature spaces $\mathcal{X}_{1},\mathcal{X}_{2}$ or marginal probability distributions $p(x_{1}),p(x_{2})$ differ, i.e., if $\mathcal{X}_{1}\neq\mathcal{X}_{2}$ or $p(x_{1})\neq p(x_{2})$ . The core objective of transfer learning is to leverage the knowledge learned in a context domain $\mathcal{D}_{context}$ to improve learning in a target domain $\mathcal{D}_{target}$ . This is typically achieved when the context and target domains differ, i.e., $\mathcal{D}_{context}\neq\mathcal{D}_{target}$ .

Our work focuses on a scenario where the context domain $\mathcal{D}_{context}$ consists of data $x_{g}$ generated by a DGM. The target domain $\mathcal{D}_{target}$ , on the other hand, consists of $x_{r}$ . Our approach leverages the representational power learned by the DGM $p_{\theta}$ on $x_{g}$ to provide a strong starting point for learning in the target domain with real data $x_{r}$ . This transfer of knowledge is achieved by initializing the model weights for the target domain with the weights learned from the DGM model trained on the generated data.

Transfer learning can be categorized into homogeneous and heterogeneous settings based on the feature spaces of the domains [32]. Homogeneous transfer learning applies when the context and target domains share the same feature space $\mathcal{X}_{1}=\mathcal{X}_{2}$ , while heterogeneous transfer learning deals with scenarios where feature spaces differ $\mathcal{X}_{1}\neq\mathcal{X}_{2}$ . This work focuses on homogeneous transfer learning, where the context domain serves as an augmented version of the target domain. The key difference between the domains in our case lies in the number of samples, leading to situations where the empirical distributions of the data differ, i.e., $p_{\theta}(x_{g})\neq p(x_{r})$ .

Within homogeneous transfer learning, various methodologies exist to improve target task performance by capitalizing on knowledge from a related source domain. These techniques encompass instance-based [33], relational knowledge transfer [34], feature-based [35], and, as employed in this work, parameter-based [36] transfer through shared model parameters or hyperparameter distributions. This study leverages a two-stage parameter-based transfer learning approach. The first stage involves either pre-training or model averaging, followed by fine-tuning in the second stage. Subsequent sections will delve deeper into both pre-training and model averaging techniques. Upon completion of one of these initial phases, fine-tuning serves to refine the model parameters, ultimately achieving optimal adaptation for the target domain.

2.1.1 Pre-training

Pre-training is a frequently adopted strategy for introducing an inductive bias into a model. By leveraging a pre-trained model on a context domain, the target model gains generalizable features that enhance its performance on a target domain. However, while pre-training is a standard in computer vision and natural language processing, achieving similar success with tabular data remains a challenge. This disparity arises from the inherent heterogeneity of the features of the tables, which creates substantial feature space shifts between pre-training and downstream datasets, hindering effective knowledge transfer. Despite these challenges, recent efforts like [37] and [38] explore tabular transfer learning with promising results. Although these studies demonstrate potential, achieving comprehensive parameter transfer in tabular data requires further research to establish best practices and unlock the full potential of pre-training in this domain.

In this work, pre-training involves the following steps. First, we train a separate DGM $p_{\theta_{pt}}$ using synthetic data $x_{g}$ as training data. Since $x_{g}$ is sampled from the initial DGM, $x_{g}\sim p_{\theta}$ , we can generate a vast amount of synthetic data. This abundance circumvents the limitations associated with training in small datasets, such as overfitting. Then, the optimal weights $\theta_{pt}^{*}$ from DGM $p_{\theta_{pt}}$ are used as initial weights $\theta_{0}$ to train the generative model $p_{\hat{\theta}}$ (see Figure 1).

In essence, our approach aligns with the well-established concept of data augmentation. We generate synthetic data $x_{g}$ , which may not perfectly capture the intricacies of the original data $x_{r}$ . However, we used these synthetic data to train another DGM $p_{\theta_{pt}}$ . Although $p_{\theta_{pt}}$ might generate lower quality synthetic samples, our objective is to exploit the information encoded within this DGM to establish an initial set of weights for the DGM that will eventually be trained on $x_{r}$ . In other words, we exploit the knowledge of the generative model $p_{\theta_{pt}}$ , the context domain, to obtain a better generative model $p_{\hat{\theta}}$ , which is our target domain. Figure 2 provides a visual representation of this pre-training procedure.

2.1.2 Model Averaging

The concept of model averaging emerged in the 1960s, primarily within the field of economics [39, 40]. Traditional empirical research often selects a single “best" model after searching a wide space of possibilities. However, this approach can underestimate the real uncertainty and lead to overly confident conclusions. Model averaging offers a compelling alternative. By combining multiple models, the resulting ensemble can outperform any individual model. This approach aligns with the core principles of statistical modeling: maximizing information use and balancing flexibility with overfitting. In essence, model averaging extends the concept of model selection by leveraging insights from all the models considered.

While pre-training can be incorporated with any DGM, our approach focuses on models where the training process is sensitive to initial conditions, such as VAEs. In such cases, it is common to train the DGM $p_{\theta}$ with multiple initial conditions (seeds) and potentially discard “bad" seeds based on a specific metric. We propose using these discarded seeds to create an artificial inductive bias. The simplest implementation involves averaging the model parameters. In this case, our context domains are the different results of each seed, and the target domain is obtained by averaging across the context domains. If we train $S$ different seeds for $p_{\theta}$ , resulting in $S$ models with parameters $\theta_{s}$ , we propose using the average of these weights as the inductive bias:

\theta_{0}=\frac{1}{S}\sum_{s=1}^{S}\theta_{s}

(2)

This straightforward approach is computationally efficient, requiring only the calculation of the average across the precomputed weights. It assumes that the average model may capture a robust inductive bias, leading to improved performance. Figure 3 summarizes this process.

2.2 Meta-learning

Traditional machine learning models often rely on large volumes of data to achieve optimal performance in specific tasks. In contrast, meta-learning introduces a distinct paradigm by training algorithms with the ability to “learn to learn" [41], enabling them to rapidly adapt to new tasks with minimal data. This departure from the conventional requirement of extensive datasets for each new task allows meta-learning algorithms to leverage knowledge gained from addressing numerous related tasks. Through introspective analysis of past experiences, these models dynamically adjust their learning strategies when confronted with novel situations, making them more efficient learners and requiring less data to perform well on tasks with similar characteristics.

In this work, we exploit the multi-seed training configuration of certain DGMs. By treating each of the $S$ different seeds obtained after training the DGM as a distinct task, we construct a meta-learning framework.

2.2.1 MAML

MAML is a prevalent approach within the field of meta-learning [42]. It identifies the initial set of weights denoted by $\theta_{MAML}$ by leveraging various tasks, enabling rapid and data-efficient adaptation to new tasks. This efficiency comes from the ability to fine-tune $\theta_{MAML}$ with minimal data for each new task. However, successful application of MAML requires access to a diverse collection of tasks for effective learning.

Formally, we can frame the problem by starting with a common single-task learning scenario and transforming it into the meta-learning framework. Consider a task $\mathcal{T}$ that consists of an input $x$ sampled from a probability distribution $\mathcal{D}$ . For simplicity, we define a task instance $\mathcal{T}$ as a tuple comprising a dataset $\mathcal{D}$ and its corresponding loss function $\mathcal{L}$ . To solve the task $\mathcal{T}$ we need to obtain an optimal model parameterized by a task-specific parameter $\omega^{*}$ , which minimizes a loss function $\mathcal{L}$ on the data of the task as follows:

\omega^{*}=\arg\min_{\omega}\underset{x\sim\mathcal{D}}{\mathbb{E}}\big{[}% \mathcal{L}(\mathcal{D};\omega)\big{]}.

(3)

In single-task learning, hyperparameter optimization is achieved by splitting the dataset $\mathcal{D}$ into two disjoint subsets $\mathcal{D}=\mathcal{D}^{(t)}\cup\mathcal{D}^{(v)}$ , which are the training and validation sets, respectively. The meta-learning setting aims to develop a general-purpose learning algorithm that excels across a distribution of tasks represented by $p(\mathcal{T})$ [43]. The objective is to use training tasks to train a meta-learning model $\theta_{MAML}$ that can be fine-tuned to obtain $\omega$ to perform well on unseen tasks sampled from the same task environment $p(\mathcal{T})$ . Meta-learning methods utilize meta-parameters to model the common latent structure of the task distribution $p(\mathcal{T})$ . Therefore, we consider meta-learning as an extension of hyperparameter optimization, where the hyperparameter of interest – often called a meta-parameter – is shared across many tasks.

In this work, the distribution of tasks is defined by the set of $S$ training seeds obtained after training the DGM. Given a set of $S$ training seeds following $p(\mathcal{T})$ , each task $\mathcal{T}\sim p(\mathcal{T})$ is therefore formalized as $\mathcal{T}=\{\mathcal{D},\mathcal{L}\}$ . Here, each dataset $\mathcal{D}$ consists of synthetic data points $x_{g}^{s}$ drawn from the model for the different training seeds. The loss function $\mathcal{L}$ corresponds to the DGM loss function. The specific form of $\mathcal{L}$ depends on the chosen DGM. If the chosen DGM is a VAE the loss function $\mathcal{L}$ would be the negative of the Evidence Lower BOund (ELBO) [44]. In contrast, if a GAN is used, the loss function $\mathcal{L}$ would be the minimax loss function arising from the interplay between the generator and discriminator networks [45]. It is important to note that both VAEs and GANs use two neural networks within their architecture, different from the single network architectures commonly found in state-of-the-art applications [46, 47].

Solving this problem using the MAML approach requires access to a collection of $B$ tasks sampled from $p(\mathcal{T})$ . We denote this set of tasks $\mathcal{T}_{b}$ used for training as $\mathcal{D}_{b}=\{(\mathcal{D}_{b}^{(t)},\mathcal{D}^{(v)}_{b})\}_{b=1}^{B}$ , where each task $b$ has dedicated meta-training and meta-validation data, respectively. The goal of meta-training is to find the optimal $\omega^{*}_{b}$ for a given each task $b$ given $\theta_{MAML}$ . This $\theta_{MAML}$ essentially captures the ability to learn effectively from new data. In this context, the task-related parameter $\omega_{b}$ denotes the parameters of the two different networks that comprise the VAE, i.e. the encoder and decoder’s task-specific parameters. After meta-training, the learned $\omega^{*}_{b}$ is used to guide the training of a base model $\theta_{MAML}$ . This procedure is called meta-testing. This essentially means that the model leverages the knowledge gained from previous tasks to improve the efficiency of learning on new tasks. This can be viewed as a bi-level optimization problem [48]:

\begin{split}\min_{\theta_{MAML}}\underset{\mathcal{T}_{b}\sim p(\mathcal{T})}% {\mathbb{E}}\left[\underset{x_{g_{b}}^{(v)}\sim\mathcal{D}^{(v)}_{b}}{\mathbb{% E}}\big{[}\mathcal{L}_{b}(\mathcal{D}^{(v)}_{b};\omega^{*}_{b}(\theta_{MAML}))% \big{]}\right]\\ \textrm{s.t: }\omega^{*}_{b}(\theta_{MAML})=\arg\min_{\omega_{b}}\underset{x_{% g_{b}}^{(t)}\sim\mathcal{D}^{(t)}_{b}}{\mathbb{E}}\big{[}\mathcal{L}_{b}(% \mathcal{D}^{(t)}_{b};\omega_{b}(\theta_{MAML}))\big{]}.\end{split}

(4)

This equation essentially minimizes the expected loss across all tasks on the meta-validation sets, subject to the constraint that for each task, the task-specific parameter $\omega$ is optimized on the corresponding meta-training data.

Since in our work we are upgrading the parameters using gradient descent, we can reformulate Equation 4 as follows:

	$\displaystyle\omega_{b}\leftarrow\theta-\alpha\nabla_{\omega_{b}}\mathcal{L}_{% b}(\mathcal{D}^{(t)}_{b};\omega_{b})$		(5)
	$\displaystyle\theta_{MAML}\leftarrow\theta_{MAML}-\gamma\nabla_{\theta_{MAML}}% \sum_{b=1}^{B}\mathcal{L}_{b}(\mathcal{D}^{(v)}_{b};\theta_{MAML}).$		(6)

Here, $\alpha$ and $\gamma$ represent the learning rates for the inner and outer loops, respectively. The inner loop updates the task-specific parameters $\omega$ for each task $b$ using the gradient of the loss function $\mathcal{L}_{b}$ in the meta-training data. The outer loop updates the meta-parameters $\theta_{MAML}$ based on the accumulated meta-validation loss across all tasks.

Figure 4 illustrates the integration of the MAML procedure within the framework of our proposed methodology. In this context, the task space denoted by $p(\mathcal{T})$ corresponds to the various seeds $S$ obtained during the training process. Essentially, the task space encompasses the different probability distributions $p_{\theta_{s}}$ associated with each training seed. Ultimately, the meta-training steps lead to the identification of the desired parameters, denoted by $\theta_{MAML}$ . Note that what $\theta_{MAML}$ represents is a set of parameters that adapt fast to new data; in our case, it means that the DGM initial parameters are chosen so that they adapt fast to generate real data.

2.2.2 DRS

Although MAML offers the potential to leverage the underlying structure of learning problems through a powerful optimization framework, it introduces a significant computational cost. Therefore, while we should seek for a trade-off between accuracy and computational efficiency, there is no approach for managing this trade-off. It needs an understanding of the domain-specific characteristics inherent to the meta-problem itself.

DRS presents an alternative meta-learning approach that circumvents the computational burden associated with bilevel optimization problems. Unlike MAML, DRS trains a model on the combined data from all tasks. This eliminates the need for the complex optimization procedures present in MAML, leading to a more computationally efficient solution. However, it is important to acknowledge that DRS offers an approximation to the ideal solution [49].

Formally, DRS focuses on the meta-information, denoted by $\theta_{meta}$ , as the initialization of an iterative optimizer used in a new meta-testing task, $\mathcal{T}_{S}$ . In this context of meta-learning initialization, a straightforward alternative involves solving the following pseudo-meta problem:

\theta_{DRS}=\arg\min_{\omega}\underset{\mathcal{T_{S}}\sim p(\mathcal{T})}{% \mathbb{E}}\mathcal{L}(\mathcal{D^{*}};\omega).

(7)

In this context, $\mathcal{D^{*}}$ represents the aggregated synthetic data collection, $x_{g}$ , obtained across all training seeds $S$ . We refer to this approach as Domain-Randomized Search due to its alignment with the domain randomization method presented in [50] and its core principle of directly searching over a distribution of domains (tasks).

Figure 5 shows the application of the DRS procedure within the framework of our proposed methodology. Similar to the MAML case, $\theta_{DRS}$ serves as the initialization weights that we aim to identify $\theta_{0}$ .

Both MAML and DRS offer complementary approaches with a trade-off between modeling complexity and optimization cost [49]. DRS delivers an approximate solution with lower computational demands, while MAML offers higher precision at the expense of greater computational resources. DRS is also advantageous when dealing with a limited number of learning tasks. In our case, where data generated by each seed ( $s=1,2,...,S$ ) is considered a task, and $S$ typically takes values around $10$ , DRS is expected to provide better solutions than MAML, aligning with the findings of [49]. Finally, note that DRS is similar to the pre-training approach. While both techniques aim to improve model performance, they utilize data differently. Pre-training leverages data from the single best VAE seed, whereas DRS capitalizes on data from all VAE seeds. This distinction reflects the core principle of DRS: exploring a wider range of possibilities by searching across a distribution of domains (tasks) represented by the various seed variations.

3 Experiments

3.1 Data

The experiments were carried out on four public datasets obtained from the SDV environment [51], which also implements various data generation models, including the CTGAN implementation we use. The experiment design prioritized datasets with a sufficient number of samples. This allows us to create multiple data splits for various configurations of the different training and validation parameters. This approach comprehensively evaluates the proposed method under different parameter settings.

•

Adult: The Adult Census Income dataset [52] is a mixed-data dataset extracted from the 1994 U.S. Census. It comprises $32,561$ data points, each described by 14 features that encompass integer, categorical, and binary values. The dataset is used to predict whether an individual’s annual income exceeds $50,000. It should be noted that the data set incorporates 13% missing values, concentrated exclusively within two specific variables: “workclass" and “occupation".
•

News: The News Popularity Prediction dataset [53] consists of 39,644 samples and 58 columns containing information about articles published on the Mashable news blog over two years. The objective is to predict the popularity of an article, measured by the number of social media shares. The dataset is multivariate, including both continuous and categorical variables.
•

King¹¹1https://www.kaggle.com/datasets/harlfoxem/housesalesprediction: The King County House Sales dataset is a regression dataset containing 21,613 house sale prices for King County, Washington, including Seattle. It includes homes sold between May 2014 and May 2015. The 20 features provide multivariate information in numerical (integer and decimal) and categorical forms.
•

Intrusion: The KDD Cup 1999 Data [54] was used for The Third Knowledge Discovery and Data Mining Competition. It comprises 494,021 samples and 39 information characteristics used to classify connections in a military network environment. The dataset contains categorical, integer, and binary features.

This selection of datasets with varying sample sizes, feature dimensions, and domains allows us to assess the performance and generalizability of the proposed method in diverse data scenarios. As will be demonstrated in the following sections, the proposed method is effective in generating synthetic data that retain the key characteristics of the original datasets across a wide range of applications.

3.2 Validation metrics

To rigorously evaluate the effectiveness of our proposed method in capturing the real data distribution, we strictly adhere to the validation approach described in [20]. This approach leverages a probabilistic classifier (discriminator) to estimate the ratio of probability densities between the real and synthetic distributions, subsequently calculating the KL and JS divergences. Traditional validation methods often focus on individual data points and the marginal distribution of each separate feature. In contrast, this approach considers the entire data distribution, including complex relationships between features. Additionally, divergences are robust to noise and offer clear interpretations, making them ideal for evaluating the DGM’s effectiveness. This provides a comprehensive approach to measuring the discrepancy between two probability distributions, making them suitable for assessing the similarity between real data $p(x_{r})$ and the distribution of the synthetic data generated by the DGM $p_{\theta}$ .

The discriminator network plays a crucial role in the validation process. This neural network architecture is trained to distinguish between real and synthetic data samples. The network receives two sets of samples as input. The first consists of $M$ samples from the real data distribution $p(x_{r})$ labeled as class $1$ , and the second consists of $M$ samples labeled as class $0$ from the synthetic data distribution generated by the DGM $p_{\theta}$ or $p_{\hat{\theta}}$ , depending on the $N$ number of samples in the dataset. During training, the discriminator aims to learn a decision boundary that effectively separates these two sets of samples. This process forces the discriminator to capture the underlying differences between the real and synthetic distributions. Once the discriminator network is trained, it is used to estimate the KL and JS divergences between the real and synthetic probability distributions. This estimation involves using $L$ samples from each distribution and feeding them to the trained discriminator. The output probabilities of the discriminator for these samples are then used to compute the KL and JS divergence metrics.

Figure 6 illustrates the overall scheme of the approach. The figure emphasizes the separate but related components: the inductive bias generator for synthetic data creation and the validation process with the discriminator network. It also highlights the number of samples used from each distribution for each step of the process ( $N$ to generate samples, $M$ to train the discriminator, and $L$ to estimate divergence). By adhering to this rigorous validation approach, we ensure that our proposed methodology for generating synthetic data from small datasets is thoroughly evaluated and its effectiveness is quantitatively demonstrated using established metrics like KL and JS divergences. Note that, as stated in [20], $M$ and $L$ need to be large enough to prevent inaccurate divergence estimations, so we consider this in our experiments.

3.3 Experimental design

In terms of the experimental design to evaluate the proposed method, and as described in the methodology section, we use the state-of-the-art VAE architecture [15] to train 10 different seeds. Subsequently, we apply methodologies based on transfer learning and meta-learning. For pre-training, we also include results using another state-of-the-art model, CTGAN. It is important to note that while we maintain the default parameters for CTGAN, we adjust the dimensionality of the latent space in the VAE depending on the dataset. This allows the VAE to capture the specific characteristics of each dataset more effectively. Particularly, we use a latent space dimension of $10$ for the Adult and Intrusion datasets, $20$ for the News dataset, and $15$ for the King dataset. We maintain a consistent hidden size of $256$ neurons for all VAE models. Regarding the configuration of parameters $M$ and $L$ , we define two different validation configurations for each methodology: a reliable case and a more realistic case. The parameter $N$ remains unchanged, as it reflects the actual number of samples available to train the DGM, which is beyond our control. However, we can vary the number of generated samples for validation, i.e., $M$ and $L$ . Specifically, the results presented for each dataset are as follows.

•

“Big data’: First, we present the optimistic scenario with a sufficient $N$ of $10,000$ samples, where no methodology is needed to calculate the inductive bias. This provides results of the divergences that serve as an “upper bound" or reference for the best possible outcome. For $M$ and $L$ , we maintain high values of the validation samples, $7500$ and $1000$ , respectively.
•

“Low data’: Next, we show results for a realistic scenario with few samples ( $N=300$ ) without applying our methodology. This allows us to quantify the gain using the methodology and determine its benefits. For this case, we use two configurations for parameters $M$ and $L$ : [ $M=7500$ , $L=1000$ ] and [ $M=100$ , $L=100$ ]. The second configuration is more realistic for few-data scenarios. When limited training data are available, there is also a limitation on the amount of data that can be effectively used for validation. The first configuration, with much larger values for $M$ and $L$ , serves as a rigorous evaluation of the impact of our methodology. However, it is acknowledged that the use of small values for $M$ and $L$ can lead to unreliable metric estimations [20].
•

“Pre-train’: In this case, we begin to apply the proposed methodology using the pre-training technique. Results are presented for both CTGAN and VAE. The parameter configurations chosen are: [ $N=300$ , $M=7500$ , $L=1000$ ] and [ $N=300$ , $M=100$ , $L=100$ ].
•

“AVG’, “MAML’, “DRS’: These scenarios apply the model average (“AVG’) and the meta-learning techniques (“MAML’, “DRS’). We solely utilize the VAE architecture for multiple training runs. The following parameter configurations will be presented: [ $N=300$ , $M=7500$ , $L=1000$ ] and [ $N=300$ , $M=100$ , $L=100$ ].

This setup aims to thoroughly evaluate the performance and robustness of the proposed methodologies in various data availability scenarios and parameter configurations.

3.4 Results

In this section, we present the results obtained from the experiments, focusing on scenarios that promote the validation of reliable synthetic data, characterized by higher values of $M$ and $L$ . The results for scenarios with $M=100$ and $L=100$ , considered unreliable due to their low information content, are presented in the Appendix for comparison purposes. For each database, we present a table summarizing the scenarios defined previously and their respective KL and JS divergence values. The results for each metric are displayed in the following format: mean (std) (lower is better). The code to replicate our results, along with the data used, can be found in our repository.

Scenario	N	VAE JS	CTGAN JS	VAE KL	CTGAN KL
Big data	10000	0.079 (0.001)	0.150 (0.002)	0.153 (0.019)	0.420 (0.025)
Low data	300	0.331 (0.004)	0.563 (0.002)	0.697 (0.018)	1.653 (0.015)
\hdashlinePre-train	300	0.171 (0.004)	0.563 (0.002)	0.427 (0.021)	1.753 (0.040)
AVG	300	0.157 (0.004)	N/A	0.380 (0.043)	N/A
MAML	300	0.300 (0.002)	N/A	0.686 (0.037)	N/A
DRS	300	0.189 (0.006)	N/A	0.427 (0.043)	N/A

Table 1: Adult dataset JS and KL results for each Scenario. Big data represents the ideal case where many samples (

N=10,000

) are available to generate reliable synthetic data. Low data represents a more realistic scenario in which a limited number of samples (

N=300

) are available, posing a challenge for synthetic data generation. The next rows compare the divergences obtained by each methodology (pre-training, model averaging, MAML, and DRS) applied to the low data scenario. Values in bold indicate an improvement due to the technique used. Results are represented as mean (std). Lower is better.

Table 1 shows the validation results in terms of divergence obtained for the Adult dataset. We focus primarily on the JS divergence due to its interpretability as a bounded metric (ranging from 0 to 1). The table shows the upper and lower bounds used to assess the efficacy of the proposed methodology in the reliable case of higher validation samples ( $M=7500$ and $L=1000$ ). These bounds are $0.079$ (upper) and $0.331$ (lower), highlighting a significant gap and room for improvement in the base VAE model (without any techniques applied). Examining the JS divergence results for the application of different proposed techniques, a consistent decrease in divergence is observed. The worst improvement is obtained for MAML ( $0.300$ ) and the best for AVG ( $0.157$ ). This implies that improvement is always present and, in the best cases, significantly high in terms of JS for VAE. A similar pattern is observed for KL divergence in VAE: better divergence results are obtained for transfer learning cases, but improvements are always achieved. However, for the CTGAN model (where only pre-training results are available), we see no significant improvement in either JS or KL divergence.

Table 2 summarizes the results obtained for the News dataset. The values demonstrate the effectiveness of the proposed techniques in improving divergence metrics compared to the established lower bounds. Notably, the VAE model shows improvement in divergence metrics in most methodologies, with the exception of the MAML technique. We hypothesize that the MAML technique might require a larger number of tasks to achieve comparable performance. Consistent with the findings for the previous dataset, Model Averaging emerges as the methodology that generally achieves the best results. When considering CTGAN, pre-training demonstrates improvement in the JS divergence metric. While the KL divergence results for pre-trained CTGAN do not show significant worsening compared to the lower bound, they remain within the established confidence intervals.

Scenario	N	VAE JS	CTGAN JS	VAE KL	CTGAN KL
Big data	10000	0.253 (0.009)	0.463 (0.003)	0.647 (0.045)	1.506 (0.031)
Low data	300	0.840 (0.003)	0.962 (0.002)	4.582 (0.136)	8.994 (0.909)
\hdashlinePre-train	300	0.746 (0.003)	0.937 (0.003)	3.516 (0.082)	8.603 (0.463)
AVG	300	0.609 (0.003)	N/A	2.596 (0.060)	N/A
MAML	300	0.851 (0.001)	N/A	5.176 (0.242)	N/A
DRS	300	0.645 (0.006)	N/A	2.449 (0.057)	N/A

Table 2: News dataset JS and KL results for each Scenario. Big data represents the ideal case where many samples (

N=10,000

) are available to generate reliable synthetic data. Low data represents a more realistic scenario in which a limited number of samples (

N=300

Table 3 further reinforces the efficacy of the methodologies proposed in the King dataset. The VAE model consistently improves on the lower bounds established for the JS and KL divergences across all techniques. For CTGAN, the results for the JS and KL divergences are consistent with the lower bounds. This suggests that while CTGAN does not produce significant reductions in divergence metrics, it appears to maintain the quality of the data distribution compared to the lower bounds. However, it should be noted that CTGAN exhibits consistently lower gains across datasets compared to the VAE. This may be attributed to inherent GAN framework instabilities.

Scenario	N	VAE JS	CTGAN JS	VAE KL	CTGAN KL
Big data	10000	0.862 (0.002)	0.777 (0.003)	4.768 (0.072)	3.124 (0.115)
Low data	300	0.927 (0.002)	0.940 (0.003)	13.763 (0.696)	7.470 (0.392)
\hdashlinePre-train	300	0.862 (0.002)	0.945 (0.002)	5.286 (0.327)	9.533 (0.453)
AVG	300	0.740 (0.002)	N/A	3.489 (0.209)	N/A
MAML	300	0.910 (0.002)	N/A	6.436 (0.496)	N/A
DRS	300	0.809 (0.003)	N/A	4.321 (0.215)	N/A

Table 3: King dataset JS and KL results for each Scenario. Big data represents the ideal case where many samples (

N=10,000

) are available to generate reliable synthetic data. Low data represents a more realistic scenario in which a limited number of samples (

N=300

The results obtained using the Intrusion dataset are summarized in Table 4. These values align with the findings for the News dataset, demonstrating consistent improvements in divergence metrics for reliable case ( $M=7500$ and $N=1000$ ) across different methodologies. Similarly to the News dataset, Intrusion presents a high number of features, resulting in higher dimensionality. This increased dimensionality poses a greater challenge in generating synthetic data that closely resembles the real-world data distribution. Consequently, the divergence results for Intrusion either maintain or exhibit smaller improvements compared to lower-dimensional datasets. Despite the challenges posed by the high dimensionality of Intrusion, the proposed methodologies still demonstrate their effectiveness in improving divergence metrics, particularly for reliable cases.

Scenario	N	VAE JS	CTGAN JS	VAE KL	CTGAN KL
Big data	10000	0.760 (0.013)	0.531 (0.033)	2.744 (0.084)	2.623 (0.537)
Low data	300	0.920 (0.003)	0.961 (0.002)	6.216 (0.154)	8.841 (0.710)
\hdashlinePre-train	300	0.793 (0.004)	0.959 (0.001)	3.831 (0.151)	8.443 (0.630)
AVG	300	0.867 (0.007)	N/A	5.798 (0.295)	N/A
MAML	300	0.913 (0.003)	N/A	6.359 (0.054)	N/A
DRS	300	0.835 (0.009)	N/A	4.587 (0.166)	N/A

Table 4: Intrusion dataset JS and KL results for each Scenario. Big data represents the ideal case where many samples (

N=10,000

) are available to generate reliable synthetic data. Low data represents a more realistic scenario in which a limited number of samples (

N=300

Finally, Table 5 summarizes the results of applying various methodologies to generate synthetic tabular data using the VAE model. Across different methodologies, the results consistently show positive gains, except for MAML. These gains are computed as the difference between the metric in the lower bound and any of the proposed methods. This indicates that the proposed techniques generate synthetic data that are closer to the real-world data distribution than in the case of not using them. This improvement is particularly evident for the model averaging methodology, which consistently outperforms other techniques in terms of divergence reduction. MAML’s performance might be limited in this scenario due to its reliance on a large number of seeds. Since the results presented do not specify the number of seeds used, MAML might not have had enough to learn effectively [49]. The consistent improvement in divergence metrics highlights the robustness and generalizability of the proposed techniques. These findings suggest that our approach is an effective methodology for generating high-quality synthetic tabular data that can be used for various applications.

Dataset	Pre-train Gain		AVG Gain		MAML Gain		DRS Gain
Dataset	JS	KL	JS	KL	JS	KL	JS	KL
Adult	0.159 (0.482)	0.271 (0.388)	0.173 (0.525)	0.317 (0.455)	0.030 (0.092)	0.011 (0.016)	0.142 (0.430)	0.271 (0.388)
News	0.093 (0.111)	1.065 (0.233)	0.230 (0.274)	1.985 (0.433)	-0.011 (-0.013)	-0.594 (-0.130)	0.194 (0.231)	2.133 (0.465)
King	0.064 (0.070)	8.477 (0.616)	0.187 (0.202)	10.274 (0.746)	0.017 (0.018)	7.327 (0.532)	0.118 (0.128)	9.442 (0.686)
Intrusion	0.127 (0.138)	2.385 (0.384)	0.053 (0.057)	0.419 (0.067)	0.006 (0.007)	-0.143 (-0.023)	0.085 (0.092)	1.629 (0.262)
\hdashlineAverage	0.111 (0.200)	3.050 (0.405)	0.161 (0.265)	3.249 (0.425)	0.011 (0.026)	1.650 (0.099)	0.135 (0.220)	3.369 (0.450)

Table 5: Gains using the proposed methodology for the VAE. Gains are represented in the following format: absolute gain (relative gain). The methodology achieves relative gains of up to 50% in JS divergence, which is bounded, and up to 75% in KS divergence. Bold values indicate positive gain. Higher is better.

It is important to acknowledge the varying computational demands of the compared methods. Model averaging emerges as the most efficient approach, as inductive bias generation only involves calculating a mean, resulting in minimal computational overhead. In contrast, MAML exhibits the highest computational load due to its intricate optimization procedure. Pre-training and DRS fall between these two extremes, both requiring the training of a DGM to establish the inductive bias. Considering these findings alongside the results presented in Table 5, we recommend against using MAML. It offers minimal performance gains with significant computational cost. The other methods, on the other hand, provide a more favorable trade-off between computational efficiency and performance. Furthermore, as detailed in the Appendix, reliably quantifying the benefits of our methodology in a realistic, limited-data setting is challenging. This implies that validation with a large number of samples is necessary to definitively assess which of our proposed inductive bias techniques yields superior results for a specific dataset. However, the experimental results strongly suggest potential gains that warrant further exploration.

4 Conclusions

This research proposed a novel approach to generate synthetic tabular data using DGMs in the context of limited datasets. Our approach leverages four distinct techniques to artificially introduce an inductive bias that guides the DGM towards generating more realistic and informative synthetic data samples. These techniques encompass two transfer learning approaches: pre-training and model averaging; and two meta-learning approaches: MAML and DRS. To facilitate the application of model averaging, MAML, and DRS, we employ the VAE model from [15] and train multiple instances with different random seeds. This allows us to leverage the ensemble properties of the VAE models for techniques like model averaging and further enables the application of meta-learning algorithms like MAML and DRS. We also used the CTGAN [7] to assess pre-training so that we can compare other architectures of well-known models for synthetic tabular data generation. We used divergence metrics, in particular JS and KL divergences, to compare the real and synthetic data distributions generated. The experimental results consistently demonstrate the effectiveness of our proposed approach in generating high-quality synthetic tabular data, particularly when using transfer learning techniques. These techniques significantly improve the divergence metrics, indicating a closer resemblance between the synthetic and real data distributions. Our approach offers several advantages over existing methods. Firstly, it effectively addresses the challenge of generating realistic synthetic data from small datasets, a common limitation in many real-world applications. Secondly, the use of transfer learning and meta-learning techniques enhances the inductive bias of the DGM, leading to more meaningful and informative synthetic data samples. However, it is also important to acknowledge the trade-offs associated with our methodology. Training VAEs with these techniques requires training multiple VAE models with different random seeds. This can lead to a significant increase in computational cost compared to simpler DGM training methods. While divergence metrics provide a valuable measure of distributional similarity, their ability to reliably assess the improvement in synthetic data quality for specific downstream tasks can be limited, especially with small datasets, as detailed in the Appendix. Nonetheless, our methodology may provide significant gains in JS divergence, of up to 50% according to our experimental results.

In conclusion, our proposed approach provides a promising solution for generating high-quality synthetic tabular data from small datasets, particularly when VAEs are employed to apply transfer learning techniques. We believe that this work has the potential to make a significant contribution to the field of synthetic data generation and machine learning applications that rely on small datasets. However, there are several research lines to be addressed. While the current study focuses on VAEs and GANs, investigating the applicability of our framework to other DGM architectures could provide valuable insights. In addition, our current approach does not explicitly incorporate domain knowledge. Future research could explore mechanisms to integrate domain-specific information from an expert into the inductive bias generation process, potentially leading to even more realistic and informative synthetic data. Lastly, although divergence metrics offer a valuable measure of distributional similarity, exploring additional evaluation techniques that assess the quality and usefulness of synthetic data for specific downstream tasks would provide a more comprehensive understanding of the effectiveness of our methodology, valid also for the case when little amount of data is available for validation, which is a current limitation of this work.

Acknowledgements

This research was supported by GenoMed4All and SYNTHEMA projects. Both have received funding from the European Union’s Horizon 2020 research and innovation program under grant agreement No 101017549 and 101095530, respectively. The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Declarations

The authors have no relevant financial or non-financial interests to disclose.

Appendix A Appendix

To further investigate the impact of sample size on divergence metric performance, we conducted an additional validation experiment using a reduced number of samples for both $M$ and $L$ . Specifically, we set $M=100$ and $L=100$ , representing a scenario with limited data availability. As noted in [20], having enough samples is crucial for accurate distribution comparisons. Therefore, we anticipated that this low sample count scenario would yield less reliable divergence results.

The results obtained for this particular validation are presented in Tables 6, 7, 8 and 9. The improvements observed in the previous experiments are not consistent in all datasets under these constrained conditions. Moreover, the divergence values are generally quite small. This suggests that when sample counts are low, the underlying distributions may not be adequately captured, leading to an underestimation of the true disparity between the distributions. Consequently, the divergence may underestimate the actual divergence, resulting in seemingly small divergence values. These findings underscore the importance of having an adequate number of samples when evaluating divergence metrics. With limited data, the reliability of these measures decreases, which can lead to misleading conclusions about similarity between distributions. Therefore, it is essential to consider sample size as a factor when interpreting divergence metric results, particularly in scenarios with limited data availability.

Scenario	N	M	L	VAE JS	CTGAN JS	VAE KL	CTGAN KL
Big data	10000	7500	1000	0.079 (0.001)	0.150 (0.002)	0.153 (0.019)	0.420 (0.025)
Low data	300	7500	1000	0.331 (0.004)	0.563 (0.002)	0.697 (0.018)	1.653 (0.015)
Low data	300	100	100	0.030 (0.012)	0.333 (0.027)	0.084 (0.101)	2.197 (0.257)
\hdashlinePre-train	300	7500	1000	0.171 (0.004)	0.563 (0.002)	0.427 (0.021)	1.753 (0.040)
Pre-train	300	100	100	-0.002 (0.004)	0.237 (0.014)	0.056 (0.103)	0.978 (0.307)
AVG	300	7500	1000	0.157 (0.004)	N/A	0.380 (0.043)	N/A
AVG	300	100	100	0.002 (0.002)	N/A	0.049 (0.107)	N/A
MAML	300	7500	1000	0.300 (0.002)	N/A	0.686 (0.037)	N/A
MAML	300	100	100	0.002 (0.008)	N/A	0.054 (0.116)	N/A
DRS	300	7500	1000	0.189 (0.006)	N/A	0.427 (0.043)	N/A
DRS	300	100	100	0.015 (0.008)	N/A	0.089 (0.051)	N/A

Table 6: Adult dataset JS and KL results for each Scenario. Big data represents the ideal case where many samples (

N=10,000

) are available to generate reliable synthetic data. Low data represents a more realistic scenario in which a limited number of samples (

N=300

) are available, posing a challenge for synthetic data generation. There are two different Low data rows, depending on the number of samples

M

and

L

used for validating. A more reliable case is presented when

M=7,500

and

L=1,000

and a less reliable case is presented when

M=100

and

L=100

. These two cases apply to the following rows too. The next rows compare the divergences obtained by each methodology (pre-training, model averaging, MAML, and DRS) applied to the low data scenario. Results are represented as mean (std). Lower is better. Values in bold indicate an improvement due to the technique used.

Scenario	N	M	L	VAE JS	CTGAN JS	VAE KL	CTGAN KL
Big data	10000	7500	1000	0.253 (0.009)	0.463 (0.003)	0.647 (0.045)	1.506 (0.031)
Low data	300	7500	1000	0.840 (0.003)	0.962 (0.002)	4.582 (0.136)	8.994 (0.909)
Low data	300	100	100	0.003 (0.004)	0.482 (0.093)	-0.062 (0.113)	2.290 (1.100)
\hdashlinePre-train	300	7500	1000	0.746 (0.003)	0.937 (0.003)	3.516 (0.082)	8.603 (0.463)
Pre-train	300	100	100	0.010 (0.006)	0.843 (0.013)	0.061 (0.051)	4.599 (0.629)
AVG	300	7500	1000	0.609 (0.003)	N/A	2.596 (0.060)	N/A
AVG	300	100	100	-0.001 (0.004)	N/A	0.073 (0.088)	N/A
MAML	300	7500	1000	0.851 (0.001)	N/A	5.176 (0.242)	N/A
MAML	300	100	100	0.179 (0.071)	N/A	2.256 (0.927)	N/A
DRS	300	7500	1000	0.645 (0.006)	N/A	2.449 (0.057)	N/A
DRS	300	100	100	0.002 (0.004)	N/A	-0.010 (0.114)	N/A

Table 7: News dataset JS and KL results for each Scenario. Big data represents the ideal case where many samples (

N=10,000

) are available to generate reliable synthetic data. Low data represents a more realistic scenario in which a limited number of samples (

N=300

) are available, posing a challenge for synthetic data generation. There are two different Low data rows, depending on the number of samples

M

and

L

used for validating. A more reliable case is presented when

M=7,500

and

L=1,000

and a less reliable case is presented when

M=100

and

L=100

Scenario	N	M	L	VAE JS	CTGAN JS	VAE KL	CTGAN KL
Big data	10000	7500	1000	0.862 (0.002)	0.777 (0.003)	4.768 (0.072)	3.124 (0.115)
Low data	300	7500	1000	0.927 (0.002)	0.940 (0.003)	13.763 (0.696)	7.470 (0.392)
Low data	300	100	100	0.533 (0.062)	0.682 (0.029)	3.264 (0.555)	3.731 (0.279)
\hdashlinePre-train	300	7500	1000	0.862 (0.002)	0.945 (0.002)	5.286 (0.327)	9.533 (0.453)
Pre-train	300	100	100	0.228 (0.018)	0.622 (0.070)	0.698 (0.197)	3.455 (0.426)
AVG	300	7500	1000	0.740 (0.002)	N/A	3.489 (0.209)	N/A
AVG	300	100	100	-0.001 (0.002)	N/A	0.020 (0.108)	N/A
MAML	300	7500	1000	0.910 (0.002)	N/A	6.436 (0.496)	N/A
MAML	300	100	100	0.322 (0.046)	N/A	1.030 (0.127)	N/A
DRS	300	7500	1000	0.809 (0.003)	N/A	4.321 (0.215)	N/A
DRS	300	100	100	0.068 (0.006)	N/A	0.447 (0.093)	N/A

Table 8: King dataset JS and KL results for each Scenario. Big data represents the ideal case where many samples (

N=10,000

) are available to generate reliable synthetic data. Low data represents a more realistic scenario in which a limited number of samples (

N=300

) are available, posing a challenge for synthetic data generation. There are two different Low data rows, depending on the number of samples

M

and

L

used for validating. A more reliable case is presented when

M=7,500

and

L=1,000

and a less reliable case is presented when

M=100

and

L=100

Scenario	N	M	L	VAE JS	CTGAN JS	VAE KL	CTGAN KL
Big data	10000	7500	1000	0.760 (0.013)	0.531 (0.033)	2.744 (0.084)	2.623 (0.537)
Low data	300	7500	1000	0.920 (0.003)	0.961 (0.002)	6.216 (0.154)	8.841 (0.710)
Low data	300	100	100	0.050 (0.009)	0.681 (0.052)	0.182 (0.150)	4.365 (0.175)
\hdashlinePre-train	300	7500	1000	0.793 (0.004)	0.959 (0.001)	3.831 (0.151)	8.443 (0.630)
Pre-train	300	100	100	0.067 (0.004)	0.604 (0.039)	0.286 (0.091)	3.818 (0.189)
AVG	300	7500	1000	0.867 (0.007)	N/A	5.798 (0.295)	N/A
AVG	300	100	100	0.055 (0.003)	N/A	0.167 (0.132)	N/A
MAML	300	7500	1000	0.913 (0.003)	N/A	6.359 (0.054)	N/A
MAML	300	100	100	0.049 (0.002)	N/A	0.261 (0.093)	N/A
DRS	300	7500	1000	0.835 (0.009)	N/A	4.587 (0.166)	N/A
DRS	300	100	100	0.066 (0.005)	N/A	0.310 (0.173)	N/A

Table 9: Intrusion dataset JS and KL results for each Scenario. Big data represents the ideal case where many samples (

N=10,000

) are available to generate reliable synthetic data. Low data represents a more realistic scenario in which a limited number of samples (

N=300

) are available, posing a challenge for synthetic data generation. There are two different Low data rows, depending on the number of samples

M

and

L

used for validating. A more reliable case is presented when

M=7,500

and

L=1,000

and a less reliable case is presented when

M=100

and

L=100

References

[1] Mohamed Elasri, Omar Elharrouss, Somaya Al-Maadeed, and Hamid Tairi. Image generation: A review. Neural Processing Letters, 54(5):4609–4646, 2022.
[2] Hanqing Zhang, Haolin Song, Shaoyu Li, Ming Zhou, and Dawei Song. A survey of controllable text generation using transformer-based pre-trained language models. ACM Computing Surveys, 56(3):1–37, 2023.
[3] Javier Selva, Anders S Johansen, Sergio Escalera, Kamal Nasrollahi, Thomas B Moeslund, and Albert Clapés. Video transformers: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2023.
[4] Rick Sauber-Cole and Taghi M Khoshgoftaar. The use of generative adversarial networks to alleviate class imbalance in tabular data: a survey. Journal of Big Data, 9(1):98, 2022.
[5] Vadim Borisov, Tobias Leemann, Kathrin Seßler, Johannes Haug, Martin Pawelczyk, and Gjergji Kasneci. Deep neural networks and tabular data: A survey. IEEE Transactions on Neural Networks and Learning Systems, 2022.
[6] Joao Fonseca and Fernando Bacao. Tabular and latent space synthetic data generation: a literature review. Journal of Big Data, 10(1):115, 2023.
[7] Lei Xu, Maria Skoularidou, Alfredo Cuesta-Infante, and Kalyan Veeramachaneni. Modeling tabular data using conditional gan. Advances in neural information processing systems, 32, 2019.
[8] Alvaro Figueira and Bruno Vaz. Survey on synthetic data generation, evaluation methods and gans. Mathematics, 10(15):2733, 2022.
[9] Zilong Zhao, Aditya Kunar, Robert Birke, and Lydia Y Chen. Ctab-gan: Effective table data synthesizing. In Asian Conference on Machine Learning, pages 97–112. PMLR, 2021.
[10] Amirarsalan Rajabi and Ozlem Ozmen Garibay. Tabfairgan: Fair tabular data generation with generative adversarial networks. Machine Learning and Knowledge Extraction, 4(2):488–501, 2022.
[11] Justin Engelmann and Stefan Lessmann. Conditional wasserstein gan-based oversampling of tabular data for imbalanced learning. Expert Systems with Applications, 174:114582, 2021.
[12] Ally Salim Jr. Synthetic patient generation: A deep learning approach using variational autoencoders. arXiv preprint arXiv:1808.06444, 2018.
[13] Akim Kotelnikov, Dmitry Baranchuk, Ivan Rubachev, and Artem Babenko. Tabddpm: Modelling tabular data with diffusion models. In International Conference on Machine Learning, pages 17564–17579. PMLR, 2023.
[14] Vadim Borisov, Kathrin Seßler, Tobias Leemann, Martin Pawelczyk, and Gjergji Kasneci. Language models are realistic tabular data generators. arXiv preprint arXiv:2210.06280, 2022.
[15] Patricia A Apellániz, Juan Parras, and Santiago Zazo. An improved tabular data generator with vae-gmm integration. arXiv preprint arXiv:2404.08434, 2024.
[16] Chris Cremer, Xuechen Li, and David Kristjanson Duvenaud. Inference suboptimality in variational autoencoders. In International Conference on Machine Learning, 2018.
[17] R. A. Fisher. Iris. UCI Machine Learning Repository, 1988. DOI: https://doi.org/10.24432/C56C76.
[18] David Harrison and Daniel L Rubinfeld. Hedonic housing prices and the demand for clean air. Journal of Environmental Economics and Management, 5(1):81–102, 1978.
[19] Matjaz Zwitter and Milan Soklic. Breast Cancer. UCI Machine Learning Repository, 1988. DOI: https://doi.org/10.24432/C51P4M.
[20] Patricia A Apellániz, Ana Jiménez, Borja Arroyo Galende, Juan Parras, and Santiago Zazo. Synthetic tabular data validation: A divergence-based approach. arXiv preprint arXiv:2405.07822, 2024.
[21] Martin Heusel, Hubert Ramsauer, Thomas Unterthiner, Bernhard Nessler, and Sepp Hochreiter. Gans trained by a two time-scale update rule converge to a local nash equilibrium. Advances in neural information processing systems, 30, 2017.
[22] Tim Salimans, Ian Goodfellow, Wojciech Zaremba, Vicki Cheung, Alec Radford, and Xi Chen. Improved techniques for training gans. In Proceedings of the 30th International Conference on Neural Information Processing Systems, NIPS’16, page 2234–2242, Red Hook, NY, USA, 2016. Curran Associates Inc.
[23] Tianyi Zhang, Varsha Kishore, Felix Wu, Kilian Q. Weinberger, and Yoav Artzi. Bertscore: Evaluating text generation with bert. In International Conference on Learning Representations, 2020.
[24] Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu. Bleu: a method for automatic evaluation of machine translation. In Annual Meeting of the Association for Computational Linguistics, 2002.
[25] Mikel Hernandez, Gorka Epelde, Ane Alberdi, Rodrigo Cilla, and Debbie Rankin. Synthetic tabular data evaluation in the health domain covering resemblance, utility, and privacy dimensions. Methods of information in medicine, 62, 01 2023.
[26] Anirudh Goyal and Yoshua Bengio. Inductive biases for deep learning of higher-level cognition. Proceedings of the Royal Society A, 478(2266):20210068, 2022.
[27] Micah Goldblum, Marc Finzi, Keefer Rowan, and Andrew Gordon Wilson. The no free lunch theorem, kolmogorov complexity, and the role of inductive biases in machine learning. arXiv preprint arXiv:2304.05366, 2023.
[28] Samuel Müller, Noah Hollmann, Sebastian Pineda Arango, Josif Grabocka, and Frank Hutter. Transformers can do bayesian inference. In International Conference on Learning Representations, 2022.
[29] Noah Hollmann, Samuel Müller, Katharina Eggensperger, and Frank Hutter. TabPFN: A transformer that solves small tabular classification problems in a second. In The Eleventh International Conference on Learning Representations, 2023.
[30] Sinno Jialin Pan and Qiang Yang. A survey on transfer learning. IEEE Transactions on Knowledge and Data Engineering, 22(10):1345–1359, 2010.
[31] Hee E. Kim, Alejandro Cosa-Linan, Nandhini Santhanam, Mahboubeh Jannesari, Mate E. Maros, and Thomas Ganslandt. Transfer learning for medical image classification: a literature review. BMC Medical Imaging 2022 22:1, 22(1):1–13, apr 2022.
[32] Karl Weiss, Taghi M Khoshgoftaar, and DingDing Wang. A survey of transfer learning. Journal of Big data, 3(1):1–40, 2016.
[33] N. V. Chawla, K. W. Bowyer, L. O. Hall, and W. P. Kegelmeyer. Smote: Synthetic minority over-sampling technique. Journal of Artificial Intelligence Research, 16:321–357, June 2002.
[34] Fangtao Li, Sinno Jialin Pan, Ou Jin, Qiang Yang, and Xiaoyan Zhu. Cross-domain co-extraction of sentiment and topic lexicons. In Haizhou Li, Chin-Yew Lin, Miles Osborne, Gary Geunbae Lee, and Jong C. Park, editors, Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 410–419, Jeju Island, Korea, July 2012. Association for Computational Linguistics.
[35] Mingsheng Long, Jianmin Wang, Guiguang Ding, Sinno Jialin Pan, and Philip S. Yu. Adaptation regularization: A general framework for transfer learning. IEEE Transactions on Knowledge and Data Engineering, 26:1076–1089, 2014.
[36] Lixin Duan, Dong Xu, and Shih-Fu Chang. Exploiting web images for event recognition in consumer videos: A multiple source domain adaptation approach. 2012 IEEE Conference on Computer Vision and Pattern Recognition, pages 1338–1345, 2012.
[37] Zifeng Wang and Jimeng Sun. Transtab: Learning transferable tabular transformers across tables. Advances in Neural Information Processing Systems, 35:2902–2915, 2022.
[38] Bingzhao Zhu, Xingjian Shi, Nick Erickson, Mu Li, George Karypis, and Mahsa Shoaran. Xtab: Cross-table pretraining for tabular transformers. In International Conference on Machine Learning, 2023.
[39] Robert Winkler and Spyros Makridakis. The combination of forecasts. Journal of the Royal Statistical Society. Series A (General), 146:150–157, 01 1983.
[40] Robert T. Clemen and Robert L. Winkler. Combining economic forecasts. Journal of Business & Economic Statistics, 4(1):39–46, 1986.
[41] Sebastian Thrun and Lorien Pratt. Learning to Learn: Introduction and Overview, pages 3–17. Springer US, Boston, MA, 1998.
[42] Chelsea Finn, Pieter Abbeel, and Sergey Levine. Model-agnostic meta-learning for fast adaptation of deep networks. In International conference on machine learning, pages 1126–1135. PMLR, 2017.
[43] Timothy M. Hospedales, Antreas Antoniou, Paul Micaelli, and Amos J. Storkey. Meta-learning in neural networks: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence, 44:5149–5169, 2020.
[44] Diederik P. Kingma and Max Welling. Auto-encoding variational bayes. CoRR, abs/1312.6114, 2013.
[45] Ian J. Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron C. Courville, and Yoshua Bengio. Generative adversarial nets. In Neural Information Processing Systems, 2014.
[46] Jonathan Gordon, John F. Bronskill, M. Bauer, Sebastian Nowozin, and Richard E. Turner. Meta-learning probabilistic inference for prediction. In International Conference on Learning Representations, 2018.
[47] Kate Smith-Miles. Cross-disciplinary perspectives on meta-learning for algorithm selection. ACM Comput. Surv., 41:6:1–6:25, 2009.
[48] Luca Franceschi, Paolo Frasconi, Saverio Salzo, Riccardo Grazzi, and Massimiliano Pontil. Bilevel programming for hyperparameter optimization and meta-learning. In International Conference on Machine Learning, 2018.
[49] Katelyn Gao and Ozan Sener. Modeling and optimization trade-off in meta-learning. In H. Larochelle, M. Ranzato, R. Hadsell, M.F. Balcan, and H. Lin, editors, Advances in Neural Information Processing Systems, volume 33, pages 11154–11165. Curran Associates, Inc., 2020.
[50] Joshua Tobin, Rachel Fong, Alex Ray, Jonas Schneider, Wojciech Zaremba, and P. Abbeel. Domain randomization for transferring deep neural networks from simulation to the real world. 2017 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pages 23–30, 2017.
[51] Neha Patki, Roy Wedge, and Kalyan Veeramachaneni. The synthetic data vault. In IEEE International Conference on Data Science and Advanced Analytics (DSAA), pages 399–410, Oct 2016.
[52] Barry Becker and Ronny Kohavi. Adult. UCI Machine Learning Repository, 1996. DOI: https://doi.org/10.24432/C5XW20.
[53] Kelwin Fernandes, Pedro Vinagre, Paulo Cortez, and Pedro Sernadela. Online News Popularity. UCI Machine Learning Repository, 2015. DOI: https://doi.org/10.24432/C5NS3V.
[54] Salvatore Stolfo, Wei Fan, Wenke Lee, Andreas Prodromidis, and Philip Chan. KDD Cup 1999 Data. UCI Machine Learning Repository, 1999. DOI: https://doi.org/10.24432/C51C7N.