Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                

Artificial Inductive Bias for Synthetic Tabular Data Generation in Data-Scarce Scenarios

Patricia A. Apellániz
ETS Ingenieros de Telecomunicación
Universidad Politécnica de Madrid
Madrid
patricia.alonsod@upm.es
&Ana Jiménez
ETS Ingenieros de Telecomunicación
Universidad Politécnica de Madrid
Madrid
ana.jimenezb@upm.es
&Borja Arroyo Galende
ETS Ingenieros de Telecomunicación
Universidad Politécnica de Madrid
Madrid
borja.arroyog@upm.es
&Juan Parras
ETS Ingenieros de Telecomunicación
Universidad Politécnica de Madrid
Madrid
j.parras@upm.es
&Santiago Zazo
ETS Ingenieros de Telecomunicación
Universidad Politécnica de Madrid
Madrid
santiago.zazo@upm.es
These authors contributed equally.
Abstract

While synthetic tabular data generation using Deep Generative Models (DGMs) offers a compelling solution to data scarcity and privacy concerns, their effectiveness relies on substantial training data, often unavailable in real-world applications. This paper addresses this challenge by proposing a novel methodology for generating realistic and reliable synthetic tabular data with DGMs in limited real-data environments. Our approach proposes several ways to generate an artificial inductive bias in a DGM through transfer learning and meta-learning techniques. We explore and compare four different methods within this framework, demonstrating that transfer learning strategies like pre-training and model averaging outperform meta-learning approaches, like Model-Agnostic Meta-Learning, and Domain Randomized Search. We validate our approach using two state-of-the-art DGMs, namely, a Variational Autoencoder and a Generative Adversarial Network, to show that our artificial inductive bias fuels superior synthetic data quality, as measured by Jensen-Shannon divergence, achieving relative gains of up to 50% when using our proposed approach. This methodology has broad applicability in various DGMs and machine learning tasks, particularly in areas like healthcare and finance, where data scarcity is often a critical issue.

Keywords Tabular Data  \cdot Deep Generative Model  \cdot Inductive Bias  \cdot Transfer Learning  \cdot Meta-Learning

1 Introduction

Recent advances in deep learning have led to the development of powerful Deep Generative Models (DGMs). These models excel at learning and representing complex, high-dimensional distributions, allowing them to sample new data points realistically. This capability has driven remarkable progress in various domains, including image generation [1], text generation [2], and video generation [3].

Tabular data has emerged as a data format of increasing interest within DGMs. Structured in a familiar spreadsheet-like format with rows and columns, tabular data are a fundamental cornerstone for information storage and analysis across diverse fields. Researchers are actively investigating techniques for generating realistic and informative tabular data, as evidenced by the growing body of work in this area [4, 5, 6]. Generative Adversarial Networks (GANs) are currently the most prominent architecture for tabular data generation. Within this domain, CTGAN [7] is a widely recognized model. However, the exploration of alternative methodologies is ongoing, as evidenced by surveys [8] that highlight diverse approaches [9, 10, 11]. Beyond GANs, there are promising approaches including Variational Autoencoders (VAEs) [12], methods based on diffusion models [13], and techniques that leverage language models [14], to mention some. Focusing on VAEs for tabular data generation, TVAE [7] stands as a foundational model. TVAE encodes tabular data into a latent space, a lower-dimensional representation that captures the essential features of the data. By sampling from this latent space and subsequently decoding the samples, TVAE facilitates the generation of new data points in the original tabular format. A recent advancement [15] builds upon TVAE by incorporating a Bayesian Gaussian Mixture model (BGM) within the VAE architecture. This novel approach outperforms the state-of-the-art models, including CTGAN and TVAE, on different validation metrics. This superior performance comes from the BGM integration enabling high-quality sampling from the VAE’s latent space. This addresses a limitation of TVAE, which assumes a Gaussian latent space for generation, which may not happen in practice [16]. As real-world data often deviates from this distribution, the BGM integration leads to improved performance.

The importance of sufficient data for DGM training cannot be overstated. Studies using popular DGMs, such as the CTGAN, use datasets ranging from 10k to 481k training instances. This contrasts starkly with the practices observed in numerous domains. For example, well-established datasets used to evaluate rudimentary models include the Iris dataset with 150 samples [17] and the Boston House Prices dataset with 506 instances [18]. Even within the realm of medical research, valuable datasets such as the breast cancer dataset encompass only around 300 patients [19]. Smaller datasets pose challenges for DGM training, including overfitting and difficulty in assessing the quality of generated data [20].

A critical challenge associated with using DGMs for tabular data generation lies in ensuring the quality and effectiveness of synthetic data. While standardized metrics exist for image [21, 22] and text data [23, 24], measuring the quality of synthetic tabular data presents unique challenges. Studies employ various metrics, including pairwise correlation difference, support coverage, likelihood fitness, and other statistics as described in [25]. However, a consistent method for holistic evaluation is lacking. Divergences, which quantify the discrepancy between probability distributions, offer a promising avenue for validation [20]. They can capture the overall differences between real and synthetic data by considering the joint distribution of all attributes. However, modeling joint distributions presents a trade-off between computational cost and accuracy. Large datasets, especially those with high dimensionality, require significant computational resources. With sufficient resources, accurate results can be achieved. In contrast, smaller datasets, common in real-world applications, present a challenge to accuracy. Limited data may hinder the capture of complex variable relationships, leading to models with poor generalization to unseen data. Consequently, even computationally efficient methods for joint distribution modeling can yield inaccurate results in small data settings.

The limitations of current validation techniques further compound the inherent limitations of small datasets. These techniques often focus on comparing synthetic data with real data used for training, failing to account for the limited scope of information on which the DGM was trained. This can lead to a false sense of security, where the synthetic data appear similar to the training data but may not generalize well. Deep learning models with high parameter counts are susceptible to overfitting too, especially on small datasets. Additionally, small datasets might not capture the full spectrum of real-world variations. Consequently, the synthetic data generated may not be representative of the underlying distribution, affecting its effectiveness for different tasks. Furthermore, smaller datasets are more prone to the influence of noise (random errors) and bias (systematic skews). These mislead DGMs into learning incorrect patterns, ultimately resulting in unrealistic synthetic data.

This work addresses the critical challenge of generating reliable synthetic tabular data from limited datasets, a prevalent scenario in many real-world applications. Traditional divergence metrics often struggle in these situations, which can lead to inaccurate assessments of the quality of synthetic data [20]. We propose a novel methodology specifically designed to address this issue by introducing a framework that leverages inductive biases to improve the performance of DGMs in small dataset environments. Inductive biases [26] are inherent preferences or assumptions built into a learning model. These biases can guide the learning process and improve model performance, particularly when data is scarce. Traditionally, inductive biases are introduced through domain knowledge or specific architectural choices. This work proposes an alternative approach that leverages the variability that is often found in the DGM training process to generate inductive biases through different learning techniques. As an example, we use VAEs, as they exhibit inherent variability between training seeds, allowing them to capture various aspects of data during training, but we note that our approach could be used with other DGMs. We exploit this variability by employing various transfer learning and meta-learning techniques to generate the inductive bias, ultimately leading to improved synthetic data generation. Our key contribution is threefold:

  • We propose a novel generation methodology for synthetic tabular data generated by DGMs in a small dataset environment. This methodology leverages inductive bias generation through transfer learning and meta-learning techniques to achieve a more reliable generation process.

  • We propose four different techniques (pre-training, model averaging, Model-Agnostic Meta-Learning (MAML), and Domain-Randomized Search (DRS)) to generate the inductive bias. We demonstrate the efficacy of our proposal using a common DGM as the VAE, and for the pre-training process, we also assess the performance with a CTGAN architecture, another common DGM.

  • We demonstrate the effectiveness of our proposed methodology through extensive experiments on benchmark datasets. We provide a comprehensive analysis of the results using established divergence metrics such as Kullback-Leibler (KL) and Jensen-Shannon (JS) divergences.

This contribution offers a significant advancement in the field of synthetic data generation. By enabling improved data quality in resource-constrained settings, our approach has the potential to broaden the applicability of synthetic tabular data across various disciplines.

2 Methodology: Generating Artificial Inductive Bias

Let us assume that we have a tabular dataset composed of N𝑁Nitalic_N entries {xri}i=1Nsuperscriptsubscriptsuperscriptsubscript𝑥𝑟𝑖𝑖1𝑁\{x_{r}^{i}\}_{i=1}^{N}{ italic_x start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT } start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT. N𝑁Nitalic_N represents the number of samples available and each entry xrisuperscriptsubscript𝑥𝑟𝑖x_{r}^{i}italic_x start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT has a dimensionality of C𝐶Citalic_C features. In other words, C𝐶Citalic_C represents the number of attributes associated with each data point. Let us also define a DGM as a high-dimensional probability distribution pθsubscript𝑝𝜃p_{\theta}italic_p start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT, where θ𝜃\thetaitalic_θ represents the learnable parameters of the model. The objective of the DGM is to learn a representation, pθsubscript𝑝𝜃p_{\theta}italic_p start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT, that closely approximates the true underlying data distribution, denoted by p(xr)𝑝subscript𝑥𝑟p(x_{r})italic_p ( italic_x start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT ). Once trained, the DGM can generate new synthetic generated samples xgsubscript𝑥𝑔x_{g}italic_x start_POSTSUBSCRIPT italic_g end_POSTSUBSCRIPT, by drawing from its learned distribution:

xgpθ.similar-tosubscript𝑥𝑔subscript𝑝𝜃x_{g}\sim p_{\theta}.italic_x start_POSTSUBSCRIPT italic_g end_POSTSUBSCRIPT ∼ italic_p start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT . (1)

Ideally, a well-trained DGM should produce synthetic data xgsubscript𝑥𝑔x_{g}italic_x start_POSTSUBSCRIPT italic_g end_POSTSUBSCRIPT that are statistically indistinguishable from real data xrsubscript𝑥𝑟x_{r}italic_x start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT.

In the prevalent big data setting, characterized by a large number of training samples (NCmuch-greater-than𝑁𝐶N\gg Citalic_N ≫ italic_C), DGMs with sufficient complexity can effectively capture the underlying data distribution p(xr)𝑝subscript𝑥𝑟p(x_{r})italic_p ( italic_x start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT ). This is evidenced by the impressive results achieved in recent research, where high-dimensional synthetic samples are generated using vast amounts of training data [7, 1, 2, 3]. However, for scenarios with limited training data, which is common in tabular domains, DGMs struggle to accurately represent the complex inter-feature relationships. Consequently, the synthetic samples generated xgsubscript𝑥𝑔x_{g}italic_x start_POSTSUBSCRIPT italic_g end_POSTSUBSCRIPT deviate significantly from the true data distribution p(xr)𝑝subscript𝑥𝑟p(x_{r})italic_p ( italic_x start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT ), leading to high KL and JS divergences between real and synthetic data.

To address this challenge, we propose an approach that uses artificially generated inductive biases. Figure 1 illustrates the overall architecture. In the standard big data setting, a DGM pθsubscript𝑝𝜃p_{\theta}italic_p start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT is directly trained using real data xrsubscript𝑥𝑟x_{r}italic_x start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT generating high-quality synthetic data xgpθsimilar-tosubscript𝑥𝑔subscript𝑝𝜃x_{g}\sim p_{\theta}italic_x start_POSTSUBSCRIPT italic_g end_POSTSUBSCRIPT ∼ italic_p start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT. However, when the number of real samples N𝑁Nitalic_N is limited, the quality of the generated data xgsubscript𝑥𝑔x_{g}italic_x start_POSTSUBSCRIPT italic_g end_POSTSUBSCRIPT deteriorates. To mitigate this issue, we introduce an artificial inductive bias generator. This module takes the initial synthetic data xgsubscript𝑥𝑔x_{g}italic_x start_POSTSUBSCRIPT italic_g end_POSTSUBSCRIPT as input and outputs an initial set of weights θ0subscript𝜃0\theta_{0}italic_θ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT. These weights are then used as the inductive bias to train a second DGM pθ^subscript𝑝^𝜃p_{\hat{\theta}}italic_p start_POSTSUBSCRIPT over^ start_ARG italic_θ end_ARG end_POSTSUBSCRIPT using real data xrsubscript𝑥𝑟x_{r}italic_x start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT. This second DGM generates a new set of synthetic samples, x^gsubscript^𝑥𝑔\hat{x}_{g}over^ start_ARG italic_x end_ARG start_POSTSUBSCRIPT italic_g end_POSTSUBSCRIPT. Notably, the only distinction between pθsubscript𝑝𝜃p_{\theta}italic_p start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT and pθ^subscript𝑝^𝜃p_{\hat{\theta}}italic_p start_POSTSUBSCRIPT over^ start_ARG italic_θ end_ARG end_POSTSUBSCRIPT lies in the initial weights: pθ^subscript𝑝^𝜃p_{\hat{\theta}}italic_p start_POSTSUBSCRIPT over^ start_ARG italic_θ end_ARG end_POSTSUBSCRIPT leverages the inductive bias encoded in θ0subscript𝜃0\theta_{0}italic_θ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT to potentially achieve faster convergence to a distribution that better resembles p(xr)𝑝subscript𝑥𝑟p(x_{r})italic_p ( italic_x start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT ), while pθsubscript𝑝𝜃p_{\theta}italic_p start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT begins training with random weights. As our simulations will demonstrate, this seemingly minor difference translates into significant improvements in the quality of the generated synthetic data.

Refer to caption
Figure 1: Block diagram for the proposed architecture. In a standard big data setting, the first generative model pθsubscript𝑝𝜃p_{\theta}italic_p start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT generates good enough samples. However, in cases where large amounts of data are not available (N𝑁Nitalic_N is limited), we propose to use the data generated by the first DGM as input to an artificial inductive bias generator, which in return provides a set of initial weights θ0subscript𝜃0\theta_{0}italic_θ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT for a DGM that contains the artificially generated inductive bias. This initial weight is then used to train a second DGM pθ^subscript𝑝^𝜃p_{\hat{\theta}}italic_p start_POSTSUBSCRIPT over^ start_ARG italic_θ end_ARG end_POSTSUBSCRIPT using real data xrsubscript𝑥𝑟x_{r}italic_x start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT, generating a second set of synthetic data x^gsubscript^𝑥𝑔\hat{x}_{g}over^ start_ARG italic_x end_ARG start_POSTSUBSCRIPT italic_g end_POSTSUBSCRIPT, which is of higher quality than the synthetic data xgsubscript𝑥𝑔x_{g}italic_x start_POSTSUBSCRIPT italic_g end_POSTSUBSCRIPT.

The proposed approach hinges on two key concepts: the importance of inductive biases and the feasibility of their artificial generation. The importance of inductive biases in supervised learning is well established. The no-free-lunch theorems state that a universally optimal learner does not exist. Consequently, specific learning biases can lead to substantial performance gains for particular problem domains (see [27] and the references therein). Convolutional Neural Networks (CNNs) exemplify this principle. Their inherent inductive bias, the fact that the image information possesses spatial correlation, makes them the preferred architecture for image processing tasks. Similarly, as highlighted in [26], the use of inductive biases is a cornerstone of Deep Learning’s success. In scenarios with limited training data, regularizers are commonly employed as inductive biases to prevent overfitting. This underscores the dual role of inductive biases: not only do they contribute to Deep Learning’s effectiveness, but they are also crucial in preventing overfitting. However, effective use of inductive biases is often contingent on having specific knowledge about the problem at hand. In the aforementioned example of CNNs, we inherently understand the existence of spatial correlation in images. However, in tabular data, this domain-specific knowledge is often scarce. To address this challenge, recent efforts have focused on designing large models trained on artificially generated data as inductive biases. The underlying hope is that the actual problem to be solved exhibits similarities to those encountered during training of the large model (e.g., [28] and [29]).

Therefore, our proposed approach departs from existing methods for incorporating inductive biases in synthetic tabular data generation. Unlike the brute-force approach employed in [29], we use data generated by a potentially low-quality DGM pθsubscript𝑝𝜃p_{\theta}italic_p start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT. This strategy aims to obtain an initial set of weights θ0subscript𝜃0\theta_{0}italic_θ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT, which act as an inductive bias. Ideally, these weights should guide the model to a region of the parameter space that facilitates convergence towards a high-quality solution. IIn particular, we assess our ideas using a state-of-the-art VAE architecture for the DGM, although we hypothesize that similar results could be achieved with other DGM architectures. Due to its demonstrated superiority against other leading models, we will use the architecture proposed in [15]. VAEs are known to be sensitive to the initial random conditions (seeds) used during training. This dependence on seeds requires training with multiple seeds and selecting the one(s) that exhibit the best performance based on a chosen metric, such as minimum validation loss. The remaining runs, often discarded, may still contain valuable problem-specific information despite not achieving optimal solutions using traditional metrics. Our key idea lies in exploiting these potentially informative data from discarded VAE runs to create an artificial inductive bias for the final DGM trained with real data.

The following subsections explore two distinct paradigms for generating the initial set of weights, θ0subscript𝜃0\theta_{0}italic_θ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT: transfer learning and meta-learning. Transfer learning techniques encompass pre-training and model averaging, while meta-learning techniques include MAML and DRS. Pre-training offers a versatile approach applicable to any DGM architecture, regardless of inherent characteristics. In contrast, model averaging and meta-learning techniques are particularly well suited for VAEs trained with multiple seeds due to their inherent variability in learned representations. Consequently, we will evaluate the latter two techniques within the chosen VAE architecture. Additionally, to assess the efficacy of pre-training across different DGM architectures, we will compare its performance on the CTGAN.

2.1 Transfer learning

Transfer learning is a machine learning paradigm that leverages knowledge acquired from a context domain (also called the source domain) to enhance learning performance in a new target domain [30]. This approach aims to improve the learning process in the target domain by capitalizing on the knowledge gained from solving related tasks in the context domain. This technique has demonstrated its efficacy in fields where data scarcity is a common challenge, such as the medical field [31].

Formally, based on the definition in [30], we can define a domain 𝒟𝒟\mathcal{D}caligraphic_D by a feature space 𝒳𝒳\mathcal{X}caligraphic_X and a marginal probability distribution p(xr)𝑝subscript𝑥𝑟p(x_{r})italic_p ( italic_x start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT ). Two domains are considered distinct if their feature spaces 𝒳1,𝒳2subscript𝒳1subscript𝒳2\mathcal{X}_{1},\mathcal{X}_{2}caligraphic_X start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , caligraphic_X start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT or marginal probability distributions p(x1),p(x2)𝑝subscript𝑥1𝑝subscript𝑥2p(x_{1}),p(x_{2})italic_p ( italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) , italic_p ( italic_x start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) differ, i.e., if 𝒳1𝒳2subscript𝒳1subscript𝒳2\mathcal{X}_{1}\neq\mathcal{X}_{2}caligraphic_X start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ≠ caligraphic_X start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT or p(x1)p(x2)𝑝subscript𝑥1𝑝subscript𝑥2p(x_{1})\neq p(x_{2})italic_p ( italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) ≠ italic_p ( italic_x start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ). The core objective of transfer learning is to leverage the knowledge learned in a context domain 𝒟contextsubscript𝒟𝑐𝑜𝑛𝑡𝑒𝑥𝑡\mathcal{D}_{context}caligraphic_D start_POSTSUBSCRIPT italic_c italic_o italic_n italic_t italic_e italic_x italic_t end_POSTSUBSCRIPT to improve learning in a target domain 𝒟targetsubscript𝒟𝑡𝑎𝑟𝑔𝑒𝑡\mathcal{D}_{target}caligraphic_D start_POSTSUBSCRIPT italic_t italic_a italic_r italic_g italic_e italic_t end_POSTSUBSCRIPT. This is typically achieved when the context and target domains differ, i.e., 𝒟context𝒟targetsubscript𝒟𝑐𝑜𝑛𝑡𝑒𝑥𝑡subscript𝒟𝑡𝑎𝑟𝑔𝑒𝑡\mathcal{D}_{context}\neq\mathcal{D}_{target}caligraphic_D start_POSTSUBSCRIPT italic_c italic_o italic_n italic_t italic_e italic_x italic_t end_POSTSUBSCRIPT ≠ caligraphic_D start_POSTSUBSCRIPT italic_t italic_a italic_r italic_g italic_e italic_t end_POSTSUBSCRIPT.

Our work focuses on a scenario where the context domain 𝒟contextsubscript𝒟𝑐𝑜𝑛𝑡𝑒𝑥𝑡\mathcal{D}_{context}caligraphic_D start_POSTSUBSCRIPT italic_c italic_o italic_n italic_t italic_e italic_x italic_t end_POSTSUBSCRIPT consists of data xgsubscript𝑥𝑔x_{g}italic_x start_POSTSUBSCRIPT italic_g end_POSTSUBSCRIPT generated by a DGM. The target domain 𝒟targetsubscript𝒟𝑡𝑎𝑟𝑔𝑒𝑡\mathcal{D}_{target}caligraphic_D start_POSTSUBSCRIPT italic_t italic_a italic_r italic_g italic_e italic_t end_POSTSUBSCRIPT, on the other hand, consists of xrsubscript𝑥𝑟x_{r}italic_x start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT. Our approach leverages the representational power learned by the DGM pθsubscript𝑝𝜃p_{\theta}italic_p start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT on xgsubscript𝑥𝑔x_{g}italic_x start_POSTSUBSCRIPT italic_g end_POSTSUBSCRIPT to provide a strong starting point for learning in the target domain with real data xrsubscript𝑥𝑟x_{r}italic_x start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT. This transfer of knowledge is achieved by initializing the model weights for the target domain with the weights learned from the DGM model trained on the generated data.

Transfer learning can be categorized into homogeneous and heterogeneous settings based on the feature spaces of the domains [32]. Homogeneous transfer learning applies when the context and target domains share the same feature space 𝒳1=𝒳2subscript𝒳1subscript𝒳2\mathcal{X}_{1}=\mathcal{X}_{2}caligraphic_X start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT = caligraphic_X start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT, while heterogeneous transfer learning deals with scenarios where feature spaces differ 𝒳1𝒳2subscript𝒳1subscript𝒳2\mathcal{X}_{1}\neq\mathcal{X}_{2}caligraphic_X start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ≠ caligraphic_X start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT. This work focuses on homogeneous transfer learning, where the context domain serves as an augmented version of the target domain. The key difference between the domains in our case lies in the number of samples, leading to situations where the empirical distributions of the data differ, i.e., pθ(xg)p(xr)subscript𝑝𝜃subscript𝑥𝑔𝑝subscript𝑥𝑟p_{\theta}(x_{g})\neq p(x_{r})italic_p start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_g end_POSTSUBSCRIPT ) ≠ italic_p ( italic_x start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT ).

Within homogeneous transfer learning, various methodologies exist to improve target task performance by capitalizing on knowledge from a related source domain. These techniques encompass instance-based [33], relational knowledge transfer [34], feature-based [35], and, as employed in this work, parameter-based [36] transfer through shared model parameters or hyperparameter distributions. This study leverages a two-stage parameter-based transfer learning approach. The first stage involves either pre-training or model averaging, followed by fine-tuning in the second stage. Subsequent sections will delve deeper into both pre-training and model averaging techniques. Upon completion of one of these initial phases, fine-tuning serves to refine the model parameters, ultimately achieving optimal adaptation for the target domain.

2.1.1 Pre-training

Refer to caption
Figure 2: Block diagram for the pre-training case. The inductive bias is introduced by training a DGM pθptsubscript𝑝subscript𝜃𝑝𝑡p_{\theta_{pt}}italic_p start_POSTSUBSCRIPT italic_θ start_POSTSUBSCRIPT italic_p italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT on a large collection of xgsubscript𝑥𝑔x_{g}italic_x start_POSTSUBSCRIPT italic_g end_POSTSUBSCRIPT samples. The weights learned from this training process with abundant samples serve as the initial parameters θ^^𝜃\hat{\theta}over^ start_ARG italic_θ end_ARG for the fine-tunning process using the real data xrsubscript𝑥𝑟x_{r}italic_x start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT to obtain pθ^subscript𝑝^𝜃p_{\hat{\theta}}italic_p start_POSTSUBSCRIPT over^ start_ARG italic_θ end_ARG end_POSTSUBSCRIPT.

Pre-training is a frequently adopted strategy for introducing an inductive bias into a model. By leveraging a pre-trained model on a context domain, the target model gains generalizable features that enhance its performance on a target domain. However, while pre-training is a standard in computer vision and natural language processing, achieving similar success with tabular data remains a challenge. This disparity arises from the inherent heterogeneity of the features of the tables, which creates substantial feature space shifts between pre-training and downstream datasets, hindering effective knowledge transfer. Despite these challenges, recent efforts like [37] and [38] explore tabular transfer learning with promising results. Although these studies demonstrate potential, achieving comprehensive parameter transfer in tabular data requires further research to establish best practices and unlock the full potential of pre-training in this domain.

In this work, pre-training involves the following steps. First, we train a separate DGM pθptsubscript𝑝subscript𝜃𝑝𝑡p_{\theta_{pt}}italic_p start_POSTSUBSCRIPT italic_θ start_POSTSUBSCRIPT italic_p italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT using synthetic data xgsubscript𝑥𝑔x_{g}italic_x start_POSTSUBSCRIPT italic_g end_POSTSUBSCRIPT as training data. Since xgsubscript𝑥𝑔x_{g}italic_x start_POSTSUBSCRIPT italic_g end_POSTSUBSCRIPT is sampled from the initial DGM, xgpθsimilar-tosubscript𝑥𝑔subscript𝑝𝜃x_{g}\sim p_{\theta}italic_x start_POSTSUBSCRIPT italic_g end_POSTSUBSCRIPT ∼ italic_p start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT, we can generate a vast amount of synthetic data. This abundance circumvents the limitations associated with training in small datasets, such as overfitting. Then, the optimal weights θptsuperscriptsubscript𝜃𝑝𝑡\theta_{pt}^{*}italic_θ start_POSTSUBSCRIPT italic_p italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT from DGM pθptsubscript𝑝subscript𝜃𝑝𝑡p_{\theta_{pt}}italic_p start_POSTSUBSCRIPT italic_θ start_POSTSUBSCRIPT italic_p italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT are used as initial weights θ0subscript𝜃0\theta_{0}italic_θ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT to train the generative model pθ^subscript𝑝^𝜃p_{\hat{\theta}}italic_p start_POSTSUBSCRIPT over^ start_ARG italic_θ end_ARG end_POSTSUBSCRIPT (see Figure 1).

In essence, our approach aligns with the well-established concept of data augmentation. We generate synthetic data xgsubscript𝑥𝑔x_{g}italic_x start_POSTSUBSCRIPT italic_g end_POSTSUBSCRIPT, which may not perfectly capture the intricacies of the original data xrsubscript𝑥𝑟x_{r}italic_x start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT. However, we used these synthetic data to train another DGM pθptsubscript𝑝subscript𝜃𝑝𝑡p_{\theta_{pt}}italic_p start_POSTSUBSCRIPT italic_θ start_POSTSUBSCRIPT italic_p italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT. Although pθptsubscript𝑝subscript𝜃𝑝𝑡p_{\theta_{pt}}italic_p start_POSTSUBSCRIPT italic_θ start_POSTSUBSCRIPT italic_p italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT might generate lower quality synthetic samples, our objective is to exploit the information encoded within this DGM to establish an initial set of weights for the DGM that will eventually be trained on xrsubscript𝑥𝑟x_{r}italic_x start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT. In other words, we exploit the knowledge of the generative model pθptsubscript𝑝subscript𝜃𝑝𝑡p_{\theta_{pt}}italic_p start_POSTSUBSCRIPT italic_θ start_POSTSUBSCRIPT italic_p italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT, the context domain, to obtain a better generative model pθ^subscript𝑝^𝜃p_{\hat{\theta}}italic_p start_POSTSUBSCRIPT over^ start_ARG italic_θ end_ARG end_POSTSUBSCRIPT, which is our target domain. Figure 2 provides a visual representation of this pre-training procedure.

2.1.2 Model Averaging

Refer to caption
Figure 3: Block diagram for the model averaging case. The inductive bias is introduced by training a DGM pθsubscript𝑝𝜃p_{\theta}italic_p start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT on xrsubscript𝑥𝑟x_{r}italic_x start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT using S𝑆Sitalic_S different seeds. The average of the weights learned from these training processes serve as the initial parameters θ^^𝜃\hat{\theta}over^ start_ARG italic_θ end_ARG for the fine-tunning process using the real data xrsubscript𝑥𝑟x_{r}italic_x start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT to obtain pθ^subscript𝑝^𝜃p_{\hat{\theta}}italic_p start_POSTSUBSCRIPT over^ start_ARG italic_θ end_ARG end_POSTSUBSCRIPT.

The concept of model averaging emerged in the 1960s, primarily within the field of economics [39, 40]. Traditional empirical research often selects a single “best" model after searching a wide space of possibilities. However, this approach can underestimate the real uncertainty and lead to overly confident conclusions. Model averaging offers a compelling alternative. By combining multiple models, the resulting ensemble can outperform any individual model. This approach aligns with the core principles of statistical modeling: maximizing information use and balancing flexibility with overfitting. In essence, model averaging extends the concept of model selection by leveraging insights from all the models considered.

While pre-training can be incorporated with any DGM, our approach focuses on models where the training process is sensitive to initial conditions, such as VAEs. In such cases, it is common to train the DGM pθsubscript𝑝𝜃p_{\theta}italic_p start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT with multiple initial conditions (seeds) and potentially discard “bad" seeds based on a specific metric. We propose using these discarded seeds to create an artificial inductive bias. The simplest implementation involves averaging the model parameters. In this case, our context domains are the different results of each seed, and the target domain is obtained by averaging across the context domains. If we train S𝑆Sitalic_S different seeds for pθsubscript𝑝𝜃p_{\theta}italic_p start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT, resulting in S𝑆Sitalic_S models with parameters θssubscript𝜃𝑠\theta_{s}italic_θ start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT, we propose using the average of these weights as the inductive bias:

θ0=1Ss=1Sθssubscript𝜃01𝑆superscriptsubscript𝑠1𝑆subscript𝜃𝑠\theta_{0}=\frac{1}{S}\sum_{s=1}^{S}\theta_{s}italic_θ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT = divide start_ARG 1 end_ARG start_ARG italic_S end_ARG ∑ start_POSTSUBSCRIPT italic_s = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_S end_POSTSUPERSCRIPT italic_θ start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT (2)

This straightforward approach is computationally efficient, requiring only the calculation of the average across the precomputed weights. It assumes that the average model may capture a robust inductive bias, leading to improved performance. Figure 3 summarizes this process.

2.2 Meta-learning

Traditional machine learning models often rely on large volumes of data to achieve optimal performance in specific tasks. In contrast, meta-learning introduces a distinct paradigm by training algorithms with the ability to “learn to learn" [41], enabling them to rapidly adapt to new tasks with minimal data. This departure from the conventional requirement of extensive datasets for each new task allows meta-learning algorithms to leverage knowledge gained from addressing numerous related tasks. Through introspective analysis of past experiences, these models dynamically adjust their learning strategies when confronted with novel situations, making them more efficient learners and requiring less data to perform well on tasks with similar characteristics.

In this work, we exploit the multi-seed training configuration of certain DGMs. By treating each of the S𝑆Sitalic_S different seeds obtained after training the DGM as a distinct task, we construct a meta-learning framework.

2.2.1 MAML

Refer to caption
Figure 4: Block diagram for the MAML case. The inductive bias is introduced by training a DGM pθsubscript𝑝𝜃p_{\theta}italic_p start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT on xrsubscript𝑥𝑟x_{r}italic_x start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT using S𝑆Sitalic_S different seeds. The synthetic dataset xgssuperscriptsubscript𝑥𝑔𝑠x_{g}^{s}italic_x start_POSTSUBSCRIPT italic_g end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_s end_POSTSUPERSCRIPT generated by each seed serves as a task for MAML. The starting point to fine-tune using the real data xrsubscript𝑥𝑟x_{r}italic_x start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT and obtain pθ^subscript𝑝^𝜃p_{\hat{\theta}}italic_p start_POSTSUBSCRIPT over^ start_ARG italic_θ end_ARG end_POSTSUBSCRIPT is the MAML solution obtained, θMAMLsubscript𝜃𝑀𝐴𝑀𝐿\theta_{MAML}italic_θ start_POSTSUBSCRIPT italic_M italic_A italic_M italic_L end_POSTSUBSCRIPT.

MAML is a prevalent approach within the field of meta-learning [42]. It identifies the initial set of weights denoted by θMAMLsubscript𝜃𝑀𝐴𝑀𝐿\theta_{MAML}italic_θ start_POSTSUBSCRIPT italic_M italic_A italic_M italic_L end_POSTSUBSCRIPT by leveraging various tasks, enabling rapid and data-efficient adaptation to new tasks. This efficiency comes from the ability to fine-tune θMAMLsubscript𝜃𝑀𝐴𝑀𝐿\theta_{MAML}italic_θ start_POSTSUBSCRIPT italic_M italic_A italic_M italic_L end_POSTSUBSCRIPT with minimal data for each new task. However, successful application of MAML requires access to a diverse collection of tasks for effective learning.

Formally, we can frame the problem by starting with a common single-task learning scenario and transforming it into the meta-learning framework. Consider a task 𝒯𝒯\mathcal{T}caligraphic_T that consists of an input x𝑥xitalic_x sampled from a probability distribution 𝒟𝒟\mathcal{D}caligraphic_D. For simplicity, we define a task instance 𝒯𝒯\mathcal{T}caligraphic_T as a tuple comprising a dataset 𝒟𝒟\mathcal{D}caligraphic_D and its corresponding loss function \mathcal{L}caligraphic_L. To solve the task 𝒯𝒯\mathcal{T}caligraphic_T we need to obtain an optimal model parameterized by a task-specific parameter ωsuperscript𝜔\omega^{*}italic_ω start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT, which minimizes a loss function \mathcal{L}caligraphic_L on the data of the task as follows:

ω=argminω𝔼x𝒟[(𝒟;ω)].superscript𝜔subscript𝜔similar-to𝑥𝒟𝔼delimited-[]𝒟𝜔\omega^{*}=\arg\min_{\omega}\underset{x\sim\mathcal{D}}{\mathbb{E}}\big{[}% \mathcal{L}(\mathcal{D};\omega)\big{]}.italic_ω start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT = roman_arg roman_min start_POSTSUBSCRIPT italic_ω end_POSTSUBSCRIPT start_UNDERACCENT italic_x ∼ caligraphic_D end_UNDERACCENT start_ARG blackboard_E end_ARG [ caligraphic_L ( caligraphic_D ; italic_ω ) ] . (3)

In single-task learning, hyperparameter optimization is achieved by splitting the dataset 𝒟𝒟\mathcal{D}caligraphic_D into two disjoint subsets 𝒟=𝒟(t)𝒟(v)𝒟superscript𝒟𝑡superscript𝒟𝑣\mathcal{D}=\mathcal{D}^{(t)}\cup\mathcal{D}^{(v)}caligraphic_D = caligraphic_D start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT ∪ caligraphic_D start_POSTSUPERSCRIPT ( italic_v ) end_POSTSUPERSCRIPT, which are the training and validation sets, respectively. The meta-learning setting aims to develop a general-purpose learning algorithm that excels across a distribution of tasks represented by p(𝒯)𝑝𝒯p(\mathcal{T})italic_p ( caligraphic_T ) [43]. The objective is to use training tasks to train a meta-learning model θMAMLsubscript𝜃𝑀𝐴𝑀𝐿\theta_{MAML}italic_θ start_POSTSUBSCRIPT italic_M italic_A italic_M italic_L end_POSTSUBSCRIPT that can be fine-tuned to obtain ω𝜔\omegaitalic_ω to perform well on unseen tasks sampled from the same task environment p(𝒯)𝑝𝒯p(\mathcal{T})italic_p ( caligraphic_T ). Meta-learning methods utilize meta-parameters to model the common latent structure of the task distribution p(𝒯)𝑝𝒯p(\mathcal{T})italic_p ( caligraphic_T ). Therefore, we consider meta-learning as an extension of hyperparameter optimization, where the hyperparameter of interest – often called a meta-parameter – is shared across many tasks.

In this work, the distribution of tasks is defined by the set of S𝑆Sitalic_S training seeds obtained after training the DGM. Given a set of S𝑆Sitalic_S training seeds following p(𝒯)𝑝𝒯p(\mathcal{T})italic_p ( caligraphic_T ), each task 𝒯p(𝒯)similar-to𝒯𝑝𝒯\mathcal{T}\sim p(\mathcal{T})caligraphic_T ∼ italic_p ( caligraphic_T ) is therefore formalized as 𝒯={𝒟,}𝒯𝒟\mathcal{T}=\{\mathcal{D},\mathcal{L}\}caligraphic_T = { caligraphic_D , caligraphic_L }. Here, each dataset 𝒟𝒟\mathcal{D}caligraphic_D consists of synthetic data points xgssuperscriptsubscript𝑥𝑔𝑠x_{g}^{s}italic_x start_POSTSUBSCRIPT italic_g end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_s end_POSTSUPERSCRIPT drawn from the model for the different training seeds. The loss function \mathcal{L}caligraphic_L corresponds to the DGM loss function. The specific form of \mathcal{L}caligraphic_L depends on the chosen DGM. If the chosen DGM is a VAE the loss function \mathcal{L}caligraphic_L would be the negative of the Evidence Lower BOund (ELBO) [44]. In contrast, if a GAN is used, the loss function \mathcal{L}caligraphic_L would be the minimax loss function arising from the interplay between the generator and discriminator networks [45]. It is important to note that both VAEs and GANs use two neural networks within their architecture, different from the single network architectures commonly found in state-of-the-art applications [46, 47].

Solving this problem using the MAML approach requires access to a collection of B𝐵Bitalic_B tasks sampled from p(𝒯)𝑝𝒯p(\mathcal{T})italic_p ( caligraphic_T ). We denote this set of tasks 𝒯bsubscript𝒯𝑏\mathcal{T}_{b}caligraphic_T start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT used for training as 𝒟b={(𝒟b(t),𝒟b(v))}b=1Bsubscript𝒟𝑏superscriptsubscriptsuperscriptsubscript𝒟𝑏𝑡subscriptsuperscript𝒟𝑣𝑏𝑏1𝐵\mathcal{D}_{b}=\{(\mathcal{D}_{b}^{(t)},\mathcal{D}^{(v)}_{b})\}_{b=1}^{B}caligraphic_D start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT = { ( caligraphic_D start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT , caligraphic_D start_POSTSUPERSCRIPT ( italic_v ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT ) } start_POSTSUBSCRIPT italic_b = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_B end_POSTSUPERSCRIPT, where each task b𝑏bitalic_b has dedicated meta-training and meta-validation data, respectively. The goal of meta-training is to find the optimal ωbsubscriptsuperscript𝜔𝑏\omega^{*}_{b}italic_ω start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT for a given each task b𝑏bitalic_b given θMAMLsubscript𝜃𝑀𝐴𝑀𝐿\theta_{MAML}italic_θ start_POSTSUBSCRIPT italic_M italic_A italic_M italic_L end_POSTSUBSCRIPT. This θMAMLsubscript𝜃𝑀𝐴𝑀𝐿\theta_{MAML}italic_θ start_POSTSUBSCRIPT italic_M italic_A italic_M italic_L end_POSTSUBSCRIPT essentially captures the ability to learn effectively from new data. In this context, the task-related parameter ωbsubscript𝜔𝑏\omega_{b}italic_ω start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT denotes the parameters of the two different networks that comprise the VAE, i.e. the encoder and decoder’s task-specific parameters. After meta-training, the learned ωbsubscriptsuperscript𝜔𝑏\omega^{*}_{b}italic_ω start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT is used to guide the training of a base model θMAMLsubscript𝜃𝑀𝐴𝑀𝐿\theta_{MAML}italic_θ start_POSTSUBSCRIPT italic_M italic_A italic_M italic_L end_POSTSUBSCRIPT. This procedure is called meta-testing. This essentially means that the model leverages the knowledge gained from previous tasks to improve the efficiency of learning on new tasks. This can be viewed as a bi-level optimization problem [48]:

minθMAML𝔼𝒯bp(𝒯)[𝔼xgb(v)𝒟b(v)[b(𝒟b(v);ωb(θMAML))]]s.t: ωb(θMAML)=argminωb𝔼xgb(t)𝒟b(t)[b(𝒟b(t);ωb(θMAML))].subscriptsubscript𝜃𝑀𝐴𝑀𝐿similar-tosubscript𝒯𝑏𝑝𝒯𝔼delimited-[]similar-tosuperscriptsubscript𝑥subscript𝑔𝑏𝑣subscriptsuperscript𝒟𝑣𝑏𝔼delimited-[]subscript𝑏subscriptsuperscript𝒟𝑣𝑏subscriptsuperscript𝜔𝑏subscript𝜃𝑀𝐴𝑀𝐿s.t: subscriptsuperscript𝜔𝑏subscript𝜃𝑀𝐴𝑀𝐿subscriptsubscript𝜔𝑏similar-tosuperscriptsubscript𝑥subscript𝑔𝑏𝑡subscriptsuperscript𝒟𝑡𝑏𝔼delimited-[]subscript𝑏subscriptsuperscript𝒟𝑡𝑏subscript𝜔𝑏subscript𝜃𝑀𝐴𝑀𝐿\begin{split}\min_{\theta_{MAML}}\underset{\mathcal{T}_{b}\sim p(\mathcal{T})}% {\mathbb{E}}\left[\underset{x_{g_{b}}^{(v)}\sim\mathcal{D}^{(v)}_{b}}{\mathbb{% E}}\big{[}\mathcal{L}_{b}(\mathcal{D}^{(v)}_{b};\omega^{*}_{b}(\theta_{MAML}))% \big{]}\right]\\ \textrm{s.t: }\omega^{*}_{b}(\theta_{MAML})=\arg\min_{\omega_{b}}\underset{x_{% g_{b}}^{(t)}\sim\mathcal{D}^{(t)}_{b}}{\mathbb{E}}\big{[}\mathcal{L}_{b}(% \mathcal{D}^{(t)}_{b};\omega_{b}(\theta_{MAML}))\big{]}.\end{split}start_ROW start_CELL roman_min start_POSTSUBSCRIPT italic_θ start_POSTSUBSCRIPT italic_M italic_A italic_M italic_L end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_UNDERACCENT caligraphic_T start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT ∼ italic_p ( caligraphic_T ) end_UNDERACCENT start_ARG blackboard_E end_ARG [ start_UNDERACCENT italic_x start_POSTSUBSCRIPT italic_g start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_v ) end_POSTSUPERSCRIPT ∼ caligraphic_D start_POSTSUPERSCRIPT ( italic_v ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT end_UNDERACCENT start_ARG blackboard_E end_ARG [ caligraphic_L start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT ( caligraphic_D start_POSTSUPERSCRIPT ( italic_v ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT ; italic_ω start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT ( italic_θ start_POSTSUBSCRIPT italic_M italic_A italic_M italic_L end_POSTSUBSCRIPT ) ) ] ] end_CELL end_ROW start_ROW start_CELL s.t: italic_ω start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT ( italic_θ start_POSTSUBSCRIPT italic_M italic_A italic_M italic_L end_POSTSUBSCRIPT ) = roman_arg roman_min start_POSTSUBSCRIPT italic_ω start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_UNDERACCENT italic_x start_POSTSUBSCRIPT italic_g start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT ∼ caligraphic_D start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT end_UNDERACCENT start_ARG blackboard_E end_ARG [ caligraphic_L start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT ( caligraphic_D start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT ; italic_ω start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT ( italic_θ start_POSTSUBSCRIPT italic_M italic_A italic_M italic_L end_POSTSUBSCRIPT ) ) ] . end_CELL end_ROW (4)

This equation essentially minimizes the expected loss across all tasks on the meta-validation sets, subject to the constraint that for each task, the task-specific parameter ω𝜔\omegaitalic_ω is optimized on the corresponding meta-training data.

Since in our work we are upgrading the parameters using gradient descent, we can reformulate Equation 4 as follows:

ωbθαωbb(𝒟b(t);ωb)subscript𝜔𝑏𝜃𝛼subscriptsubscript𝜔𝑏subscript𝑏subscriptsuperscript𝒟𝑡𝑏subscript𝜔𝑏\displaystyle\omega_{b}\leftarrow\theta-\alpha\nabla_{\omega_{b}}\mathcal{L}_{% b}(\mathcal{D}^{(t)}_{b};\omega_{b})italic_ω start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT ← italic_θ - italic_α ∇ start_POSTSUBSCRIPT italic_ω start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT end_POSTSUBSCRIPT caligraphic_L start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT ( caligraphic_D start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT ; italic_ω start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT ) (5)
θMAMLθMAMLγθMAMLb=1Bb(𝒟b(v);θMAML).subscript𝜃𝑀𝐴𝑀𝐿subscript𝜃𝑀𝐴𝑀𝐿𝛾subscriptsubscript𝜃𝑀𝐴𝑀𝐿superscriptsubscript𝑏1𝐵subscript𝑏subscriptsuperscript𝒟𝑣𝑏subscript𝜃𝑀𝐴𝑀𝐿\displaystyle\theta_{MAML}\leftarrow\theta_{MAML}-\gamma\nabla_{\theta_{MAML}}% \sum_{b=1}^{B}\mathcal{L}_{b}(\mathcal{D}^{(v)}_{b};\theta_{MAML}).italic_θ start_POSTSUBSCRIPT italic_M italic_A italic_M italic_L end_POSTSUBSCRIPT ← italic_θ start_POSTSUBSCRIPT italic_M italic_A italic_M italic_L end_POSTSUBSCRIPT - italic_γ ∇ start_POSTSUBSCRIPT italic_θ start_POSTSUBSCRIPT italic_M italic_A italic_M italic_L end_POSTSUBSCRIPT end_POSTSUBSCRIPT ∑ start_POSTSUBSCRIPT italic_b = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_B end_POSTSUPERSCRIPT caligraphic_L start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT ( caligraphic_D start_POSTSUPERSCRIPT ( italic_v ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT ; italic_θ start_POSTSUBSCRIPT italic_M italic_A italic_M italic_L end_POSTSUBSCRIPT ) . (6)

Here, α𝛼\alphaitalic_α and γ𝛾\gammaitalic_γ represent the learning rates for the inner and outer loops, respectively. The inner loop updates the task-specific parameters ω𝜔\omegaitalic_ω for each task b𝑏bitalic_b using the gradient of the loss function bsubscript𝑏\mathcal{L}_{b}caligraphic_L start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT in the meta-training data. The outer loop updates the meta-parameters θMAMLsubscript𝜃𝑀𝐴𝑀𝐿\theta_{MAML}italic_θ start_POSTSUBSCRIPT italic_M italic_A italic_M italic_L end_POSTSUBSCRIPT based on the accumulated meta-validation loss across all tasks.

Figure 4 illustrates the integration of the MAML procedure within the framework of our proposed methodology. In this context, the task space denoted by p(𝒯)𝑝𝒯p(\mathcal{T})italic_p ( caligraphic_T ) corresponds to the various seeds S𝑆Sitalic_S obtained during the training process. Essentially, the task space encompasses the different probability distributions pθssubscript𝑝subscript𝜃𝑠p_{\theta_{s}}italic_p start_POSTSUBSCRIPT italic_θ start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT end_POSTSUBSCRIPT associated with each training seed. Ultimately, the meta-training steps lead to the identification of the desired parameters, denoted by θMAMLsubscript𝜃𝑀𝐴𝑀𝐿\theta_{MAML}italic_θ start_POSTSUBSCRIPT italic_M italic_A italic_M italic_L end_POSTSUBSCRIPT. Note that what θMAMLsubscript𝜃𝑀𝐴𝑀𝐿\theta_{MAML}italic_θ start_POSTSUBSCRIPT italic_M italic_A italic_M italic_L end_POSTSUBSCRIPT represents is a set of parameters that adapt fast to new data; in our case, it means that the DGM initial parameters are chosen so that they adapt fast to generate real data.

2.2.2 DRS

Refer to caption
Figure 5: Block diagram for the DRS case. The inductive bias is introduced by training a DGM pθsubscript𝑝𝜃p_{\theta}italic_p start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT on xrsubscript𝑥𝑟x_{r}italic_x start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT using S𝑆Sitalic_S different seeds. The synthetic dataset xgsubscript𝑥𝑔x_{g}italic_x start_POSTSUBSCRIPT italic_g end_POSTSUBSCRIPT contains data generated by each seed and serves as input to DRS. The starting point to fine-tune using the real data xrsubscript𝑥𝑟x_{r}italic_x start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT and obtain pθ^subscript𝑝^𝜃p_{\hat{\theta}}italic_p start_POSTSUBSCRIPT over^ start_ARG italic_θ end_ARG end_POSTSUBSCRIPT is the DRS solution obtained, θDRSsubscript𝜃𝐷𝑅𝑆\theta_{DRS}italic_θ start_POSTSUBSCRIPT italic_D italic_R italic_S end_POSTSUBSCRIPT.

Although MAML offers the potential to leverage the underlying structure of learning problems through a powerful optimization framework, it introduces a significant computational cost. Therefore, while we should seek for a trade-off between accuracy and computational efficiency, there is no approach for managing this trade-off. It needs an understanding of the domain-specific characteristics inherent to the meta-problem itself.

DRS presents an alternative meta-learning approach that circumvents the computational burden associated with bilevel optimization problems. Unlike MAML, DRS trains a model on the combined data from all tasks. This eliminates the need for the complex optimization procedures present in MAML, leading to a more computationally efficient solution. However, it is important to acknowledge that DRS offers an approximation to the ideal solution [49].

Formally, DRS focuses on the meta-information, denoted by θmetasubscript𝜃𝑚𝑒𝑡𝑎\theta_{meta}italic_θ start_POSTSUBSCRIPT italic_m italic_e italic_t italic_a end_POSTSUBSCRIPT, as the initialization of an iterative optimizer used in a new meta-testing task, 𝒯Ssubscript𝒯𝑆\mathcal{T}_{S}caligraphic_T start_POSTSUBSCRIPT italic_S end_POSTSUBSCRIPT. In this context of meta-learning initialization, a straightforward alternative involves solving the following pseudo-meta problem:

θDRS=argminω𝔼𝒯𝒮p(𝒯)(𝒟;ω).subscript𝜃𝐷𝑅𝑆subscript𝜔similar-tosubscript𝒯𝒮𝑝𝒯𝔼superscript𝒟𝜔\theta_{DRS}=\arg\min_{\omega}\underset{\mathcal{T_{S}}\sim p(\mathcal{T})}{% \mathbb{E}}\mathcal{L}(\mathcal{D^{*}};\omega).italic_θ start_POSTSUBSCRIPT italic_D italic_R italic_S end_POSTSUBSCRIPT = roman_arg roman_min start_POSTSUBSCRIPT italic_ω end_POSTSUBSCRIPT start_UNDERACCENT caligraphic_T start_POSTSUBSCRIPT caligraphic_S end_POSTSUBSCRIPT ∼ italic_p ( caligraphic_T ) end_UNDERACCENT start_ARG blackboard_E end_ARG caligraphic_L ( caligraphic_D start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ; italic_ω ) . (7)

In this context, 𝒟superscript𝒟\mathcal{D^{*}}caligraphic_D start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT represents the aggregated synthetic data collection, xgsubscript𝑥𝑔x_{g}italic_x start_POSTSUBSCRIPT italic_g end_POSTSUBSCRIPT, obtained across all training seeds S𝑆Sitalic_S. We refer to this approach as Domain-Randomized Search due to its alignment with the domain randomization method presented in [50] and its core principle of directly searching over a distribution of domains (tasks).

Figure 5 shows the application of the DRS procedure within the framework of our proposed methodology. Similar to the MAML case, θDRSsubscript𝜃𝐷𝑅𝑆\theta_{DRS}italic_θ start_POSTSUBSCRIPT italic_D italic_R italic_S end_POSTSUBSCRIPT serves as the initialization weights that we aim to identify θ0subscript𝜃0\theta_{0}italic_θ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT.

Both MAML and DRS offer complementary approaches with a trade-off between modeling complexity and optimization cost [49]. DRS delivers an approximate solution with lower computational demands, while MAML offers higher precision at the expense of greater computational resources. DRS is also advantageous when dealing with a limited number of learning tasks. In our case, where data generated by each seed (s=1,2,,S𝑠12𝑆s=1,2,...,Sitalic_s = 1 , 2 , … , italic_S) is considered a task, and S𝑆Sitalic_S typically takes values around 10101010, DRS is expected to provide better solutions than MAML, aligning with the findings of [49]. Finally, note that DRS is similar to the pre-training approach. While both techniques aim to improve model performance, they utilize data differently. Pre-training leverages data from the single best VAE seed, whereas DRS capitalizes on data from all VAE seeds. This distinction reflects the core principle of DRS: exploring a wider range of possibilities by searching across a distribution of domains (tasks) represented by the various seed variations.

3 Experiments

3.1 Data

The experiments were carried out on four public datasets obtained from the SDV environment [51], which also implements various data generation models, including the CTGAN implementation we use. The experiment design prioritized datasets with a sufficient number of samples. This allows us to create multiple data splits for various configurations of the different training and validation parameters. This approach comprehensively evaluates the proposed method under different parameter settings.

  • Adult: The Adult Census Income dataset [52] is a mixed-data dataset extracted from the 1994 U.S. Census. It comprises 32,5613256132,56132 , 561 data points, each described by 14 features that encompass integer, categorical, and binary values. The dataset is used to predict whether an individual’s annual income exceeds $50,000. It should be noted that the data set incorporates 13% missing values, concentrated exclusively within two specific variables: “workclass" and “occupation".

  • News: The News Popularity Prediction dataset [53] consists of 39,644 samples and 58 columns containing information about articles published on the Mashable news blog over two years. The objective is to predict the popularity of an article, measured by the number of social media shares. The dataset is multivariate, including both continuous and categorical variables.

  • King111https://www.kaggle.com/datasets/harlfoxem/housesalesprediction: The King County House Sales dataset is a regression dataset containing 21,613 house sale prices for King County, Washington, including Seattle. It includes homes sold between May 2014 and May 2015. The 20 features provide multivariate information in numerical (integer and decimal) and categorical forms.

  • Intrusion: The KDD Cup 1999 Data [54] was used for The Third Knowledge Discovery and Data Mining Competition. It comprises 494,021 samples and 39 information characteristics used to classify connections in a military network environment. The dataset contains categorical, integer, and binary features.

This selection of datasets with varying sample sizes, feature dimensions, and domains allows us to assess the performance and generalizability of the proposed method in diverse data scenarios. As will be demonstrated in the following sections, the proposed method is effective in generating synthetic data that retain the key characteristics of the original datasets across a wide range of applications.

3.2 Validation metrics

To rigorously evaluate the effectiveness of our proposed method in capturing the real data distribution, we strictly adhere to the validation approach described in [20]. This approach leverages a probabilistic classifier (discriminator) to estimate the ratio of probability densities between the real and synthetic distributions, subsequently calculating the KL and JS divergences. Traditional validation methods often focus on individual data points and the marginal distribution of each separate feature. In contrast, this approach considers the entire data distribution, including complex relationships between features. Additionally, divergences are robust to noise and offer clear interpretations, making them ideal for evaluating the DGM’s effectiveness. This provides a comprehensive approach to measuring the discrepancy between two probability distributions, making them suitable for assessing the similarity between real data p(xr)𝑝subscript𝑥𝑟p(x_{r})italic_p ( italic_x start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT ) and the distribution of the synthetic data generated by the DGM pθsubscript𝑝𝜃p_{\theta}italic_p start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT.

The discriminator network plays a crucial role in the validation process. This neural network architecture is trained to distinguish between real and synthetic data samples. The network receives two sets of samples as input. The first consists of M𝑀Mitalic_M samples from the real data distribution p(xr)𝑝subscript𝑥𝑟p(x_{r})italic_p ( italic_x start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT ) labeled as class 1111, and the second consists of M𝑀Mitalic_M samples labeled as class 00 from the synthetic data distribution generated by the DGM pθsubscript𝑝𝜃p_{\theta}italic_p start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT or pθ^subscript𝑝^𝜃p_{\hat{\theta}}italic_p start_POSTSUBSCRIPT over^ start_ARG italic_θ end_ARG end_POSTSUBSCRIPT, depending on the N𝑁Nitalic_N number of samples in the dataset. During training, the discriminator aims to learn a decision boundary that effectively separates these two sets of samples. This process forces the discriminator to capture the underlying differences between the real and synthetic distributions. Once the discriminator network is trained, it is used to estimate the KL and JS divergences between the real and synthetic probability distributions. This estimation involves using L𝐿Litalic_L samples from each distribution and feeding them to the trained discriminator. The output probabilities of the discriminator for these samples are then used to compute the KL and JS divergence metrics.

Figure 6 illustrates the overall scheme of the approach. The figure emphasizes the separate but related components: the inductive bias generator for synthetic data creation and the validation process with the discriminator network. It also highlights the number of samples used from each distribution for each step of the process (N𝑁Nitalic_N to generate samples, M𝑀Mitalic_M to train the discriminator, and L𝐿Litalic_L to estimate divergence). By adhering to this rigorous validation approach, we ensure that our proposed methodology for generating synthetic data from small datasets is thoroughly evaluated and its effectiveness is quantitatively demonstrated using established metrics like KL and JS divergences. Note that, as stated in [20], M𝑀Mitalic_M and L𝐿Litalic_L need to be large enough to prevent inaccurate divergence estimations, so we consider this in our experiments.

Refer to caption
Figure 6: General Scheme of the proposed approach. Overall scheme of the approach and its validation process from [20]. This last one consists of a discriminator and a divergence estimator. The number of samples used from each distribution for each step is also highlighted: N𝑁Nitalic_N to generate samples, M𝑀Mitalic_M to train the discriminator, and L𝐿Litalic_L to estimate the divergences.

3.3 Experimental design

In terms of the experimental design to evaluate the proposed method, and as described in the methodology section, we use the state-of-the-art VAE architecture [15] to train 10 different seeds. Subsequently, we apply methodologies based on transfer learning and meta-learning. For pre-training, we also include results using another state-of-the-art model, CTGAN. It is important to note that while we maintain the default parameters for CTGAN, we adjust the dimensionality of the latent space in the VAE depending on the dataset. This allows the VAE to capture the specific characteristics of each dataset more effectively. Particularly, we use a latent space dimension of 10101010 for the Adult and Intrusion datasets, 20202020 for the News dataset, and 15151515 for the King dataset. We maintain a consistent hidden size of 256256256256 neurons for all VAE models. Regarding the configuration of parameters M𝑀Mitalic_M and L𝐿Litalic_L, we define two different validation configurations for each methodology: a reliable case and a more realistic case. The parameter N𝑁Nitalic_N remains unchanged, as it reflects the actual number of samples available to train the DGM, which is beyond our control. However, we can vary the number of generated samples for validation, i.e., M𝑀Mitalic_M and L𝐿Litalic_L. Specifically, the results presented for each dataset are as follows.

  • “Big data’: First, we present the optimistic scenario with a sufficient N𝑁Nitalic_N of 10,0001000010,00010 , 000 samples, where no methodology is needed to calculate the inductive bias. This provides results of the divergences that serve as an “upper bound" or reference for the best possible outcome. For M𝑀Mitalic_M and L𝐿Litalic_L, we maintain high values of the validation samples, 7500750075007500 and 1000100010001000, respectively.

  • “Low data’: Next, we show results for a realistic scenario with few samples (N=300𝑁300N=300italic_N = 300) without applying our methodology. This allows us to quantify the gain using the methodology and determine its benefits. For this case, we use two configurations for parameters M𝑀Mitalic_M and L𝐿Litalic_L: [M=7500𝑀7500M=7500italic_M = 7500, L=1000𝐿1000L=1000italic_L = 1000] and [M=100𝑀100M=100italic_M = 100, L=100𝐿100L=100italic_L = 100]. The second configuration is more realistic for few-data scenarios. When limited training data are available, there is also a limitation on the amount of data that can be effectively used for validation. The first configuration, with much larger values for M𝑀Mitalic_M and L𝐿Litalic_L, serves as a rigorous evaluation of the impact of our methodology. However, it is acknowledged that the use of small values for M𝑀Mitalic_M and L𝐿Litalic_L can lead to unreliable metric estimations [20].

  • “Pre-train’: In this case, we begin to apply the proposed methodology using the pre-training technique. Results are presented for both CTGAN and VAE. The parameter configurations chosen are: [N=300𝑁300N=300italic_N = 300, M=7500𝑀7500M=7500italic_M = 7500, L=1000𝐿1000L=1000italic_L = 1000] and [N=300𝑁300N=300italic_N = 300, M=100𝑀100M=100italic_M = 100, L=100𝐿100L=100italic_L = 100].

  • “AVG’, “MAML’, “DRS’: These scenarios apply the model average (“AVG’) and the meta-learning techniques (“MAML’, “DRS’). We solely utilize the VAE architecture for multiple training runs. The following parameter configurations will be presented: [N=300𝑁300N=300italic_N = 300, M=7500𝑀7500M=7500italic_M = 7500, L=1000𝐿1000L=1000italic_L = 1000] and [N=300𝑁300N=300italic_N = 300, M=100𝑀100M=100italic_M = 100, L=100𝐿100L=100italic_L = 100].

This setup aims to thoroughly evaluate the performance and robustness of the proposed methodologies in various data availability scenarios and parameter configurations.

3.4 Results

In this section, we present the results obtained from the experiments, focusing on scenarios that promote the validation of reliable synthetic data, characterized by higher values of M𝑀Mitalic_M and L𝐿Litalic_L. The results for scenarios with M=100𝑀100M=100italic_M = 100 and L=100𝐿100L=100italic_L = 100, considered unreliable due to their low information content, are presented in the Appendix for comparison purposes. For each database, we present a table summarizing the scenarios defined previously and their respective KL and JS divergence values. The results for each metric are displayed in the following format: mean (std) (lower is better). The code to replicate our results, along with the data used, can be found in our repository.

Scenario N VAE JS CTGAN JS VAE KL CTGAN KL
Big data 10000 0.079 (0.001) 0.150 (0.002) 0.153 (0.019) 0.420 (0.025)
Low data 300 0.331 (0.004) 0.563 (0.002) 0.697 (0.018) 1.653 (0.015)
\hdashlinePre-train 300 0.171 (0.004) 0.563 (0.002) 0.427 (0.021) 1.753 (0.040)
AVG 300 0.157 (0.004) N/A 0.380 (0.043) N/A
MAML 300 0.300 (0.002) N/A 0.686 (0.037) N/A
DRS 300 0.189 (0.006) N/A 0.427 (0.043) N/A
Table 1: Adult dataset JS and KL results for each Scenario. Big data represents the ideal case where many samples (N=10,000𝑁10000N=10,000italic_N = 10 , 000) are available to generate reliable synthetic data. Low data represents a more realistic scenario in which a limited number of samples (N=300𝑁300N=300italic_N = 300) are available, posing a challenge for synthetic data generation. The next rows compare the divergences obtained by each methodology (pre-training, model averaging, MAML, and DRS) applied to the low data scenario. Values in bold indicate an improvement due to the technique used. Results are represented as mean (std). Lower is better.

Table 1 shows the validation results in terms of divergence obtained for the Adult dataset. We focus primarily on the JS divergence due to its interpretability as a bounded metric (ranging from 0 to 1). The table shows the upper and lower bounds used to assess the efficacy of the proposed methodology in the reliable case of higher validation samples (M=7500𝑀7500M=7500italic_M = 7500 and L=1000𝐿1000L=1000italic_L = 1000). These bounds are 0.0790.0790.0790.079 (upper) and 0.3310.3310.3310.331 (lower), highlighting a significant gap and room for improvement in the base VAE model (without any techniques applied). Examining the JS divergence results for the application of different proposed techniques, a consistent decrease in divergence is observed. The worst improvement is obtained for MAML (0.3000.3000.3000.300) and the best for AVG (0.1570.1570.1570.157). This implies that improvement is always present and, in the best cases, significantly high in terms of JS for VAE. A similar pattern is observed for KL divergence in VAE: better divergence results are obtained for transfer learning cases, but improvements are always achieved. However, for the CTGAN model (where only pre-training results are available), we see no significant improvement in either JS or KL divergence.

Table 2 summarizes the results obtained for the News dataset. The values demonstrate the effectiveness of the proposed techniques in improving divergence metrics compared to the established lower bounds. Notably, the VAE model shows improvement in divergence metrics in most methodologies, with the exception of the MAML technique. We hypothesize that the MAML technique might require a larger number of tasks to achieve comparable performance. Consistent with the findings for the previous dataset, Model Averaging emerges as the methodology that generally achieves the best results. When considering CTGAN, pre-training demonstrates improvement in the JS divergence metric. While the KL divergence results for pre-trained CTGAN do not show significant worsening compared to the lower bound, they remain within the established confidence intervals.

Scenario N VAE JS CTGAN JS VAE KL CTGAN KL
Big data 10000 0.253 (0.009) 0.463 (0.003) 0.647 (0.045) 1.506 (0.031)
Low data 300 0.840 (0.003) 0.962 (0.002) 4.582 (0.136) 8.994 (0.909)
\hdashlinePre-train 300 0.746 (0.003) 0.937 (0.003) 3.516 (0.082) 8.603 (0.463)
AVG 300 0.609 (0.003) N/A 2.596 (0.060) N/A
MAML 300 0.851 (0.001) N/A 5.176 (0.242) N/A
DRS 300 0.645 (0.006) N/A 2.449 (0.057) N/A
Table 2: News dataset JS and KL results for each Scenario. Big data represents the ideal case where many samples (N=10,000𝑁10000N=10,000italic_N = 10 , 000) are available to generate reliable synthetic data. Low data represents a more realistic scenario in which a limited number of samples (N=300𝑁300N=300italic_N = 300) are available, posing a challenge for synthetic data generation. The next rows compare the divergences obtained by each methodology (pre-training, model averaging, MAML, and DRS) applied to the low data scenario. Values in bold indicate an improvement due to the technique used. Results are represented as mean (std). Lower is better.

Table 3 further reinforces the efficacy of the methodologies proposed in the King dataset. The VAE model consistently improves on the lower bounds established for the JS and KL divergences across all techniques. For CTGAN, the results for the JS and KL divergences are consistent with the lower bounds. This suggests that while CTGAN does not produce significant reductions in divergence metrics, it appears to maintain the quality of the data distribution compared to the lower bounds. However, it should be noted that CTGAN exhibits consistently lower gains across datasets compared to the VAE. This may be attributed to inherent GAN framework instabilities.

Scenario N VAE JS CTGAN JS VAE KL CTGAN KL
Big data 10000 0.862 (0.002) 0.777 (0.003) 4.768 (0.072) 3.124 (0.115)
Low data 300 0.927 (0.002) 0.940 (0.003) 13.763 (0.696) 7.470 (0.392)
\hdashlinePre-train 300 0.862 (0.002) 0.945 (0.002) 5.286 (0.327) 9.533 (0.453)
AVG 300 0.740 (0.002) N/A 3.489 (0.209) N/A
MAML 300 0.910 (0.002) N/A 6.436 (0.496) N/A
DRS 300 0.809 (0.003) N/A 4.321 (0.215) N/A
Table 3: King dataset JS and KL results for each Scenario. Big data represents the ideal case where many samples (N=10,000𝑁10000N=10,000italic_N = 10 , 000) are available to generate reliable synthetic data. Low data represents a more realistic scenario in which a limited number of samples (N=300𝑁300N=300italic_N = 300) are available, posing a challenge for synthetic data generation. The next rows compare the divergences obtained by each methodology (pre-training, model averaging, MAML, and DRS) applied to the low data scenario. Values in bold indicate an improvement due to the technique used. Results are represented as mean (std). Lower is better.

The results obtained using the Intrusion dataset are summarized in Table 4. These values align with the findings for the News dataset, demonstrating consistent improvements in divergence metrics for reliable case (M=7500𝑀7500M=7500italic_M = 7500 and N=1000𝑁1000N=1000italic_N = 1000) across different methodologies. Similarly to the News dataset, Intrusion presents a high number of features, resulting in higher dimensionality. This increased dimensionality poses a greater challenge in generating synthetic data that closely resembles the real-world data distribution. Consequently, the divergence results for Intrusion either maintain or exhibit smaller improvements compared to lower-dimensional datasets. Despite the challenges posed by the high dimensionality of Intrusion, the proposed methodologies still demonstrate their effectiveness in improving divergence metrics, particularly for reliable cases.

Scenario N VAE JS CTGAN JS VAE KL CTGAN KL
Big data 10000 0.760 (0.013) 0.531 (0.033) 2.744 (0.084) 2.623 (0.537)
Low data 300 0.920 (0.003) 0.961 (0.002) 6.216 (0.154) 8.841 (0.710)
\hdashlinePre-train 300 0.793 (0.004) 0.959 (0.001) 3.831 (0.151) 8.443 (0.630)
AVG 300 0.867 (0.007) N/A 5.798 (0.295) N/A
MAML 300 0.913 (0.003) N/A 6.359 (0.054) N/A
DRS 300 0.835 (0.009) N/A 4.587 (0.166) N/A
Table 4: Intrusion dataset JS and KL results for each Scenario. Big data represents the ideal case where many samples (N=10,000𝑁10000N=10,000italic_N = 10 , 000) are available to generate reliable synthetic data. Low data represents a more realistic scenario in which a limited number of samples (N=300𝑁300N=300italic_N = 300) are available, posing a challenge for synthetic data generation. The next rows compare the divergences obtained by each methodology (pre-training, model averaging, MAML, and DRS) applied to the low data scenario. Values in bold indicate an improvement due to the technique used. Results are represented as mean (std). Lower is better.

Finally, Table 5 summarizes the results of applying various methodologies to generate synthetic tabular data using the VAE model. Across different methodologies, the results consistently show positive gains, except for MAML. These gains are computed as the difference between the metric in the lower bound and any of the proposed methods. This indicates that the proposed techniques generate synthetic data that are closer to the real-world data distribution than in the case of not using them. This improvement is particularly evident for the model averaging methodology, which consistently outperforms other techniques in terms of divergence reduction. MAML’s performance might be limited in this scenario due to its reliance on a large number of seeds. Since the results presented do not specify the number of seeds used, MAML might not have had enough to learn effectively [49]. The consistent improvement in divergence metrics highlights the robustness and generalizability of the proposed techniques. These findings suggest that our approach is an effective methodology for generating high-quality synthetic tabular data that can be used for various applications.

Dataset Pre-train Gain AVG Gain MAML Gain DRS Gain
JS KL JS KL JS KL JS KL
Adult 0.159 (0.482) 0.271 (0.388) 0.173 (0.525) 0.317 (0.455) 0.030 (0.092) 0.011 (0.016) 0.142 (0.430) 0.271 (0.388)
News 0.093 (0.111) 1.065 (0.233) 0.230 (0.274) 1.985 (0.433) -0.011 (-0.013) -0.594 (-0.130) 0.194 (0.231) 2.133 (0.465)
King 0.064 (0.070) 8.477 (0.616) 0.187 (0.202) 10.274 (0.746) 0.017 (0.018) 7.327 (0.532) 0.118 (0.128) 9.442 (0.686)
Intrusion 0.127 (0.138) 2.385 (0.384) 0.053 (0.057) 0.419 (0.067) 0.006 (0.007) -0.143 (-0.023) 0.085 (0.092) 1.629 (0.262)
\hdashlineAverage 0.111 (0.200) 3.050 (0.405) 0.161 (0.265) 3.249 (0.425) 0.011 (0.026) 1.650 (0.099) 0.135 (0.220) 3.369 (0.450)
Table 5: Gains using the proposed methodology for the VAE. Gains are represented in the following format: absolute gain (relative gain). The methodology achieves relative gains of up to 50% in JS divergence, which is bounded, and up to 75% in KS divergence. Bold values indicate positive gain. Higher is better.

It is important to acknowledge the varying computational demands of the compared methods. Model averaging emerges as the most efficient approach, as inductive bias generation only involves calculating a mean, resulting in minimal computational overhead. In contrast, MAML exhibits the highest computational load due to its intricate optimization procedure. Pre-training and DRS fall between these two extremes, both requiring the training of a DGM to establish the inductive bias. Considering these findings alongside the results presented in Table 5, we recommend against using MAML. It offers minimal performance gains with significant computational cost. The other methods, on the other hand, provide a more favorable trade-off between computational efficiency and performance. Furthermore, as detailed in the Appendix, reliably quantifying the benefits of our methodology in a realistic, limited-data setting is challenging. This implies that validation with a large number of samples is necessary to definitively assess which of our proposed inductive bias techniques yields superior results for a specific dataset. However, the experimental results strongly suggest potential gains that warrant further exploration.

4 Conclusions

This research proposed a novel approach to generate synthetic tabular data using DGMs in the context of limited datasets. Our approach leverages four distinct techniques to artificially introduce an inductive bias that guides the DGM towards generating more realistic and informative synthetic data samples. These techniques encompass two transfer learning approaches: pre-training and model averaging; and two meta-learning approaches: MAML and DRS. To facilitate the application of model averaging, MAML, and DRS, we employ the VAE model from [15] and train multiple instances with different random seeds. This allows us to leverage the ensemble properties of the VAE models for techniques like model averaging and further enables the application of meta-learning algorithms like MAML and DRS. We also used the CTGAN [7] to assess pre-training so that we can compare other architectures of well-known models for synthetic tabular data generation. We used divergence metrics, in particular JS and KL divergences, to compare the real and synthetic data distributions generated. The experimental results consistently demonstrate the effectiveness of our proposed approach in generating high-quality synthetic tabular data, particularly when using transfer learning techniques. These techniques significantly improve the divergence metrics, indicating a closer resemblance between the synthetic and real data distributions. Our approach offers several advantages over existing methods. Firstly, it effectively addresses the challenge of generating realistic synthetic data from small datasets, a common limitation in many real-world applications. Secondly, the use of transfer learning and meta-learning techniques enhances the inductive bias of the DGM, leading to more meaningful and informative synthetic data samples. However, it is also important to acknowledge the trade-offs associated with our methodology. Training VAEs with these techniques requires training multiple VAE models with different random seeds. This can lead to a significant increase in computational cost compared to simpler DGM training methods. While divergence metrics provide a valuable measure of distributional similarity, their ability to reliably assess the improvement in synthetic data quality for specific downstream tasks can be limited, especially with small datasets, as detailed in the Appendix. Nonetheless, our methodology may provide significant gains in JS divergence, of up to 50% according to our experimental results.

In conclusion, our proposed approach provides a promising solution for generating high-quality synthetic tabular data from small datasets, particularly when VAEs are employed to apply transfer learning techniques. We believe that this work has the potential to make a significant contribution to the field of synthetic data generation and machine learning applications that rely on small datasets. However, there are several research lines to be addressed. While the current study focuses on VAEs and GANs, investigating the applicability of our framework to other DGM architectures could provide valuable insights. In addition, our current approach does not explicitly incorporate domain knowledge. Future research could explore mechanisms to integrate domain-specific information from an expert into the inductive bias generation process, potentially leading to even more realistic and informative synthetic data. Lastly, although divergence metrics offer a valuable measure of distributional similarity, exploring additional evaluation techniques that assess the quality and usefulness of synthetic data for specific downstream tasks would provide a more comprehensive understanding of the effectiveness of our methodology, valid also for the case when little amount of data is available for validation, which is a current limitation of this work.

Acknowledgements

This research was supported by GenoMed4All and SYNTHEMA projects. Both have received funding from the European Union’s Horizon 2020 research and innovation program under grant agreement No 101017549 and 101095530, respectively. The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Declarations

The authors have no relevant financial or non-financial interests to disclose.

Appendix A Appendix

To further investigate the impact of sample size on divergence metric performance, we conducted an additional validation experiment using a reduced number of samples for both M𝑀Mitalic_M and L𝐿Litalic_L. Specifically, we set M=100𝑀100M=100italic_M = 100 and L=100𝐿100L=100italic_L = 100, representing a scenario with limited data availability. As noted in [20], having enough samples is crucial for accurate distribution comparisons. Therefore, we anticipated that this low sample count scenario would yield less reliable divergence results.

The results obtained for this particular validation are presented in Tables 6, 7, 8 and 9. The improvements observed in the previous experiments are not consistent in all datasets under these constrained conditions. Moreover, the divergence values are generally quite small. This suggests that when sample counts are low, the underlying distributions may not be adequately captured, leading to an underestimation of the true disparity between the distributions. Consequently, the divergence may underestimate the actual divergence, resulting in seemingly small divergence values. These findings underscore the importance of having an adequate number of samples when evaluating divergence metrics. With limited data, the reliability of these measures decreases, which can lead to misleading conclusions about similarity between distributions. Therefore, it is essential to consider sample size as a factor when interpreting divergence metric results, particularly in scenarios with limited data availability.

Scenario N M L VAE JS CTGAN JS VAE KL CTGAN KL
Big data 10000 7500 1000 0.079 (0.001) 0.150 (0.002) 0.153 (0.019) 0.420 (0.025)
Low data 300 7500 1000 0.331 (0.004) 0.563 (0.002) 0.697 (0.018) 1.653 (0.015)
Low data 300 100 100 0.030 (0.012) 0.333 (0.027) 0.084 (0.101) 2.197 (0.257)
\hdashlinePre-train 300 7500 1000 0.171 (0.004) 0.563 (0.002) 0.427 (0.021) 1.753 (0.040)
Pre-train 300 100 100 -0.002 (0.004) 0.237 (0.014) 0.056 (0.103) 0.978 (0.307)
AVG 300 7500 1000 0.157 (0.004) N/A 0.380 (0.043) N/A
AVG 300 100 100 0.002 (0.002) N/A 0.049 (0.107) N/A
MAML 300 7500 1000 0.300 (0.002) N/A 0.686 (0.037) N/A
MAML 300 100 100 0.002 (0.008) N/A 0.054 (0.116) N/A
DRS 300 7500 1000 0.189 (0.006) N/A 0.427 (0.043) N/A
DRS 300 100 100 0.015 (0.008) N/A 0.089 (0.051) N/A
Table 6: Adult dataset JS and KL results for each Scenario. Big data represents the ideal case where many samples (N=10,000𝑁10000N=10,000italic_N = 10 , 000) are available to generate reliable synthetic data. Low data represents a more realistic scenario in which a limited number of samples (N=300𝑁300N=300italic_N = 300) are available, posing a challenge for synthetic data generation. There are two different Low data rows, depending on the number of samples M𝑀Mitalic_M and L𝐿Litalic_L used for validating. A more reliable case is presented when M=7,500𝑀7500M=7,500italic_M = 7 , 500 and L=1,000𝐿1000L=1,000italic_L = 1 , 000 and a less reliable case is presented when M=100𝑀100M=100italic_M = 100 and L=100𝐿100L=100italic_L = 100. These two cases apply to the following rows too. The next rows compare the divergences obtained by each methodology (pre-training, model averaging, MAML, and DRS) applied to the low data scenario. Results are represented as mean (std). Lower is better. Values in bold indicate an improvement due to the technique used.
Scenario N M L VAE JS CTGAN JS VAE KL CTGAN KL
Big data 10000 7500 1000 0.253 (0.009) 0.463 (0.003) 0.647 (0.045) 1.506 (0.031)
Low data 300 7500 1000 0.840 (0.003) 0.962 (0.002) 4.582 (0.136) 8.994 (0.909)
Low data 300 100 100 0.003 (0.004) 0.482 (0.093) -0.062 (0.113) 2.290 (1.100)
\hdashlinePre-train 300 7500 1000 0.746 (0.003) 0.937 (0.003) 3.516 (0.082) 8.603 (0.463)
Pre-train 300 100 100 0.010 (0.006) 0.843 (0.013) 0.061 (0.051) 4.599 (0.629)
AVG 300 7500 1000 0.609 (0.003) N/A 2.596 (0.060) N/A
AVG 300 100 100 -0.001 (0.004) N/A 0.073 (0.088) N/A
MAML 300 7500 1000 0.851 (0.001) N/A 5.176 (0.242) N/A
MAML 300 100 100 0.179 (0.071) N/A 2.256 (0.927) N/A
DRS 300 7500 1000 0.645 (0.006) N/A 2.449 (0.057) N/A
DRS 300 100 100 0.002 (0.004) N/A -0.010 (0.114) N/A
Table 7: News dataset JS and KL results for each Scenario. Big data represents the ideal case where many samples (N=10,000𝑁10000N=10,000italic_N = 10 , 000) are available to generate reliable synthetic data. Low data represents a more realistic scenario in which a limited number of samples (N=300𝑁300N=300italic_N = 300) are available, posing a challenge for synthetic data generation. There are two different Low data rows, depending on the number of samples M𝑀Mitalic_M and L𝐿Litalic_L used for validating. A more reliable case is presented when M=7,500𝑀7500M=7,500italic_M = 7 , 500 and L=1,000𝐿1000L=1,000italic_L = 1 , 000 and a less reliable case is presented when M=100𝑀100M=100italic_M = 100 and L=100𝐿100L=100italic_L = 100. These two cases apply to the following rows too. The next rows compare the divergences obtained by each methodology (pre-training, model averaging, MAML, and DRS) applied to the low data scenario. Results are represented as mean (std). Lower is better. Values in bold indicate an improvement due to the technique used.
Scenario N M L VAE JS CTGAN JS VAE KL CTGAN KL
Big data 10000 7500 1000 0.862 (0.002) 0.777 (0.003) 4.768 (0.072) 3.124 (0.115)
Low data 300 7500 1000 0.927 (0.002) 0.940 (0.003) 13.763 (0.696) 7.470 (0.392)
Low data 300 100 100 0.533 (0.062) 0.682 (0.029) 3.264 (0.555) 3.731 (0.279)
\hdashlinePre-train 300 7500 1000 0.862 (0.002) 0.945 (0.002) 5.286 (0.327) 9.533 (0.453)
Pre-train 300 100 100 0.228 (0.018) 0.622 (0.070) 0.698 (0.197) 3.455 (0.426)
AVG 300 7500 1000 0.740 (0.002) N/A 3.489 (0.209) N/A
AVG 300 100 100 -0.001 (0.002) N/A 0.020 (0.108) N/A
MAML 300 7500 1000 0.910 (0.002) N/A 6.436 (0.496) N/A
MAML 300 100 100 0.322 (0.046) N/A 1.030 (0.127) N/A
DRS 300 7500 1000 0.809 (0.003) N/A 4.321 (0.215) N/A
DRS 300 100 100 0.068 (0.006) N/A 0.447 (0.093) N/A
Table 8: King dataset JS and KL results for each Scenario. Big data represents the ideal case where many samples (N=10,000𝑁10000N=10,000italic_N = 10 , 000) are available to generate reliable synthetic data. Low data represents a more realistic scenario in which a limited number of samples (N=300𝑁300N=300italic_N = 300) are available, posing a challenge for synthetic data generation. There are two different Low data rows, depending on the number of samples M𝑀Mitalic_M and L𝐿Litalic_L used for validating. A more reliable case is presented when M=7,500𝑀7500M=7,500italic_M = 7 , 500 and L=1,000𝐿1000L=1,000italic_L = 1 , 000 and a less reliable case is presented when M=100𝑀100M=100italic_M = 100 and L=100𝐿100L=100italic_L = 100. These two cases apply to the following rows too. The next rows compare the divergences obtained by each methodology (pre-training, model averaging, MAML, and DRS) applied to the low data scenario. Results are represented as mean (std). Lower is better. Values in bold indicate an improvement due to the technique used.
Scenario N M L VAE JS CTGAN JS VAE KL CTGAN KL
Big data 10000 7500 1000 0.760 (0.013) 0.531 (0.033) 2.744 (0.084) 2.623 (0.537)
Low data 300 7500 1000 0.920 (0.003) 0.961 (0.002) 6.216 (0.154) 8.841 (0.710)
Low data 300 100 100 0.050 (0.009) 0.681 (0.052) 0.182 (0.150) 4.365 (0.175)
\hdashlinePre-train 300 7500 1000 0.793 (0.004) 0.959 (0.001) 3.831 (0.151) 8.443 (0.630)
Pre-train 300 100 100 0.067 (0.004) 0.604 (0.039) 0.286 (0.091) 3.818 (0.189)
AVG 300 7500 1000 0.867 (0.007) N/A 5.798 (0.295) N/A
AVG 300 100 100 0.055 (0.003) N/A 0.167 (0.132) N/A
MAML 300 7500 1000 0.913 (0.003) N/A 6.359 (0.054) N/A
MAML 300 100 100 0.049 (0.002) N/A 0.261 (0.093) N/A
DRS 300 7500 1000 0.835 (0.009) N/A 4.587 (0.166) N/A
DRS 300 100 100 0.066 (0.005) N/A 0.310 (0.173) N/A
Table 9: Intrusion dataset JS and KL results for each Scenario. Big data represents the ideal case where many samples (N=10,000𝑁10000N=10,000italic_N = 10 , 000) are available to generate reliable synthetic data. Low data represents a more realistic scenario in which a limited number of samples (N=300𝑁300N=300italic_N = 300) are available, posing a challenge for synthetic data generation. There are two different Low data rows, depending on the number of samples M𝑀Mitalic_M and L𝐿Litalic_L used for validating. A more reliable case is presented when M=7,500𝑀7500M=7,500italic_M = 7 , 500 and L=1,000𝐿1000L=1,000italic_L = 1 , 000 and a less reliable case is presented when M=100𝑀100M=100italic_M = 100 and L=100𝐿100L=100italic_L = 100. These two cases apply to the following rows too. The next rows compare the divergences obtained by each methodology (pre-training, model averaging, MAML, and DRS) applied to the low data scenario. Results are represented as mean (std). Lower is better. Values in bold indicate an improvement due to the technique used.

References

  • [1] Mohamed Elasri, Omar Elharrouss, Somaya Al-Maadeed, and Hamid Tairi. Image generation: A review. Neural Processing Letters, 54(5):4609–4646, 2022.
  • [2] Hanqing Zhang, Haolin Song, Shaoyu Li, Ming Zhou, and Dawei Song. A survey of controllable text generation using transformer-based pre-trained language models. ACM Computing Surveys, 56(3):1–37, 2023.
  • [3] Javier Selva, Anders S Johansen, Sergio Escalera, Kamal Nasrollahi, Thomas B Moeslund, and Albert Clapés. Video transformers: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2023.
  • [4] Rick Sauber-Cole and Taghi M Khoshgoftaar. The use of generative adversarial networks to alleviate class imbalance in tabular data: a survey. Journal of Big Data, 9(1):98, 2022.
  • [5] Vadim Borisov, Tobias Leemann, Kathrin Seßler, Johannes Haug, Martin Pawelczyk, and Gjergji Kasneci. Deep neural networks and tabular data: A survey. IEEE Transactions on Neural Networks and Learning Systems, 2022.
  • [6] Joao Fonseca and Fernando Bacao. Tabular and latent space synthetic data generation: a literature review. Journal of Big Data, 10(1):115, 2023.
  • [7] Lei Xu, Maria Skoularidou, Alfredo Cuesta-Infante, and Kalyan Veeramachaneni. Modeling tabular data using conditional gan. Advances in neural information processing systems, 32, 2019.
  • [8] Alvaro Figueira and Bruno Vaz. Survey on synthetic data generation, evaluation methods and gans. Mathematics, 10(15):2733, 2022.
  • [9] Zilong Zhao, Aditya Kunar, Robert Birke, and Lydia Y Chen. Ctab-gan: Effective table data synthesizing. In Asian Conference on Machine Learning, pages 97–112. PMLR, 2021.
  • [10] Amirarsalan Rajabi and Ozlem Ozmen Garibay. Tabfairgan: Fair tabular data generation with generative adversarial networks. Machine Learning and Knowledge Extraction, 4(2):488–501, 2022.
  • [11] Justin Engelmann and Stefan Lessmann. Conditional wasserstein gan-based oversampling of tabular data for imbalanced learning. Expert Systems with Applications, 174:114582, 2021.
  • [12] Ally Salim Jr. Synthetic patient generation: A deep learning approach using variational autoencoders. arXiv preprint arXiv:1808.06444, 2018.
  • [13] Akim Kotelnikov, Dmitry Baranchuk, Ivan Rubachev, and Artem Babenko. Tabddpm: Modelling tabular data with diffusion models. In International Conference on Machine Learning, pages 17564–17579. PMLR, 2023.
  • [14] Vadim Borisov, Kathrin Seßler, Tobias Leemann, Martin Pawelczyk, and Gjergji Kasneci. Language models are realistic tabular data generators. arXiv preprint arXiv:2210.06280, 2022.
  • [15] Patricia A Apellániz, Juan Parras, and Santiago Zazo. An improved tabular data generator with vae-gmm integration. arXiv preprint arXiv:2404.08434, 2024.
  • [16] Chris Cremer, Xuechen Li, and David Kristjanson Duvenaud. Inference suboptimality in variational autoencoders. In International Conference on Machine Learning, 2018.
  • [17] R. A. Fisher. Iris. UCI Machine Learning Repository, 1988. DOI: https://doi.org/10.24432/C56C76.
  • [18] David Harrison and Daniel L Rubinfeld. Hedonic housing prices and the demand for clean air. Journal of Environmental Economics and Management, 5(1):81–102, 1978.
  • [19] Matjaz Zwitter and Milan Soklic. Breast Cancer. UCI Machine Learning Repository, 1988. DOI: https://doi.org/10.24432/C51P4M.
  • [20] Patricia A Apellániz, Ana Jiménez, Borja Arroyo Galende, Juan Parras, and Santiago Zazo. Synthetic tabular data validation: A divergence-based approach. arXiv preprint arXiv:2405.07822, 2024.
  • [21] Martin Heusel, Hubert Ramsauer, Thomas Unterthiner, Bernhard Nessler, and Sepp Hochreiter. Gans trained by a two time-scale update rule converge to a local nash equilibrium. Advances in neural information processing systems, 30, 2017.
  • [22] Tim Salimans, Ian Goodfellow, Wojciech Zaremba, Vicki Cheung, Alec Radford, and Xi Chen. Improved techniques for training gans. In Proceedings of the 30th International Conference on Neural Information Processing Systems, NIPS’16, page 2234–2242, Red Hook, NY, USA, 2016. Curran Associates Inc.
  • [23] Tianyi Zhang, Varsha Kishore, Felix Wu, Kilian Q. Weinberger, and Yoav Artzi. Bertscore: Evaluating text generation with bert. In International Conference on Learning Representations, 2020.
  • [24] Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu. Bleu: a method for automatic evaluation of machine translation. In Annual Meeting of the Association for Computational Linguistics, 2002.
  • [25] Mikel Hernandez, Gorka Epelde, Ane Alberdi, Rodrigo Cilla, and Debbie Rankin. Synthetic tabular data evaluation in the health domain covering resemblance, utility, and privacy dimensions. Methods of information in medicine, 62, 01 2023.
  • [26] Anirudh Goyal and Yoshua Bengio. Inductive biases for deep learning of higher-level cognition. Proceedings of the Royal Society A, 478(2266):20210068, 2022.
  • [27] Micah Goldblum, Marc Finzi, Keefer Rowan, and Andrew Gordon Wilson. The no free lunch theorem, kolmogorov complexity, and the role of inductive biases in machine learning. arXiv preprint arXiv:2304.05366, 2023.
  • [28] Samuel Müller, Noah Hollmann, Sebastian Pineda Arango, Josif Grabocka, and Frank Hutter. Transformers can do bayesian inference. In International Conference on Learning Representations, 2022.
  • [29] Noah Hollmann, Samuel Müller, Katharina Eggensperger, and Frank Hutter. TabPFN: A transformer that solves small tabular classification problems in a second. In The Eleventh International Conference on Learning Representations, 2023.
  • [30] Sinno Jialin Pan and Qiang Yang. A survey on transfer learning. IEEE Transactions on Knowledge and Data Engineering, 22(10):1345–1359, 2010.
  • [31] Hee E. Kim, Alejandro Cosa-Linan, Nandhini Santhanam, Mahboubeh Jannesari, Mate E. Maros, and Thomas Ganslandt. Transfer learning for medical image classification: a literature review. BMC Medical Imaging 2022 22:1, 22(1):1–13, apr 2022.
  • [32] Karl Weiss, Taghi M Khoshgoftaar, and DingDing Wang. A survey of transfer learning. Journal of Big data, 3(1):1–40, 2016.
  • [33] N. V. Chawla, K. W. Bowyer, L. O. Hall, and W. P. Kegelmeyer. Smote: Synthetic minority over-sampling technique. Journal of Artificial Intelligence Research, 16:321–357, June 2002.
  • [34] Fangtao Li, Sinno Jialin Pan, Ou Jin, Qiang Yang, and Xiaoyan Zhu. Cross-domain co-extraction of sentiment and topic lexicons. In Haizhou Li, Chin-Yew Lin, Miles Osborne, Gary Geunbae Lee, and Jong C. Park, editors, Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 410–419, Jeju Island, Korea, July 2012. Association for Computational Linguistics.
  • [35] Mingsheng Long, Jianmin Wang, Guiguang Ding, Sinno Jialin Pan, and Philip S. Yu. Adaptation regularization: A general framework for transfer learning. IEEE Transactions on Knowledge and Data Engineering, 26:1076–1089, 2014.
  • [36] Lixin Duan, Dong Xu, and Shih-Fu Chang. Exploiting web images for event recognition in consumer videos: A multiple source domain adaptation approach. 2012 IEEE Conference on Computer Vision and Pattern Recognition, pages 1338–1345, 2012.
  • [37] Zifeng Wang and Jimeng Sun. Transtab: Learning transferable tabular transformers across tables. Advances in Neural Information Processing Systems, 35:2902–2915, 2022.
  • [38] Bingzhao Zhu, Xingjian Shi, Nick Erickson, Mu Li, George Karypis, and Mahsa Shoaran. Xtab: Cross-table pretraining for tabular transformers. In International Conference on Machine Learning, 2023.
  • [39] Robert Winkler and Spyros Makridakis. The combination of forecasts. Journal of the Royal Statistical Society. Series A (General), 146:150–157, 01 1983.
  • [40] Robert T. Clemen and Robert L. Winkler. Combining economic forecasts. Journal of Business & Economic Statistics, 4(1):39–46, 1986.
  • [41] Sebastian Thrun and Lorien Pratt. Learning to Learn: Introduction and Overview, pages 3–17. Springer US, Boston, MA, 1998.
  • [42] Chelsea Finn, Pieter Abbeel, and Sergey Levine. Model-agnostic meta-learning for fast adaptation of deep networks. In International conference on machine learning, pages 1126–1135. PMLR, 2017.
  • [43] Timothy M. Hospedales, Antreas Antoniou, Paul Micaelli, and Amos J. Storkey. Meta-learning in neural networks: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence, 44:5149–5169, 2020.
  • [44] Diederik P. Kingma and Max Welling. Auto-encoding variational bayes. CoRR, abs/1312.6114, 2013.
  • [45] Ian J. Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron C. Courville, and Yoshua Bengio. Generative adversarial nets. In Neural Information Processing Systems, 2014.
  • [46] Jonathan Gordon, John F. Bronskill, M. Bauer, Sebastian Nowozin, and Richard E. Turner. Meta-learning probabilistic inference for prediction. In International Conference on Learning Representations, 2018.
  • [47] Kate Smith-Miles. Cross-disciplinary perspectives on meta-learning for algorithm selection. ACM Comput. Surv., 41:6:1–6:25, 2009.
  • [48] Luca Franceschi, Paolo Frasconi, Saverio Salzo, Riccardo Grazzi, and Massimiliano Pontil. Bilevel programming for hyperparameter optimization and meta-learning. In International Conference on Machine Learning, 2018.
  • [49] Katelyn Gao and Ozan Sener. Modeling and optimization trade-off in meta-learning. In H. Larochelle, M. Ranzato, R. Hadsell, M.F. Balcan, and H. Lin, editors, Advances in Neural Information Processing Systems, volume 33, pages 11154–11165. Curran Associates, Inc., 2020.
  • [50] Joshua Tobin, Rachel Fong, Alex Ray, Jonas Schneider, Wojciech Zaremba, and P. Abbeel. Domain randomization for transferring deep neural networks from simulation to the real world. 2017 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pages 23–30, 2017.
  • [51] Neha Patki, Roy Wedge, and Kalyan Veeramachaneni. The synthetic data vault. In IEEE International Conference on Data Science and Advanced Analytics (DSAA), pages 399–410, Oct 2016.
  • [52] Barry Becker and Ronny Kohavi. Adult. UCI Machine Learning Repository, 1996. DOI: https://doi.org/10.24432/C5XW20.
  • [53] Kelwin Fernandes, Pedro Vinagre, Paulo Cortez, and Pedro Sernadela. Online News Popularity. UCI Machine Learning Repository, 2015. DOI: https://doi.org/10.24432/C5NS3V.
  • [54] Salvatore Stolfo, Wei Fan, Wenke Lee, Andreas Prodromidis, and Philip Chan. KDD Cup 1999 Data. UCI Machine Learning Repository, 1999. DOI: https://doi.org/10.24432/C51C7N.