Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
research-article
Open access

Do We Really Need Imputation in AutoML Predictive Modeling?

Published: 12 April 2024 Publication History
  • Get Citation Alerts
  • Abstract

    Numerous real-world data contain missing values, while in contrast, most Machine Learning (ML) algorithms assume complete datasets. For this reason, several imputation algorithms have been proposed to predict and fill in the missing values. Given the advances in predictive modeling algorithms tuned in an Automated Machine Learning context (AutoML) setting, a question that naturally arises is to what extent sophisticated imputation algorithms (e.g., Neural Network based) are really needed, or we can obtain a descent performance using simple methods like Mean/Mode (MM). In this article, we experimentally compare six state-of-the-art representatives of different imputation algorithmic families from an AutoML predictive modeling perspective, including a feature selection step and combined algorithm and hyper-parameter selection. We used a commercial AutoML tool for our experiments, in which we included the selected imputation methods. Experiments ran on 25 binary classification real-world incomplete datasets with missing values and 10 binary classification complete datasets in which synthetic missing values are introduced according to different missingness mechanisms, at varying missing frequencies. The main conclusion drawn from our experiments is that the best method on average is the Denoise AutoEncoder on real-world datasets and the MissForest in simulated datasets, followed closely by MM. In addition, binary indicator variables encoding missingness patterns actually improve predictive performance, on average. Last, although there are cases where Neural-Network-based imputation significantly improves predictive performance, this comes at a great computational cost and requires measuring all feature values to impute new samples.

    1 Introduction

    Real-world data often contain missing values, stemming from faulty sensors, non-responders in questionnaires, incomplete data entry, or other reasons. For example, in the openml portal, as of March 2022, 364 of the 3,487 active datasets contain missing values. Unfortunately, most Machine Learning (ML) algorithms demand complete datasets on which to operate.1 To address this problem, a plethora of imputation algorithms, ranging from simple to very advanced, have been developed to predict the missing values and allow the remaining algorithms in the analysis pipeline to complete.
    The problem of imputation has been under study for decades [28, 47, 62]. Initially, it was studied in the context of estimating the coefficients of linear models, call it estimation perspective. In contrast, we study imputation from a predictive modeling perspective where the goal is to create an accurate model to predict a specific outcome of interest (target variable) in new samples. There are important differences in approaching the subject, under these two perspectives. Under the estimation perspective, (a) some methods would impute the missing values in the training data but would not create an imputation model that is able to impute test data [15, 77]. Hence, these methods cannot be applied to predictive modeling. In addition, (b) standard guidelines [67] suggest using the outcome in imputing feature values, e.g., to differentiate imputation values in cases vs. controls. This technique is not applicable in predictive modeling where the outcome is unknown in test samples. Finally, (c) a useful metric of imputation efficacy under the estimation perspective is the imputation accuracy [29, 34], i.e., the accuracy of predicting the missing values. Imputation accuracy is important for estimation purposes but may not be indicative of the impact of imputation on predictive performance.
    Under the predictive modeling perspective, several interesting questions arise as follows:
    Are advanced predictive modeling algorithms in need of imputation beyond the simple Mean/Mode (MM) technique? A non-linear algorithm could potentially learn a rule of the sort “if a feature value equals its mean (i.e., it is missing), then do not use it but instead rely on other observed features values for prediction.” Hence, it is questionable whether imputation would provide an advantage to such an algorithm.
    Is the need for sophisticated imputation further reduced in Automated Machine Learning context (AutoML) whereby the most appropriate combination of algorithm and hyper-parameter values (combined algorithm and hyper-parameter selection (CASH) optimization) [68] is taking place?
    Do Binary Indicator (BI) variables (1 if the value of a feature is missing and 0 otherwise) encoding the missingness patterns provide additional information to a classifier to learn a predictive model?
    How does the feature selection step interact with imputation? Feature selection aims to reduce the number of features that enter the model without sacrificing predictive performance and leads to more interpretable models by providing insights regarding the underlying data generation. It remains open how the benefits of feature selection are impacted when we impute the missing values.
    What is the tradeoff between the computational overhead of imputation and the improvement in predictive performance? Imputation algorithms impute all the missing values, independently of whether they contribute to the predictions of the model. In other words, imputation is unsupervised and not guided by the outcome to predict. Hence, they potentially perform a significant amount of unnecessary computations.
    If imputation algorithms indeed improve performance, then are there any characteristics of the datasets (called meta-features) that allow us to predict the value of imputation prior to their analysis and decide whether imputation is worth the computational overhead?
    To the best of our knowledge, this is the first empirical study that answers all the above research questions via an experimental evaluation over 25 binary classification real-world datasets, as well as 10 complete datasets in which synthetic missing values are introduced according to different missingness mechanisms, at varying missing frequencies. The MM imputation is used as a baseline and is compared against state-of-the-art representatives of different imputation algorithmic families, namely Discriminative, such as Miss-Forest [66], and Generative, such as SoftImpute [44] and probabilistic principal component analysis (PPCA) [70] exploiting matrix-factorization, or Generative Adversarial Imputation Nets (GAIN) [83], and Denoise AutoEncoder (DAE) [21] based on Neural Networks. The imputation algorithms are integrated into the Just Add Data Bio (JADBio) AutoML platform [73], which performs CASH and it includes a feature selection step.
    In summary, the results show that the single best-performing algorithm is DAE and MissForest for the real and the simulated datasets, respectively. For five of the six imputation algorithms studied, the inclusion of BI variables is beneficial, on average. MM, when BI variables are included and CASH is taking place, is a close competitor and places as the second-best algorithm. Advanced imputation methods do offer a significant advantage but only in a few datasets. In contrast, they require the measurements of all feature values to impute new samples, which in some way invalidates the feature selection step and leads to models of high dimensionality. In addition, they require orders of magnitude more computational time. Meta-level analysis has indicated that only one feature is correlated with the relative performance of the algorithms; unfortunately, the correlation is not statistically significant when corrected for multiple testing. More datasets and new meta-features are needed to extract patterns of when sophisticated imputation should be used over the simple MM.
    Overall, in an AutoML setting where optimization is taking place and BI variables are included, MM is a reasonable option; other algorithms should be used only if feature selection is not required and computational time is of little importance relative to improving predictive performance.
    The article is organized as follows. Section 2 introduces missing data mechanisms and a taxonomy of imputation families. In Section 3, we present the experimental environment, the selected datasets for evaluation, and the metrics and hyper-parameters tuned. Section 4 describes the missing data generation procedure. The experimental results for real-world data with missing data and simulated missing data are presented in Sections 5 and 6, respectively. In Section 7, we discuss the results of the meta-level analysis on real-world datasets. Related work is discussed in Section 8, followed by the contributions and lessons learned in Section 9. Finally, Section 10 presents the conclusions and limitations of the study. The detailed information about the datasets, missing value simulation setup, and experimental results are provided in Appendices A, B, and C, respectively.

    2 Background and Context

    2.1 Missingness Mechanisms

    The concept of a missing mechanism [62] formalizes the generation process of missing data. In this respect, the BI are modeled as random variables and assigned a distribution. There are three types of underlying mechanisms that generate missing data, namely, missing completely at random (MCAR), missing at random (MAR), and missing not at random (MNAR). For formal definitions of these mechanisms, readers are referred to Reference [40]. Intuitively, MCAR implies that the probability of a value missing is independent of the actual value, the other observed quantities, and any latent variables. MAR implies that the missingness only depends on the observed data (so it can be predicted). MNAR refers to the case that the missing values are related to both the observed and unobserved variables, including the missing value itself. When missingness is MNAR, it is in principle and in general not possible to impute the missing values in a way that follows the unknown underlying data distribution.2
    An illustrative example is given in Figure 1, which is adapted from Reference [46]. The missingness mechanisms can be described using a causal graph. Let us assume A and B are observed random variables and O a latent variable. Each variable is depicted as a node of the graph. Assume that A and B have a direct connection to O, which is the variable (node) of interest. The \(R_o\) node is a mask variable that denotes the missingness inserted into O, which causes \(O^*\) . \(O^*\) is a surrogate of O but with missing values inserted in the positions specified by \(R_o\) . As seen in Figure 1(a), MCAR missing values do not depend on any of the variables A, B, 0. In contrast, missingness depends on B for the MAR mechanism, and in O itself for the MNAR data, as seen in Figure 1(b) and (c).
    Fig. 1.
    Fig. 1. Panel (a) illustrates the MCAR missingness mechanism. Panel (b) denotes the MAR missingness and panel (c) the MNAR missingness. A, B are random variables, while O is the observed variable of interest. \(R_o\) is variable that represents the missingness in O in the form of mask variable. \(O^*\) is the result of applying the \(R_o\) mask to the O variable.

    2.2 An Imputation Family Taxonomy

    There are numerous imputation algorithms and approaches in the literature, and we do not attempt a full review. Readers are encouraged to explore comprehensive surveys available in the field for a more in-depth understanding [2, 4, 12, 13, 24]. Imputation approaches can be partitioned into various distinct families/groups of methods. A taxonomy is attempted in Figure 2. First, imputed values can be decided based on only the feature with the missing value (Univariate imputation) or several features (multivariate imputation). The former methods include Mean/Median/Max imputation for continuous data and Mode imputation for categorical data. Multivariate methods can be partitioned into Iterative and Distance-Based, also known as Hot-Deck methods [6]. Distance-based methods employ a distance or a similarity metric for samples to find neighbors or cluster them. A commonly used algorithm in this category is the K-nearest neighbors imputation (KNNi) [71], which imputes values based on the neighbors of the sample with missing values. K-means-based methods cluster the samples before imputation [37].
    Fig. 2.
    Fig. 2. Taxonomy of imputation families: rectangular nodes represent families, while oval nodes represent algorithms of that family.
    Iterative methods, start with a simple initial guess (e.g., using MM imputation) and, in each iteration, try to improve the imputed values. We further split iterative methods into Discriminative and Generative. Discriminative methods, build a predictive model per feature with missing values, given the other features in the dataset. This model is used to predict the missing values of the corresponding feature, in each iteration. The Discriminative family can either utilize a (generalized) linear model or a non-linear model. Linear discriminative methods include Multivariate Imputation by Chained Equations (MICE) [76]. Non-linear discriminative methods include the MissForest algorithm [66] employing Random Forests, and Datawig [9] that can impute continuous, categorical, and text data by employing different loss functions according to the missing features’ datatype.
    Generative methods try to model the joint distribution of the data and use the generative model to impute values. They can be split into two categories, methods that employ matrix factorization and methods that use neural networks. The matrix-factorization family includes low-rank matrix decomposition methods: First, missing values are imputed with an initial guess, and the matrix is decomposed (factorized) and used to predict the missing values. Imputation is improved in each cycle via expectation-maximization steps. Examples of this family include the PPCA [70], SVDImpute [71], bPCA [10], and SoftImpute [44]. Such algorithms scale better w.r.t. to the number of features than MICE or MissForest that train a different model for each feature with missing values in each iteration. Recently, neural networks have also been tried as generative models. These algorithms are essentially non-linear alternatives to matrix factorization. These methods start with an initial guess and then train a neural network that learns the joint distribution. This family includes methods based on AutoEncoders (AE), such as DAE) [21, 35, 41] and Variational Autoencoder (VAE) [17, 45, 61]. Also, it includes generative adversarial networks (GAIN) [64, 83]). Finally, HoloClean, a data cleaning tool, implements an attention-based neural network for imputation, named Aimnet [82]. A detailed comparison of imputation methods is detailed in Section 8. In the next subsection, we will explain the rationale for our choice to include in our empirical study a subset of the aforementioned imputation methods.

    2.3 Description of the Selected Imputation Methods

    In this section, we present the main characteristics of the imputation methods given in Table 1 that we included in our testbed. In the analysis of their computational complexity, n denotes the number of samples, m the number of features, \(\#comp\) the number of principal components, \(\#sing\) the number of singular values, and \(\#trees\) for the number of trees.
    Table 1.
    AlgorithmModel FamilyBase ModelLearning ProcedureCategorical HandlingApprox. Complexity
    MMUnivariateNo IterativeNativeO( \(n \cdot m\) )
    SOFTGenerativeSVDIterativeOne-hot-encodingO( \(k \cdot n \cdot m \cdot \#sing\) )
    PPCAGenerativePCAIterativeOne-hot-encodingO( \(k \cdot n \cdot m \cdot \#comp\) )
    MFDiscriminativeRFIterativeNativeO( \(k \cdot m^2 \cdot n \cdot \log (n) \cdot \#\text{trees}\) )
    GAINGenerativeGANIterativeOrdinal-Encoding
    DAEGenerativeAEIterativeOne-hot-encoding
    Table 1. Comparison of the Imputation Methods and Their Characteristics
    Abbreviations: n is the number of samples, m is the number of features, k is the number of iterations, \(\#trees\) stands for the number of trees in the forest(hp), \(\#comp\) stands for the number of principal components employed in the matrix factorization, and \(\#sing\) for the number of singular values of the SVD.

    2.3.1 Mean/Mode.

    MM is the most common imputation method in AutoML tools and is included as the baseline methodology. It is an instance of the univariate imputation family. In the MM algorithm, missing values are imputed with the mean in the training data of the corresponding feature if it is continuous and the mode (most frequent value) if it is discrete. MM is the most computationally efficient method as it needs only \(O(n \cdot m)\) to impute the whole dataset. A variation of MM imputation is mentioned in medical literature [51] where missing values of a sample are imputed based on the mean/mode of the class to which it belongs. However, in the case of predictive modeling, this approach becomes problematic as the class of a sample is unknown during inference, as discussed in Section 1.

    2.3.2 MissForest.

    MissForest (MF) is a discriminative iterative method based on Random Forests [66]. First, the missing values are imputed by Mean/Mode. Subsequently, for each feature with missing values serving as the outcome, the algorithm trains a random forest on the rest of the features and uses it to predict the outcome’s missing values. After imputing all missing values, the algorithm uses the (now) complete dataset to warm-start the new iteration until a stopping criterion is met. MF is one of the slowest methods as it requires building a forest per feature for a number of iterations. The approximate worst case is \(O(k \cdot m^2 \cdot n \cdot \log (n) \cdot \#\text{trees})\) . MF encounters scalability issues in datasets with more than 50 features. In addition, MF needs to store a forest for each feature, which creates model-storing issues. To avoid such issues, in our experiments we limit the maximum allowed depth of the Random Forests.

    2.3.3 Probabilistic PCA.

    PPCA is a statistical iterative method [39]. In each iteration, a principal component analysis (PCA) is performed, which is improved in the next step using maximum likelihood estimation [58] and assuming a multivariate Gaussian distribution of the data. To impute a new sample, the optimal set of principal components found from training is used to identify the missing values that maximize the joint probability of the sample. Categorical features are one-hot-encoded before applying PPCA and then are inverse transformed after PPCA returns the imputed data. PPCA is one of the fastest methods, scaling linearly to the number of samples, features, and the number of principal components computed. The approximate complexity for PPCA is \(O(k\cdot n\cdot m\cdot \#\text{comp})\) .

    2.3.4 SoftImpute.

    SoftImpute (SOFT) is a statistical iterative method [44]. It starts with the initialization of missing values with the mean. Then, it iteratively solves the optimization problem on the complete matrix using a soft-thresholded SVD and proceeds iteratively until a stopping criterion is met. Categorical features are one-hot-encoded before applying SOFT and then are inverse transformed after SOFT returns the imputed data. SOFT like PPCA is very very fast, utilizing an EM approach. The approximate time complexity is \(O(k\cdot n \cdot m \cdot \#\text{sing})\) .

    2.3.5 Denoise Autoencoder.

    DAE is a deep learning algorithm based on autoencoders [21]. The Denoise Autoencoder is based on an overcomplete implementation and a dropout layer. DAE projects the input data to a higher-dimensional subspace where the missing data are recovered by the decoder. The categorical data are one-hot-encoded before DAE is applied. Then the one-hot-encoded data are transformed back to the original representation. The complexity of the DAE is mostly measured by the number of epochs needed for the algorithm to impute the dataset accurately and the hidden layers’ size and depth. For further information see Section 3.8.

    2.3.6 Generative Adversarial Imputation Nets.

    GAIN is an adaptation of the GAN framework [83]. A generator is used to impute missing data based on the observed data. The discriminator tries to determine which data are observed and which are imputed. The goal of the generator is to provide an accurate imputation whereas the goal of the discriminator is to distinguish between the observed and missing data. The two neural networks are trained in an adversarial process. Categorical data are turned into ordinal features and normalized between 0 and 1. After applying the GANs we revert the categorical data to the original representation by doing the inverse procedure. The complexity of GAIN is mostly bottlenecked by the number of iterations needed to train the GANs. See Section 3.8 for more details.

    2.3.7 Binary Indicators.

    BI is not an imputation method but a feature construction method. Specifically, for each feature \(F_j\) with missing values, we construct a new feature, call it \(I_{jk}\) , which indicates whether the value at the kth sample of feature j is missing or not. The idea is to encode in \(I_j\) the missingness pattern. BIs may help the classifier and allow it to learn whether to trust value \(F_{jk}\) for prediction. BI can complement any imputation method. BI’s complexity is \(O(n \cdot m)\) ; however, we should note that it increases the complexity of subsequent stages of the ML pipeline by increasing the dataset’s dimensionality by a maximum factor of 2. We note that imputation models extended with BI do not use BI to impute the missing data. BI are merged along with the imputed dataset. It is important to mention that BIs are not utilized during the imputation phase for BI-extended methods. Instead, they are added to the imputed dataset.

    2.4 Rationale of the Selection of Algorithms

    MM imputation is selected as a baseline and one of the most commonly used methods. MissForest is selected as a representative of a multivariate iterative imputation method over MICE, based on the results on imputation accuracy presented in Reference [79]. Encoding missingness information using BI is also experimentally evaluated, as it performed better than other methods in Reference [43]. Distance-based methods are excluded for various reasons. First, they need to memorize the full dataset to produce imputation as they do not learn a model. k-means and KNN imputation were not included in our testbed as according to other empirical studies are outperformed by MissForest [30, 42, 66].
    As representatives of matrix completion-based methods, we chose PPCA that have shown the best performance according to previous empirical studies [25, 31, 58]. SoftImpute was also selected based on the experimental results presented in Reference [85]. GAIN and DAE were also included as representatives of neural network–based methods as they excel in several studies [11, 54]. The former is based on using a generative adversarial network to learn the probability distribution to impute and the latter on autoencoder. VAE and Aimnet were not included based on the inferior or comparable results to GAIN and MM, respectively [11, 38]. Finally, Datawig [9] is not included as, according to the results reported in Reference [9], it is outperformed by MissForest for both continuous and categorical data while it comes with a high cost to fit one neural network per feature with missing values.

    2.5 How Is Imputation Treated in AutoML Platforms

    AutoML platforms employ imputation methods, as well as modeling algorithms that directly treat missing values as a separate category. The current versions of JADBio (version v1.4) and AutoSklearn [16] employ MM by default, while DataRobot3 may also include BI variables. AutoSklearn allows the user to specify additional imputation methods to optimize over as part of the pipeline. TPOT employs median imputation for all missing features [36]. Auger.AI4 does iterative regression or mean imputation for numerical features depending on the dataset size and creates a new category for the categorical features. BigML5 by default does not impute the missing value; the missing values are handled internally by their predictive models, which are based on trees only. Autoprognosis optimizes the ML pipeline over a variety of missing data imputation algorithms. Specifically, it employs MICE, MissForest, Bootstrapped Expectation-Maximization imputation, Soft-Impute, and MM [3]. DriverlessAI by H2O creates a new value to express missingness when the XGBoost, LightGBM, and RuleFit algorithms are used. For generalized linear models, it performs MM imputation, while for tensorflow models missing values are treated as outliers [22]. GAMA [20] does not impute missing values by default. Autogluon [14] uses median imputation for continuous features and introduces a new “Unknown” category for categorical features.

    3 Experimental Setup

    We now present the design choices for the experimental setup and the comparative evaluation.

    3.1 Datasets

    Incomplete Real Datasets. There are currently 364 datasets with missing values in the OpenML repository [78], we restricted our selection to binary classification datasets. We selected 25 binary classification datasets in an effort to cover a range of various dataset characteristics. The datasets contain both continuous and discrete features. The number of features ranges from 7 to 69, the sample size ranges from 155 to 31,406, the prevalence of the minority class ranges from 0.06 to 0.48, the number of features with at least 1% missing values ranges from 1 to 32, and, finally, the percentage of missing values ranges from 1.11% to 71.64%. Table 6 presents the characteristics of the datasets, along with their OpenML id.
    Table 2.
    AlgorithmHyper-parameterValue
    Mean/Mode
    MissForestn-trees250
    maxDepth20, 30
    maxLeafNodes30
    SoftImputevariance-explained50%, 70%, 90%
    PPCAvariance-explained50%, 70%, 90%
    DAEdropout0.25, 0.4, 0.5
    batch-size64
    \(\theta\) 5, 7, 10
    epochs500
    GAINalpha0.1, 1, 10
    hint-rate0.5, 0.9
    batch-size64
    epochs10.000
    Table 2. The Set of Values Tried for Each Hyper-Parameter Tuned
    For each algorithm, all combinations of values for its hyper-parameters shown were tried and combined with all other choices for feature selection and modeling by JADBio. There are 48 combinations of imputation algorithms and hyper-parameter values. The default hyper-parameter values are underlined.
    Table 3.
    Base Imp.Methodp-valueq-value
    MF0.0360.182
    PPCA0.0640.182
    GAIN0.0910.182
    MM0.1480.222
    SOFT0.310.375
    DAE0.5520.552
    Table 3. p-values of the Matched t-test and the q-values after FDR Correction (Sorted)
    Only MF has p-value < 0.05. GAIN and PPCA have p-value < 0.1. Setting the q-value threshold to 0.25 leads to accepting the hypothesis that BI’s are beneficial for four algorithms (MF, GAIN, PPCA, MM), expecting a 25% (one of four) of these discoveries to be false on average.
    Table 4.
    NameCategoryDescription
    inst_to_attrGeneralSamples to features ratio
    Minority Class %General% of minority class
    nr_attrGeneralNumber of features
    nr_instGeneralNumber of samples
    n_numGeneralNumber of numerical features
    n_catGeneralNumber of categorical features
    % NAMissing% of missing values in data
    % samples /w NAMissing% of samples with missing values
    % features /w NAMissing% of features with missing values
    % NA/Feat. /w (NA 1+%)MissingMean % of missing values per feature with more than 1% missing
    # components 50%ClusteringNumber of components that explain 50% of data variance
    # components 70%ClusteringNumber of components that explain 70% of data variance
    # components 90%ClusteringNumber of components that explain 90% of data variance
    Slh(k=2)ClusteringMean Silhouette Coefficient of all samples when using 2 clusters
    Slh (k=3)ClusteringMean Silhouette Coefficient of all samples when using 3 clusters
    Slh (k=4)ClusteringMean Silhouette Coefficient of all samples when using 4 clusters
    Table 4. Meta-features Used in the Meta-level Analysis
    The first column contains the name of the meta-feature, the second column denotes the category of the meta-feature, and the third column provides a brief explanation of the meta-feature.
    Table 5.
    Study#DatasetsMechanism% Missing values#Imp. methodsNNsBISystemFSTuning#ModelsEvalMetricMeta
    [38]6BNat7–84%3N, 2C, 2M.NoNoAdhocNoImp+Pred7R-TT (70-30)ACC-F1No
    [81]13BNat0.6–33.6%2N,1C,4MNoNoAdhocNoNo5TT (80-20)AUC-F1No
    [57]2BMC,MR0–40% \(^{**}\) 7CNoNoAdhocNoNo3TT(66.6-33.3)ACCNo
    [8]5B,5RMC-MN10–50%8MNoNoAdhocNoImp4R-TT(50-50)ACC-R2No
    [30]31B, 21R, 17MMC,MR,MN1–50% \(^{*}\) 6MYesNoAdhocNoImp2CV(3-5)RMSE-F1No
    [55]10B, 3RNat7MNoYesAdhocNoPred3NCVACCNo
    [19]23B,MMC7%6MNoNoAutoMLYes10R-TT(75-25)ACCNo
    [49]5BNat4N, 2C, 1MNoNoAutoMLImp+PredEnsembleCV(5)B-ACCNo
    Ours35BN,MC,MR1–72%6MYesYesAutoMLYesImp+Pred4TT(50-50)ACC-F1-AUCYes
    Table 5. An Overview of Related Work on Predictive Modeling
    Most benchmarks either use datasets with simulated missing values or with native but not both. Abbreviations: # symbol means number, “—” denotes that the paper does not mention any details about the topic, on column data B, M, R denotes binary, multiclass, and regression datasets, respectively. On the column mechanism, values Nat, MC, MR, and MN denote Native, MCAR, MAR, and MNAR. On column #Imp. methods, N, C, and M denotes numerical, categorical, and mixed imputation method, respectively. The NNs column denotes that Neural Network imputation methods were included. BI means that methods were extended with BIs, FS means feature selection was included in the pipeline, and Tuning represents whether the study tuned the imputation methods (Imp), the predictive models (Pred), or both (Imp+Pred). Eval column presents the evaluation methodology, R denotes repeated, TT: train-test split and number in parenthesis the percentages of the train and test set, respectively, CV: Cross-validation and the number in parenthesis the number of folds, NCV denotes Nested Cross Validation. The metric column denotes the metric used for the evaluation methodology, ACC is classification accuracy, B-ACC is balanced accuracy, F1 is F1-score, RMSE is the root mean squared error, and AUC is the Area under the ROC curve. Meta presents whether a study has conducted a meta-level analysis. \(^{*}\) One of the features was made missing. \(^{*}\) missing values were only generated on the train data.
    Table 6.
    DatasetIDSamplesFeatures#Numerical#CategoricalMissing %Imbalance ratio#Feat with miss>1%%Missing/Feature
    analcatdata_reviewer1,00837970751.560.43751.56
    audiology999226690692.030.25623.23
    anneal98989838162264.980.242985.15
    autoHorse840205251781.110.4046.46
    braziltourism9574128712.910.23210.68
    bridges32810711476.030.4179.35
    cjs1,0242,7963432271.640.242886.97
    colic273682271523.800.371927.50
    colleges_aaup8971,161151321.470.3063.68
    cylinder-bands6,3325403924154.740.42237.93
    dresses-sales23,3815001211113.920.42533.04
    eucalyptus990736191453.210.2969.95
    hepatitis55155196135.670.21119.56
    hungarian2312941312120.460.36552.93
    kdd_el_nino-small8397828807.450.35414.90
    mushroom248,124220221.390.48130.53
    pbcseq8021,945171343.430.5069.71
    primary-tumor1,003339170173.900.25232.74
    profb47067295419.840.33289.29
    schizo4663401412217.520.481122.30
    sick383,772297225.540.06722.96
    soybean1,023683350359.780.133210.68
    stress42,16719912848.290.20714.22
    vote56435160165.630.39165.63
    water-treatment940527363602.860.15224.53
    Table 6. Binary Classification Real-World Datasets Used in the Comparative Evaluation
    The table below contains the dataset name, id, number of samples, number of features, number of categorical, and numeric features, Missingness percentage in the whole dataset, Minority Class with missing values over 1 finally the outcome type of each dataset.
    Complete Datasets: We selected 10 complete datasets from OpenML, where we introduce and simulate missingness. The number of features ranges from 9 to 135, the sample size ranges from 101 to 5,473, and the prevalence of the minority class ranges from 10% to 49%. Table 7 contains these values for each dataset, along with their OpenML id.
    Table 7.
    DatasetID#Samples#Features#Numerical#CategoricalMinority Class %
    Australian40,98169014860.44
    boston853506131210.41
    churn40,7015,000201640.14
    compas-two-years42,1935,27813760.47
    image40,5922,00013513500.21
    page-blocks1,0215,473101000.1
    parkinsons1,488195222200.25
    segment9582,310191900.14
    stock8419509900.49
    zoo965101161150.41
    Table 7. Binary Complete Datasets in Which We Inject Missing Values
    The table reports the dataset name, ID, the number of samples, number of features, the imbalance ratio, and the number of numerical and categorical variables.

    3.2 Evaluation Task and Metric

    We note that the evaluation concerns only binary classification. The main metric of predictive performance is the Area Under the ROC curve (AUC). To save space and make interpretation easier, we report classification accuracy and F1-score results in the Appendices C. The datasets are split to 50% training and 50% hold-out test set used only for performance evaluation. Our experiments were conducted only once, due to the computational complexity of the experimental procedure (see Section 3.6). We applied statistical tests to compensate for the lack of repeated experiments. This allows reliable conclusions to be drawn from the experimental results.

    3.3 AutoML Environment

    To experiment with different imputation algorithms when CASH optimization is taking place, we employed the JADBio AutoML platform [73]. JADBio is a commercial product (a version of JADBio with basic functionality is freely available) but was offered to us for research purposes. JADBio includes feature selection as part of the ML pipeline and, thus, it can be used to study the effect of feature selection on imputation.
    A quick description of JADBio’s architecture now follows. For each dataset to analyze, an internal knowledge base system, called Algorithm and Hyper-Parameter Space selection (AHPS) in Reference [73], selects the feature construction, preprocessing, feature selection, and modeling algorithms to try, along with a set of values for their hyper-parameters. The AHPS also selects the configuration evaluation protocol, e.g., 10-fold cross-validation, repeated cross-validation, or hold-out to estimate the performance of each configuration and select the winning one. The knowledge in AHPS is engineered by experienced analysts but also induced by meta-level learning algorithms.
    The choices of the AHPS are based on the meta-features of the dataset (e.g., sample size, number of features), as well as the user preferences. For example, an algorithm that does not scale to the number of samples in the current dataset, will not be selected by AHPS. The choice of the evaluation protocol also depends on the meta-features: For a typical-sized dataset, JADBio may run a 10-fold cross-validation, for a large balanced dataset a hold-out, while for a small sample or an imbalanced dataset, it may run a repeated cross-validation protocol.
    Subsequently, JADBio executes all configurations effectively performing a grid search for CASH optimization. However, JADBio includes pruning heuristics that may drop a configuration in the early folds of cross-validation if it is not deemed promising, departing from a pure grid search strategy [74]. Once configurations execute, the final model is built on all available data using the winning configuration.
    The final performance of the model producing with the winning configuration is the cross-validated AUC adjusted for the bias incurred due to multiple tries (called “winner’s curse” in statistics). This adjustment is conceptually equivalent to adjusting p-values in multiple hypotheses testing. JADBio uses the BBC-CV algorithm for the performance estimate adjustment [74]. In Reference [73], experiments on 360 omics datasets of small sample size show that this estimation protocol returns slightly conservative out-of-sample AUC performances of the returned model. Nevertheless, for the purposes of this article, JADBio’s performance estimation was not used; instead, the performances on the 50% held-out set are reported.
    Regarding the settings of JADBio employed in this set of experiments, we note the following. One of the user preferences indirectly controls the execution time and the number of configurations to try and has the settings Preliminary, Typical, Extensive, with Extensive trying more configurations and performing a more thorough optimization. All subsequent experiments were run using the Preliminary setting to make the computational requirements manageable. The number of configurations may vary between datasets depending on their meta-features, but in our experiments, it ranges from 900 to over 1,000. The training protocol of JADBio depends on the sample size, the class imbalance, and other factors. For typical-size datasets, JADBio uses a repeated 10-fold cross-validation with #repeats from 1 to 20. A heuristic procedure stops repetitions of cross-validation if no progress is detected. Overall, JADBio uses estimation protocols that execute each configuration between 10 to 200 times per dataset to choose the winning configuration and produce a model.
    JADBio optimizes over the following set of algorithms. For feature selection, JADBio uses the Lasso [69] and a variant of the SES algorithm [75] with an upper bound on the number of conditional independence tests to perform. For classification, it optimizes over Ridge Logistic Regression, Decision Tree, Random Forests, and Support Vector Machines with polynomial, linear, and radial basis kernels.
    To evaluate imputation algorithms, we embedded them into the JADBio configurations as the second step, after the standardization of continuous features and before feature selection, using the API provided. It is important to note that configurations are cross-validated as an atom, and hence, learning to impute is based only on the training data. This is necessary to avoid overestimating the performances of configurations and correspondingly, the imputation methods. Each imputation method returns an imputation model that is used to impute the test data before modeling is applied. It is worth noting that even if the feature selection step selects a small subset S of features when some values of S are missing in the test set, the imputation model may impute them based on other features. Hence, even if the predictive model requires just the features in S, the predictive pipeline may require more features. Specifically, all multivariate algorithms selected in the article require all features to impute. Hence, the predictive pipeline always requires all features when these algorithms are employed, even with feature selection.

    3.4 Imputation Algorithms Implementations

    We used the JadBio version 1.4.0 for our experiments. MM and BI methods were already implemented by the developing team of the tool used. For PPCA and SoftImpute, we relied on third-party implementations in R from the PCA methods 1.64.0 [65] and ‘softImpute’ package version 1.4.1 respectively. We implemented MissForest in python 3.8.4 using the iterativeImputer and RandomForest models from sci-kit learn 1.0.1 [53]. Pytorch 1.7.1 version [52] was utilized for the implementation of GAIN and DAE. We adapted the DAE implementation found at https://github.com/Harry24k/MIDA-pytorch to closely follow the description of DAE by the original authors in Reference [21]. We employed the GAIN from https://github.com/dhanajitb/GAIN-Pytorch.

    3.5 Machine Specifications

    The predictive performance experiments of the article were conducted on a fedora-powered VM using 8-core AMD Threadripper 3970x at 3.7 GHz with 12 GB RAM. The neural networks were trained using CPUs. The execution time results reported were measured on an eight-core AMD Ryzen-3600x at 4.6 GHz with 16 GB RAM and Windows 11 OS.

    3.6 Computational Resources Employed

    During the experiments, more than 41 days of CPU time have been spent training more than 80,000 configurations to conduct the experiments mentioned in the article.

    3.7 Availability of Code

    The code is available on the Github repository: https://github.com/mensxmachina/Imputation_in_AutoML. The code in the repository consists of scripts for the plots, the datasets, the meta-level analysis as well as the basic implementation of each imputation algorithm.

    3.8 Exploring the Hyper-parameter Space of Imputation Algorithms

    In the experiments, 24 hyper-parameter (hp) value sets were tried for the imputation algorithms: MM (1 hp set), MissForest (2 hp sets), SoftImpute (3 hp sets), PPCA (3 hp sets), DAE (9 hp sets), and GAIN (6 hp sets). The values tried for each hyper-parameter are shown in Table 2. These choices were based on the algorithm’s authors’ defaults and suggestions. These 24 hp sets were coupled with all other choices of JADBio multiplying by 24 the number of configurations normally tried. In subsequent experiments, each of these 24 hp sets is run on the original dataset, as well as the dataset with the inclusion of the BI features, leading to 48 different combinations. MM has no parameters and therefore does not need tuning. For MissForest, we train RF models with 250 trees, which offers higher imputation accuracy according to Reference [66]. However, we restrict the maximum depth of the tree and maximum leaf nodes, because the trained model had storing memory issues (see Section 2.3.2). SoftImpute and PPCA require selecting the number of principal components to use as a hyper-parameter. The majority of papers in the literature fails to report the tuning of the aforementioned methods that led us to develop the following heuristic: We select as many components required to explain \(x\%\) of the data variance. The values of x are shown in Table 2 as the values of “variance-explained.” The default hyper-parameters are used for DAE with the exception of the dropout layer and the hidden layers’ dimensions. The range of the dropout layer is based on Reference [63], while the theta value is tuned within a neighborhood of the author’s suggested default value. In the current implementation, we have three hidden layers for the encoder and the decoder. For each successive layer in the encoder, \(\theta\) hidden layer nodes are added and hyperbolic tangent is used as the activation function, as it produces better results for small and medium-sized datasets [21]. The model is trained using Stochastic Gradient Descent with an adaptive learning rate with a time decay factor of 0.99 and Nesterov’s accelerated gradient. GAIN architecture consists of three hidden layers for the discriminator and the generator while using Rectified Linear Unit as the activation function. For GAIN, we tune two hyper-parameters; alpha and hint rate. These hyper-parameters are considered the most important for GAIN. Alpha balances the loss between the discriminator and the generator, while the hint rate is responsible for the training of the discriminator. Both DAE and GAIN are trained at the specific epochs as suggested by authors and use the sigmoid activation function for the output layer.

    4 Simulating Missing Data

    To experiment with a ranging percentage of missing values, as well as different missing mechanisms, we simulated the presence of missing values in the complete datasets presented in Appendix A.2.

    4.1 Simulating Missing Completely at Random Data

    Under MCAR, missing values are missing with a given probability (percentage) independently of any other factors such as the value itself or the values of other features. To simulate missing values at a realistic missingness percentage we sampled 64 real-world datasets from the OpenML repository with varying characteristics (see Section B.1). We then computed the 25%, 50%, and 75% quantiles of missingness percentages. Features with less than 1% of missing values were excluded from the calculation, as they probably point to features that missing values from typos or non-systematic sources. The quantile values turn out to be about 10%, 25%, and 50% of missingness. The quantile values are then used to vary the missingness percentages in both MCAR and MAR simulation experiments. We then introduced missing values with the given percentages at the 10 complete datasets described in Section 3.1. Even though it is trivial to introduce MCAR missing values, for consistency reasons, we employed the code available in Reference [48], which is also used for the MAR simulations below. To simulate the MCAR mechanism the software discards values uniformly at random from the dataset at the specified missingness percentage.

    4.2 Simulating Missing at Random Data

    Under MAR, missing values are missing with a probability (percentage) that depends (is conditional) on other observed features, i.e., \(P(I_j = 1|F_{k_1}, \ldots , F_{k_m})\) . To realistically simulate data under MAR, one needs to decide (a) the number of features upon which the probability depends, (b) the functional form of the conditional probability function, and (c) the set \(\lbrace F_{k_1}, \ldots , F_{k_m}\rbrace\) . To answer (a) we needed a realistic estimate of the number m of features in the conditional probability. To that end, in the corpus of the 10 real-world binary datasets of Table 7, we randomly selected one feature with missing values as the target feature and then performed predictive modeling using JADBio including feature selection.6 These experimental results suggest that, on average, a feature with missing values is dependent on 12 features, so we set \(m=12\) . Subsequently, for each \(F_j\) we randomly selected a set of m other features with uniform probability. Finally, for the functional form of P, we used a logistic regression model: \(P(I_j = 1|F_{k_1}=f_1, \ldots , F_{k_m}=f_m) = \frac{1}{1+e^{\langle -w, f\rangle }}\) , where w is a set of randomly chosen coefficients from a normal Gaussian distribution, f is the vector of values of the features \(F_{k_l}\) , and \(\langle \cdot , \cdot \rangle\) denotes the inner product. For the simulation, the software [48] was also used. The software allows the simulation of MAR missing data, as described above, with prespecified missingess percentages. The same percentages as in the MCAR case were used.

    5 Comparative Evaluation On Real-World Datasets with Missing Values

    The 25 real-world datasets with missing values were analyzed with JADBio, optimizing over configurations that include the imputation algorithms selected and their hyper-parameter values.

    5.1 Binary Indicators Improve the Predictive Performance

    First, we partition results achieved when optimizing over any single imputation algorithm. Specifically, for each imputation algorithm, on a given dataset, the best AUC was selected over all configurations that include the specific algorithm. We will refer to this best AUC simply as the AUC of a given imputation algorithm, in all subsequently reported results. Figure 3(a) shows the difference in AUC performance when Binary Indicators are used versus when excluded. As we can see, MM and GAIN have the largest average increase by 0.0074 AUC and 0.0056 AUC, respectively. MF when extended with BI shows an average AUC increase of 0.0046, while PPCA shows an increase of 0.0038 AUC. The lowest average improvement is achieved by SOFT, which improves by 0.00022 AUC. Contrary to the above observations, DAE is the only method that does not benefit from the addition of BIs with a negligible decrease of 0.0004 AUC when BI’s are included. Figure 3(b) offers a complementary view. It illustrates the count of datasets per imputation method where the inclusion of BI is beneficial to the downstream performance. We observe that for every imputation method, including BI is beneficial in most instances. SOFT exhibits improvement across 19 of 25 datasets. GAIN, MM, and PPCA in 17 datasets. Finally, including BI, leads to enhancements in DAE and MF across 16 and 15 datasets, respectively.
    Fig. 3.
    Fig. 3. Panel (a) denotes the performance of each imputation method when including binary indicators minus the performance without binary indicators for each dataset. Panel (b) denotes the count of datasets per imputation method that indicators improve or deteriorate the performance.
    To determine the statistical significance of the results, we performed a paired matched t-test for each algorithm with the null hypothesis H0 being that the BI+base has worse performance than the base method. The resulting p-values were converted to q-values with the Benjamini/Hochberg [7] method, to control for multiple testing. Table 3 shows the results. Using a q-value threshold of 0.25 there are four statistically significant results, resulting in accepting the alternative hypotheses that MF, GAIN, PPCA, and MM improve their performance when BIs are present. At the level of \(q=0.25=\frac{1}{4}\) this implies that, in the worse case, we expect one of these four discoveries to be false. While the inclusion of BIs may, in the worse case, double the dimensionality of the dataset, based on the above results, we would recommend their inclusion when the above imputation algorithms are employed.

    5.2 BI+DAE Is the Best Imputation Method in Real-world Data

    Figure 4 shows the average ranking achieved by each algorithm using the Autorank tool [26] (lower ranking is better). To avoid clutter, and based on the results of Section 5.1, we only show results when BIs are included. The horizontal black bars in the graph connect tools with non-statistically different ranks, according to a non-parametric Friedman test and post hoc Nemenyi test.
    Fig. 4.
    Fig. 4. The average rank of each imputation method when binary indicators are used. BI+DAE has the lowest average ranking. Rank differences are not statistically significant except for the average rank between BI+DAE and BI+PPCA.
    Results show that BI+DAE is the highest ranking algorithm with an average rank of 2.84, followed by BI+MM with a 2.94 average ranking, BI+MF with 3.5, and BI+GAIN with 3.56, although their rank difference is not statistically significant at the 0.05 level. The two lowest-ranked methods are BI+SOFT and BI+PPCA, with 3.74 and 4.42 average rankings, respectively. BI+DAE’s rank is statistically significantly lower compared to BI+PPCA.

    5.3 BI+MM Is the Best Method When Considering the Efficiency–Effectiveness Tradeoff

    We now study the performance effectiveness vs. the computational efficiency tradeoff of the algorithms. In Figure 5(a) we use MM as the baseline. A point (execution run) corresponds to AutoML predictive modeling on a dataset with a given imputation algorithm. This results in \(5 \times 25 = 125\) points. The x-axis shows the effectiveness ratio defined as the ratio of the AUC corresponding to the point divided by the corresponding performance of BI+MM. Similarly, the y-axis shows the efficiency ratio defined as the training time of the point divided by the corresponding time of BI+MM. Hence, points in the first/fourth quadrant (top-left/bottom-right) correspond to runs where BI+MM dominates/is-dominated by other algorithms on the same datasets in both time and AUC. Notice that the scale of the y-axis is logarithmic. Larger points correspond to the mean value of an imputation method over all datasets.
    Fig. 5.
    Fig. 5. Panel (a) denotes the efficiency-effectiveness tradeoff by using BI+MM as reference algorithm. Panel (b) illustrates the efficiency-effectiveness tradeoff when BI+DAE is used as reference. Each point in panels (a) and (b) represents a dataset. The x-axis shows the effectiveness, defined as the ratio of the AUC achieved by the marked imputation method divided by the corresponding performance of the reference method, for a dataset. Similarly, the y-axis shows the efficiency that is defined as the training time of the marked imputation method divided by the time of the reference method for a dataset.
    In total BI+MM is inferior in terms of predictive performance in 42 cases (16 of the 25 datasets) and, unsurprisingly, never gets dominated in terms of AUC performance and efficiency at the same time. The computational time of the other algorithms is orders of magnitude slower than BI+MM. However, in 83 of 125 points, BI+MM is both more efficient and effective than the compared method. Only BI+DAE scores on average higher than the AUC score. All the other imputation methods are on average slower to train and worse in terms of predictive performance.
    Figure 5(b) shows the same exact results with BI+DAE as the baseline. In contrast to BI+MM above, BI+DAE dominates the other imputation methods on predictive performance and training time, in only 35 of 125 combinations. In 41 of 125 cases, it provides better predictive performance but at a higher computation cost. In 15 points, BI+DAE is faster but has lower predictive performance than the compared imputation method. Finally, 34 times it is dominated in both metrics. In conclusion, when if a single imputation algorithm is to be used, BI+MM arguably provides the best tradeoff between computational time and predictive performance.

    5.4 Best Imputation Subset for Maximizing AUC Performance Is {BI+MM, BI+DAE}

    In this section, we examine the results from a different perspective, trying to answer the question: What is the minimal-size subset of algorithms to try to achieve close-to-maximum AUC performance? To answer this question, we have implemented a simple greedy algorithm, where we assume the analyst starts with the subset \(\lbrace\) BI+MM \(\rbrace\) as an efficient baseline and adds algorithms to consider. In each iteration, the algorithm that leads to the largest AUC improvement of the subset when added is selected for inclusion. The maximum AUC performance is the sum of the maximum AUC for each dataset when including all imputation methods in the optimization pipeline, averaged across all datasets.
    The results are shown in Figure 6 and quantitatively in Table 11. The x-axis shows the imputation algorithms in order of addition to the subset. For each algorithm, several hyper-parameter combinations are tried and combined with all other feature selection and modeling choices by AutoML. Hence the total number of configurations tried is multiplied by this factor. At each tick, the multiplication factor for the whole set is depicted in the parenthesis next to the name of the algorithm added to the set in that step. For example, BI+MM has no hyper-parameters ( \(1\times\) ), while BI+DAE has 9, so the multiplicative factor of the set \(\lbrace BI+MM, BI+DAE\rbrace\) is \(10\times\) . The y-axis is the average (over all datasets) relative AUC achieved when performance is optimized over all algorithms and their hyper-parameters in the corresponding subset.
    Table 8.
    DatasetIDSamplesFeatures#Numerical#CategoricalMissing %Minority Class %#Feat miss>1%%Missing/FeatureType
    adult17948,84214680.950.2434.41Binary
    albert41,147425,2407878013.640.504324.73Binary
    analcatdata_reviewer1,00837970751.560.43751.56Binary
    anneal98989838162264.980.242985.15Binary
    aps_failure41,13876,00017017008.350.021608.83Binary
    ASP-POTASSCO-classification41,7051,29414213939.940.0213810.23MultiClass
    ASP-POTASSCO-regression41,70414,23414213849.940.0013810.23Regression
    audiology999226690692.030.25623.23Binary
    autoHorse840205251781.110.4046.46Binary
    braziltourism9574128712.910.23210.68Binary
    bridges32810711476.030.4179.35Binary
    Census-Income-KDD42,750199,5234113285.080.06729.72Binary
    cjs1,0242,7963432271.640.242886.97Binary
    Code_Smells_Data_Class43,07986,4676666049.990.006253.20Regression
    colic273682271523.800.371927.50Binary
    colleges42,7277,06347311631.420.003049.19Regression
    colleges_aaup8971,161151321.470.3063.68Binary
    colleges_usnews9301,3023332118.220.472523.96Binary
    cylinder-bands6,3325403924154.740.42237.93Binary
    Domainome41,5331,62398389838082.170.35968883.44Binary
    dresses-sales23,3815001211113.920.42533.04Binary
    echoMonths2221309728.290.00612.31Regression
    eucalyptus990736191453.210.2969.95Binary
    fishcatch2321587707.870.00155.06Regression
    fps-in-video-games42,737425,8334433116.940.001225.44Regression
    hepatitis55155196135.670.21119.56Binary
    house_prices_nominal42,5631,4607936436.040.001629.74Regression
    hungarian2312941312120.460.36552.93Binary
    ipums_la_97-small9937,01960342611.420.041838.06MultiClass
    ipums_la_98-small3817,48560342611.590.011740.91MultiClass
    ipums_la_99-small3788,8446034269.710.021832.36MultiClass
    jungle_chess_2pcs_endgame_rat_panther41,0025,8804618281.300.23610.00MultiClass
    KDD9842,34382,31847735811911.300.128761.98Binary
    KDDCup09-Upselling1,11250,0001500013,39116093.350.0760882.59Binary
    KDDCup09_churn42,75950,0002301923869.780.0720578.28Binary
    kdd_coil_156731611831.610.0034.85Regression
    kdd_el_nino-small8397828807.450.35414.90Binary
    kick41,16272,9833217156.390.12540.51Binary
    lymphoma_2classes1,1014540264,02603.280.4921166.25Binary
    meta566528211834.550.00331.82Regression
    MiceProtein40,9661,080817741.600.10814.69MultiClass
    Midwest_Survey_nominal42,5322,778271261.950.03510.51MultiClass
    mlr_ranger_rng42,458278,86314863.560.00149.69Regression
    mlr_svm_rng42,456540,57613769.380.00260.95Regression
    Moneyball41,0211,2321411320.870.00473.05Regression
    mushroom248,124220221.390.48130.53Binary
    NewFuelCar41,50636,203171701.460.00124.78Regression
    okcupid-stem42,73450,7891931615.970.101225.28MultiClass
    pbc5244181817116.470.001224.66Regression
    pbcseq8021,945171343.430.5069.71Binary
    porto-seguro42,206595,2123725123.840.04528.21Binary
    primary-tumor1,003339170173.900.25232.74Binary
    profb47067295419.840.33289.29Binary
    rl41,16031,4062222010.450.10828.71Binary
    road-safety42,803363,243666159.100.054114.62MultiClass
    SAT11-HAND-runtime-regression41,9804,44011611335.270.001061.15Regression
    schizo4663401412217.520.481122.30Binary
    sick383,772297225.540.06722.96Binary
    soybean1,023683350359.780.133210.68Binary
    speeddating40,5368,37812261612.870.001093.17Binary
    stress42,16719912848.290.20714.22Binary
    us_crime3151,994127126115.480.002481.91Regression
    vote56435160165.630.39165.63Binary
    water-treatment940527363602.860.15224.53Binary
    Table 8. Datasets Used for Missing Value Simulation Experimental Setup
    The table contains the dataset name, id, number of samples, number of features, number of categorical and numeric features, Missingness percentage in the whole dataset, Minority Class %, the number of features with missing values over 1%, the missingness percentage over features with missing values, and, finally, the outcome type of each dataset.
    Table 9.
    DatasetMissing Feature (target)#Selected Features
    aps_failurecn_00625
    colleges_aaupAverage_salary-full_professors5
    colleges_usnewsOut-of-state_tuition18
    dresses-salesV37
    eucalyptusPMCno6
    hepatitisALBUMIN6
    hungarianthalach3
    mushroomstalk-root16
    pbcseqpresence_of_asictes7
    speeddatingattractive24
    Table 9. Summary of the Feature Selection Experiments for MAR Simulation
    On average, a missing feature depends on 12 other features.
    Table 10.
    Table 10. Number of Wins, Average Ranking, Average Metric Score, Average Difference from Best per Dataset and Average Difference from a Baseline Method (MM)
    Table 11.
    Table 11. Table Denotes the Methods That Are Added to the Imputation Set, the % of the Maximum Score for the Specified Metric Reached by Each Set, and the Configuration Complexity Increase for Each Set
    Fig. 6.
    Fig. 6. The percentage of maximum AUC achieved by each subset of imputation methods. Each tick in the x-axis shows the algorithm to add to the previous subset. The multiplier next to the name of an algorithm shows the factor by which the total number of configurations tried is multiplied, due to the different combinations of hyper-parameter values of the imputation algorithms. BI+DAE and BI+MM allow us to recover 99.69% of the maximum AUC while increasing the configuration space by 10 times.
    BI+MM, by itself, accounts for 98.69% of the maximum AUC. When BI+DAE is added to the mix, relative performance reaches 99.69%. The next best algorithm to add is BI+MF; 100% of AUC is reached when invoking all imputation algorithms. In summary, the addition of BI+MF, BI+PPCA, BI+GAIN, and BI+SOFT provide only marginal gains to the set \(\lbrace\) BI+MM, BI+DAE \(\rbrace\) .

    5.5 The Interplay between Feature Selection and Imputation

    Feature selection algorithms try to reduce the number of features that enter the model without sacrificing predictive performance. Feature selection is often the primary task in analysis, while the predictive model may be just a side-benefit. For example, a medical doctor may be more interested in the quantities that determine the risk of disease and may reveal new medical knowledge, rather than the risk model itself. Feature selection leads to more interpretable models that provide intuition into the domain. In fact, the solution to the feature selection problem is directly linked to the causal model that underlies data generation [72]. In other circumstances, it is important to reduce the cost of measuring the features to provide predictions. The cost may be measured in monetary units, the computational cost to compute the features or risk to a patient from medical procedures that measure these features.
    Figure 7 shows the impact of feature selection for each imputation algorithm on the real dataset. The drop in AUC performance when feature selection is enforced vs. not enforced (i.e., optimizing over all configurations) in the final configuration is shown. For each algorithm is about two to three AUC points. In other extensive experiments with hundreds of complete (no missing values), small-sample, high-dimensional omics datasets, JADBio has been shown to reduce the number of features by a factor of 4,000 without a noticeable drop in AUC performance [73]. The results provide evidence that feature selection may be more challenging in the presence of missing values.
    Fig. 7.
    Fig. 7. The loss in predictive performance when enforcing feature selection in the pipeline for the real-world data. For all imputation methods, enforcing feature selection leads to a drop in predictive performance (red lines) by less than three AUC points (on average). While feature selection can reduce the required features to measure for an acceptable loss of performance in some applications, it is invalidated by the imputation models that need to measure all features.
    In any case, the problem of including both imputation and a feature selection step in the ML pipeline is that imputation invalidates feature selection, in some sense. Let us explain this statement with an example. Let us assume the pipeline that produces the final model consists of MF imputation, Lasso feature selection, and RF predictive modeling. Let us assume that Lasso selects the features \(\lbrace A, B, C\rbrace\) . If any of these values (say the value of A) is missing on a new sample, then the MF imputation model will impute them using a Random Forest for A using some other subsets of features. If any of those are also missing, then MF will invoke its Random Forests for each value that is missing, and so on, recursively. Hence, if there are missing values on the test samples, one may need to measure an arbitrarily large feature subset, not just the selected features. The storage required to apply the ML pipeline includes both the RF as well as the MF model, which in turn includes a Random Forest for every feature that may need imputation.

    6 Comparative Evaluation On Datasets with Simulated Missing Values

    This section focuses on comparing imputation methods in datasets with generated missing values. To that end, we compare the predictive performance of each imputation method under various missingness mechanisms and percentages. Additionally, we study the effect of feature selection when the missingness increases. The figures in this section illustrate the more general MAR case. The results for MCAR results are included in Appendix C.2 and are qualitatively similar. Finally, results regarding imputation accuracy can be found in Appendix C.7.

    6.1 BI+MF Is the Best Imputation Method in MCAR and MAR Simulated Missing Data

    Figure 8 presents the AUC performance results (see Figure 16(b) for MCAR results). The AUC performance denoted is the absolute difference in performance at the specified missingness percentage minus the AUC performance of the complete dataset. First, we note that the figure illustrates that as the missingness percentage increases the average predictive performance for every imputation method used decreases, as expected. As we can see, increasing the missingness from 25% to 50% leads to a sizable performance drop for all imputation methods. Specifically, methods based on linear dimensionality reduction, namely BI+PPCA and BI+SOFT are the most affected by this increase in missing values.
    Fig. 8.
    Fig. 8. MAR data: AUC difference of each imputation method from the complete dataset. BI+MM is the best at 10% missingness. BI+MF and BI+MM are the top methods, tied at 25% missingness. BI+MF has the lowest avg. loss at 50%.
    Fig. 9.
    Fig. 9. The percentage of maximum AUC achieved by each subset of imputation methods. Each tick in the x-axis shows the algorithm to add to the previous subset. The multiplier next to the name of an algorithm shows the factor by which the total number of configurations tried is multiplied, due to the different combinations of hyper-parameter values of the imputation algorithms. The set containing BI+MF and BI+MM recovers 99.43% of the maximum AUC for MAR data. The complexity of the configuration space increases by 3 times when including both MM and MF.
    Fig. 10.
    Fig. 10. The relative efficiency for each imputation method against BI+MM versus the relative effectiveness in terms of AUC. Larger points indicate the mean values for a given algorithm. One hundred four of 147 pairs are won by BI+MM in both efficiency and effectiveness. Only MF dominates MM on average in terms of effectiveness. However, MF is 23.000 \(\times\) slower than MM.
    Fig. 11.
    Fig. 11. Panels (a) and (c) illustrate improvement of BI extended imputation methods over the base method. Panels (b) and (d) show the count of datasets where BI extended methods are scoring higher/lower than base methods.
    Fig. 12.
    Fig. 12. Panels (a) and (b) show the ranking of imputation methods for F1-score and accuracy score.
    Fig. 13.
    Fig. 13. Panels (a) and (b) illustrate the maximum performance achieved by each imputation subset. For both F1 and accuracy, BI+MM and BI+DAE set scores over 99% of the maximum performance.
    Fig. 14.
    Fig. 14. Panels (a) and (b) denote the tradeoff in terms of relative effectiveness and relative efficiency between BI+MM and the other imputation methods for F1 and accuracy metrics, respectively.
    Fig. 15.
    Fig. 15. Panels (a) and (b) denote the loss in predictive performance when enforcing feature selection in the pipeline for the real-world data. For all imputation methods, enforcing feature selection leads to a drop in predictive performance by less than 5% accuracy (on average).
    Fig. 16.
    Fig. 16. Panels (a), (c), and (e) denote the tradeoff in terms of effectiveness and efficiency between BI+MM and the other imputation methods. Panels (b), (d), and (f) show the difference from complete data for each imputation method at various missingness levels. BI+MF is the best method for MCAR data. However, BI+MM exhibits good performance at a fraction of the cost.
    Figures 8 and 16(b) illustrate that in both MAR and MCAR data, respectively, MissForest combined with Binary Indicators is, on average, the best-performing method. Additionally, we note that PPCA and SOFT are the two worst imputation methods, especially as the missingness percentage increases. Table 12(a) in Appendix C.2 contains the quantitative results in detail and a detailed discussion on the ranking of the algorithms.
    Table 12.
    Table 12. The Average Loss from the Complete Datasets by Each Imputation Method when Missing Data Are MCAR (Left) and MAR (Right)

    6.2 The Best Imputation Subset for Maximizing AUC Performance Is {BI+MM, BI+MF}

    We now identify the minimal-size algorithm subset with close-to-optimal performance for simulated missing data. We use again the simple greedy algorithm introduced in Section 5.4 and apply it to the MCAR and MAR simulated data results. The results for MAR are in Figure 9, which is similar to Figure 6. The quantitative results are shown in Table 13(b). As shown in the figure, the \(\lbrace BI+MM, BI+MF\rbrace\) subset can score over 99% of the total max AUC for MAR data and would be the suggested set of algorithms to run in such problems. The results for MCAR are in Appendix C.2.6. They are qualitatively similar. The results for simulated missing values are somewhat different than the ones in the real datasets, namely BI+MF scores better than BI+DAE, which is now placed in third place. Possible reasons why are discussed in Section 9.
    Table 13.
    Table 13. The Methods That Are Added Sequentially to the Imputation Set, the % of Maximum Score for the Specified Metric Reached by Each Set, and the Configuration Complexity Increase for Each Set

    6.3 BI+MM Provides the Best Tradeoff between Effectiveness and Efficiency

    Figures 10 and 16(a), show the effectiveness vs. efficiency tradeoff of the algorithms. The aformentioned figures are similar to Figure 5(a) above for the real datasets. We repeat the explanation of the figure: The reference (baseline) algorithm is BI+MM. The x-axis shows the effectiveness ratio defined as the ratio of the AUC corresponding to the point divided by the corresponding performance of BI+MM. Similarly, the y-axis shows the efficiency ratio defined as the training time of the point divided by the corresponding time of BI+MM. Hence, points in the first/fourth quadrant (top-left/bottom-right) correspond to runs where BI+MM dominates/is-dominated-by other algorithms on the same datasets in both time and AUC. Notice that the scale of the y-axis is logarithmic. Larger points correspond to the mean value of an imputation method over all datasets. There are five imputation methods to compare against MM for 10 datasets over 3 percentages of missing values. This will naturally result in 150 points. However, MissForest did not run in three of the datasets (image dataset variations) due to its dimensionality; see Section 2.3.2 for details. The resulting plot will consist of 147 points.
    For MAR data, BI+MM is never dominated in both metrics, as it is by far the most efficient method. In 104 of 147 cases, it dominates the opposing imputation methods in terms of both effectiveness and efficiency. However, 43 times it is dominated in efficiency. Only BI+MF has on average better predictive performance than BI+MM. However, it is 23,000 times slower to train on average. All the other imputation methods are worse than BI+MM on average while also taking more time to train. The results for MCAR data are qualitatively similar (see Figure 16(a) and discussion in Appendix C.2).
    In total, BI+MM is again found to provide the best, arguably, tradeoff between efficiency and effectiveness. The results in the simulated data are further validating the results in the real-world data, verifying that BI+MM is indeed a decent imputation method all around. BI+MM, on average, is on par with more sophisticated methods such as BI+MF, BI+GAIN, and BI+DAE, while being thousands of times faster to train.

    7 Meta-Level Analysis of Real-World Results

    In Machine Learning it is always invariably the case that there is no single better algorithm for all datasets, a one-size-fits-all type of algorithm. Hence, one needs to optimize over several choices for the dataset at hand. This school of thought is what gave rise to AutoML systems. The field of Meta-Level Learning [18] studies how to predict the most promising algorithm or algorithms to run on a given dataset based on its characteristics. These characteristics are called meta-level features or meta-features of the dataset and include the sample size, the number of features, the type of features, the percentage of missing values, and others [60].
    In this section, we try to identify meta-features that correlate with the performance of the imputation algorithms. Such correlations could help predict which algorithms to run on a given dataset. They could also shed light on the dataset properties that enable an algorithm to perform better and lead to the design of better algorithms. Hence, we defined and computed the meta-features in Table 4. The selected meta-features can be split into three categories: (1) General meta-features, which report general characteristics of the dataset such as the number of samples or features. (2) Missing value-related meta-features, which provide insight into the dataset’s missing patterns, such as missing value percentage of features. (3) Cluster-based meta-features. One such type of metric is the silhouette coefficient, computed with the k-means algorithm with \(k=2, 3, 4\) , as was proposed in Reference [1]. It shows the tendency of the data to cluster. Another type of such metric is the number of PCA components that explain \(\%x\) of the data. It shows whether the data are limited to a lower-dimensional subspace and the extent of cross-correlations between features. General meta-features were extracted using the pymfe package [5]. We implemented the missing and clustering-based meta-feature extraction using sklearn [53] and numpy [23]. To apply clustering or PCA the data are first imputed with MM.
    We then correlated (Spearman correlation) these meta-features with the AUC performance of an algorithm relative to the performance of BI+MM as the baseline. A positive (negative) correlation indicates that when the meta-feature increases, the performance of the algorithm increases (decreases), relative to BI+MM. There are five algorithms (except BI+MM, which is used as a baseline) and 16 meta-features, leading to 80 correlations over datasets. Only one correlation was found to be significant at the level significance 0.1 (p-value = 0.059). Specifically, BI+PPCA relative AUC performance is positively correlated (correlation = 0.383) with the number of categorical variables in a dataset. This means that as the number of categorical variables increases we expect BI+PPCA to perform better relative to BI+MM. However, when correcting the p-values for multiple testing using the FDR control technique of Benjamini-Hochberg [7], we see that the q-value is 0.991, which means that detecting one such correlation is expected even if all meta-features are uncorrelated with the relative performance. BI+PPCA does not handle categorical features natively, which further makes us believe that the result is probably a false positive. Statistically significant correlations could not be found using meta-learning analysis. Further experiments containing more datasets and meta-features need to be conducted.

    8 Related Work

    In this section, we discuss related work on missing values imputation and position our contributions. We focus on empirical studies that compare different imputation methods based on the performance of the predictive models build on imputed datasets rather than the original values of complete datasets [13, 29, 56, 80].
    Current literature can be split into two categories: AutoML and Adhoc ML modeling. The first category extends a specific AutoML tool by adding imputation methods, while the latter creates a predictive modeling pipeline that may contain a subset of a modern AutoML tool’s pipeline, such as hyper-parameter optimization, model selection, and pre-processing. AutoML in general is able to optimize the performance over various stages in a pipeline. As we optimize the whole pipeline, we expect the effect of each stage to become less significant, as other stages may compensate. AutoML tools allow us to get more insights on which features are more important for the task (feature selection), optimize the hyper-parameters for each stage of the pipeline (hyper-parameter tuning), and select the best predictive model for each imputation method (model selection). Consequently, imputation methods can be evaluated fairly under this optimization framework.
    As shown in Table 5, the majority of related work either uses datasets with native or simulated missing values. The literature mainly focuses on the binary classification task (included in all previous works). Of the eight previous works, two papers include binary+regression [8, 55], and one work binary and multi-class data [19]. Reference [30] is the only study that includes all three types of outcomes. The most prominent missingness mechanism is MCAR found in all works that simulate missing values. Reference [30] is the only work that includes deep learning–based imputation methods. Binary Indicators are very prominent in AutoML tools; however, only Reference [55] has studied their effect when extending imputation methods. Finally, Reference [49] is the only work that included ensemble models for the prediction phase while Reference [19] is the only work that includes feature selection as part of the pipeline. As shown in the Table 5, none of the related work has included every step mentioned in the table’s columns.
    Summarizing the related work section, the majority of the literature uses datasets with native missing values or generates them through a simulation based on various missingness mechanisms and missingness proportions. However, none of the mentioned studies benchmarks imputation methods on both native and generated missing value datasets. The studies on real-world datasets in general conclude that simple imputation methods such as MM are on par with other more complex methods. Research on datasets with simulated missing values concludes that more complex methods can indeed improve predictive performance on average. However, there is no universal best method proposed by any of the aforementioned benchmarks. Literature mainly focuses on the binary classification task. Accuracy and F1-score are the more prominent metrics in the literature. In the majority of the studies, the hold-out split is used for the evaluation. Some studies, use repeated splits or cross-validation to handle randomness. Specific predictive models could benefit from native handling of missing values compared to simple imputation, for instance, Gradient Boosted Trees. However, not all classifiers support missing value handling, making imputation still an essential part of the pre-processing step of ML pipelines. In general, hyper-parameter tuning, model selection, and feature selection are given less importance in previous literature. Most works skip one or more of the previous steps or fail to mention information about the specific stage. For example, only one predictive model is tuned or imputation methods are used with default parameters specified by the authors of the methods or the package implementations.
    The research closer to ours is Reference [49]. In the aforementioned paper, Autosklearn was extended to include the data cleaning process, the emerging tool named AutoClean. Part of the extension was imputation. The study compares mean, median, mode, KNN [71], and Iterative imputation [77] for continuous features. For categorical features, constant, KNNi, and mode imputation were selected. The study used five binary classification datasets with 891 to 10,500 observations and 9 to 39 features that include missing values at low percentages. AutoClean optimized the pipeline by Bayesian hyper-parameter optimization. In autosklearn the predictive model is an ensemble of methods. They evaluated the performance by using a fivefold cross-validation and balanced accuracy metric. The study concluded that KNNi is a valuable addition to the simpler imputation methods. However, in most cases, simple imputation methods are selected more frequently than KNNi for both continuous and categorical data. Contrary to the aforementioned literature, we included feature selection in our experimental setup. Also, we conducted comparisons on both datasets with native and simulated missing values. In general, our evaluation was conducted on more datasets, with a higher range in terms of samples and features. Finally, we included neural network imputation methods and extended imputation with binary indicators.
    In Reference [19], TPOT AutoML tool was extended with imputation methods, specifically mean, median, mode, max, MICE [77], and EM [27]. The median and mode were found to be the best imputation methods based on a restricted simulation study on 23 datasets at 7% MCAR missingness. The data were split multiple times (20) to account for randomness. At each split, 25% of the data were used as a hold-out set. Compared to the mentioned work, we simulated missing values with other mechanisms and missing proportions as well as used datasets with native missing values in our experiments. Also, we included recent state-of-the-art methods based on NNs such as DAE and GAIN. We also implemented and measured the effect of binary indicators when coupled with MM and complex methods.
    Similarly, missing data imputation has been also researched as part of the data cleaning systems. Reference [38] compared deletion, mean, median, mode, new-category, and HoloClean [59] on six datasets with native missing values. For the predictive task 7 models were considered: Logistic Regression, KNN, Decision Tree, Random Forest, AdaBoost, Naive Bayes, and XGBoost. For the evaluation, the data were split 70% train–30% test set, repeated 20 times to account for randomness. They used accuracy and F1-score for evaluation according to the dataset’s imbalance. They concluded that simple imputation methods yield competitive performance to more complex methods such as Holoclean. Contrary to the aforementioned work, we included neural networks, the binary indicator method, and feature selection in the experimental setup. We conducted experiments on more datasets that, generated missing values but also had real-world missing values.
    The benchmark study [81] was conducted on 13 real-world datasets from OpenML and concluded that mean/mode is comparable to more complex imputation methods such as random, SOFT [44], MF [66], KNNi [71], Hot-Deck [33], and MICE [77]. In the study, 20% of the data was kept as a hold-out and reported measures were AUC and F1-score. Specifically, while measuring the F1-score, MM had the highest average ranking. In contrast, for the AUC score, KNNi is found to be the best-performing method. However, both KNNi and MM are among the three best methods in both metrics. Hyper-parameter tuning was not considered in this article, the imputation and predictive methods used default parameters. In our work, we tune all the steps of the pipeline for a fair comparison. We also included deep learning methods that are the current state-of-the-art for imputation as well as binary indicators.
    The largest benchmark study was conducted in Reference [30] on 69 real-world datasets with simulated missing values. The missing mechanisms were MCAR, MAR, and MNAR. The generation of missing values was set at 1%, 10%, 30%, and 50% missingness. They compared MM, KNNi [71], MF [66], custom DL-based imputation inspired by Reference [9], GAIN [83] and variational autoencoders [32] for the imputation problem. Cross-validation scores are reported, data split into 5 folds for all but deep learning methods. For deep learning methods, three folds were used for the split due to training costs. For regression data, RMSE is the reported metric. For classification data, the F1-score was reported. They concluded that MissForest is the best imputation method. However, they used a single classifier for the prediction phase. We argue that different imputation methods work better with different classifiers, which should be tuned as well. For example, on the cylinder-band dataset GAIN imputation method performs best with the Ridge Logistic Regression, whereas the DAE imputation method performs best with the RandomForest classifier on the same dataset. Additionally, missing values for the downstream task were generated for a randomly sampled feature in the dataset. We also uniformly simulated missing values, which is a harder problem to solve for the imputation methods as less observed data exist. In the mentioned work, GAIN had a convergence problem in 33% of the cases resulting in the worst ranking among the mentioned methods. In our work, GAIN does indeed converge due to different hyperparameter tuning. Finally, we include Binary Indicators as well as the DAE imputation method, which is the best method on average in real-world data with missing values. Simulated missingness results in our work are on par with the results of the aforementioned work as MissForest is the best method in both works. However, deep learning methods in our work are among the best methods and not the worst as in the literature mentioned.
    Another study [57], compared the predictive performance on two datasets with imputed and incomplete data. Missing values were simulated on categorical features on the train data. They generated MCAR and MNAR missing values from 10% to 40% missingness in categorical features. One-third of the data were kept as hold-out test set. The accuracy score on the test set is reported. For the imputation of categorical features, they used six imputation models: mode, random, k-NN [71], iterative imputation based on logistic regression, random forest [66], and SVM. For the classification, they used three predictive models, ANNs, decision trees, and random forests. The authors optimized the hyper-parameters for the ANNs only. The imputation models and the other classifiers were not tuned. They did not conclude that an imputation method or a classifier is better than others and heavily depends on the nature and proportion of the missing data. However, results indicated that imputation is better than simply creating a new category in the data. In our work, for fair evaluation, we tune both imputation and predictive models. We include binary indicator methods and neural network imputation. Finally, we simulated missingness on both numeric and categorical features.
    In Reference [55], the authors compared mean, median, KNNi [71], Iterative Imputer, Iterative Imputer /w Bagging (Multiple Imputation) [77], MIA (the native handling of missing values by Gradient Boosted Trees), and MIA /w bag. All of the previous methods, except MIA, were also extended with binary indicators. The study was conducted on 13 real-world datasets from four databases with native missing values. Nested cross-validation with five outer folds is used for estimating the accuracy score of the downstream task. The predictive models were set to Gradient-boosted trees and linear models. They concluded that MIA is a better alternative to imputation. Also, the indicator method helps improve the performance of the predictive task, which is on par with the results of our work. They conclude that simple imputation using mean or median is on par with KNNi and iterative imputation with linear models. In our work, we included deep learning imputation in our set of imputation methods. Also, we tuned the imputation methods to fairly evaluate the performance of each method, as tuning is important in the performance of some imputation methods.
    Last, Reference [8] introduced a group of three methods named OptImpute, focusing on optimizing KNNi and iterative imputation based on SVMs and decision trees. They compared the group against five other imputation methods: mean/mode, K-nearest neighbors [71], iterative known [84], Bayesian PCA [50], and predictive-mean matching [77]. They compared the introduced method across 84 datasets with simulated missing values measuring imputation accuracy. They additionally measured the group’s performance on learning algorithm performance on 10 datasets. The missing values are generated by the MCAR mechanism with a range from 10% to 50%. The classifiers used for regression tasks are LASSO and SVR while for the classification tasks SVM and Optimal Trees. These datasets range in size, having 150 to 5,875 observations and 4 to 16 features. Data were split 50%–50% into train and test sets. The splits were repeated 100 times to account for randomness. Their group of methods improved the predictive performance of the models. Their method scored 86.1% average accuracy and average R-Squared (R2) of 0.339 compared to 84.4% and 0.315 R2 for the classification and regression data, respectively. However, no neural networks were used and the methods introduced have not been compared individually. Additionally, tuning was applied only to the group of proposed methods and not to the other imputation methods that were used and the predictive models. We tune all of the imputation methods and models. We also extend methods with binary indicators and include NNs imputation methods in our test bed, as well as MissForest. We also report multiple metrics (Accuracy, F1, and AUC) for the binary classification task. Finally, we have a wider range of datasets, both with native and simulated missing values.

    8.1 Synopsis of Contributions Relative to the Related Work

    Compared to the related work, we contribute in various ways. Our work can be directly compared to two other works that are conducted in an AutoML tool [19, 49]. Compared to the mentioned works, we include more datasets, more missingness mechanisms, neural network methods, and binary indicators in the experimental setup. For the first time, deep learning methods are compared to simple imputation methods in an AutoML predictive setting. One of the deep learning methods (BI+DAE) has the best average performance on real-world data with native missing values. Additionally, for the first time, the effect of the imputation methods on predictive performance is measured on datasets with generated missing values and native missing values. Until now, comparisons were conducted on only one of the two settings, specifically half of the papers use real-world datasets with missing values in them, while the other half use complete datasets with generated missing values. Contrary to the majority of the literature, we tune both imputation and predictive methods to fairly evaluate them. Only two of the eight related mention tuning both imputation and predictive modeling methods [38, 49]. We also conducted experiments on more datasets compared to the majority of the literature, while unlike [30], our simulation setting is applied to all features in the datasets and not only one. Also, only one of the eight aforementioned works includes feature selection as part of the ML pipeline. Finally, meta-learning, for the first time, is used to identify useful data characteristics that could give insights into the choice of a simple vs. a sophisticated imputation method. In general, as shown in Table 5, our testbed is the most complete overall in terms of dataset selection, missingness selection, imputation method selection, and pipeline steps. This allows us to fairly evaluate imputation methods in a state-of-the-art AutoML environment.

    9 Lessons Learned and Contributions

    The main insights that are drawn from our experimental results are the following:
    Including BI in the dataset improves the predictive performance of the machine learning pipeline for most algorithms (see Section 5.1). The inclusion of BIs does increase the dimensionality—and difficulty—of the machine learning task. However, it does encode the information about which missing values are missing; this allows a classifier to learn which values to trust or not. Results indicate that encoding this information turns out to be more beneficial than harmful.
    BI+DAE is found to be the single best imputation method in real-world data with native missing values followed by BI+MM, which is the standard in AutoML tools. As seen in Section 5.2, both methods have the same number of wins (when comparing only BI extended methods) across datasets with BI+DAE having higher mean AUC. The worst performance is exhibited by matrix-factorization (linear dimensionality reduction) methods such as PPCA. These methods do scale with the number of features and may be more suitable for high-dimensional, low-sample datasets.
    BI+MM exhibits the best tradeoff between efficiency and effectiveness. As expected (see Sections 5.3 and 6.3 and Appendix C.2.4), BI+MM is the fastest method to train and also is more effective in the majority of the comparisons. MF, due to its iterative nature, is the slowest among all closely followed by GAIN. GAIN’s main bottleneck is the number of epochs required to train the network. The authors’ suggestion was 10,000 epochs, which is 20 times more than the 500 epochs suggested by the authors of the DAE method.
    Based on the results of Section 5.4, we would suggest practitioners to optimize their models over the BI+MM and BI+DAE algorithms. BI+MM and BI+DAE score over 99% of the maximum AUC in real-world data as shown in Section 5.4. Specifically, BI+MM scores 98.68% of the maximum AUC. Adding BI+DAE to the pipeline leads to 99.69% of the maximum AUC. However, this comes at the cost of increasing the configuration space by 10 \(\times\) , as DAE has nine tuning configurations compared to one of BI+MM. Also, to reach 100% of the optimal performance, we have to train 24-times more configurations than by simply using BI+MM.
    BI+MF is the best method in datasets with simulated missing values. As shown in Section 6.1, in both MCAR and MAR simulations, BI+MF is on average the best. In contrast, BI+MF is the third best with real-world data, falling behind BI+DAE. Despite our best efforts to realistically simulate missing values, there may still be differences between real-world missing-data generative mechanisms and our simulations. First, we simulated MCAR and MAR missing values. Real-world missing values may be NMAR. Second, the missingness probability for MAR data is determined by a generalized linear model (logistic regression model). Real-world missing values may follow non-linear models. The majority of the literature employs similar simulations for comparing imputation algorithms. However, as indicated by this study, results with simulated missingness may not generalize to real-world datasets. New simulation methodologies need to be proposed to this end.
    Missingness increase leads to a deterioration in predictive performance. As shown in Section 6.1, increasing missingness causes a drop in the AutoML tool’s capability of predicting the outcome. Missingness at 10% leads to a 0.024 AUC drop compared to the complete dataset. Similarly, 25% missingness leads to 0.05 AUC drop, while at 50% we can inspect up to 0.144 drop average as seen in Tables 12(a) and (b).
    The set containing BI+MM and BI+MF reaches 99% of maximum AUC for simulated data as shown in Section 6.2. BI+MM scores the 98.7% of the maximum AUC for MCAR data and 98.99% for MAR data. To surpass 99% of the maximum AUC, the addition of BI+MF is needed. This addition allows the tool to reach 99.62% and 99.43% on MCAR and MAR data, respectively. However, BI+MF has to be tuned, leading to a total 3 \(\times\) increase in pipeline complexity.
    A meta-learning methodology to correlate meta-features with performance is presented in Section 7. It could allow scientists to select the appropriate sophisticated methods based on meta-features, saving training time and improving overall performance. In addition, it could provide insight into the design choice of an algorithm that leads to better or worse performance on a given dataset. Unfortunately, no statistically significant results were found. This means that either there are no correlations present with the selected meta-features, or these correlations are not strong enough to be found significant with the given sample size of 25 datasets.
    There are, of course, several limitations of the study that we would like to point to. The results and conclusions stem from computational experiments with binary classification tasks within a range of a number of features, sample size, imbalance of the classes, and missingness percentage. MNAR missingness pattern is not included in our experiments. Also, the mechanism for generating MAR data is based on a linear model. Results may differ for non-linear missingness generation and MNAR data. Despite the significant computational effort involved—optimizing over thousands of ML pipelines for each dataset—results stem from only 25 real-world datasets with native missing values and 60 complete datasets where missing values were introduced (10 original datasets times 2 missingness mechanisms (MCAR, MAR) times 3 missingness percentages). This fact limits the statistical power of our statistical tests. While JADBio is an effective AutoML tool, results should also be obtained from other AutoML tools to further generalize the conclusions. Another limitation of our work, concerns the comparison of methods on only binary classification data. Even though imputation algorithms are unsupervised learning methods and do not use information from the target variable (in our work), results may vary according to the supervised task. Finally, we selected models based on the AUC score in the training set. Optimizing for another metric, such as accuracy or F1-score, during training may yield different results.

    10 Conclusions

    In this article, we conducted experiments on real-world datasets with native missing values and simulated missing values. We compared six imputation methods extended by binary indicators on a state-of-the-art AutoML tool. BI+DAE is the best method on real-world datasets with native missing values. However, BI+MM is comparable to, if not better than, the more sophisticated imputation methods in terms of predictive performance and efficiency on real-world data. Increasing missingness leads to predictive performance deterioration. Additionally, simulation data lead to contradicting results compared to real-world datasets. BI+DAE and BI+MM are the best methods on real-world data; however, when simulated data are considered BI+MF is the best method on average followed by BI+MM. Finally, meta-learning was employed but could not successfully find any patterns to predict whether a sophisticated imputation method can be used instead of the simple BI+MM to improve the downstream performance.
    The results make us question whether advanced, multivariate imputation algorithms are really necessary for predictive modeling with AutoML. The simple BI+MM imputation is surprisingly effective and computationally efficient when the ML pipeline is properly tuned within an AutoML setting. BI features allow advanced classifiers to learn when to trust a value or not. Multivariate Imputation algorithms try to learn the full joint distribution of the dataset, a task that is quite challenging with low sample, imbalanced, or high-dimensional data and prone to error. It is also a very computationally demanding task. Imputing values for features that are redundant or irrelevant to the final model is a waste of computations. When imputing using multivariate imputation, one needs to store not only the final model (e.g., RF, SVM, or a NN) but also the imputation model to impute test samples. For some imputation models (Deep Neural Networks, or one RF for each feature as in MF) the additional storage may be non-negligible. In addition, the imputation model requires measuring all features and invalidates the efforts of feature selection. Arguably, the research effort that goes into novel and better-perfoming imputation methods would be more productive to be spent on novel and better-performing ways to natively handle missing values in our classification and feature selection algorithms.

    Footnotes

    1
    Although some ML algorithms such as KNN and Naive Bayes are robust to missing values, their implementations in popular platforms like sklearn does not currently support the presence of missing values.
    2
    Recent work [46] shows that when the causal graph of the distribution is known there are cases where MNAR data can be imputed.
    6
    Table 9 contains the dataset name, the missing feature, and the number of features selected for the missing feature as the outcome.

    Supplementary Material

    tkdd-2023-03-0117-File002 (tkdd-2023-03-0117-file002.zip)
    Supplementary material
    tkdd-2023-03-0117-File003 (tkdd-2023-03-0117-file003.zip)
    Supplementary material

    Appendices

    A Datasets Appendix

    A.1 Real-World Datasets with Native Missing Values

    This section presents the 25 real-world binary classification datasets with native missing values. See Table 6 for more details.

    A.2 Complete Datasets for Missing Data Simulation

    This section presents the 10 complete binary classification datasets used for the simulated missing value experiments. Table 7 contains the dataset names and their characteristics.

    B Missing Value Simulation Setup Appendix

    B.1 Datasets Selected to Determine the Percentage of Missing Values per Feature.

    Realistic simulation of missing values requires selecting the missingness percentage for each feature; see Section 4.1. We sampled 64 real-world datasets with missing values from OpenML repository. Table 8 describes the datasets’ characteristics.

    B.2 Determining the Average Number of Features on Wich a Missing Feature Depends.

    In this section, we present the quantitative results for the experiments regarding the simulation of MAR mechanism presented in Section 4. Table 9 presents the dataset name, the randomly selected feature with missing values (target) and the result of the feature selection (# features selected). On average a missing feature is dependent on 12 features.

    C Experimental Results Appendix

    C.1 Real-world Results

    C.1.1 BI Improve Performance across All Metrics.
    BI extended methods perform better than their base methods when the AUC score is measured (see Section 5.1). As seen in Figure 11(b) and (d), BI indeed improves the accuracy and F1-score of the downstream task, in the majority of the datasets. Figure 11(a) and (c) illustrate the gain or loss of including BI for each imputation method in the x-axis. BI improves the performance of the downstream task, on average.
    C.1.2 BI+DAE Is the Best Method Followed by BI+MM.
    Table 10(a), (b), and (c) show the quantitative results of the real-world experiments. Specifically, it depicts the number of wins (ties included) for each imputation method, as well as the average difference in AUC from the winning imputation method for each dataset. The table also reports the average AUC, average AUC difference from MM (set as baseline), and average ranking per method. For completeness, we include methods without BI as well. BI+DAE and BI+MM are the two best methods across all metrics, in terms of average AUC and average ranking. BI+MM gets the highest number of wins in all metrics. While BI+DAE closely follows with one win for each AUC and two wins for F1 and accuracy. However, BI+DAE is more consistent and has the highest average ranking and highest score for AUC and F1. Finally, BI+DAE exhibits the highest improvement over MM and the lowest difference from the best method in each dataset, on average.
    C.1.3 BI+DAE Is the Best Method across All BI Extended Methods.
    As seen in Section 5.2, BI+DAE is the highest-ranked method for the AUC metric. Figure 12(a) and (b) illustrate the average ranking of BI extended methods for F1 and accuracy metrics, respectively. BI+DAE is the highest-ranked method for accuracy metric but is the third best for F1. BI+MM is the second-best in accuracy and the best in F1. Overall, across different metrics, the relative order varies slightly. Statistically significant results remain the same for AUC and accuracy. No statistically significant results are found for the F1-score.
    C.1.4 BI+MM and BI+DAE Score 99% of the Maximum across All Metrics.
    As shown in Tables 11(a), (b), (c), and Figure 13, BI+MM scores over 98% of the maximum performance for each metric. Adding BI+DAE, which is the best next method, to the imputation set that already contains BI+MM allows the tool to score over 99.5% of the maximum performance. However, the complexity increases by 10 \(\times\) . Finally, to reach 100% of the maximum, all imputation methods need to be included in the imputation set. Including all methods in the pipeline of the tool, increases the original complexity by a factor of 24.
    C.1.5 BI+MM Exhibits the Best Tradeoff between Effectiveness and Efficiency.
    This section presents the tradeoff between the effectiveness and efficiency of imputation methods against a baseline (BI+MM). For a detailed explanation of the illustration see Section 5.3. Figure 14(a) shows that BI+MM dominates the other imputation methods in both effectiveness and efficiency in 84 of 125 pairs for F1-score. In 41 pairs, BI+MM is dominated in effectiveness. For the accuracy metric, BI+MM dominates the other methods in relative effectiveness and efficiency in 86 of 125 pairs. As seen in Figure 14(b), BI+MM is dominated in only 39 pairs.
    C.1.6 Feature Selection–enforced Pipelines Degrade the Performance.
    As seen in Figure 15(a) and (b), feature selection deteriorates the performance of the pipelines. The average absolute difference between feature selection–enforced pipelines and non-enforced is less than 5% for accuracy. On average, the F1-score is lower by 5 points when enforcing feature selection.

    C.2 Simulation Results

    C.2.1 A Decline in Predictive Performance Is Caused by Increasing Missingness.
    In this section, we investigate the average performance drop in terms of multiple metrics compared to the complete dataset. Specifically, Table 12(a), (c), and (e) denote the results for the MCAR missingness and AUC, F1, and accuracy score, respectively. Table 12(b), (d), and (f) present the results for MAR missing data for AUC, F1, and accuracy score, respectively. Summarizing the results, across all metrics and both missingness mechanisms, as missingness increases, the performance of the tool deteriorates. For MCAR data, average AUC drops in absolute terms up to 0.023 at 10% missingness and 0.05 and 0.144 for 25% and 50%, respectively. While measuring F1, the loss is even bigger. At 10%, loss is up to 0.05, and at 25% loss can reach up to 0.078 average absolute difference, while average F1 loss at 50% missingness can be up to 0.158. The accuracy score deteriorates comparably to AUC. At 10% the loss can be up to 0.03, at 25% up to 0.05, and at 50% up to 0.118. For MAR data, results are similar to MCAR. It is noteworthy to mention that at 50% missingness, the performance does not deteriorate as much as for MCAR. This leads us to conclude that most methods in this comparative evaluation can recover information for high MAR missingness better than MCAR. This is theoretically sound, as every method except MM uses multiple variables for the imputation.
    C.2.2 BI+MF Is the Best Method for MCAR Data.
    Table 12(a), (c), and (e) and Figure 16(b), (d), and (f) present the results for MCAR missing data at various missingness percentages and multiple metrics (AUC, F1, and Accuracy). Summarizing the results, across all missingness percentages and measured metrics, BI+MF is the best method in terms of average absolute loss to the complete data. The second best method, at 10%, is BI+DAE, while at 25% BI+MM is the second best for F1 and accuracy metrics. The relative order of imputation methods across metrics remains stable until 50% missingness. The order at 50% missingness may vary according to the performance metric. The two worst methods are BI+PPCA and BI+SOFT, while the positions of second-, third-, and fourth-best methods are shared by BI+MM, BI+GAIN, and BI+DAE.
    C.2.3 BI+MF Is the Best Method for MAR Data.
    As seen in Section 6.1, BI+MF is the best method for MAR data. Table 12(b), (d), and (f) provide an overview of the results for AUC, F1, and accuracy score. The results are robust across all metrics. Figure 17(b) and (d) show that BI+MF has the lowest average loss for 25% and 50% missingness in both F1 and accuracy metric. At 10% missingness, BI+MF has comparable performance to BI+MM, which has the lowest loss at that missingness rate. A detailed review of the results for the AUC metric follows.
    Fig. 17.
    Fig. 17. Panels (a) and (c) denote the tradeoff in terms of effectiveness and efficiency between BI+MM and the other imputation methods. Panels (b) and (d) show the difference from complete data for each imputation method at various missingness levels. BI+MF is the best method for MAR data. However, BI+MM exhibits good performance at a fraction of the cost.
    C.2.4 BI+MM Exhibits the Best Efficiency vs. Effectiveness Tradeoff for MCAR Missing data.
    Regarding the MCAR missing data, it is worth noting that BI+MM dominates in over 90 of 147 pairs, for each performance metric. Specifically, for AUC metric BI+MM dominates in 98 pairs, for F1-score in 90 pairs, and for accuracy in 95 pairs. BI+MM is never dominated in efficiency, as seen in Figure 16(a), (c), and (e). This is not surprising, as BI+MM is significantly faster to train than any other imputation method. BI+MM is only dominated by BI+MF across all metrics in average effectiveness. BI+PPCA and BI+SOFT are on average less effective and less efficient than BI+MM.
    C.2.5 BI+MM Exhibits the Best Efficiency vs. Effectiveness Tradeoff for MAR Missing Data.
    In section 6.3, we presented the efficiency–effectiveness tradeoff for the MAR simulated data when AUC is reported. We extend and confirm our conclusion in this section by measuring and computing the tradeoff for F1-score and classification accuracy metrics. As seen in Figure 17(a) and (d), BI+MM is never dominated in terms of efficiency, as expected. BI+MM dominates the other methods in effectiveness and efficiency in 94 and 98 pairs for the F1-score and accuracy score, respectively. For F1, it is dominated in terms of relative effectiveness in 53 of 157 pairs, while it is dominated for accuracy in 49 pairs. BI+MF is on average more effective than BI+MM for MAR data. However, it is thousands of times less efficient (up to 90,000 for big datasets).
    C.2.6 BI+MM and BI+MF Score over 99% of Maximum Performance for MCAR Data.
    The minimal-size subset of algorithms with close-to-optimal performance for MCAR missing data is \({BI+MM, BI+MF}\) . We used the simple greedy algorithm introduced in Section 5.4. As illustrated in Figure 18(a), this subset achieves over 99% of the maximum achievable AUC for MCAR data. The subset containing \({BI+MM, BI+MF}\) , also scores over 99% of the maximum performance score for both F1 and accuracy, as denoted in Figure 18(b) and (c). Detailed quantitative results are presented in Table 13(a), (c), and (e) for the AUC, F1, and accuracy score, respectively.
    Fig. 18.
    Fig. 18. MCAR: The percentage of maximum metric (AUC, ACC, F1) achieved by each set of imputation methods versus the increased complexity of the pipelines. On the x-axis, we denote the added method as well as the additional complexity to the configuration space. The addition of BI+MF to the set increases the complexity by 3 \(\times\) compared to the original complexity. BI+MF and BI+MM allows us to recover over 99% of the maximum performance for MCAR data.
    C.2.7 BI+MM and BI+MF Score over 99% of Maximum Performance for MAR Data.
    The minimal-size subset of algorithms with close-to-optimal performance for MCAR missing data is \({BI+MM, BI+MF}\) as presented in Section 6.2. In this section, we present results for F1 and accuracy score. We used the simple greedy algorithm introduced in Section 5.4. The subset containing \({BI+MM, BI+MF}\) , also scores over 99% of the maximum performance score for both F1 and accuracy, as denoted in Figure 19(a) and (b). Detailed quantitative results are presented in Table 13(b), (d), and (f) for the AUC, F1, and accuracy score, respectively.
    Fig. 19.
    Fig. 19. MAR: The percentage of maximum metric (ACC, F1) achieved by each set of imputation methods versus the increased complexity of the pipelines. On the x-axis, we denote the added method as well as the additional complexity to the configuration space. The addition of BI+MF to the set increases the complexity by 3 \(\times\) compared to the original complexity. BI+MF and BI+MM allow us to recover over 99% of the maximum metric for MAR data.

    C.3 Evaluation of Imputation Accuracy

    In this section, we present the results of imputation accuracy experiments. We measure the imputation accuracy for each imputation method, missing mechanism, and missingness percentage, by training the imputation methods on the default configurations (highlighted in Table 2). For each dataset, we measure the average R2-score between the imputed and the complete values (clamped between the 0–1 range) for the continuous features. For the categorical features, we measure the accuracy score between the imputed and the complete values. We measure the imputation scores in both train and test sets.
    In summary, MF has the highest on-average imputation accuracy for categorical and continuous features. For both MCAR and MAR data, MF is the best method. Additionally, we observe that the differences between MF and the other methods are more prominent in MAR data. The result remains relatively the same across the train and test sets. MM is one of the worst performing in terms of imputation accuracy. However, MM is performing similarly to MF when measuring the downstream task performance, as seen in Sections 5 and 6. This observation further enhances our original hypothesis that imputation accuracy does not necessarily lead to better downstream task performance.
    C.3.1 MF Has the Highest Imputation Accuracy for MAR Data.
    For MAR data, MF has the highest average R2 and accuracy score in the train data, as seen in Figure 20(a) and (b). Figure 20(c) and (d) show that MF imputes values more accurately on test MAR data. In general, MF is the most accurate method for all missingness percentages. MM, as expected, has the lowest R2-score as it does not predict any of the variance of the continuous data. One interesting observation is that SOFT imputation fails to generalize on test data. As missingness increases, the imputation methods make worse predictions leading to lower scores.
    Fig. 20.
    Fig. 20. Panel (a) depicts the imputation R2-score for the continuous variables in MAR train data. Panel (b) illustrates the imputation accuracy in MAR train data, for categorical variables. Panel (c) shows the R2-score for continuous variables in MAR test data and panel (d) the accuracy score for categorical data in MAR test data.
    C.3.2 MF Has the Highest Imputation Accuracy for MCAR Data.
    For MCAR data, MF is the most accurate imputation method in the train data, as seen in Figure 21(a) and (b). Figure 21(c) and (d) show that MF has the highest average R2 and accuracy score on test MCAR data, across all missingness percentages. MM, as expected, has the lowest R2-score. SOFT fails to generalize on new unseen data. Finally, as missingness increases, the quality of imputed values deteriorates.
    Fig. 21.
    Fig. 21. Panel (a) depicts the imputation R2-score for the continuous variables in MCAR train data. Panel (b) illustrates the imputation accuracy in MCAR train data for categorical variables. Panel (c) shows the R2-score for continuous variables in MCAR test data and panel (d) shows the accuracy score for categorical data in MCAR test data.

    C.4 Real World: Downstream Task Results

    In this section, we include the quantitative results of the real-world experiments. Table 14 contains AUC results. Table 15 the results of F1-score, and Table 16 contains the results for the accuracy metric.
    Table 14.
    DatasetMMBI+MMMFBI+MFGAINBI+GAINSOFTBI+SOFTPPCABI+PPCADAEBI+DAE
    analcatdata_reviewer-FS0.5850.5850.50.5970.5610.5950.6020.6020.5580.5580.5850.585
    analcatdata_reviewer-NOFS0.6610.6680.5990.6460.6020.610.5970.6590.6060.6560.6610.668
    analcatdata_reviewer-Overall0.6610.6680.6050.6430.6070.630.5970.6590.6060.6560.6610.668
    anneal-FS0.8830.9880.7610.970.9160.9730.9870.9670.9950.9860.9750.968
    anneal-NOFS0.8810.9960.9380.970.9310.9830.9720.9910.9960.9960.9820.982
    anneal-Overall0.8830.9960.9430.9690.8960.9910.9720.9910.9960.9960.9750.982
    audiology-FS0.9860.9860.980.980.980.9740.980.980.980.980.980.98
    audiology-NOFS0.9980.9980.9920.9930.9940.9910.9920.9980.9950.9890.9930.995
    audiology-Overall0.9980.9980.980.9810.9930.9920.9920.980.9950.9890.9930.98
    autoHorse-FS0.9670.9670.9660.9660.9660.9660.9660.9660.9010.9960.9660.966
    autoHorse-NOFS0.990.9890.9830.9880.9820.9880.9810.990.9890.9930.9760.976
    autoHorse-Overall0.990.9890.9660.9660.9870.9880.9810.990.9890.9930.9660.966
    braziltourism-FS0.6340.6340.640.640.6430.6340.640.640.7250.7250.6430.643
    braziltourism-NOFS0.6160.7270.7160.7250.7210.7090.7090.7250.6690.6680.7310.715
    braziltourism-Overall0.6160.7270.6430.640.6320.640.7090.640.6690.6680.7310.715
    bridges-FS0.8440.8440.8530.8570.8910.8490.8910.8910.8920.8850.880.847
    bridges-NOFS0.9090.9110.9020.9010.8820.9160.9020.8890.9010.9160.9150.912
    bridges-Overall0.9090.9110.9020.9090.9050.9090.9020.8890.9010.9160.9150.912
    cjs-FS1.01.01.01.01.01.01.01.00.9870.9971.01.0
    cjs-NOFS1.01.00.9941.00.9850.9980.9960.9910.9870.990.9990.999
    cjs-Overall1.01.01.01.01.01.01.01.00.9870.9971.01.0
    colic-FS0.8290.8290.8370.8550.8450.8390.8360.8360.8390.8390.830.83
    colic-NOFS0.8390.8380.8530.8630.8450.8720.8420.8650.8480.8580.8520.856
    colic-Overall0.8290.8290.8490.8810.8460.8620.8360.8650.8390.8580.830.83
    colleges_aaup-FS0.9990.9990.9990.9990.9990.9990.9990.9990.9960.9960.9990.999
    colleges_aaup-NOFS0.9990.9990.9970.9990.9970.9970.9980.9970.9980.9980.9980.997
    colleges_aaup-Overall0.9990.9990.9990.9990.9990.9990.9990.9990.9980.9980.9990.999
    cylinder-bands-FS0.8080.7850.810.7880.8080.7970.8080.7980.7230.7230.8320.826
    cylinder-bands-NOFS0.820.8190.8260.8320.8260.8280.8150.8210.8070.8240.8480.858
    cylinder-bands-Overall0.820.8190.8320.8310.8130.8160.8150.8210.8070.8240.8480.858
    dresses-sales-FS0.5620.5620.5640.5610.560.5520.5650.5650.50.5490.5620.562
    dresses-sales-NOFS0.6190.6050.60.5970.5690.5830.5970.6010.5450.5390.6310.62
    dresses-sales-Overall0.6190.5620.5640.5610.5670.5520.5650.5650.5450.5390.6310.562
    eucalyptus-FS0.8330.8330.8210.820.8230.8350.8420.8170.750.8160.8070.834
    eucalyptus-NOFS0.7770.7770.7780.7780.7780.7770.7790.8440.820.8240.8490.855
    eucalyptus-Overall0.8330.8330.7780.7780.8320.8360.8420.8170.820.8240.8490.855
    hepatitis-FS0.7480.7480.8340.830.8030.8070.6730.6730.7990.7990.8260.826
    hepatitis-NOFS0.8260.8480.8660.8660.8440.850.8690.8640.8760.8660.8520.869
    hepatitis-Overall0.8260.8480.8670.8580.8580.8660.8690.8640.7990.7990.8520.869
    hungarian-FS0.8990.8990.8830.8840.8930.8670.8710.8710.8970.8970.8750.875
    hungarian-NOFS0.9180.9150.9010.9010.9170.9150.8810.8950.8950.8980.9140.912
    hungarian-Overall0.9180.9150.8870.8880.9130.9180.8710.8710.8950.8970.9140.912
    kdd_el_nino-small-FS0.9830.9830.980.9810.9880.9870.9830.9810.950.9260.9830.986
    kdd_el_nino-small-NOFS0.9870.9890.9840.9850.9880.9880.9850.9880.980.9850.9860.987
    kdd_el_nino-small-Overall0.9870.9890.9820.9860.9880.9870.9850.9880.980.9850.9860.987
    mushroom-FS1.01.01.01.01.01.01.01.01.01.01.01.0
    mushroom-NOFS1.01.01.01.01.01.01.01.01.01.01.01.0
    mushroom-Overall1.01.01.01.01.01.01.01.01.01.01.01.0
    pbcseq-FS0.8490.8490.8510.8520.840.8490.8360.8310.8490.8490.8460.843
    pbcseq-NOFS0.8490.850.8570.8440.8560.8480.8460.8450.8410.8380.8480.842
    pbcseq-Overall0.8490.850.8510.850.8510.8490.8460.8450.8410.8380.8480.842
    primary-tumor-FS0.8750.8870.8550.8750.8460.8740.8290.8820.8290.8640.7860.887
    primary-tumor-NOFS0.880.8920.8750.8750.8710.880.8770.8890.8610.870.880.892
    primary-tumor-Overall0.880.8920.8670.8920.8860.870.8770.8890.8610.870.880.892
    profb-FS0.6420.6420.6420.6420.6420.6420.6420.6420.6310.6310.6420.642
    profb-NOFS0.6950.6960.6910.6930.6950.6920.6950.6960.5790.580.6930.695
    profb-Overall0.6950.6960.6920.6940.6920.6860.6950.6960.6310.6310.6930.695
    schizo-FS0.5260.5260.5560.5570.5340.5150.5560.5560.5430.5430.6230.623
    schizo-NOFS0.7190.6840.7640.7650.7170.740.760.760.560.5590.7940.805
    schizo-Overall0.7190.6840.7810.7720.7360.7510.760.760.5430.5430.7940.805
    sick-FS0.9840.9840.9920.9930.9930.9920.9770.9690.9910.9910.9920.992
    sick-NOFS0.990.9890.9940.9940.9920.9950.9860.9880.9920.9890.9930.992
    sick-Overall0.990.9890.9940.9940.9920.9930.9860.9880.9920.9890.9930.992
    soybean-FS0.9810.9830.9830.9920.9910.9810.9850.9850.9730.9730.9910.991
    soybean-NOFS0.9910.9940.9890.9860.9890.9920.9870.9910.9910.9850.9890.993
    soybean-Overall0.9810.9940.9850.9910.990.9910.9870.9910.9910.9850.9890.993
    stress-FS0.9160.9160.9320.9320.9320.9320.9320.9320.9330.9330.9320.932
    stress-NOFS0.9020.9040.8990.9030.9060.9040.9020.9010.9480.9460.9090.909
    stress-Overall0.9160.9160.9320.9320.9320.9320.9320.9320.9330.9330.9090.932
    vote-FS0.9830.9850.9920.9950.9860.990.9910.9910.9890.9910.9780.986
    vote-NOFS0.9920.9910.9940.9910.9940.990.9950.9910.9950.9920.9920.992
    vote-Overall0.9920.9910.9920.9910.9950.9920.9910.9910.9890.9910.9920.992
    water-treatment-FS0.9160.9880.9860.9870.9580.9880.9430.9430.50.50.9620.979
    water-treatment-NOFS0.9880.9880.9880.9870.9880.9880.9540.9860.7880.7740.980.981
    water-treatment-Overall0.9880.9880.9870.9870.9880.9880.9540.9860.7880.7740.980.981
    Table 14. Real-world Results for the AUC Metric
    -FS: denotes the result with enforced feature selection, -NOFS: denotes the results without feature selection, -Overall: denotes the results with and without feature selection.
    Table 15.
    DatasetMMBI+MMMFBI+MFGAINBI+GAINSOFTBI+SOFTPPCABI+PPCADAEBI+DAE
    analcatdata_reviewer-FS0.6030.6030.6030.6030.6030.6030.6080.6080.6030.6030.6030.603 
    analcatdata_reviewer-NOFS0.6350.6560.6090.6530.6060.6360.6240.6530.6320.6450.6350.656 
    analcatdata_reviewer-Overall0.6350.6560.6140.6350.6120.6070.6240.6530.6320.6450.6350.656 
    anneal-FS0.9050.9880.880.990.9280.9910.9710.9870.9810.9760.9660.99 
    anneal-NOFS0.9030.990.930.9870.9420.9770.9610.990.9850.9930.9750.983 
    anneal-Overall0.9050.990.9330.9870.9240.990.9610.990.9850.9930.9660.983 
    audiology-FS0.9310.9310.9260.9260.9260.9090.9260.9260.9260.9260.9260.926 
    audiology-NOFS0.9660.9660.9310.9450.9310.9310.9330.9660.9490.9180.9310.966 
    audiology-Overall0.9660.9660.9260.8930.9260.9260.9330.9260.9490.9180.9310.926 
    autoHorse-FS0.9760.9760.9680.9680.9680.9680.9680.9680.8890.9840.9680.968 
    autoHorse-NOFS0.9840.9840.9760.9840.9760.9840.9760.9760.9760.9760.9760.976 
    autoHorse-Overall0.9840.9840.9680.9680.9760.9840.9760.9760.9760.9760.9680.968 
    braziltourism-FS0.8930.8930.8850.8850.8930.8910.8850.8850.8950.8950.8930.893 
    braziltourism-NOFS0.8720.8770.8830.8760.8820.8750.8790.8810.8810.880.8780.876 
    braziltourism-Overall0.8720.8770.8850.8850.8850.8850.8790.8850.8810.880.8780.876 
    bridges-FS0.7780.7780.80.80.8370.8090.8290.8290.8090.8160.8180.783 
    bridges-NOFS0.80.8240.8290.80.7920.830.830.8080.8160.8210.830.8 
    bridges-Overall0.80.8240.8260.8150.830.8080.830.8080.8160.8210.830.8 
    cjs-FS1.01.01.01.00.9990.9971.01.00.9310.9490.9881.0 
    cjs-NOFS1.01.00.9840.9810.9730.9740.9740.9560.9310.9430.9760.981 
    cjs-Overall1.01.01.01.00.9990.9991.01.00.9310.9490.9881.0 
    colic-FS0.9060.9060.8910.8950.8860.8990.8760.8760.8770.8770.8960.896 
    colic-NOFS0.8860.8920.8770.8970.9020.890.8930.9090.8780.8890.8810.873 
    colic-Overall0.9060.9060.8980.9020.9020.8980.8760.9090.8770.8890.8960.896 
    colleges_aaup-FS0.9940.9880.9930.9880.9880.9880.9930.9930.9850.9850.9940.993 
    colleges_aaup-NOFS0.9870.9870.9850.9870.9850.9830.9870.9850.9890.9890.9890.983 
    colleges_aaup-Overall0.9940.9880.9930.9880.9880.9880.9930.9930.9890.9890.9940.993 
    cylinder-bands-FS0.7960.810.80.810.7970.8130.7930.8190.810.810.8320.827 
    cylinder-bands-NOFS0.8160.8050.8270.8260.8140.8080.8060.8020.7990.8250.8440.852 
    cylinder-bands-Overall0.8160.8050.8290.8320.810.80.8060.8020.7990.8250.8440.852 
    dresses-sales-FS0.5920.5920.5920.5920.5950.5920.5920.5920.5920.5920.5920.592 
    dresses-sales-NOFS0.5970.6010.5920.5980.5940.5920.6050.6010.5930.5960.6020.602 
    dresses-sales-Overall0.5970.5920.5920.5920.5920.5920.5920.5920.5930.5960.6020.592 
    eucalyptus-FS0.6720.6720.6440.6440.6640.6770.680.6590.6030.6340.6490.654 
    eucalyptus-NOFS0.6380.6380.6380.6380.6380.6380.640.6780.650.6520.6970.691 
    eucalyptus-Overall0.6720.6720.6380.6380.6640.6760.680.6590.650.6520.6970.691 
    hepatitis-FS0.9120.9120.9020.9320.9320.9170.9120.9120.8960.8960.9320.932 
    hepatitis-NOFS0.910.910.9160.9280.910.9040.9190.9240.9320.9250.9120.913 
    hepatitis-Overall0.910.910.9310.9120.9130.9170.9190.9240.8960.8960.9120.913 
    hungarian-FS0.810.810.790.7890.8030.7520.780.780.7830.7830.770.77 
    hungarian-NOFS0.8260.8240.8040.8040.8140.8260.8170.8070.80.810.810.814 
    hungarian-Overall0.8260.8240.790.790.8060.8210.780.780.80.7830.810.814 
    kdd_el_nino-small-FS0.8970.8970.9080.910.9170.9130.9190.8970.830.7770.9180.932 
    kdd_el_nino-small-NOFS0.9250.9280.9190.9150.9240.9270.90.9230.8980.9160.9220.921 
    kdd_el_nino-small-Overall0.9250.9280.9110.9180.9190.9180.90.9230.8980.9160.9220.921 
    mushroom-FS1.01.00.9991.01.01.00.9990.9990.9960.9971.01.0 
    mushroom-NOFS1.01.01.01.01.01.01.01.00.9990.9981.01.0 
    mushroom-Overall1.01.00.9991.01.01.00.9990.9990.9960.9981.01.0 
    pbcseq-FS0.7760.7760.7880.7960.770.7790.7630.7720.7790.7790.7860.767 
    pbcseq-NOFS0.7750.7790.7870.7790.7850.7920.7790.780.7730.7740.7850.775 
    pbcseq-Overall0.7750.7790.7810.7830.7850.7770.7790.780.7730.7740.7850.775 
    primary-tumor-FS0.7530.7030.6880.6920.7050.7110.6520.7040.6520.6730.5850.703 
    primary-tumor-NOFS0.7350.7120.7170.7170.7440.7030.7360.7030.7360.7020.7350.712 
    primary-tumor-Overall0.7350.7120.7140.7380.7440.6990.7360.7030.7360.7020.7350.712 
    profb-FS0.5480.5480.5480.5480.5480.5480.5480.5480.5520.5520.5480.548 
    profb-NOFS0.5730.5730.5620.5760.5620.5710.5730.5730.5140.510.5720.573 
    profb-Overall0.5730.5730.5690.5760.5680.560.5730.5730.5520.5520.5720.573 
    schizo-FS0.6530.6530.6530.6530.6450.650.6480.6480.6580.6580.6480.648 
    schizo-NOFS0.6840.6450.7350.7290.6940.7010.7280.7220.6530.6530.750.763 
    schizo-Overall0.6840.6450.7190.7220.6960.710.7280.7220.6580.6580.750.763 
    sick-FS0.8370.8370.8380.8360.860.8510.8410.8230.8430.8430.8490.849 
    sick-NOFS0.8390.8390.8470.8490.8420.8510.8550.8510.8430.8390.8450.848 
    sick-Overall0.8390.8390.8370.8540.8420.8490.8550.8510.8430.8390.8450.848 
    soybean-FS0.8070.8390.8890.9010.9180.8330.8430.8430.8330.8330.8970.905 
    soybean-NOFS0.9130.8980.8840.8670.8940.8820.8750.8970.9110.8710.8820.899 
    soybean-Overall0.8070.8980.860.8890.8870.8840.8750.8970.9110.8710.8820.899 
    stress-FS0.7920.7920.8440.8440.8440.8440.8440.8440.850.850.8440.844 
    stress-NOFS0.750.750.7690.7560.7560.750.7560.750.8780.8370.750.75 
    stress-Overall0.7920.7920.8440.8440.8440.8440.8440.8440.850.850.750.844 
    vote-FS0.9480.9480.9480.9480.9480.9490.9480.9480.9490.9480.9290.948 
    vote-NOFS0.9430.9540.9480.9490.9490.9530.9590.9550.9590.9540.9480.959 
    vote-Overall0.9430.9540.9430.9550.9550.9650.9480.9480.9490.9480.9480.959 
    water-treatment-FS0.80.9870.9750.9750.8210.9870.7320.7320.2630.2630.9610.94 
    water-treatment-NOFS0.9870.9870.9870.9750.9870.9870.7850.950.5710.50.9510.918 
    water-treatment-Overall0.9870.9870.9750.9750.9870.9870.7850.950.5710.50.9510.918 
    Table 15. Real-world Results for the F1 Metric
    -FS: denotes the result with enforced feature selection, -NOFS: denotes the results without feature selection, -Overall: denotes the results with and without feature selection.
    Table 16.
    DatasetMMBI+MMMFBI+MFGAINBI+GAINSOFTBI+SOFTPPCABI+PPCADAEBI+DAE
    analcatdata_reviewer-FS0.5890.5890.5680.6110.60.60.60.60.5680.5680.5890.589
    analcatdata_reviewer-NOFS0.6320.6470.6210.6370.6260.60.60.6470.5950.6370.6320.647
    analcatdata_reviewer-Overall0.6320.6470.6210.6420.60.6370.60.6470.5950.6370.6320.647
    anneal-FS0.8490.9820.7950.9840.8840.9870.9550.980.9710.9640.9470.984
    anneal-NOFS0.8440.9840.8950.980.9060.9640.940.9840.9780.9890.9620.973
    anneal-Overall0.8490.9840.8950.980.880.9840.940.9840.9780.9890.9470.973
    audiology-FS0.9650.9650.9650.9650.9650.9560.9650.9650.9650.9650.9650.965
    audiology-NOFS0.9820.9820.9650.9730.9650.9650.9650.9820.9730.9560.9650.982
    audiology-Overall0.9820.9820.9650.9470.9650.9650.9650.9650.9730.9560.9650.965
    autoHorse-FS0.9710.9710.9610.9610.9610.9610.9610.9610.8740.9810.9610.961
    autoHorse-NOFS0.9810.9810.9710.9810.9710.9810.9710.9710.9710.9710.9710.971
    autoHorse-Overall0.9810.9810.9610.9610.9710.9810.9710.9710.9710.9710.9610.961
    braziltourism-FS0.8160.8160.8060.8060.8160.8110.8060.8060.820.820.8160.816
    braziltourism-NOFS0.7770.7910.8010.7910.8010.7910.7960.7960.7960.7910.7960.786
    braziltourism-Overall0.7770.7910.8060.8060.8060.8060.7960.8060.7960.7910.7960.786
    bridges-FS0.7960.7960.8330.8150.870.8330.870.870.8330.8330.8520.815
    bridges-NOFS0.8330.8330.870.8330.8150.8520.8330.8150.8330.870.8330.833
    bridges-Overall0.8330.8330.8520.8150.8330.8330.8330.8150.8330.870.8330.833
    cjs-FS1.01.01.01.00.9990.9991.01.00.9670.9760.9941.0
    cjs-NOFS1.01.00.9920.9910.9870.9880.9870.9790.9670.9730.9890.991
    cjs-Overall1.01.01.01.00.9990.9991.01.00.9670.9760.9941.0
    colic-FS0.8750.8750.8530.8590.8480.8640.8370.8370.8370.8370.8640.864
    colic-NOFS0.8480.8530.8370.8590.870.8530.8590.880.8370.8530.8370.837
    colic-Overall0.8750.8750.8640.870.870.8640.8370.880.8370.8530.8640.864
    colleges_aaup-FS0.9910.9830.990.9830.9830.9830.990.990.9790.9790.9910.99
    colleges_aaup-NOFS0.9810.9810.9790.9810.9790.9760.9810.9790.9850.9850.9850.976
    colleges_aaup-Overall0.9910.9830.990.9830.9830.9830.990.990.9850.9850.9910.99
    cylinder-bands-FS0.7520.7410.7480.7410.7560.7410.7560.7560.730.730.7850.778
    cylinder-bands-NOFS0.7810.770.7930.7960.7810.7740.7670.7630.7590.7810.8070.807
    cylinder-bands-Overall0.7810.770.8040.80.7740.7670.7670.7630.7590.7810.8070.807
    dresses-sales-FS0.6320.6320.6320.6320.6320.6320.6320.6320.580.5960.6320.632
    dresses-sales-NOFS0.6440.6480.6320.6240.6280.6080.6440.6480.6240.6120.6440.648
    dresses-sales-Overall0.6440.6320.6320.6320.6320.6320.6320.6320.6240.6120.6440.632
    eucalyptus-FS0.7910.7910.7740.7740.7720.7830.7830.7690.7120.7610.750.774
    eucalyptus-NOFS0.7690.7690.7690.7690.7690.7690.7580.7850.7910.7910.7850.788
    eucalyptus-Overall0.7910.7910.7690.7690.7770.780.7830.7690.7910.7910.7850.788
    hepatitis-FS0.8460.8460.8330.8850.8850.8590.8460.8460.8210.8210.8850.885
    hepatitis-NOFS0.8460.8460.8590.8850.8460.8330.8590.8720.8850.8720.8460.859
    hepatitis-Overall0.8460.8460.8850.8590.8590.8590.8590.8720.8210.8210.8460.859
    hungarian-FS0.8440.8440.830.8370.8370.8230.850.850.8440.8440.830.83
    hungarian-NOFS0.8570.8570.8570.8570.8570.8640.8570.8640.8440.850.850.857
    hungarian-Overall0.8570.8570.8370.8370.8570.8570.850.850.8440.8440.850.857
    kdd_el_nino-small-FS0.9280.9280.9340.9390.9440.9360.9440.9310.880.8490.9440.954
    kdd_el_nino-small-NOFS0.9460.9490.9440.9390.9460.9490.9310.9460.9260.9410.9440.944
    kdd_el_nino-small-Overall0.9460.9490.9360.9410.9440.9410.9310.9460.9260.9410.9440.944
    mushroom-FS1.01.01.01.01.01.01.01.00.9960.9981.01.0
    mushroom-NOFS1.01.01.01.01.01.01.01.01.00.9981.01.0
    mushroom-Overall1.01.01.01.01.01.01.01.00.9960.9981.01.0
    pbcseq-FS0.7730.7730.7790.7830.7640.780.7580.7570.7820.7820.7720.769
    pbcseq-NOFS0.7760.7720.7770.7760.7820.7770.7760.7760.7660.7640.7820.775
    pbcseq-Overall0.7760.7720.7810.7790.7810.7810.7760.7760.7660.7640.7820.775
    primary-tumor-FS0.8650.8710.8350.8410.8530.8710.8410.8760.8410.8590.8120.871
    primary-tumor-NOFS0.8590.8760.8470.8530.8710.8410.8650.8710.8650.8530.8590.876
    primary-tumor-Overall0.8590.8760.8470.8410.8710.8590.8650.8710.8650.8530.8590.876
    profb-FS0.6730.670.670.6730.6730.6730.6730.6730.6760.6760.6730.673
    profb-NOFS0.7050.7050.7080.7020.7110.7110.7050.7050.6670.670.7050.705
    profb-Overall0.7050.7050.7050.7080.7050.7020.7050.7050.6760.6760.7050.705
    schizo-FS0.5350.5350.5940.5940.5760.5470.5880.5880.5530.5590.6590.659
    schizo-NOFS0.7180.7060.7410.7760.7350.7590.7240.7410.6060.5940.7760.788
    schizo-Overall0.7180.7060.7530.7530.7530.7650.7240.7410.5530.5590.7760.788
    sick-FS0.9790.9790.9810.9810.9830.9810.9790.9770.980.980.9810.981
    sick-NOFS0.9790.9790.9820.9810.980.980.9810.9810.980.980.980.981
    sick-Overall0.9790.9790.980.980.980.980.9810.9810.980.980.980.981
    soybean-FS0.9440.9560.9710.9740.980.9560.9620.9620.9590.9590.9740.974
    soybean-NOFS0.9770.9710.9680.9650.9710.9680.9650.9740.9770.9680.9680.974
    soybean-Overall0.9440.9710.9620.9710.9680.9710.9650.9740.9770.9680.9680.974
    stress-FS0.910.910.930.930.930.930.930.930.940.940.930.93
    stress-NOFS0.90.90.910.90.90.90.890.90.950.930.90.9
    stress-Overall0.910.910.930.930.930.930.930.930.940.940.90.93
    vote-FS0.9590.9590.9590.9590.9590.9590.9590.9590.9590.9590.9450.959
    vote-NOFS0.9540.9630.9590.9590.9590.9630.9680.9630.9680.9630.9590.968
    vote-Overall0.9540.9630.9540.9630.9630.9720.9590.9590.9590.9590.9590.968
    water-treatment-FS0.9430.9960.9920.9920.9470.9960.920.920.8480.8480.9890.981
    water-treatment-NOFS0.9960.9960.9960.9920.9960.9960.9360.9850.8790.8640.9850.973
    water-treatment-Overall0.9960.9960.9920.9920.9960.9960.9360.9850.8790.8640.9850.973
    Table 16. Real-world Results for the ACC Metric
    -FS: denotes the result with enforced feature selection, -NOFS: denotes the results without feature selection, -Overall: denotes the results with and without feature selection.

    C.5 MCAR: Downstream Task Results

    This section presents the results for MCAR data under varying levels of missingness, and multiple metrics. Tables 17, 18, and 19 present the results for 10%, 25%, and 50% missingness for the AUC metric. Results for F1-score are presented in Tables 20, 21, and 22. Finally, Tables 23, 24, and 25 display the results for the classification accuracy.
    Table 17.
    DatasetMMBI+MMMFBI+MFGAINBI+GAINSOFTBI+SOFTPPCABI+PPCADAEBI+DAE
    Australian-FS0.9020.9050.9220.9140.9130.9190.8850.9150.890.8970.9060.906
    Australian-NOFS0.9160.9110.9170.9150.9110.9040.8860.8960.8920.8960.9150.91
    Australian-Overall0.9020.9110.9220.9140.9090.9080.8850.8960.8920.8960.9060.906
    boston-FS0.9330.9330.9260.9260.9240.9240.9120.9170.890.890.9260.926
    boston-NOFS0.9310.9280.940.9350.9270.9310.9330.9080.9130.9210.9280.926
    boston-Overall0.9330.9330.9380.9410.9220.9210.9120.9080.9130.9210.9260.926
    churn-FS0.9140.9140.920.9160.9070.9160.8870.8830.8840.8850.9060.906
    churn-NOFS0.9090.9120.9140.9160.910.9160.9040.910.8860.8860.9060.909
    churn-Overall0.9090.9120.9170.9160.9080.9190.9040.910.8840.8860.9060.909
    compas-two-years-FS0.7040.7040.6980.6920.6990.7050.6930.6930.6880.6880.6970.697
    compas-two-years-NOFS0.7020.6980.7030.7010.7010.6980.7020.7020.7030.6990.6920.692
    compas-two-years-Overall0.7020.7040.7030.6960.70.6990.6930.7020.6880.6880.6920.697
    image-FS0.870.8450.8830.8750.850.850.8450.8490.8770.878
    image-NOFS0.8850.8840.8810.8780.8770.8770.8630.8590.8920.887
    image-Overall0.8850.8840.8820.8830.8770.8770.8630.8590.8920.887
    page-blocks-FS0.990.9890.990.9890.9880.990.9830.9850.9690.9690.9880.988
    page-blocks-NOFS0.9880.9880.990.990.9890.9880.9830.9820.9730.9740.9870.987
    page-blocks-Overall0.9880.9890.990.990.9870.9890.9830.9820.9730.9740.9880.987
    parkinsons-FS0.850.8460.8710.8710.8490.8490.850.8450.8480.8610.8680.866
    parkinsons-NOFS0.8960.8960.9350.9230.8960.8950.8980.8920.8990.9070.9190.914
    parkinsons-Overall0.8960.8960.9320.9250.9070.8940.8980.8920.8990.9070.9190.914
    segment-FS0.9990.9991.01.00.9990.9990.9960.9970.9780.9770.9990.999
    segment-NOFS0.9991.01.01.00.9991.00.9980.9990.9850.9861.00.999
    segment-Overall0.9991.01.01.01.01.00.9980.9990.9850.9861.00.999
    stock-FS0.990.990.9930.9930.9840.9870.9750.9750.9540.9540.990.99
    stock-NOFS0.9890.990.9920.9930.990.9890.9790.980.9690.970.9910.991
    stock-Overall0.990.990.9930.9930.9860.9860.9790.980.9690.970.990.99
    zoo-FS0.9930.9930.8950.8950.9290.9290.9860.9860.8980.8950.9950.995
    zoo-NOFS0.9790.9790.9920.9981.01.00.9730.9890.8970.9940.9031.0
    zoo-Overall0.9790.9930.9941.00.9290.9890.9860.9890.8970.9940.9031.0
    Table 17. MCAR Results at 10% Missingness for the AUC Metric
    -FS: denotes the result with enforced feature selection, -NOFS: denotes the results without feature selection, -Overall: denotes the results with and without feature selection.
    Table 18.
    DatasetMMBI+MMMFBI+MFGAINBI+GAINSOFTBI+SOFTPPCABI+PPCADAEBI+DAE
    Australian-FS0.8760.8770.8960.8820.8720.880.8450.8640.8310.8320.8870.887
    Australian-NOFS0.8790.8840.8980.8790.8810.8950.8650.8750.8420.8240.8820.881
    Australian-Overall0.8760.8770.8910.8830.8930.8840.8650.8640.8420.8320.8870.887
    boston-FS0.9250.9170.9160.9130.9120.9170.8780.8780.9110.9110.9070.907
    boston-NOFS0.9190.9060.9270.9170.9160.9070.8880.8870.9190.9090.9060.903
    boston-Overall0.9250.9170.9170.9140.9260.9230.8780.8780.9190.9090.9070.907
    churn-FS0.8660.870.8740.8710.8660.8740.8340.8190.850.8490.8620.86
    churn-NOFS0.8640.870.8670.8650.8680.8690.8470.8510.8560.8590.8660.866
    churn-Overall0.8660.870.8720.8760.8620.8670.8470.8510.8560.8590.8660.866
    compas-two-years-FS0.6830.6760.690.6910.6870.6560.6420.6450.6780.6780.670.675
    compas-two-years-NOFS0.6850.6850.690.6840.680.6730.6640.6630.6820.6810.6770.674
    compas-two-years-Overall0.6850.6850.690.690.6740.6780.6640.6630.6820.6810.670.675
    image-FS0.8010.8010.8390.8450.8170.8170.8510.8450.8780.878
    image-NOFS0.8680.8660.8620.8640.8370.8440.8640.8560.880.882
    image-Overall0.8680.8660.8620.8660.8370.8440.8640.8560.880.882
    page-blocks-FS0.9830.9820.9860.9850.9820.9810.9660.9640.9120.9120.9830.979
    page-blocks-NOFS0.9820.9830.9830.9850.9830.9830.9660.9680.9610.9660.9840.982
    page-blocks-Overall0.9830.9820.9830.9860.9830.9830.9660.9680.9610.9660.9840.982
    parkinsons-FS0.7930.8320.8470.850.8060.8380.780.8080.8330.8330.8490.933
    parkinsons-NOFS0.8930.9040.9150.9190.8910.8990.860.8730.8370.8870.9180.93
    parkinsons-Overall0.8930.9040.9190.9140.9010.8820.860.8730.8330.8330.9180.933
    segment-FS0.9990.9990.9991.00.9990.9990.9780.9850.940.9410.9980.998
    segment-NOFS0.9990.9991.01.00.9990.9980.9940.990.9620.9670.9980.999
    segment-Overall0.9990.9991.01.00.9990.9990.9940.990.9620.9670.9980.999
    stock-FS0.9760.9760.9880.990.9790.980.9370.9370.930.930.9830.981
    stock-NOFS0.9830.9810.990.9920.9810.9820.9580.9590.9380.9390.9830.981
    stock-Overall0.9830.9810.9910.990.9770.980.9580.9590.9380.9390.9830.981
    zoo-FS0.9730.9741.01.00.9730.9970.9980.9981.01.00.990.994
    zoo-NOFS0.9981.01.01.00.991.01.00.9890.9381.01.00.995
    zoo-Overall0.9731.01.01.00.9981.00.9980.9891.01.01.00.995
    Table 18. MCAR Results at 25% Missingness for the AUC Metric
    -FS: denotes the result with enforced feature selection, -NOFS: denotes the results without feature selection, -Overall: denotes the results with and without feature selection.
    Table 19.
    DatasetMMBI+MMMFBI+MFGAINBI+GAINSOFTBI+SOFTPPCABI+PPCADAEBI+DAE
    Australian-FS0.8640.8640.8450.8550.8460.8560.8060.8250.8560.8270.8650.868
    Australian-NOFS0.8430.8590.8510.8580.8520.8360.7920.8210.8530.8380.860.866
    Australian-Overall0.8430.8590.8520.8520.8540.8650.8060.8250.8530.8380.8650.868
    boston-FS0.8630.8630.8930.8680.8390.8450.7540.7540.8620.8360.8540.851
    boston-NOFS0.8530.8310.8850.8840.8480.8510.7650.7610.860.8080.8390.842
    boston-Overall0.8530.8630.8840.8810.8530.8440.7540.7610.860.8360.8540.842
    churn-FS0.7850.7850.7770.7870.760.790.7330.7390.7680.7620.7770.777
    churn-NOFS0.7890.790.7850.7890.7870.7870.7720.7810.7760.7840.7840.785
    churn-Overall0.7890.790.790.7890.7890.7990.7720.7810.7760.7620.7840.785
    compas-two-years-FS0.6260.6260.6470.6390.6350.6420.5940.6040.6320.6320.6330.633
    compas-two-years-NOFS0.6480.6460.6420.6390.6460.6390.6110.6140.6380.6390.6420.643
    compas-two-years-Overall0.6260.6460.6510.6470.6490.6460.5940.6140.6320.6390.6330.643
    image-FS0.7780.7540.7810.7540.6810.6710.7970.8170.8230.823
    image-NOFS0.8260.820.8410.8240.7350.750.8450.8380.8430.847
    image-Overall0.8260.820.820.8330.7350.750.8450.8380.8430.847
    page-blocks-FS0.9430.9340.9630.9630.9520.9560.8960.8790.7910.7910.9550.934
    page-blocks-NOFS0.960.9560.9690.9670.960.9580.8990.910.9060.9170.9570.958
    page-blocks-Overall0.960.9560.9670.9680.9590.9550.8990.910.9060.9170.9570.958
    parkinsons-FS0.7310.7040.8170.8420.7470.6930.70.6520.8120.7950.8160.809
    parkinsons-NOFS0.8420.8190.8390.8690.840.8430.7530.6770.8210.8140.8470.822
    parkinsons-Overall0.8420.8190.820.840.8370.8030.7530.6520.8120.7950.8470.809
    segment-FS0.990.9920.9970.9860.9860.9920.8640.9090.8520.860.9920.991
    segment-NOFS0.9950.9950.9950.9980.9920.9940.9230.930.9120.9140.9930.992
    segment-Overall0.9950.9950.9970.9970.9940.9910.9230.930.9120.9140.9930.992
    stock-FS0.9030.9030.9720.9490.9490.9180.7840.7840.80.80.9350.923
    stock-NOFS0.9480.9440.9650.9710.920.9380.8090.8220.8580.8420.9350.95
    stock-Overall0.9480.9440.9650.9710.9330.9320.8090.8220.8580.8420.9350.95
    zoo-FS0.9650.9180.9250.810.860.9010.9210.8210.8330.8180.8940.699
    zoo-NOFS0.970.8480.9380.9460.9020.9020.8410.8440.8210.8190.8840.829
    zoo-Overall0.970.8480.930.9160.8540.9250.8410.8440.8330.8180.8840.829
    Table 19. MCAR Results at 50% Missingness for the AUC Metric
    -FS: denotes the result with enforced feature selection, -NOFS: denotes the results without feature selection, -Overall: denotes the results with and without feature selection.
    Table 20.
    DatasetMMBI+MMMFBI+MFGAINBI+GAINSOFTBI+SOFTPPCABI+PPCADAEBI+DAE
    Australian-FS0.9020.9050.9220.9140.9130.9190.8850.9150.890.8970.9060.906
    Australian-NOFS0.9160.9110.9170.9150.9110.9040.8860.8960.8920.8960.9150.91
    Australian-Overall0.9020.9110.9220.9140.9090.9080.8850.8960.8920.8960.9060.906
    boston-FS0.9330.9330.9260.9260.9240.9240.9120.9170.890.890.9260.926
    boston-NOFS0.9310.9280.940.9350.9270.9310.9330.9080.9130.9210.9280.926
    boston-Overall0.9330.9330.9380.9410.9220.9210.9120.9080.9130.9210.9260.926
    churn-FS0.9140.9140.920.9160.9070.9160.8870.8830.8840.8850.9060.906
    churn-NOFS0.9090.9120.9140.9160.910.9160.9040.910.8860.8860.9060.909
    churn-Overall0.9090.9120.9170.9160.9080.9190.9040.910.8840.8860.9060.909
    compas-two-years-FS0.7040.7040.6980.6920.6990.7050.6930.6930.6880.6880.6970.697
    compas-two-years-NOFS0.7020.6980.7030.7010.7010.6980.7020.7020.7030.6990.6920.692
    compas-two-years-Overall0.7020.7040.7030.6960.70.6990.6930.7020.6880.6880.6920.697
    image-FS0.870.8450.8830.8750.850.850.8450.8490.8770.878
    image-NOFS0.8850.8840.8810.8780.8770.8770.8630.8590.8920.887
    image-Overall0.8850.8840.8820.8830.8770.8770.8630.8590.8920.887
    page-blocks-FS0.990.9890.990.9890.9880.990.9830.9850.9690.9690.9880.988
    page-blocks-NOFS0.9880.9880.990.990.9890.9880.9830.9820.9730.9740.9870.987
    page-blocks-Overall0.9880.9890.990.990.9870.9890.9830.9820.9730.9740.9880.987
    parkinsons-FS0.850.8460.8710.8710.8490.8490.850.8450.8480.8610.8680.866
    parkinsons-NOFS0.8960.8960.9350.9230.8960.8950.8980.8920.8990.9070.9190.914
    parkinsons-Overall0.8960.8960.9320.9250.9070.8940.8980.8920.8990.9070.9190.914
    segment-FS0.9990.9991.01.00.9990.9990.9960.9970.9780.9770.9990.999
    segment-NOFS0.9991.01.01.00.9991.00.9980.9990.9850.9861.00.999
    segment-Overall0.9991.01.01.01.01.00.9980.9990.9850.9861.00.999
    stock-FS0.990.990.9930.9930.9840.9870.9750.9750.9540.9540.990.99
    stock-NOFS0.9890.990.9920.9930.990.9890.9790.980.9690.970.9910.991
    stock-Overall0.990.990.9930.9930.9860.9860.9790.980.9690.970.990.99
    zoo-FS0.9930.9930.8950.8950.9290.9290.9860.9860.8980.8950.9950.995
    zoo-NOFS0.9790.9790.9920.9981.01.00.9730.9890.8970.9940.9031.0
    zoo-Overall0.9790.9930.9941.00.9290.9890.9860.9890.8970.9940.9031.0
    Table 20. MCAR Results at 10% Missingness for the F1 Metric
    -FS: denotes the result with enforced feature selection, -NOFS: denotes the results without feature selection, -Overall: denotes the results with and without feature selection.
    Table 21.
    DatasetMMBI+MMMFBI+MFGAINBI+GAINSOFTBI+SOFTPPCABI+PPCADAEBI+DAE
    Australian-FS0.8760.8770.8960.8820.8720.880.8450.8640.8310.8320.8870.887
    Australian-NOFS0.8790.8840.8980.8790.8810.8950.8650.8750.8420.8240.8820.881
    Australian-Overall0.8760.8770.8910.8830.8930.8840.8650.8640.8420.8320.8870.887
    boston-FS0.9250.9170.9160.9130.9120.9170.8780.8780.9110.9110.9070.907
    boston-NOFS0.9190.9060.9270.9170.9160.9070.8880.8870.9190.9090.9060.903
    boston-Overall0.9250.9170.9170.9140.9260.9230.8780.8780.9190.9090.9070.907
    churn-FS0.8660.870.8740.8710.8660.8740.8340.8190.850.8490.8620.86
    churn-NOFS0.8640.870.8670.8650.8680.8690.8470.8510.8560.8590.8660.866
    churn-Overall0.8660.870.8720.8760.8620.8670.8470.8510.8560.8590.8660.866
    compas-two-years-FS0.6830.6760.690.6910.6870.6560.6420.6450.6780.6780.670.675
    compas-two-years-NOFS0.6850.6850.690.6840.680.6730.6640.6630.6820.6810.6770.674
    compas-two-years-Overall0.6850.6850.690.690.6740.6780.6640.6630.6820.6810.670.675
    image-FS0.8010.8010.8390.8450.8170.8170.8510.8450.8780.878
    image-NOFS0.8680.8660.8620.8640.8370.8440.8640.8560.880.882
    image-Overall0.8680.8660.8620.8660.8370.8440.8640.8560.880.882
    page-blocks-FS0.9830.9820.9860.9850.9820.9810.9660.9640.9120.9120.9830.979
    page-blocks-NOFS0.9820.9830.9830.9850.9830.9830.9660.9680.9610.9660.9840.982
    page-blocks-Overall0.9830.9820.9830.9860.9830.9830.9660.9680.9610.9660.9840.982
    parkinsons-FS0.7930.8320.8470.850.8060.8380.780.8080.8330.8330.8490.933
    parkinsons-NOFS0.8930.9040.9150.9190.8910.8990.860.8730.8370.8870.9180.93
    parkinsons-Overall0.8930.9040.9190.9140.9010.8820.860.8730.8330.8330.9180.933
    segment-FS0.9990.9990.9991.00.9990.9990.9780.9850.940.9410.9980.998
    segment-NOFS0.9990.9991.01.00.9990.9980.9940.990.9620.9670.9980.999
    segment-Overall0.9990.9991.01.00.9990.9990.9940.990.9620.9670.9980.999
    stock-FS0.9760.9760.9880.990.9790.980.9370.9370.930.930.9830.981
    stock-NOFS0.9830.9810.990.9920.9810.9820.9580.9590.9380.9390.9830.981
    stock-Overall0.9830.9810.9910.990.9770.980.9580.9590.9380.9390.9830.981
    zoo-FS0.9730.9741.01.00.9730.9970.9980.9981.01.00.990.994
    zoo-NOFS0.9981.01.01.00.991.01.00.9890.9381.01.00.995
    zoo-Overall0.9731.01.01.00.9981.00.9980.9891.01.01.00.995
    Table 21. MCAR Results at 25% Missingness for the F1 Metric
    -FS: denotes the result with enforced feature selection, -NOFS: denotes the results without feature selection, -Overall: denotes the results with and without feature selection.
    Table 22.
    DatasetMMBI+MMMFBI+MFGAINBI+GAINSOFTBI+SOFTPPCABI+PPCADAEBI+DAE
    Australian-FS0.8640.8640.8450.8550.8460.8560.8060.8250.8560.8270.8650.868
    Australian-NOFS0.8430.8590.8510.8580.8520.8360.7920.8210.8530.8380.860.866
    Australian-Overall0.8430.8590.8520.8520.8540.8650.8060.8250.8530.8380.8650.868
    boston-FS0.8630.8630.8930.8680.8390.8450.7540.7540.8620.8360.8540.851
    boston-NOFS0.8530.8310.8850.8840.8480.8510.7650.7610.860.8080.8390.842
    boston-Overall0.8530.8630.8840.8810.8530.8440.7540.7610.860.8360.8540.842
    churn-FS0.7850.7850.7770.7870.760.790.7330.7390.7680.7620.7770.777
    churn-NOFS0.7890.790.7850.7890.7870.7870.7720.7810.7760.7840.7840.785
    churn-Overall0.7890.790.790.7890.7890.7990.7720.7810.7760.7620.7840.785
    compas-two-years-FS0.6260.6260.6470.6390.6350.6420.5940.6040.6320.6320.6330.633
    compas-two-years-NOFS0.6480.6460.6420.6390.6460.6390.6110.6140.6380.6390.6420.643
    compas-two-years-Overall0.6260.6460.6510.6470.6490.6460.5940.6140.6320.6390.6330.643
    image-FS0.7780.7540.7810.7540.6810.6710.7970.8170.8230.823
    image-NOFS0.8260.820.8410.8240.7350.750.8450.8380.8430.847
    image-Overall0.8260.820.820.8330.7350.750.8450.8380.8430.847
    page-blocks-FS0.9430.9340.9630.9630.9520.9560.8960.8790.7910.7910.9550.934
    page-blocks-NOFS0.960.9560.9690.9670.960.9580.8990.910.9060.9170.9570.958
    page-blocks-Overall0.960.9560.9670.9680.9590.9550.8990.910.9060.9170.9570.958
    parkinsons-FS0.7310.7040.8170.8420.7470.6930.70.6520.8120.7950.8160.809
    parkinsons-NOFS0.8420.8190.8390.8690.840.8430.7530.6770.8210.8140.8470.822
    parkinsons-Overall0.8420.8190.820.840.8370.8030.7530.6520.8120.7950.8470.809
    segment-FS0.990.9920.9970.9860.9860.9920.8640.9090.8520.860.9920.991
    segment-NOFS0.9950.9950.9950.9980.9920.9940.9230.930.9120.9140.9930.992
    segment-Overall0.9950.9950.9970.9970.9940.9910.9230.930.9120.9140.9930.992
    stock-FS0.9030.9030.9720.9490.9490.9180.7840.7840.80.80.9350.923
    stock-NOFS0.9480.9440.9650.9710.920.9380.8090.8220.8580.8420.9350.95
    stock-Overall0.9480.9440.9650.9710.9330.9320.8090.8220.8580.8420.9350.95
    zoo-FS0.9650.9180.9250.810.860.9010.9210.8210.8330.8180.8940.699
    zoo-NOFS0.970.8480.9380.9460.9020.9020.8410.8440.8210.8190.8840.829
    zoo-Overall0.970.8480.930.9160.8540.9250.8410.8440.8330.8180.8840.829
    Table 22. MCAR Results at 50% Missingness for the F1 Metric
    -FS: denotes the result with enforced feature selection, -NOFS: denotes the results without feature selection, -Overall: denotes the results with and without feature selection.
    Table 23.
    DatasetMMBI+MMMFBI+MFGAINBI+GAINSOFTBI+SOFTPPCABI+PPCADAEBI+DAE
    Australian0.8460.8550.8780.8580.8610.8410.8410.8410.8290.8380.8550.855
    Australian-FS0.8460.8490.8750.8670.8640.8640.8410.8610.8350.8430.8550.855
    Australian-NOFS0.8580.8550.870.8640.8610.8490.8430.8410.8290.8380.8580.849
    boston0.8810.8810.8930.8970.8740.8620.8580.8620.8740.8810.8810.874
    boston-FS0.8810.8810.8740.8740.8810.8810.8580.8620.8620.8620.8810.874
    boston-NOFS0.8810.8810.8850.8930.8770.8740.8810.8620.8740.8810.8890.874
    churn0.9360.9360.9440.9420.9430.9430.9280.9240.9240.9220.9310.932
    churn-FS0.9360.9360.9440.9430.9360.9430.9230.9260.9240.9220.9340.94
    churn-NOFS0.9360.9360.9440.9420.9440.9440.9280.9240.9240.9220.9310.932
    compas-two-years0.6650.6630.6580.6550.6580.6530.6480.6620.650.650.6550.653
    compas-two-years-FS0.6630.6630.6560.6550.6530.6620.6480.6480.650.650.6530.653
    compas-two-years-NOFS0.6650.6590.6570.6540.6630.6510.6620.6620.6650.6570.6550.651
    image0.8630.8690.8650.8660.8620.8660.8580.8610.8730.867
    image-FS0.8560.8450.8630.8620.8590.8460.8470.8460.8640.864
    image-NOFS0.8630.8690.8590.8720.8620.8660.8580.8610.8730.867
    page-blocks0.9720.9720.9740.9730.970.9720.9660.9670.9610.9610.9730.973
    page-blocks-FS0.9710.9720.9740.9740.9730.9720.9660.9680.9590.9590.9730.971
    page-blocks-NOFS0.9720.9720.9740.9720.970.9710.9660.9670.9610.9610.9730.973
    parkinsons0.8670.8670.9080.9180.8670.8570.8780.8880.8980.8880.8980.888
    parkinsons-FS0.8370.8270.8670.8670.8370.8270.8670.8570.8670.8880.8570.857
    parkinsons-NOFS0.8670.8670.9290.8980.8670.8570.8780.8880.8980.8880.8980.888
    segment0.9960.9970.9960.9980.9970.9960.9930.9950.9580.9570.9970.997
    segment-FS0.9960.9940.9980.9980.9930.9950.9920.9920.9470.9510.9940.993
    segment-NOFS0.9960.9970.9950.9960.9950.9970.9930.9950.9580.9570.9970.997
    stock0.9560.9560.960.960.9370.9430.9220.9240.9010.9050.9560.96
    stock-FS0.9560.9560.960.9560.9330.9470.9240.9240.8720.8720.9560.96
    stock-NOFS0.9470.9450.9560.9580.9430.9370.9220.9240.9010.9050.9540.954
    zoo0.9610.9610.9611.00.9410.9410.9410.9410.9020.9610.8631.0
    zoo-FS0.9610.9610.9410.9410.9410.9410.9410.9410.9220.9020.980.98
    zoo-NOFS0.9610.9220.9610.981.01.00.980.9410.9020.9610.8631.0
    Table 23. MCAR Results at 10% Missingness for the ACC Metric
    -FS: denotes the result with enforced feature selection, -NOFS: denotes the results without feature selection, -Overall: denotes the results with and without feature selection.
    Table 24.
    DatasetMMBI+MMMFBI+MFGAINBI+GAINSOFTBI+SOFTPPCABI+PPCADAEBI+DAE
    Australian0.80.8060.8230.8380.8140.8090.8030.7910.7710.7710.8140.814
    Australian-FS0.80.8060.8320.8260.8090.8090.7970.7910.7740.7710.8140.814
    Australian-NOFS0.8030.8060.8350.8060.8170.8320.8030.8120.7710.7620.8090.806
    boston0.8770.8660.870.8620.8660.870.8140.8140.8740.8620.8460.846
    boston-FS0.8770.8660.8620.850.8580.8660.8140.8140.8580.8580.8460.846
    boston-NOFS0.8810.8460.8810.870.8620.8460.8140.8140.8740.8620.8620.85
    churn0.9190.9180.9220.9310.9220.9190.9040.8940.9140.9160.9240.921
    churn-FS0.9190.9230.9220.9240.9250.9240.8950.8910.9070.9070.9220.923
    churn-NOFS0.9240.9180.9310.9250.9250.9160.9040.8940.9140.9160.9240.921
    compas-two-years0.6440.6420.6510.6510.6360.6410.6230.6290.6380.6390.6290.638
    compas-two-years-FS0.6390.6370.6540.6530.6450.6190.610.6090.6350.6350.6290.638
    compas-two-years-NOFS0.6440.6420.6530.6420.6430.6370.6230.6290.6380.6390.6380.631
    image0.8530.8530.8530.8620.8440.8460.8610.860.8660.862
    image-FS0.8150.8150.8490.8490.8450.8450.850.8520.8620.862
    image-NOFS0.8530.8530.8620.860.8440.8460.8610.860.8660.862
    page-blocks0.9660.9650.9650.9680.9660.9650.9470.9480.950.950.9630.963
    page-blocks-FS0.9660.9650.9670.9670.9610.9660.9480.9470.9360.9360.9630.961
    page-blocks-NOFS0.9650.9630.9670.9670.9670.9650.9470.9480.950.950.9630.963
    parkinsons0.8780.8780.8980.9080.8780.8570.8980.8670.8370.8370.8670.888
    parkinsons-FS0.8270.8370.8780.8780.8060.8570.8570.8270.8370.8370.8570.888
    parkinsons-NOFS0.8780.8780.9080.8980.8880.8880.8980.8670.8670.8570.8670.898
    segment0.9880.9910.9970.9970.9920.9930.9760.9730.9220.9270.9880.99
    segment-FS0.9880.9910.9970.9970.990.9890.9580.9650.8970.8960.9850.988
    segment-NOFS0.9890.990.9970.9970.990.990.9760.9730.9220.9270.9880.99
    stock0.9310.9330.9490.9470.9240.9260.8930.8930.8650.8650.9310.924
    stock-FS0.9240.9240.9520.9520.9180.9240.8590.8590.8530.8530.9310.933
    stock-NOFS0.9310.9330.9470.9520.920.9310.8930.8930.8650.8650.9310.924
    zoo0.9611.01.01.00.981.00.980.981.01.01.00.98
    zoo-FS0.9610.9611.01.00.980.980.980.981.01.00.980.98
    zoo-NOFS0.981.01.01.00.981.01.00.980.9021.01.00.98
    Table 24. MCAR Results at 25% Missingness for the ACC Metric
    -FS: denotes the result with enforced feature selection, -NOFS: denotes the results without feature selection, -Overall: denotes the results with and without feature selection.
    Table 25.
    DatasetMMBI+MMMFBI+MFGAINBI+GAINSOFTBI+SOFTPPCABI+PPCADAEBI+DAE
    Australian0.8030.8060.8260.8170.780.8120.730.7770.80.7680.80.817
    Australian-FS0.80.80.7970.8060.7830.7830.730.7770.8030.7710.80.817
    Australian-NOFS0.8030.8060.8090.8140.7680.780.7360.7740.80.7680.8030.809
    boston0.8020.810.8340.8380.8060.7980.7310.7190.8220.810.7830.791
    boston-FS0.810.810.8580.8140.7910.7870.7310.7310.8380.810.7830.787
    boston-NOFS0.8020.7870.850.8340.8180.8180.7110.7190.8220.7790.7790.791
    churn0.8970.8870.8980.8950.8920.8890.8720.8720.8880.890.8930.887
    churn-FS0.8920.8920.8990.8940.8870.890.8630.8660.890.890.890.89
    churn-NOFS0.8970.8870.9040.8950.8930.8920.8720.8720.8880.8850.8930.887
    compas-two-years0.5930.6120.6150.6170.6140.6090.5810.5890.6060.6050.6030.606
    compas-two-years-FS0.5930.5930.6120.6130.6030.6120.5810.5790.6060.6060.6030.603
    compas-two-years-NOFS0.6130.6120.610.610.6150.6090.5840.5890.6080.6050.6120.606
    image0.8460.8380.8370.8540.8070.8090.8550.8540.850.853
    image-FS0.8170.8120.8260.8120.8070.8040.8280.8480.8440.844
    image-NOFS0.8460.8380.8510.8430.8070.8090.8550.8540.850.853
    page-blocks0.9420.9410.9530.9530.9450.9440.9110.920.9330.9360.9410.946
    page-blocks-FS0.940.9450.9550.9540.9480.9410.910.9120.9230.9230.9420.942
    page-blocks-NOFS0.9420.9410.9520.9530.9430.9460.9110.920.9330.9360.9410.946
    parkinsons0.7960.7860.8780.8060.8060.7760.7760.7650.8470.8160.8160.816
    parkinsons-FS0.7860.7760.8570.8060.8370.7760.7860.7650.8470.8160.8270.816
    parkinsons-NOFS0.7960.7860.8470.8470.8270.8270.7760.7650.8570.8160.8160.806
    segment0.9750.9740.9850.9840.9760.9710.9120.9190.8720.8780.9670.966
    segment-FS0.9720.970.9870.9830.9650.9690.890.9020.8640.8580.9640.965
    segment-NOFS0.9750.9740.9860.9870.970.9730.9120.9190.8720.8780.9670.966
    stock0.880.880.9160.920.8480.8590.7310.7450.7660.7870.8510.874
    stock-FS0.8230.8230.9160.8860.8650.8420.7180.7220.7220.7220.8510.844
    stock-NOFS0.880.880.9090.9180.8420.8510.7310.7450.7660.7870.8510.874
    zoo0.9610.8820.9020.8820.7840.8820.8240.8630.7840.7840.8630.863
    zoo-FS0.9410.8820.9220.8240.8430.8630.9020.7650.7840.7840.8820.725
    zoo-NOFS0.9610.8820.9220.9020.8240.8630.8240.8630.7840.7840.8630.863
    Table 25. MCAR Results at 50% Missingness for the ACC Metric
    -FS: denotes the result with enforced feature selection, -NOFS: denotes the results without feature selection, -Overall: denotes the results with and without feature selection.

    C.6 MAR: Downstream Task Results

    This section presents the results for MAR data under varying levels of missingness, and multiple metrics. Tables 26, 27, and 28 present the results for 10%, 25%, and 50% missingness for the AUC metric. Results for F1-score are presented in Tables 29, 30, and 31. Finally, Tables 32, 33, and 34 display the results for the classification accuracy.
    Table 26.
    DatasetMMBI+MMMFBI+MFGAINBI+GAINSOFTBI+SOFTPPCABI+PPCADAEBI+DAE
    Australian-FS0.8990.9070.8930.8950.9050.9070.880.8860.8730.8730.8950.911
    Australian-NOFS0.9210.9120.910.9150.9010.9180.9120.9040.8860.890.9190.899
    Australian-Overall0.8990.9070.9070.9150.9180.9120.9120.9040.8860.890.8950.911
    boston-FS0.940.940.9260.9260.9280.9180.9180.9180.9260.9260.9330.94
    boston-NOFS0.9360.9320.9470.9430.9380.9390.9360.9290.930.9230.940.932
    boston-Overall0.940.940.9260.9210.9420.9220.9180.9180.930.9230.9330.94
    churn-FS0.8930.8970.8990.8960.8960.8970.890.8810.8710.8730.8960.896
    churn-NOFS0.8940.8960.8990.9020.8980.9020.8910.8940.8730.8750.9020.898
    churn-Overall0.8930.8960.8990.8970.8980.90.8910.8940.8730.8750.8960.896
    compas-two-years-FS0.7040.7040.6940.710.7070.7020.6870.6890.7050.7020.7060.693
    compas-two-years-NOFS0.7070.7050.7110.7060.7030.70.6990.7030.7030.6970.7090.702
    compas-two-years-Overall0.7040.7040.7110.710.70.6990.6990.7030.7030.6970.7060.702
    image-FS0.8550.8580.8680.8570.8730.8730.820.820.8720.868
    image-NOFS0.8780.8790.8790.8880.8750.8750.8620.8650.8820.885
    image-Overall0.8780.8790.8750.8790.8750.8750.8620.8650.8820.885
    page-blocks-FS0.9870.9880.9890.9880.9880.9880.9850.9860.9760.9540.9870.987
    page-blocks-NOFS0.9880.9880.9890.990.9880.9880.9850.9850.9820.9830.9870.988
    page-blocks-Overall0.9880.9880.9880.9880.9870.9870.9850.9850.9820.9830.9870.988
    parkinsons-FS0.8730.8730.870.8710.8750.8760.8740.8740.8890.8010.870.87
    parkinsons-NOFS0.9080.9110.9230.9260.9250.9120.9180.9180.930.8940.9170.912
    parkinsons-Overall0.8730.8730.9360.870.9050.8730.9180.9180.930.8940.870.87
    segment-FS0.9990.9991.01.00.9991.00.9990.9980.960.9540.9980.998
    segment-NOFS0.9991.01.01.01.01.00.9991.00.9680.971.00.999
    segment-Overall0.9991.01.01.00.9991.00.9991.00.9680.970.9980.998
    stock-FS0.9930.9850.9950.9950.9910.9920.9910.9870.9650.9650.9910.991
    stock-NOFS0.9930.9940.9950.9950.9930.9950.9910.9890.980.9810.9940.994
    stock-Overall0.9930.9940.9950.9940.9930.9940.9910.9890.980.9810.9940.994
    zoo-FS1.01.01.01.00.8241.01.01.01.01.00.9870.993
    zoo-NOFS0.9921.01.01.01.00.9971.00.9970.991.00.9831.0
    zoo-Overall0.9921.01.01.00.9971.01.00.9970.991.00.9830.993
    Table 26. MAR Results at 10% Missingness for the AUC Metric
    -FS: denotes the result with enforced feature selection, -NOFS: denotes the results without feature selection, -Overall: denotes the results with and without feature selection.
    Table 27.
    DatasetMMBI+MMMFBI+MFGAINBI+GAINSOFTBI+SOFTPPCABI+PPCADAEBI+DAE
    Australian-FS0.9010.9050.8880.8860.8830.890.8640.870.8770.8690.9090.909
    Australian-NOFS0.9030.8980.8870.8880.8870.8960.8840.8880.870.8750.9080.902
    Australian-Overall0.9030.9050.8860.8840.880.8980.8640.870.8770.8690.9090.909
    boston-FS0.9070.9070.9030.9010.9040.9080.8410.8440.8870.8870.8910.891
    boston-NOFS0.8940.8910.9030.9050.90.9050.8580.8520.8950.8870.9010.89
    boston-Overall0.8940.8910.9010.910.9140.90.8580.8520.8950.8870.9010.89
    churn-FS0.8410.8410.8470.8410.8440.8390.8040.8110.8080.8160.8370.837
    churn-NOFS0.8330.8360.8460.8420.8430.8310.840.8270.8170.8140.8320.843
    churn-Overall0.8330.8360.8450.8460.8470.8410.840.8270.8080.8140.8370.843
    compas-two-years-FS0.7010.6970.6880.7010.6880.6960.6850.70.6850.6810.6820.697
    compas-two-years-NOFS0.7010.6950.6930.6950.690.690.6730.6920.6920.6970.6940.691
    compas-two-years-Overall0.7010.6950.6950.6930.690.6850.6730.70.6850.6810.6820.691
    image-FS0.8060.8060.7980.840.8350.8390.8260.8230.8710.871
    image-NOFS0.8710.8750.8710.8650.860.8620.8690.860.8780.884
    image-Overall0.8710.8750.8790.8690.860.8620.8690.860.8780.884
    page-blocks-FS0.9760.9790.9820.9840.9770.9780.950.9510.9320.940.9790.98
    page-blocks-NOFS0.9760.980.9820.9840.9750.9780.950.9560.8660.9470.9780.978
    page-blocks-Overall0.9760.9790.9840.9820.9790.9760.950.9560.9320.9470.9780.978
    parkinsons-FS0.8890.8760.8720.870.8890.8640.8030.8030.8440.8290.8490.881
    parkinsons-NOFS0.9040.8920.9090.9180.8980.8920.8830.8710.8860.8830.9050.905
    parkinsons-Overall0.9040.8920.8720.8760.8910.8750.8830.8710.8860.8830.9050.881
    segment-FS0.9940.9971.01.00.9980.9970.9740.9610.9410.9630.9980.999
    segment-NOFS0.9990.9991.01.00.9970.9980.9890.9950.9720.9750.9990.999
    segment-Overall0.9990.9991.01.00.9990.9980.9890.9950.9720.9750.9990.999
    stock-FS0.9790.9680.980.980.9750.9770.8820.9080.8860.9030.9740.974
    stock-NOFS0.9790.9780.9810.9860.9730.9770.8870.9330.9310.9420.9760.975
    stock-Overall0.9790.9780.9820.9810.9750.9780.8870.9330.9310.9420.9760.974
    zoo-FS0.9910.9841.00.990.9860.9670.9730.9890.960.990.9230.925
    zoo-NOFS0.990.9890.9731.00.9981.00.9970.9980.9860.9970.9520.99
    zoo-Overall0.990.9890.9651.01.01.00.9730.9980.960.9970.9520.925
    Table 27. MAR Results at 25% Missingness for the AUC Metric
    -FS: denotes the result with enforced feature selection, -NOFS: denotes the results without feature selection, -Overall: denotes the results with and without feature selection.
    Table 28.
    DatasetMMBI+MMMFBI+MFGAINBI+GAINSOFTBI+SOFTPPCABI+PPCADAEBI+DAE
    Australian-FS0.8550.8870.8670.8780.8520.840.7850.8380.8210.8210.8630.884
    Australian-NOFS0.8620.8730.8790.8580.8510.8610.7720.8210.8220.8290.860.874
    Australian-Overall0.8550.8870.8820.8610.8430.8490.7850.8380.8220.8210.8630.884
    boston-FS0.8670.8970.8840.8840.8680.8770.8140.7860.8690.8560.8530.863
    boston-NOFS0.8870.8970.8940.8990.8850.8750.8180.8320.8660.8640.8490.867
    boston-Overall0.8670.8970.8950.8930.8710.8750.8180.8320.8660.8560.8530.863
    churn-FS0.7580.7730.7640.7740.7540.7470.7040.7110.7650.7760.7660.757
    churn-NOFS0.7550.7660.7750.7930.7560.7520.720.7370.7730.7640.7590.76
    churn-Overall0.7550.7660.7750.7750.7610.7620.720.7370.7650.7640.7660.757
    compas-two-years-FS0.6620.6680.6650.6740.6420.6640.6430.6580.6630.6710.6590.666
    compas-two-years-NOFS0.6660.6820.6630.6740.6580.6780.6370.6620.6660.6820.6650.679
    compas-two-years-Overall0.6660.6820.6680.6760.6470.6780.6430.6620.6660.6820.6650.679
    image-FS0.7710.760.7420.7790.7210.7350.8120.8120.8050.81
    image-NOFS0.8240.8240.8360.8240.7310.7490.8410.8280.8570.852
    image-Overall0.8240.8240.8280.830.7310.7490.8410.8280.8570.852
    page-blocks-FS0.9630.9640.9520.9590.9630.9620.890.9020.8780.870.9610.957
    page-blocks-NOFS0.9630.9620.9590.960.960.9660.8960.9110.8980.9180.9610.958
    page-blocks-Overall0.9630.9640.9580.9610.9590.9640.8960.9110.8980.9180.9610.957
    parkinsons-FS0.8090.8550.8610.880.8390.8130.80.8010.8650.8840.8030.85
    parkinsons-NOFS0.8750.8470.8830.8760.8550.8650.7780.770.9080.9030.8870.869
    parkinsons-Overall0.8750.8470.8760.870.8550.840.80.770.9080.9030.8870.869
    segment-FS0.9950.9930.9910.9910.9950.9910.8770.9090.8770.8950.9910.993
    segment-NOFS0.9950.9930.9920.9940.9920.9940.8860.9030.9070.9340.9930.993
    segment-Overall0.9950.9930.9940.9970.9850.9950.8860.9030.9070.9340.9910.993
    stock-FS0.9620.9350.9660.9730.9570.950.8760.8690.7130.8030.9540.939
    stock-NOFS0.9620.9590.9720.9710.9560.9620.8710.8970.9250.9280.9560.956
    stock-Overall0.9620.9590.9690.9750.9530.9540.8710.8970.9250.9280.9540.956
    zoo-FS0.960.9050.960.9560.9490.990.9460.9480.8960.8630.9670.884
    zoo-NOFS0.9540.9520.9730.9810.9080.9240.9480.9520.9430.9650.9540.968
    zoo-Overall0.9540.9520.9540.9570.9160.9790.9480.9520.9430.9650.9540.968
    Table 28. MAR Results at 50% Missingness for the AUC Metric
    -FS: denotes the result with enforced feature selection, -NOFS: denotes the results without feature selection, -Overall: denotes the results with and without feature selection.
    Table 29.
    DatasetMMBI+MMMFBI+MFGAINBI+GAINSOFTBI+SOFTPPCABI+PPCADAEBI+DAE
    Australian-FS0.8990.9070.8930.8950.9050.9070.880.8860.8730.8730.8950.911
    Australian-NOFS0.9210.9120.910.9150.9010.9180.9120.9040.8860.890.9190.899
    Australian-Overall0.8990.9070.9070.9150.9180.9120.9120.9040.8860.890.8950.911
    boston-FS0.940.940.9260.9260.9280.9180.9180.9180.9260.9260.9330.94
    boston-NOFS0.9360.9320.9470.9430.9380.9390.9360.9290.930.9230.940.932
    boston-Overall0.940.940.9260.9210.9420.9220.9180.9180.930.9230.9330.94
    churn-FS0.8930.8970.8990.8960.8960.8970.890.8810.8710.8730.8960.896
    churn-NOFS0.8940.8960.8990.9020.8980.9020.8910.8940.8730.8750.9020.898
    churn-Overall0.8930.8960.8990.8970.8980.90.8910.8940.8730.8750.8960.896
    compas-two-years-FS0.7040.7040.6940.710.7070.7020.6870.6890.7050.7020.7060.693
    compas-two-years-NOFS0.7070.7050.7110.7060.7030.70.6990.7030.7030.6970.7090.702
    compas-two-years-Overall0.7040.7040.7110.710.70.6990.6990.7030.7030.6970.7060.702
    image-FS0.8550.8580.8680.8570.8730.8730.820.820.8720.868
    image-NOFS0.8780.8790.8790.8880.8750.8750.8620.8650.8820.885
    image-Overall0.8780.8790.8750.8790.8750.8750.8620.8650.8820.885
    page-blocks-FS0.9870.9880.9890.9880.9880.9880.9850.9860.9760.9540.9870.987
    page-blocks-NOFS0.9880.9880.9890.990.9880.9880.9850.9850.9820.9830.9870.988
    page-blocks-Overall0.9880.9880.9880.9880.9870.9870.9850.9850.9820.9830.9870.988
    parkinsons-FS0.8730.8730.870.8710.8750.8760.8740.8740.8890.8010.870.87
    parkinsons-NOFS0.9080.9110.9230.9260.9250.9120.9180.9180.930.8940.9170.912
    parkinsons-Overall0.8730.8730.9360.870.9050.8730.9180.9180.930.8940.870.87
    segment-FS0.9990.9991.01.00.9991.00.9990.9980.960.9540.9980.998
    segment-NOFS0.9991.01.01.01.01.00.9991.00.9680.971.00.999
    segment-Overall0.9991.01.01.00.9991.00.9991.00.9680.970.9980.998
    stock-FS0.9930.9850.9950.9950.9910.9920.9910.9870.9650.9650.9910.991
    stock-NOFS0.9930.9940.9950.9950.9930.9950.9910.9890.980.9810.9940.994
    stock-Overall0.9930.9940.9950.9940.9930.9940.9910.9890.980.9810.9940.994
    zoo-FS1.01.01.01.00.8241.01.01.01.01.00.9870.993
    zoo-NOFS0.9921.01.01.01.00.9971.00.9970.991.00.9831.0
    zoo-Overall0.9921.01.01.00.9971.01.00.9970.991.00.9830.993
    Table 29. MAR Results at 10% Missingness for the F1 Metric
    -FS: denotes the result with enforced feature selection, -NOFS: denotes the results without feature selection, -Overall: denotes the results with and without feature selection.
    Table 30.
    DatasetMMBI+MMMFBI+MFGAINBI+GAINSOFTBI+SOFTPPCABI+PPCADAEBI+DAE
    Australian-FS0.9010.9050.8880.8860.8830.890.8640.870.8770.8690.9090.909
    Australian-NOFS0.9030.8980.8870.8880.8870.8960.8840.8880.870.8750.9080.902
    Australian-Overall0.9030.9050.8860.8840.880.8980.8640.870.8770.8690.9090.909
    boston-FS0.9070.9070.9030.9010.9040.9080.8410.8440.8870.8870.8910.891
    boston-NOFS0.8940.8910.9030.9050.90.9050.8580.8520.8950.8870.9010.89
    boston-Overall0.8940.8910.9010.910.9140.90.8580.8520.8950.8870.9010.89
    churn-FS0.8410.8410.8470.8410.8440.8390.8040.8110.8080.8160.8370.837
    churn-NOFS0.8330.8360.8460.8420.8430.8310.840.8270.8170.8140.8320.843
    churn-Overall0.8330.8360.8450.8460.8470.8410.840.8270.8080.8140.8370.843
    compas-two-years-FS0.7010.6970.6880.7010.6880.6960.6850.70.6850.6810.6820.697
    compas-two-years-NOFS0.7010.6950.6930.6950.690.690.6730.6920.6920.6970.6940.691
    compas-two-years-Overall0.7010.6950.6950.6930.690.6850.6730.70.6850.6810.6820.691
    image-FS0.8060.8060.7980.840.8350.8390.8260.8230.8710.871
    image-NOFS0.8710.8750.8710.8650.860.8620.8690.860.8780.884
    image-Overall0.8710.8750.8790.8690.860.8620.8690.860.8780.884
    page-blocks-FS0.9760.9790.9820.9840.9770.9780.950.9510.9320.940.9790.98
    page-blocks-NOFS0.9760.980.9820.9840.9750.9780.950.9560.8660.9470.9780.978
    page-blocks-Overall0.9760.9790.9840.9820.9790.9760.950.9560.9320.9470.9780.978
    parkinsons-FS0.8890.8760.8720.870.8890.8640.8030.8030.8440.8290.8490.881
    parkinsons-NOFS0.9040.8920.9090.9180.8980.8920.8830.8710.8860.8830.9050.905
    parkinsons-Overall0.9040.8920.8720.8760.8910.8750.8830.8710.8860.8830.9050.881
    segment-FS0.9940.9971.01.00.9980.9970.9740.9610.9410.9630.9980.999
    segment-NOFS0.9990.9991.01.00.9970.9980.9890.9950.9720.9750.9990.999
    segment-Overall0.9990.9991.01.00.9990.9980.9890.9950.9720.9750.9990.999
    stock-FS0.9790.9680.980.980.9750.9770.8820.9080.8860.9030.9740.974
    stock-NOFS0.9790.9780.9810.9860.9730.9770.8870.9330.9310.9420.9760.975
    stock-Overall0.9790.9780.9820.9810.9750.9780.8870.9330.9310.9420.9760.974
    zoo-FS0.9910.9841.00.990.9860.9670.9730.9890.960.990.9230.925
    zoo-NOFS0.990.9890.9731.00.9981.00.9970.9980.9860.9970.9520.99
    zoo-Overall0.990.9890.9651.01.01.00.9730.9980.960.9970.9520.925
    Table 30. MAR Results at 25% Missingness for the F1 Metric
    -FS: denotes the result with enforced feature selection, -NOFS: denotes the results without feature selection, -Overall: denotes the results with and without feature selection.
    Table 31.
    DatasetMMBI+MMMFBI+MFGAINBI+GAINSOFTBI+SOFTPPCABI+PPCADAEBI+DAE
    Australian-FS0.8550.8870.8670.8780.8520.840.7850.8380.8210.8210.8630.884
    Australian-NOFS0.8620.8730.8790.8580.8510.8610.7720.8210.8220.8290.860.874
    Australian-Overall0.8550.8870.8820.8610.8430.8490.7850.8380.8220.8210.8630.884
    boston-FS0.8670.8970.8840.8840.8680.8770.8140.7860.8690.8560.8530.863
    boston-NOFS0.8870.8970.8940.8990.8850.8750.8180.8320.8660.8640.8490.867
    boston-Overall0.8670.8970.8950.8930.8710.8750.8180.8320.8660.8560.8530.863
    churn-FS0.7580.7730.7640.7740.7540.7470.7040.7110.7650.7760.7660.757
    churn-NOFS0.7550.7660.7750.7930.7560.7520.720.7370.7730.7640.7590.76
    churn-Overall0.7550.7660.7750.7750.7610.7620.720.7370.7650.7640.7660.757
    compas-two-years-FS0.6620.6680.6650.6740.6420.6640.6430.6580.6630.6710.6590.666
    compas-two-years-NOFS0.6660.6820.6630.6740.6580.6780.6370.6620.6660.6820.6650.679
    compas-two-years-Overall0.6660.6820.6680.6760.6470.6780.6430.6620.6660.6820.6650.679
    image-FS0.7710.760.7420.7790.7210.7350.8120.8120.8050.81
    image-NOFS0.8240.8240.8360.8240.7310.7490.8410.8280.8570.852
    image-Overall0.8240.8240.8280.830.7310.7490.8410.8280.8570.852
    page-blocks-FS0.9630.9640.9520.9590.9630.9620.890.9020.8780.870.9610.957
    page-blocks-NOFS0.9630.9620.9590.960.960.9660.8960.9110.8980.9180.9610.958
    page-blocks-Overall0.9630.9640.9580.9610.9590.9640.8960.9110.8980.9180.9610.957
    parkinsons-FS0.8090.8550.8610.880.8390.8130.80.8010.8650.8840.8030.85
    parkinsons-NOFS0.8750.8470.8830.8760.8550.8650.7780.770.9080.9030.8870.869
    parkinsons-Overall0.8750.8470.8760.870.8550.840.80.770.9080.9030.8870.869
    segment-FS0.9950.9930.9910.9910.9950.9910.8770.9090.8770.8950.9910.993
    segment-NOFS0.9950.9930.9920.9940.9920.9940.8860.9030.9070.9340.9930.993
    segment-Overall0.9950.9930.9940.9970.9850.9950.8860.9030.9070.9340.9910.993
    stock-FS0.9620.9350.9660.9730.9570.950.8760.8690.7130.8030.9540.939
    stock-NOFS0.9620.9590.9720.9710.9560.9620.8710.8970.9250.9280.9560.956
    stock-Overall0.9620.9590.9690.9750.9530.9540.8710.8970.9250.9280.9540.956
    zoo-FS0.960.9050.960.9560.9490.990.9460.9480.8960.8630.9670.884
    zoo-NOFS0.9540.9520.9730.9810.9080.9240.9480.9520.9430.9650.9540.968
    zoo-Overall0.9540.9520.9540.9570.9160.9790.9480.9520.9430.9650.9540.968
    Table 31. MAR Results at 50% Missingness for the F1 Metric
    -FS: denotes the result with enforced feature selection, -NOFS: denotes the results without feature selection, -Overall: denotes the results with and without feature selection.
    Table 32.
    DatasetMMBI+MMMFBI+MFGAINBI+GAINSOFTBI+SOFTPPCABI+PPCADAEBI+DAE
    Australian0.8490.8430.8550.8610.8610.8520.8520.8460.8320.8380.8290.841
    Australian-FS0.8490.8430.8380.8410.8520.8490.8290.8350.820.820.8290.841
    Australian-NOFS0.8520.8520.8520.8580.8290.8580.8520.8460.8320.8380.8550.829
    boston0.8930.8930.8770.8580.8890.8580.8540.8540.8740.8580.870.881
    boston-FS0.8930.8930.870.8810.8770.8580.8540.8540.8770.8770.870.881
    boston-NOFS0.9050.8850.8970.8850.8770.8890.8810.870.8740.8580.8970.87
    churn0.940.9350.940.9330.9430.9360.9320.9280.9210.9220.9440.944
    churn-FS0.940.9420.930.9320.940.9430.930.9260.920.9140.9440.944
    churn-NOFS0.9410.9350.9430.9430.9430.9360.9320.9280.9210.9220.9420.933
    compas-two-years0.6570.6570.6620.6580.6560.660.6660.6610.6560.6580.6640.662
    compas-two-years-FS0.6570.6570.6540.6590.6580.6530.6490.6520.6510.6490.6640.644
    compas-two-years-NOFS0.660.6650.6640.6630.6520.6610.6660.6610.6560.6580.6660.662
    image0.8730.8640.8720.8720.8590.860.8540.8620.8650.871
    image-FS0.8510.8550.8570.8580.8590.8590.840.840.8690.859
    image-NOFS0.8730.8640.8670.8690.8590.860.8540.8620.8650.871
    page-blocks0.9710.9710.9720.9720.9720.9720.9660.9680.9650.9630.9680.972
    page-blocks-FS0.9720.9710.9710.9730.9730.9720.970.9660.9610.9580.9680.973
    page-blocks-NOFS0.9710.9720.9720.9730.9710.9720.9660.9680.9650.9630.9680.972
    parkinsons0.8780.8780.8980.8780.8670.8780.9080.8980.9080.8570.8780.878
    parkinsons-FS0.8780.8780.8780.8780.8780.8780.8780.8780.8670.8570.8780.878
    parkinsons-NOFS0.8980.8670.8880.8980.8980.8880.9080.8980.9080.8570.8980.898
    segment0.9930.9940.9960.9970.9940.9940.9940.9960.9320.9320.9940.994
    segment-FS0.9940.9940.9970.9960.9950.9940.9910.9890.9230.9160.9940.994
    segment-NOFS0.9930.9940.9970.9980.9950.9950.9940.9960.9320.9320.9960.995
    stock0.960.960.9730.9730.9560.9640.9450.9410.9240.9260.9620.962
    stock-FS0.9520.9350.9750.9750.9490.9620.9640.9470.9010.9010.9560.956
    stock-NOFS0.960.960.9640.9660.9620.9710.9450.9410.9240.9260.9620.962
    zoo0.9611.01.01.00.981.01.00.980.9611.00.9410.941
    zoo-FS1.01.01.01.00.8431.01.01.01.01.00.9410.941
    zoo-NOFS0.9611.01.01.01.00.981.00.980.9611.00.9411.0
    Table 32. MAR Results at 10% Missingness for the ACC Metric
    -FS: denotes the result with enforced feature selection, -NOFS: denotes the results without feature selection, -Overall: denotes the results with and without feature selection.
    Table 33.
    DatasetMMBI+MMMFBI+MFGAINBI+GAINSOFTBI+SOFTPPCABI+PPCADAEBI+DAE
    Australian0.8320.8380.8410.8350.820.8350.80.7910.8140.8120.8350.843
    Australian-FS0.8290.8380.8410.8350.8260.8260.80.7910.8140.8120.8350.843
    Australian-NOFS0.8320.820.8350.8230.8120.8120.8090.8140.8090.8120.8320.826
    boston0.8620.8580.850.870.8850.8580.8140.8180.8540.8340.850.834
    boston-FS0.8620.8620.850.8660.8660.8540.8020.8020.8340.8340.8420.842
    boston-NOFS0.8620.8580.8620.8580.850.850.8140.8180.8540.8580.850.834
    churn0.9210.9150.9260.9220.9210.9180.8980.8920.90.90.9180.918
    churn-FS0.9190.9220.9210.920.9180.9210.9010.8870.90.9010.9180.918
    churn-NOFS0.9210.9150.9280.9240.920.9130.8980.8920.9030.90.9180.918
    compas-two-years0.6560.6560.6530.6480.640.6460.6410.6560.6490.6520.6490.653
    compas-two-years-FS0.6560.6490.6530.6560.6390.6490.6560.6560.6490.6520.6490.647
    compas-two-years-NOFS0.6590.6560.6490.6590.6410.6540.6410.6540.6470.650.6520.653
    image0.8610.8660.8690.8620.8590.860.8610.8520.8730.869
    image-FS0.8380.8380.8190.8460.8430.8410.8430.8420.8670.867
    image-NOFS0.8610.8660.8660.8680.8590.860.8610.8520.8730.869
    page-blocks0.9620.9620.9660.9640.9610.9590.9450.9460.9470.9530.9590.958
    page-blocks-FS0.9620.9620.9660.9650.9610.9620.9460.9420.9470.9490.9630.959
    page-blocks-NOFS0.9620.9610.9650.9650.960.9590.9450.9460.9470.9530.9590.958
    parkinsons0.8670.8780.8670.8670.8880.8470.8570.8570.8880.8780.8880.867
    parkinsons-FS0.8670.8570.8670.8670.8670.8370.8370.8370.8670.8370.8470.867
    parkinsons-NOFS0.8670.8780.8880.8980.8880.8880.8570.8570.8880.8780.8880.888
    segment0.9910.9950.9970.9970.9920.990.9730.9790.9410.9380.9940.993
    segment-FS0.990.9910.9970.9970.9910.990.9580.9450.8970.9390.9930.993
    segment-NOFS0.9910.9950.9970.9970.9890.9920.9730.9790.9410.9380.9940.993
    stock0.9260.9180.9410.9390.9140.9160.8130.8590.8460.8610.9140.918
    stock-FS0.9260.9010.9310.9310.920.9160.7940.8440.80.8250.9180.918
    stock-NOFS0.9260.9180.9450.9390.9160.9180.8130.8590.8460.8610.9140.922
    zoo0.9610.9610.9221.01.01.00.9410.980.9220.980.9220.863
    zoo-FS0.9610.9411.00.9610.9610.9410.9410.9610.9220.9410.8430.863
    zoo-NOFS0.9610.9610.9221.00.981.00.980.980.980.980.9220.98
    Table 33. MAR Results at 25% Missingness for the ACC Metric
    -FS: denotes the result with enforced feature selection, -NOFS: denotes the results without feature selection, -Overall: denotes the results with and without feature selection.
    Table 34.
    DatasetMMBI+MMMFBI+MFGAINBI+GAINSOFTBI+SOFTPPCABI+PPCADAEBI+DAE
    Australian0.7910.8230.8320.8120.7940.7830.7480.7910.7740.7620.8140.82
    Australian-FS0.7910.8230.8170.8090.8030.7830.7480.7910.7620.7620.8140.82
    Australian-NOFS0.8060.8230.8430.8120.7710.8120.7220.7620.7740.7740.8060.803
    boston0.7980.8180.8380.8180.7980.810.7630.7630.7870.7910.7870.791
    boston-FS0.7980.8180.8060.8220.8060.7980.7670.7390.7910.7910.7870.791
    boston-NOFS0.8260.8340.8220.8260.8220.8220.7630.7630.7870.8020.7940.798
    churn0.8940.8890.8960.8940.8880.8820.8650.8660.8870.8830.890.888
    churn-FS0.8920.890.8930.8990.890.8840.870.8690.8870.8890.890.888
    churn-NOFS0.8940.8890.8920.8910.8880.8820.8650.8660.8880.8830.8940.883
    compas-two-years0.6210.6320.6290.6280.6080.6270.6070.6140.620.6290.6210.632
    compas-two-years-FS0.6110.6230.6230.6240.5980.620.6070.6180.6190.6220.6260.627
    compas-two-years-NOFS0.6210.6320.6290.6260.6180.6310.6040.6140.620.6290.6210.632
    image0.840.8390.8450.8490.8090.8170.8540.8470.8560.853
    image-FS0.8130.8150.8110.8170.8040.8040.8460.8460.8310.837
    image-NOFS0.840.8390.8460.8460.8090.8170.8540.8470.8560.853
    page-blocks0.9480.9490.9550.9520.9490.9470.9210.920.9330.9340.9490.949
    page-blocks-FS0.9480.9490.9540.9540.950.9480.9160.9210.9320.930.9490.949
    page-blocks-NOFS0.9480.950.9530.9530.950.9490.9210.920.9330.9340.9490.949
    parkinsons0.8370.8570.8570.8570.8470.8370.8270.8470.8780.8670.8670.857
    parkinsons-FS0.7960.8570.8160.8670.8470.8370.8270.8160.8470.8370.8370.837
    parkinsons-NOFS0.8370.8570.8780.8670.8470.8370.8060.8470.8780.8670.8670.857
    segment0.9820.9770.990.990.9610.9830.910.9110.8820.8980.9670.969
    segment-FS0.9830.9770.9890.990.9830.980.8990.9150.8610.8720.9670.969
    segment-NOFS0.9820.9770.990.9880.9830.9790.910.9110.8820.8980.9710.969
    stock0.9010.8910.9220.9180.8630.8820.8110.8270.8340.8380.8820.874
    stock-FS0.9010.8570.9240.9180.8970.8570.8150.7980.6950.7430.8820.859
    stock-NOFS0.9010.8910.9240.9240.8760.8840.8110.8270.8340.8380.8820.874
    zoo0.9020.9020.9410.9410.8630.9410.9020.9220.8820.9220.9220.922
    zoo-FS0.9020.8820.9610.9220.9020.9610.8820.9220.8630.8430.9410.882
    zoo-NOFS0.9020.9020.980.9610.8430.8630.9020.9220.8820.9220.9220.922
    Table 34. MAR Results at 50% Missingness for the ACC Metric
    -FS: denotes the result with enforced feature selection, -NOFS: denotes the results without feature selection, -Overall: denotes the results with and without feature selection.

    C.7 Imputation Accuracy Results

    In this section, we include the quantitative results of the experiments regarding the imputation quality. We split this section, into two subsections for each missingness mechanism. We report results for the train and test set using the default configurations for each imputation method (for details see Section C.3).
    C.7.1 MCAR.
    Tables 35 and 36 present the results at 10% missingness for R2 and accuracy scores, respectively. Tables 37 and 38 show the imputation R2 and accuracy score at 25% missingness, while Tables 39 and 40 show these at 50% missingness.
    Table 35.
    DatasetGAINSOFTMFDAEMMPPCA
    Australian-Test0.0730.00.1710.080.00.0
    Australian-Train0.0330.00.1740.1180.00.0
    boston-Test0.2460.00.6360.4910.00.302
    boston-Train0.2240.2520.6510.5210.00.336
    churn-Test0.2920.00.4740.330.00.348
    churn-Train0.2590.4290.460.3220.00.337
    compas-two-years-Test0.060.00.2350.2090.00.0
    compas-two-years-Train0.0580.1390.3160.2250.00.0
    image-Test0.7110.00.7980.00.827
    image-Train0.460.9380.7750.00.829
    page-blocks-Test0.2990.00.8150.150.00.4
    page-blocks-Train0.3190.3090.7960.1680.00.431
    parkinsons-Test0.1940.040.6970.5780.00.489
    parkinsons-Train0.2210.5920.8150.60.00.628
    segment-Test0.3640.00.6650.4830.0530.48
    segment-Train0.4330.410.7290.5390.0530.517
    stock-Test0.5140.00.9310.6990.00.546
    stock-Train0.5590.6560.9510.7010.00.588
    zoo-Test0.00.00.6870.00.00.0
    zoo-Train0.00.00.00.00.00.0
    Table 35. Imputation R2-score for MCAR 10% Missingness
    Table 36.
    DatasetGAINSOFTMFDAEMMPPCA
    Australian-Test0.6530.5940.7450.6820.6590.36
    Australian-Train0.6790.7430.8250.6350.6050.364
    boston-Test0.9550.8181.01.01.00.5
    boston-Train0.7730.3640.9550.9550.9550.682
    churn-Test0.770.6420.7950.7420.6890.708
    churn-Train0.7460.7470.7690.7160.6730.715
    compas-two-years-Test0.8580.5860.9550.7660.6820.835
    compas-two-years-Train0.8320.940.9530.750.6680.821
    zoo-Test0.8370.5720.8610.8060.6890.73
    zoo-Train0.7740.8740.9090.8630.7340.743
    Table 36. Imputation Accuracy Score for MCAR 10% Missingness
    Table 37.
    DatasetGAINSOFTMFDAEMMPPCA
    Australian-Test0.0430.00.1250.0970.00.01
    Australian-Train0.0340.00.1010.0690.00.001
    boston-Test0.140.00.5850.4130.00.201
    boston-Train0.0470.1590.6040.3820.00.175
    churn-Test0.2370.00.3780.2110.00.283
    churn-Train0.2430.3760.3910.2130.00.286
    compas-two-years-Test0.0030.00.1520.1290.00.0
    compas-two-years-Train0.0070.0310.1620.1530.00.0
    image-Test0.2560.00.7120.00.757
    image-Train0.1810.880.6440.00.765
    page-blocks-Test0.1830.00.6140.2280.00.279
    page-blocks-Train0.2270.2920.7680.2360.00.399
    parkinsons-Test0.010.0050.7230.5610.00.466
    parkinsons-Train0.0670.540.7640.3530.00.497
    segment-Test0.2560.00.6960.4520.0530.473
    segment-Train0.2430.4120.6770.4460.0530.46
    stock-Test0.4230.00.9020.5620.00.504
    stock-Train0.4430.4960.9010.5540.00.516
    zoo-Test0.0590.00.00.1070.00.0
    zoo-Train0.00.00.4210.4070.00.0
    Table 37. Imputation R2-score for MCAR 25% Missingness
    Table 38.
    DatasetGAINSOFTMFDAEMMPPCA
    Australian-Test0.6370.5930.7540.6320.6240.622
    Australian-Train0.6460.6350.7450.6230.6160.592
    boston-Test0.8570.8140.90.90.90.457
    boston-Train0.8690.4430.9510.9510.9510.541
    churn-Test0.7190.6170.7640.710.7030.679
    churn-Train0.7210.7040.7760.7120.7060.678
    compas-two-years-Test0.8310.550.9250.6830.6820.8
    compas-two-years-Train0.8310.7580.9220.6870.6860.814
    zoo-Test0.8240.5550.8970.7950.6610.564
    zoo-Train0.7590.7990.8930.8420.7410.582
    Table 38. Imputation Accuracy Score for MCAR 25% Missingness
    Table 39.
    DatasetGAINSOFTMFDAEMMPPCA
    Australian-Test0.020.00.0460.0390.00.013
    Australian-Train0.00.0040.0320.0270.00.01
    boston-Test0.0790.00.4030.2130.00.192
    boston-Train0.0890.0470.3850.1760.00.186
    churn-Test0.0590.00.250.0770.00.02
    churn-Train0.0570.1910.2430.0750.00.019
    compas-two-years-Test0.0020.00.0830.0520.00.042
    compas-two-years-Train0.0080.0590.0750.0510.00.04
    image-Test0.0090.00.4550.00.574
    image-Train0.0110.240.3930.00.573
    page-blocks-Test0.0860.00.4110.1710.00.154
    page-blocks-Train0.1740.1130.6050.1540.00.21
    parkinsons-Test0.00.00.5660.3050.00.323
    parkinsons-Train0.0030.2280.5950.1670.00.407
    segment-Test0.1890.00.6210.220.0530.301
    segment-Train0.1870.1880.6310.210.0530.309
    stock-Test0.1940.00.7140.290.00.38
    stock-Train0.1640.2240.7510.2890.00.401
    zoo-Test0.00.0890.00.00.00.0
    zoo-Train0.00.00.00.0020.00.0
    Table 39. Imputation R2-score for MCAR 50% Missingness
    Table 40.
    DatasetGAINSOFTMFDAEMMPPCA
    Australian-Test0.6220.5460.7230.6390.6390.599
    Australian-Train0.6170.5530.710.6290.6290.624
    boston-Test0.8920.6770.9310.9310.9310.462
    boston-Train0.9050.4480.9310.9310.9310.44
    churn-Test0.6010.5990.7490.7110.7110.603
    churn-Train0.5960.6380.7420.7070.7070.601
    compas-two-years-Test0.6580.5010.8530.6810.6810.782
    compas-two-years-Train0.6460.5620.8450.6790.6790.777
    zoo-Test0.6920.5110.770.6960.6890.567
    zoo-Train0.6990.6120.7970.6870.6850.522
    Table 40. Imputation Accuracy for MCAR 50% Missingness
    C.7.2 MAR.
    Tables 41 and 42 present the results at 10% missingness for R2 and Accuracy scores, respectively. Tables 43 and 44 show the R2-score and accuracy score at 25% missingness, while Tables 45 and 46 show these at 50% missingness.
    Table 41.
    DatasetGAINSOFTMFDAEMMPPCA
    Australian-Test0.0040.00.1310.0640.00.0
    Australian-Train0.0720.00.2460.1420.00.0
    boston-Test0.0510.00.5540.4450.00.171
    boston-Train0.0350.1990.7170.3820.00.278
    churn-Test0.2260.00.4930.3240.00.344
    churn-Train0.2260.4150.470.3210.00.336
    compas-two-years-Test0.00.00.2190.1930.00.0
    compas-two-years-Train0.00.1420.2470.2070.00.0
    image-Test0.3120.00.7510.00.816
    image-Train0.030.9430.7060.00.816
    page-blocks-Test0.0310.00.6980.1330.00.212
    page-blocks-Train0.0840.2860.6840.1340.00.22
    parkinsons-Test0.1580.0050.7260.6320.00.536
    parkinsons-Train0.0650.6330.7240.4910.00.564
    segment-Test0.2810.00.6380.4760.0530.418
    segment-Train0.2680.4370.7220.4470.0530.395
    stock-Test0.4760.00.9250.6680.00.479
    stock-Train0.4520.5730.9270.6080.00.481
    zoo-Test0.00.4760.9290.3810.00.0
    zoo-Train0.00.00.480.6810.00.0
    Table 41. Imputation R2-score for MAR 10% Missingness
    Table 42.
    DatasetGAINSOFTMFDAEMMPPCA
    Australian-Test0.610.6250.7860.7030.650.54
    Australian-Train0.6120.6950.760.6220.5940.518
    boston-Test1.00.9410.9411.01.00.471
    boston-Train0.9410.5290.9410.9410.9410.353
    churn-Test0.7410.6410.7910.7560.7190.695
    churn-Train0.7550.7780.8090.7580.7270.731
    compas-two-years-Test0.8540.6080.9550.730.7040.808
    compas-two-years-Train0.8150.9140.9410.7240.7070.798
    zoo-Test0.7890.5680.8880.8010.7220.471
    zoo-Train0.8230.8050.9290.8230.720.477
    Table 42. Imputation Accuracy Score for MAR 10% Missingness
    Table 43.
    DatasetGAINSOFTMFDAEMMPPCA
    Australian-Test0.0070.00.1230.0530.00.01
    Australian-Train0.0070.00.1420.0760.00.002
    boston-Test0.0410.00.6410.2880.00.152
    boston-Train0.010.0570.540.3010.00.168
    churn-Test0.1180.00.4050.190.00.289
    churn-Train0.1150.380.4110.190.00.296
    compas-two-years-Test0.00.00.1690.1210.00.0
    compas-two-years-Train0.00.1640.1570.1020.00.0
    image-Test0.00.00.6210.00.732
    image-Train0.00.8670.540.00.73
    page-blocks-Test0.0260.00.6280.1740.00.159
    page-blocks-Train0.0130.110.6780.1620.00.16
    parkinsons-Test0.0160.00.660.5360.00.501
    parkinsons-Train0.0050.5130.6710.3010.00.482
    segment-Test0.00.00.6860.3460.0530.309
    segment-Train0.00.2980.6750.3440.0530.313
    stock-Test0.0280.00.8090.3460.00.366
    stock-Train0.0210.3260.8410.3460.00.388
    zoo-Test0.00.00.3580.5640.00.003
    zoo-Train0.00.00.2610.1230.00.0
    Table 43. Imputation R2-score for MAR 25% Missingness
    Table 44.
    DatasetGAINSOFTMFDAEMMPPCA
    Australian-Test0.6750.5740.7560.6650.660.558
    Australian-Train0.6240.6460.7330.6420.640.522
    boston-Test0.8890.8890.9210.9210.9210.46
    boston-Train0.790.5480.9190.9190.9190.468
    churn-Test0.7520.6150.7810.7190.7040.684
    churn-Train0.7460.6960.7740.7080.6990.676
    compas-two-years-Test0.7010.5420.9040.6630.660.795
    compas-two-years-Train0.6950.7780.9030.6550.6520.807
    zoo-Test0.7660.5120.8890.7590.6780.561
    zoo-Train0.7550.7630.9090.7540.6850.546
    Table 44. Imputation Accuracy Score for MAR 25% Missingness
    Table 45.
    DatasetGAINSOFTMFDAEMMPPCA
    Australian-Test0.00.00.0290.0270.00.003
    Australian-Train0.00.0150.0340.0360.00.001
    boston-Test0.00.00.4680.0930.00.104
    boston-Train0.00.0210.4840.0960.00.162
    churn-Test0.0190.00.260.0640.00.051
    churn-Train0.0170.140.2580.0640.00.048
    compas-two-years-Test0.00.00.1150.0720.00.027
    compas-two-years-Train0.00.0290.1040.0670.00.032
    image-Test0.00.00.3640.00.515
    image-Train0.00.1760.2920.00.514
    page-blocks-Test0.00.00.4370.0760.00.116
    page-blocks-Train0.00.1030.4490.0790.00.11
    parkinsons-Test0.0180.00.60.2280.00.304
    parkinsons-Train0.0170.190.6080.1860.00.342
    segment-Test0.00.00.590.1440.0530.098
    segment-Train0.00.0690.5930.1350.0530.097
    stock-Test0.00.00.7510.1590.00.118
    stock-Train0.00.1130.7250.1360.00.098
    zoo-Test0.00.0320.1910.220.00.102
    zoo-Train0.00.00.1270.00.00.0
    Table 45. Imputation R2-score for MAR 50% Missingness
    Table 46.
    DatasetGAINSOFTMFDAEMMPPCA
    Australian-Test0.6050.5410.7240.6490.6490.568
    Australian-Train0.6210.5320.720.6250.6250.582
    boston-Test0.9230.650.930.9370.9370.469
    boston-Train0.9280.4560.9440.9440.9440.368
    churn-Test0.7250.5890.7450.6970.6970.595
    churn-Train0.7320.6450.7530.7040.7040.576
    compas-two-years-Test0.6340.5040.8530.6910.690.67
    compas-two-years-Train0.6290.5810.8420.690.6890.671
    zoo-Test0.60.5350.7640.6250.6250.599
    zoo-Train0.5990.5850.8460.6880.6910.613
    Table 46. Imputation Accuracy Score for MAR 50% Missingness

    References

    [1]
    2023. (unpublished).
    [2]
    Deepak Adhikari, Wei Jiang, Jinyu Zhan, Zhiyuan He, Danda B. Rawat, Uwe Aickelin, and Hadi A. Khorshidi. 2022. A comprehensive survey on imputation of missing data in internet of things. ACM Comput. Surv. 55, 7, Article 133 (Dec.2022), 38 pages. DOI:
    [3]
    Ahmed Alaa and Mihaela van der Schaar. 2018. AutoPrognosis: Automated clinical prognostic modeling via bayesian optimization with structured kernel learning. In Proceedings of the 35th International Conference on Machine Learning(Proceedings of Machine Learning Research, Vol. 80), Jennifer Dy and Andreas Krause (Eds.). PMLR, 139–148. https://proceedings.mlr.press/v80/alaa18b.html
    [4]
    Mustafa Alabadla, Fatimah Sidi, Iskandar Ishak, Hamidah Ibrahim, Lilly Suriani Affendey, Zafienas Che Ani, Marzanah A. Jabar, Umar Ali Bukar, Navin Kumar Devaraj, Ahmad Sobri Muda, Anas Tharek, Noritah Omar, and M. Izham Mohd Jaya. 2022. Systematic review of using machine learning in imputing missing values. IEEE Access 10 (2022), 44483–44502. DOI:
    [5]
    Edesio Alcobaça, Felipe Siqueira, Adriano Rivolli, Luís P. F. Garcia, Jefferson T. Oliva, and André C. P. L. F. de Carvalho. 2020. MFE: Towards reproducible meta-feature extraction. J. Mach. Learn. Res. 21, 111 (2020), 1–5. http://jmlr.org/papers/v21/19-348.html
    [6]
    Rebecca R. Andridge and Roderick J. A. Little. 2010. A review of hot deck imputation for survey non-response. Int. Stat. Rev. 78, 1 (Apr.2010), 40–64.
    [7]
    Yoav Benjamini and Yosef Hochberg. 1995. Controlling the false discovery rate: A practical and powerful approach to multiple testing. J. Roy. Stat. Soc.: Ser. B (Methodol.) 57, 1 (1995), 289–300. DOI:
    [8]
    Dimitris Bertsimas, Colin Pawlowski, and Ying Daisy Zhuo. 2018. From predictive methods to missing data imputation: An optimization approach. J. Mach. Learn. Res. 18, 196 (2018), 1–39.
    [9]
    Felix Biessmann, Tammo Rukat, Phillipp Schmidt, Prathik Naidu, Sebastian Schelter, Andrey Taptunov, Dustin Lange, and David Salinas. 2019. DataWig: Missing value imputation for tables. J. Mach. Learn. Res. 20, 175 (2019), 1–6.
    [10]
    Christopher Bishop. 1998. Bayesian PCA. In Advances in Neural Information Processing Systems, M. Kearns, S. Solla, and D. Cohn (Eds.), Vol. 11. MIT Press.
    [11]
    Ramiro Daniel Camino, Christian A. Hammerschmidt, and Radu State. 2019. Improving missing data imputation with deep generative models. arXiv:1902.10666. Retrieved from http://arxiv.org/abs/1902.10666
    [12]
    James R. Carpenter and Melanie Smuk. 2021. Missing data: A statistical framework for practice. Biometr. J. 63, 5 (2021), 915–947. DOI:
    [13]
    Tlamelo Emmanuel, Thabiso Maupong, Dimane Mpoeleng, Thabo Semong, Banyatsang Mphago, and Oteng Tabona. 2021. A survey on missing data in machine learning. J. Big Data 8, 1 (27 Oct.2021), 140. DOI:
    [14]
    Nick Erickson, Jonas Mueller, Alexander Shirkov, Hang Zhang, Pedro Larroy, Mu Li, and Alexander Smola. 2020. AutoGluon-Tabular: Robust and accurate AutoML for structured data. DOI:. Retrieved from https://arxiv.org/abs/2003.06505
    [15]
    Shahla Faisal and Gerhard Tutz. 2021. Multiple imputation using nearest neighbor methods. Inf. Sci. 570 (2021), 500–516. DOI:
    [16]
    Matthias Feurer, Katharina Eggensperger, Stefan Falkner, Marius Lindauer, and Frank Hutter. 2020. Auto-Sklearn 2.0: Hands-free AutoML via meta-learning.
    [17]
    Vincent Fortuin, Dmitry Baranchuk, Gunnar Rätsch, and Stephan Mandt. 2020. GP-VAE: Deep probabilistic time series imputation. arxiv:1907.04155 [stat.ML]. Retrieved from https://arxiv.org/abs/1907.04155
    [18]
    João Gama and Pavel Brazdil. 1995. Characterization of classification algorithms.189–200. DOI:
    [19]
    Unai Garciarena, Roberto Santana, and Alexander Mendiburu. 2017. Evolving imputation strategies for missing data in classification problems with TPOT. arXiv:1706.01120. Retrieved from http://arxiv.org/abs/1706.01120
    [20]
    Pieter Gijsbers and Joaquin Vanschoren. 2020. GAMA: A general automated machine learning assistant. arXiv:2007.04911. Retrieved from https://arxiv.org/abs/2007.04911
    [21]
    Lovedeep Gondara and Ke Wang. 2018. MIDA: Multiple Imputation Using Denoising Autoencoders. 260–272. DOI:
    [22]
    H2O.ai. 2022. DriverlessAI. Retrieved from https://www.h2o.ai/products/h2o-driverless-ai/
    [23]
    Charles R. Harris, K. Jarrod Millman, Stéfan J. van der Walt, Ralf Gommers, Pauli Virtanen, David Cournapeau, Eric Wieser, Julian Taylor, Sebastian Berg, Nathaniel J. Smith, Robert Kern, Matti Picus, Stephan Hoyer, Marten H. van Kerkwijk, Matthew Brett, Allan Haldane, Jaime Fernández del Río, Mark Wiebe, Pearu Peterson, Pierre Gérard-Marchant, Kevin Sheppard, Tyler Reddy, Warren Weckesser, Hameer Abbasi, Christoph Gohlke, and Travis E. Oliphant. 2020. Array programming with NumPy. Nature 585, 7825 (Sept.2020), 357–362. DOI:
    [24]
    Md. Kamrul Hasan, Md. Ashraful Alam, Shidhartho Roy, Aishwariya Dutta, Md. Tasnim Jawad, and Sunanda Das. 2021. Missing value imputation affects the performance of machine learning: A review and analysis of the literature (2010–2021). Inf. Med. Unlock. 27 (2021), 100799. DOI:
    [25]
    Harshad Hegde, Neel Shimpi, Aloksagar Panny, Ingrid Glurich, Pamela Christie, and Amit Acharya. 2019. MICE vs PPCA: Missing data imputation in healthcare. Inf. Med. Unlock. 17 (2019), 100275. DOI:
    [26]
    Steffen Herbold. 2020. Autorank: A Python package for automated ranking of classifiers. J.of Open Source Softw. 5, 48 (2020), 2173. DOI:
    [27]
    James Honaker, Gary King, and Matthew Blackwell. 2011. Amelia II: A program for missing data. J. Stat. Softw. 45, 7 (2011), 1–47. DOI:
    [28]
    Md Hamidul Huque, John B. Carlin, Julie A. Simpson, and Katherine J. Lee. 2018. A comparison of multiple imputation methods for missing data in longitudinal studies. BMC Med. Res. Methodol. 18, 1 (12 Dec.2018), 168. DOI:
    [29]
    Anil Jadhav, Dhanya Pramod, and Krishnan Ramanathan. 2019. Comparison of performance of data imputation methods for numeric dataset. Appl. Artif. Intell. 33 (072019), 1–21. DOI:
    [30]
    Sebastian Jäger, Arndt Allhorn, and Felix Bießmann. 2021. A benchmark for data imputation methods. Front. Big Data 4 (2021). DOI:
    [31]
    Jintao Ke, Shuaichao Zhang, Hai Yang, and Xiqun (Michael) Chen. 2019. PCA-based missing information imputation for real-time crash likelihood prediction under imbalanced data. Transportmetr. A: Transp. Sci. 15, 2 (2019), 872–895. DOI:
    [32]
    Diederik P. Kingma and Max Welling. 2013. Auto-encoding variational Bayes. DOI:. Retrieved from https://arxiv.org/abs/1312.6114
    [33]
    Alexander Kowarik and Matthias Templ. 2016. Imputation with the R package VIM. J. Stat. Softw. 74, 7 (2016), 1–16. DOI:
    [34]
    Gayaneh Kyureghian, Oral Capps, and Rodolfo M. Nayga. 2011. A Missing Variable Imputation Methodology with an Empirical Application. 313–337. DOI:
    [35]
    Ranjit Lall and Thomas Robinson. 2022. The MIDAS touch: Accurate and scalable missing-data imputation with deep learning. Polit. Anal. 30, 2 (2022), 179–196. DOI:
    [36]
    Trang T. Le, Weixuan Fu, and Jason H. Moore. 2020. Scaling tree-based automated machine learning to biomedical big data with a feature set selector. Bioinformatics 36, 1 (2020), 250–256.
    [37]
    Dan Li, Jitender Deogun, William Spaulding, and Bill Shuart. 2004. Towards missing data imputation: A study of fuzzy k-means clustering method. In Rough Sets and Current Trends in Computing, Shusaku Tsumoto, Roman Słowiński, Jan Komorowski, and Jerzy W. Grzymała-Busse (Eds.). Springer, Berlin, 573–579.
    [38]
    Peng Li, Xi Rao, Jennifer Blase, Yue Zhang, Xu Chu, and Ce Zhang. 2019. CleanML: A benchmark for joint data cleaning and machine learning [experiments and analysis]. arXiv:1904.09483. Retrieved from http://arxiv.org/abs/1904.09483
    [39]
    Yuebiao Li, Zhiheng Li, and Li Li. 2014. Missing traffic data: Comparison of imputation methods. Intell. Transp. Syst. IET 8 (022014), 51–57. DOI:
    [40]
    R. J. A. Little and D. B. Rubin. 2002. Statistical Analysis with Missing Data. Wiley. 2002027006
    [41]
    Haw-minn Lu, Giancarlo Perrone, and José Unpingco. 2020. Multiple imputation with denoising autoencoder using metamorphic truth and imputation feedback. arXiv:2002.08338. Retrieved from https://arxiv.org/abs/2002.08338
    [42]
    Ms. R. Malarvizhi. 2012. KNN classifier performs better than k-means clustering in missing value imputation. IOSR J. Comput. Eng. 6 (2012), 12–15.
    [43]
    Behrooz Mamandipoor, Mahshid Majd, Monica Moz, and Venet Osmani. 2019. Blood lactate concentration prediction in critical care patients: Handling missing values. arXiv:1910.01473. Retrieved from http://arxiv.org/abs/1910.01473
    [44]
    Rahul Mazumder, Trevor Hastie, and Robert Tibshirani. 2010. Spectral regularization algorithms for learning large incomplete matrices. J. Mach. Learn. Res. 11, 80 (2010), 2287–2322. http://jmlr.org/papers/v11/mazumder10a.html
    [45]
    John McCoy, Steve Kroon, and Lidia Auret. 2018. Variational autoencoders for missing data imputation with application to a simulated milling circuit. IFAC-PapersOnLine 51 (012018), 141–146. DOI:
    [46]
    Karthika Mohan and Judea Pearl. 2021. Graphical models for processing missing data. J. Am. Stat. Assoc. 116, 534 (2021), 1023–1037. DOI:
    [47]
    Carol M. Musil, Camille B. Warner, Piyanee Klainin Yobas, and Susan L. Jones. 2002. A comparison of imputation techniques for handling missing data. West. J. Nurs. Res. 24, 7 (2002), 815–829. DOI:
    [48]
    Boris Muzellec, Julie Josse, Claire Boyer, and Marco Cuturi. 2020. Missing data imputation using optimal transport. arxiv:2002.03860 [stat.ML]. Retrieved from https://arxiv.org/abs/2002.03860
    [49]
    Felix Neutatz, Binger Chen, Yazan Alkhatib, Jingwen Ye, and Ziawasch Abedjan. 2022. Data cleaning and AutoML: Would an optimizer choose to clean? Datenb.-Spektr. 22, 2 (01 Jul2022), 121–130. DOI:
    [50]
    Shigeyuki Oba, Masa-aki Sato, and Shin Ishii. 2003. Variational bayes method for mixture of principal component analyzers. Syst. Comput. Jpn. 34, 11 (2003), 55–66. DOI:
    [51]
    Tomasz Orczyk and Piotr Porwik. 2013. Influence of missing data imputation method on the classification accuracy of the medical data. J. Med. Inf. Technol. 22 (2013).
    [52]
    Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, Gregory Chanan, Trevor Killeen, Zeming Lin, Natalia Gimelshein, Luca Antiga, Alban Desmaison, Andreas Kopf, Edward Yang, Zachary DeVito, Martin Raison, Alykhan Tejani, Sasank Chilamkurthy, Benoit Steiner, Lu Fang, Junjie Bai, and Soumith Chintala. 2019. PyTorch: An imperative style, high-performance deep learning library. In Advances in Neural Information Processing Systems 32, H. Wallach, H. Larochelle, A. Beygelzimer, F. d'Alché-Buc, E. Fox, and R. Garnett (Eds.). Curran Associates, Inc., 8024–8035.
    [53]
    F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M. Blondel, P. Prettenhofer, R. Weiss, V. Dubourg, J. Vanderplas, A. Passos, D. Cournapeau, M. Brucher, M. Perrot, and E. Duchesnay. 2011. Scikit-learn: Machine learning in Python. J. Mach. Learn. Res. 12 (2011), 2825–2830.
    [54]
    Ricardo Cardoso Pereira, Miriam Seoane Santos, Pedro Pereira Rodrigues, and Pedro Henriques Abreu. 2020. Reviewing autoencoders for missing data imputation: Technical trends, applications and outcomes. J. Artif. Intell. Res. 69 (2020), 1255–1285.
    [55]
    Alexandre Perez-Lebel, Gael Varoquaux, Marine Le Morvan, Julie Josse, and Jean-Baptiste Poline. 2022. Benchmarking missing-values approaches for predictive models on health databases. GigaScience 11 (042022). DOI:
    [56]
    Ben Omega Petrazzini, Hugo Naya, Fernando Lopez-Bello, Gustavo Vazquez, and Lucía Spangenberg. 2021. Evaluation of different approaches for missing data imputation on features associated to genomic data. BioData Mining 14, 1 (03 Sep.2021), 44. DOI:
    [57]
    Jason Poulos and Rafael Valle. 2018. Missing data imputation for supervised learning. Appl. Artif. Intell. 32, 2 (2018), 186–196. DOI:
    [58]
    Li Qu, Li Li, Yi Zhang, and Jianming Hu. 2009. PPCA-based missing data imputation for traffic flow volume: A systematical approach. IEEE Trans. Intell. Transport. Syst. 10, 3 (2009), 512–522. DOI:
    [59]
    Theodoros Rekatsinas, Xu Chu, Ihab F. Ilyas, and Christopher Ré. 2017. HoloClean: Holistic data repairs with probabilistic inference. Proc. VLDB Endow. 10, 11 (Aug.2017), 1190–1201. DOI:
    [60]
    Adriano Rivolli, Luís P. F. Garcia, Carlos Soares, Joaquin Vanschoren, and André C. P. L. F. de Carvalho. 2022. Meta-features for meta-learning. Knowl.-Bas. Syst. 240 (2022), 108101. DOI:
    [61]
    Breeshey Roskams-Hieter, Jude Wells, and Sara Wade. 2022. Leveraging variational autoencoders for multiple data imputation. arxiv:2209.15321 [stat.ML]. Retrieved from https://arxiv.org/abs/2209.15321
    [62]
    Donald B. Rubin. 1976. Inference and missing data. Biometrika 63, 3 (1976), 581–592.
    [63]
    Seunghyoung Ryu, Minsoo Kim, and Hongseok Kim. 2020. Denoising autoencoder-based missing value imputation for smart meters. IEEE Access PP (022020), 1–1. DOI:
    [64]
    Reza Shahbazian and Irina Trubitsyna. 2022. DEGAIN: Generative-adversarial-network-based missing data imputation. Information 13, 12 (2022). DOI:
    [65]
    Wolfram Stacklies, Henning Redestig, Matthias Scholz, Dirk Walther, and Joachim Selbig. 2007. pcaMethods a bioconductor package providing PCA methods for incomplete data. Bioinformatics 23, 9 (032007), 1164–1167.
    [66]
    Daniel J. Stekhoven and Peter Bühlmann. 2011. MissForest–non-parametric missing value imputation for mixed-type data. Bioinformatics 28, 1 (102011), 112–118. DOI:
    [67]
    Jonathan A. C. Sterne, Ian R. White, John B. Carlin, Michael Spratt, Patrick Royston, Michael G. Kenward, Angela M. Wood, and James R. Carpenter. 2009. Multiple imputation for missing data in epidemiological and clinical research: Potential and pitfalls. Br. Med. J. 338 (2009). DOI:
    [68]
    Chris Thornton, Frank Hutter, Holger H. Hoos, and Kevin Leyton-Brown. 2012. Auto-WEKA: Automated selection and hyper-parameter optimization of classification algorithms. arXiv:1208.3719. Retrieved from http://arxiv.org/abs/1208.3719
    [69]
    Robert Tibshirani. 1996. Regression shrinkage and selection via the lasso. J. Roy. Stat. Soc.: Ser. B (Methodol.) 58, 1 (1996), 267–288. DOI:
    [70]
    Michael E. Tipping and Christopher M. Bishop. 1999. Probabilistic principal component analysis. J. Roy. Stat. Soc. Ser. B (Stat. Methodol.) 61, 3 (1999), 611–622. http://www.jstor.org/stable/2680726
    [71]
    Olga Troyanskaya, Michael Cantor, Gavin Sherlock, Pat Brown, Trevor Hastie, Robert Tibshirani, David Botstein, and Russ B. Altman. 2001. Missing value estimation methods for DNA microarrays. Bioinformatics 17, 6 (062001), 520–525. DOI:
    [72]
    Ioannis Tsamardinos and Constantin F. Aliferis. 2003. Towards principled feature selection: Relevancy, filters and wrappers. In Proceedings of the 9th International Workshop on Artificial Intelligence and Statistics(Proceedings of Machine Learning Research, Vol. R4), Christopher M. Bishop and Brendan J. Frey (Eds.). PMLR, 300–307.
    [73]
    Ioannis Tsamardinos, Paulos Charonyktakis, Georgios Papoutsoglou, Giorgos Borboudakis, Kleanthi Lakiotaki, Jean Claude Zenklusen, Hartmut Juhl, Ekaterini Chatzaki, and Vincenzo Lagani. 2022. Just add data: Automated predictive modeling for knowledge discovery and feature selection. npj Precis. Oncol. 6, 1 (16 Jun2022), 38. DOI:
    [74]
    Ioannis Tsamardinos, Elissavet Greasidou, Michalis Tsagris, and Giorgos Borboudakis. 2017. Bootstrapping the out-of-sample predictions for efficient and accurate cross-validation. arXiv:1708.07180. Retrieved from http://arxiv.org/abs/1708.07180
    [75]
    I. Tsamardinos, V. Lagani, and D. Pappas. 2012. Discovering multiple, equivalent biomarker signatures. In Proceedings of the 7th Conference of the Hellenic Society for Computational Biology and Bioinformatics (HSCBB ’12). Heraklion.
    [76]
    S. van Buuren and C. G. M. Groothuis-Oudshoorn. 1999. Flexible Multivariate Imputation by MICE. Vol. (PG/VGZ/99.054). TNO Prevention and Health, Leiden.
    [77]
    Stef van Buuren and Karin Groothuis-Oudshoorn. 2011. mice: Multivariate imputation by chained equations in R. J. Stat. Softw. 45, 3 (2011), 1–67. DOI:
    [78]
    J. Vanschoren, J. N. van Rijn, B. Bischl, and L. Torgo. 2013. OpenML : Networked science in machine learning. SIGKDD Explor. 15, 2 (2013), 49–60. DOI:
    [79]
    Akbar K. Waljee, Ashin Mukherjee, Amit G. Singal, Yiwei Zhang, Jeffrey Warren, Ulysses Balis, Jorge Marrero, Ji Zhu, and Peter D. R. Higgins. 2013. Comparison of imputation methods for missing laboratory data in medicine. BMJ Open 3, 8 (2013). DOI:
    [80]
    Akbar K. Waljee, Ashin Mukherjee, Amit G. Singal, Yiwei Zhang, Jeffrey Warren, Ulysses Balis, Jorge Marrero, Ji Zhu, and Peter D. R. Higgins. 2013. Comparison of imputation methods for missing laboratory data in medicine. BMJ Open 3, 8 (2013). DOI:
    [81]
    Katarzyna Woźnica and Przemysław Biecek. 2020. Does imputation matter? Benchmark for predictive models. arxiv:2007.02837 [stat.ML]. Retrieved from https://arixv.org/abs/2007.02837
    [82]
    Richard Wu, Aoqian Zhang, Ihab Ilyas, and Theodoros Rekatsinas. 2020. Attention-based learning for missing data imputation in HoloClean. In Proceedings of Machine Learning and Systems, I. Dhillon, D. Papailiopoulos, and V. Sze (Eds.), Vol. 2. 307–325.
    [83]
    Jinsung Yoon, James Jordon, and Mihaela van der Schaar. 2018. GAIN: Missing data imputation using generative adversarial nets. arXiv :1806.02920. Retrieved from https://arxiv.org/abs/1806.02920
    [84]
    Shichao Zhang. 2012. Nearest neighbor selection for iteratively kNN imputation. J. Syst. Softw. 85, 11 (2012), 2541–2552. DOI:
    [85]
    Xinmeng Zhang, Chao Yan, Cheng Gao, Bradley A. Malin, and You Chen. 2020. Predicting missing values in medical data via xgboost regression. J. Healthc. Inf. Res. 4, 4 (01 Dec.2020), 383–394. DOI:

    Index Terms

    1. Do We Really Need Imputation in AutoML Predictive Modeling?

      Recommendations

      Comments

      Information & Contributors

      Information

      Published In

      cover image ACM Transactions on Knowledge Discovery from Data
      ACM Transactions on Knowledge Discovery from Data  Volume 18, Issue 6
      July 2024
      760 pages
      ISSN:1556-4681
      EISSN:1556-472X
      DOI:10.1145/3613684
      Issue’s Table of Contents
      This work is licensed under a Creative Commons Attribution International 4.0 License.

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      Published: 12 April 2024
      Online AM: 16 February 2024
      Accepted: 19 January 2024
      Revised: 26 November 2023
      Received: 28 March 2023
      Published in TKDD Volume 18, Issue 6

      Check for updates

      Author Tags

      1. Missing values
      2. imputation
      3. automl
      4. machine learning
      5. optimization

      Qualifiers

      • Research-article

      Funding Sources

      • Hellenic Foundation for Research and Innovation
      • Faculty members and Researchers

      Contributors

      Other Metrics

      Bibliometrics & Citations

      Bibliometrics

      Article Metrics

      • 0
        Total Citations
      • 613
        Total Downloads
      • Downloads (Last 12 months)613
      • Downloads (Last 6 weeks)165

      Other Metrics

      Citations

      View Options

      View options

      PDF

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader

      Get Access

      Login options

      Full Access

      Media

      Figures

      Other

      Tables

      Share

      Share

      Share this Publication link

      Share on social media