research-article

Open access

Do We Really Need Imputation in AutoML Predictive Modeling?

Authors:

George Paterakis,

Stefanos Fafalios,

Paulos Charonyktakis,

Vassilis Christophides, and

Ioannis TsamardinosAuthors Info & Claims

ACM Transactions on Knowledge Discovery from Data, Volume 18, Issue 6

Article No.: 147, Pages 1 - 64

https://doi.org/10.1145/3643643

Published: 12 April 2024 Publication History

PDF eReader

Abstract

Numerous real-world data contain missing values, while in contrast, most Machine Learning (ML) algorithms assume complete datasets. For this reason, several imputation algorithms have been proposed to predict and fill in the missing values. Given the advances in predictive modeling algorithms tuned in an Automated Machine Learning context (AutoML) setting, a question that naturally arises is to what extent sophisticated imputation algorithms (e.g., Neural Network based) are really needed, or we can obtain a descent performance using simple methods like Mean/Mode (MM). In this article, we experimentally compare six state-of-the-art representatives of different imputation algorithmic families from an AutoML predictive modeling perspective, including a feature selection step and combined algorithm and hyper-parameter selection. We used a commercial AutoML tool for our experiments, in which we included the selected imputation methods. Experiments ran on 25 binary classification real-world incomplete datasets with missing values and 10 binary classification complete datasets in which synthetic missing values are introduced according to different missingness mechanisms, at varying missing frequencies. The main conclusion drawn from our experiments is that the best method on average is the Denoise AutoEncoder on real-world datasets and the MissForest in simulated datasets, followed closely by MM. In addition, binary indicator variables encoding missingness patterns actually improve predictive performance, on average. Last, although there are cases where Neural-Network-based imputation significantly improves predictive performance, this comes at a great computational cost and requires measuring all feature values to impute new samples.

1 Introduction

Real-world data often contain missing values, stemming from faulty sensors, non-responders in questionnaires, incomplete data entry, or other reasons. For example, in the openml portal, as of March 2022, 364 of the 3,487 active datasets contain missing values. Unfortunately, most Machine Learning (ML) algorithms demand complete datasets on which to operate.¹ To address this problem, a plethora of imputation algorithms, ranging from simple to very advanced, have been developed to predict the missing values and allow the remaining algorithms in the analysis pipeline to complete.

The problem of imputation has been under study for decades [28, 47, 62]. Initially, it was studied in the context of estimating the coefficients of linear models, call it estimation perspective. In contrast, we study imputation from a predictive modeling perspective where the goal is to create an accurate model to predict a specific outcome of interest (target variable) in new samples. There are important differences in approaching the subject, under these two perspectives. Under the estimation perspective, (a) some methods would impute the missing values in the training data but would not create an imputation model that is able to impute test data [15, 77]. Hence, these methods cannot be applied to predictive modeling. In addition, (b) standard guidelines [67] suggest using the outcome in imputing feature values, e.g., to differentiate imputation values in cases vs. controls. This technique is not applicable in predictive modeling where the outcome is unknown in test samples. Finally, (c) a useful metric of imputation efficacy under the estimation perspective is the imputation accuracy [29, 34], i.e., the accuracy of predicting the missing values. Imputation accuracy is important for estimation purposes but may not be indicative of the impact of imputation on predictive performance.

Under the predictive modeling perspective, several interesting questions arise as follows:

—

Are advanced predictive modeling algorithms in need of imputation beyond the simple Mean/Mode (MM) technique? A non-linear algorithm could potentially learn a rule of the sort “if a feature value equals its mean (i.e., it is missing), then do not use it but instead rely on other observed features values for prediction.” Hence, it is questionable whether imputation would provide an advantage to such an algorithm.

—

Is the need for sophisticated imputation further reduced in Automated Machine Learning context (AutoML) whereby the most appropriate combination of algorithm and hyper-parameter values (combined algorithm and hyper-parameter selection (CASH) optimization) [68] is taking place?

—

Do Binary Indicator (BI) variables (1 if the value of a feature is missing and 0 otherwise) encoding the missingness patterns provide additional information to a classifier to learn a predictive model?

—

How does the feature selection step interact with imputation? Feature selection aims to reduce the number of features that enter the model without sacrificing predictive performance and leads to more interpretable models by providing insights regarding the underlying data generation. It remains open how the benefits of feature selection are impacted when we impute the missing values.

—

What is the tradeoff between the computational overhead of imputation and the improvement in predictive performance? Imputation algorithms impute all the missing values, independently of whether they contribute to the predictions of the model. In other words, imputation is unsupervised and not guided by the outcome to predict. Hence, they potentially perform a significant amount of unnecessary computations.

—

If imputation algorithms indeed improve performance, then are there any characteristics of the datasets (called meta-features) that allow us to predict the value of imputation prior to their analysis and decide whether imputation is worth the computational overhead?

To the best of our knowledge, this is the first empirical study that answers all the above research questions via an experimental evaluation over 25 binary classification real-world datasets, as well as 10 complete datasets in which synthetic missing values are introduced according to different missingness mechanisms, at varying missing frequencies. The MM imputation is used as a baseline and is compared against state-of-the-art representatives of different imputation algorithmic families, namely Discriminative, such as Miss-Forest [66], and Generative, such as SoftImpute [44] and probabilistic principal component analysis (PPCA) [70] exploiting matrix-factorization, or Generative Adversarial Imputation Nets (GAIN) [83], and Denoise AutoEncoder (DAE) [21] based on Neural Networks. The imputation algorithms are integrated into the Just Add Data Bio (JADBio) AutoML platform [73], which performs CASH and it includes a feature selection step.

In summary, the results show that the single best-performing algorithm is DAE and MissForest for the real and the simulated datasets, respectively. For five of the six imputation algorithms studied, the inclusion of BI variables is beneficial, on average. MM, when BI variables are included and CASH is taking place, is a close competitor and places as the second-best algorithm. Advanced imputation methods do offer a significant advantage but only in a few datasets. In contrast, they require the measurements of all feature values to impute new samples, which in some way invalidates the feature selection step and leads to models of high dimensionality. In addition, they require orders of magnitude more computational time. Meta-level analysis has indicated that only one feature is correlated with the relative performance of the algorithms; unfortunately, the correlation is not statistically significant when corrected for multiple testing. More datasets and new meta-features are needed to extract patterns of when sophisticated imputation should be used over the simple MM.

Overall, in an AutoML setting where optimization is taking place and BI variables are included, MM is a reasonable option; other algorithms should be used only if feature selection is not required and computational time is of little importance relative to improving predictive performance.

The article is organized as follows. Section 2 introduces missing data mechanisms and a taxonomy of imputation families. In Section 3, we present the experimental environment, the selected datasets for evaluation, and the metrics and hyper-parameters tuned. Section 4 describes the missing data generation procedure. The experimental results for real-world data with missing data and simulated missing data are presented in Sections 5 and 6, respectively. In Section 7, we discuss the results of the meta-level analysis on real-world datasets. Related work is discussed in Section 8, followed by the contributions and lessons learned in Section 9. Finally, Section 10 presents the conclusions and limitations of the study. The detailed information about the datasets, missing value simulation setup, and experimental results are provided in Appendices A, B, and C, respectively.

2 Background and Context

2.1 Missingness Mechanisms

The concept of a missing mechanism [62] formalizes the generation process of missing data. In this respect, the BI are modeled as random variables and assigned a distribution. There are three types of underlying mechanisms that generate missing data, namely, missing completely at random (MCAR), missing at random (MAR), and missing not at random (MNAR). For formal definitions of these mechanisms, readers are referred to Reference [40]. Intuitively, MCAR implies that the probability of a value missing is independent of the actual value, the other observed quantities, and any latent variables. MAR implies that the missingness only depends on the observed data (so it can be predicted). MNAR refers to the case that the missing values are related to both the observed and unobserved variables, including the missing value itself. When missingness is MNAR, it is in principle and in general not possible to impute the missing values in a way that follows the unknown underlying data distribution.²

An illustrative example is given in Figure 1, which is adapted from Reference [46]. The missingness mechanisms can be described using a causal graph. Let us assume A and B are observed random variables and O a latent variable. Each variable is depicted as a node of the graph. Assume that A and B have a direct connection to O, which is the variable (node) of interest. The \(R_o\) node is a mask variable that denotes the missingness inserted into O, which causes \(O^*\) . \(O^*\) is a surrogate of O but with missing values inserted in the positions specified by \(R_o\) . As seen in Figure 1(a), MCAR missing values do not depend on any of the variables A, B, 0. In contrast, missingness depends on B for the MAR mechanism, and in O itself for the MNAR data, as seen in Figure 1(b) and (c).

Fig. 1.

2.2 An Imputation Family Taxonomy

There are numerous imputation algorithms and approaches in the literature, and we do not attempt a full review. Readers are encouraged to explore comprehensive surveys available in the field for a more in-depth understanding [2, 4, 12, 13, 24]. Imputation approaches can be partitioned into various distinct families/groups of methods. A taxonomy is attempted in Figure 2. First, imputed values can be decided based on only the feature with the missing value (Univariate imputation) or several features (multivariate imputation). The former methods include Mean/Median/Max imputation for continuous data and Mode imputation for categorical data. Multivariate methods can be partitioned into Iterative and Distance-Based, also known as Hot-Deck methods [6]. Distance-based methods employ a distance or a similarity metric for samples to find neighbors or cluster them. A commonly used algorithm in this category is the K-nearest neighbors imputation (KNNi) [71], which imputes values based on the neighbors of the sample with missing values. K-means-based methods cluster the samples before imputation [37].

Fig. 2.

Iterative methods, start with a simple initial guess (e.g., using MM imputation) and, in each iteration, try to improve the imputed values. We further split iterative methods into Discriminative and Generative. Discriminative methods, build a predictive model per feature with missing values, given the other features in the dataset. This model is used to predict the missing values of the corresponding feature, in each iteration. The Discriminative family can either utilize a (generalized) linear model or a non-linear model. Linear discriminative methods include Multivariate Imputation by Chained Equations (MICE) [76]. Non-linear discriminative methods include the MissForest algorithm [66] employing Random Forests, and Datawig [9] that can impute continuous, categorical, and text data by employing different loss functions according to the missing features’ datatype.

Generative methods try to model the joint distribution of the data and use the generative model to impute values. They can be split into two categories, methods that employ matrix factorization and methods that use neural networks. The matrix-factorization family includes low-rank matrix decomposition methods: First, missing values are imputed with an initial guess, and the matrix is decomposed (factorized) and used to predict the missing values. Imputation is improved in each cycle via expectation-maximization steps. Examples of this family include the PPCA [70], SVDImpute [71], bPCA [10], and SoftImpute [44]. Such algorithms scale better w.r.t. to the number of features than MICE or MissForest that train a different model for each feature with missing values in each iteration. Recently, neural networks have also been tried as generative models. These algorithms are essentially non-linear alternatives to matrix factorization. These methods start with an initial guess and then train a neural network that learns the joint distribution. This family includes methods based on AutoEncoders (AE), such as DAE) [21, 35, 41] and Variational Autoencoder (VAE) [17, 45, 61]. Also, it includes generative adversarial networks (GAIN) [64, 83]). Finally, HoloClean, a data cleaning tool, implements an attention-based neural network for imputation, named Aimnet [82]. A detailed comparison of imputation methods is detailed in Section 8. In the next subsection, we will explain the rationale for our choice to include in our empirical study a subset of the aforementioned imputation methods.

2.3 Description of the Selected Imputation Methods

In this section, we present the main characteristics of the imputation methods given in Table 1 that we included in our testbed. In the analysis of their computational complexity, n denotes the number of samples, m the number of features, \(\#comp\) the number of principal components, \(\#sing\) the number of singular values, and \(\#trees\) for the number of trees.

Table 1.

Algorithm	Model Family	Base Model	Learning Procedure	Categorical Handling	Approx. Complexity
MM	Univariate	—	No Iterative	Native	O( \(n \cdot m\) )
SOFT	Generative	SVD	Iterative	One-hot-encoding	O( \(k \cdot n \cdot m \cdot \#sing\) )
PPCA	Generative	PCA	Iterative	One-hot-encoding	O( \(k \cdot n \cdot m \cdot \#comp\) )
MF	Discriminative	RF	Iterative	Native	O( \(k \cdot m^2 \cdot n \cdot \log (n) \cdot \#\text{trees}\) )
GAIN	Generative	GAN	Iterative	Ordinal-Encoding	—
DAE	Generative	AE	Iterative	One-hot-encoding	—

Table 1. Comparison of the Imputation Methods and Their Characteristics

Abbreviations: n is the number of samples, m is the number of features, k is the number of iterations, \(\#trees\) stands for the number of trees in the forest(hp), \(\#comp\) stands for the number of principal components employed in the matrix factorization, and \(\#sing\) for the number of singular values of the SVD.

2.3.1 Mean/Mode.

MM is the most common imputation method in AutoML tools and is included as the baseline methodology. It is an instance of the univariate imputation family. In the MM algorithm, missing values are imputed with the mean in the training data of the corresponding feature if it is continuous and the mode (most frequent value) if it is discrete. MM is the most computationally efficient method as it needs only \(O(n \cdot m)\) to impute the whole dataset. A variation of MM imputation is mentioned in medical literature [51] where missing values of a sample are imputed based on the mean/mode of the class to which it belongs. However, in the case of predictive modeling, this approach becomes problematic as the class of a sample is unknown during inference, as discussed in Section 1.

2.3.2 MissForest.

MissForest (MF) is a discriminative iterative method based on Random Forests [66]. First, the missing values are imputed by Mean/Mode. Subsequently, for each feature with missing values serving as the outcome, the algorithm trains a random forest on the rest of the features and uses it to predict the outcome’s missing values. After imputing all missing values, the algorithm uses the (now) complete dataset to warm-start the new iteration until a stopping criterion is met. MF is one of the slowest methods as it requires building a forest per feature for a number of iterations. The approximate worst case is \(O(k \cdot m^2 \cdot n \cdot \log (n) \cdot \#\text{trees})\) . MF encounters scalability issues in datasets with more than 50 features. In addition, MF needs to store a forest for each feature, which creates model-storing issues. To avoid such issues, in our experiments we limit the maximum allowed depth of the Random Forests.

2.3.3 Probabilistic PCA.

PPCA is a statistical iterative method [39]. In each iteration, a principal component analysis (PCA) is performed, which is improved in the next step using maximum likelihood estimation [58] and assuming a multivariate Gaussian distribution of the data. To impute a new sample, the optimal set of principal components found from training is used to identify the missing values that maximize the joint probability of the sample. Categorical features are one-hot-encoded before applying PPCA and then are inverse transformed after PPCA returns the imputed data. PPCA is one of the fastest methods, scaling linearly to the number of samples, features, and the number of principal components computed. The approximate complexity for PPCA is \(O(k\cdot n\cdot m\cdot \#\text{comp})\) .

2.3.4 SoftImpute.

SoftImpute (SOFT) is a statistical iterative method [44]. It starts with the initialization of missing values with the mean. Then, it iteratively solves the optimization problem on the complete matrix using a soft-thresholded SVD and proceeds iteratively until a stopping criterion is met. Categorical features are one-hot-encoded before applying SOFT and then are inverse transformed after SOFT returns the imputed data. SOFT like PPCA is very very fast, utilizing an EM approach. The approximate time complexity is \(O(k\cdot n \cdot m \cdot \#\text{sing})\) .

2.3.5 Denoise Autoencoder.

DAE is a deep learning algorithm based on autoencoders [21]. The Denoise Autoencoder is based on an overcomplete implementation and a dropout layer. DAE projects the input data to a higher-dimensional subspace where the missing data are recovered by the decoder. The categorical data are one-hot-encoded before DAE is applied. Then the one-hot-encoded data are transformed back to the original representation. The complexity of the DAE is mostly measured by the number of epochs needed for the algorithm to impute the dataset accurately and the hidden layers’ size and depth. For further information see Section 3.8.

2.3.6 Generative Adversarial Imputation Nets.

GAIN is an adaptation of the GAN framework [83]. A generator is used to impute missing data based on the observed data. The discriminator tries to determine which data are observed and which are imputed. The goal of the generator is to provide an accurate imputation whereas the goal of the discriminator is to distinguish between the observed and missing data. The two neural networks are trained in an adversarial process. Categorical data are turned into ordinal features and normalized between 0 and 1. After applying the GANs we revert the categorical data to the original representation by doing the inverse procedure. The complexity of GAIN is mostly bottlenecked by the number of iterations needed to train the GANs. See Section 3.8 for more details.

2.3.7 Binary Indicators.

BI is not an imputation method but a feature construction method. Specifically, for each feature \(F_j\) with missing values, we construct a new feature, call it \(I_{jk}\) , which indicates whether the value at the kth sample of feature j is missing or not. The idea is to encode in \(I_j\) the missingness pattern. BIs may help the classifier and allow it to learn whether to trust value \(F_{jk}\) for prediction. BI can complement any imputation method. BI’s complexity is \(O(n \cdot m)\) ; however, we should note that it increases the complexity of subsequent stages of the ML pipeline by increasing the dataset’s dimensionality by a maximum factor of 2. We note that imputation models extended with BI do not use BI to impute the missing data. BI are merged along with the imputed dataset. It is important to mention that BIs are not utilized during the imputation phase for BI-extended methods. Instead, they are added to the imputed dataset.

2.4 Rationale of the Selection of Algorithms

MM imputation is selected as a baseline and one of the most commonly used methods. MissForest is selected as a representative of a multivariate iterative imputation method over MICE, based on the results on imputation accuracy presented in Reference [79]. Encoding missingness information using BI is also experimentally evaluated, as it performed better than other methods in Reference [43]. Distance-based methods are excluded for various reasons. First, they need to memorize the full dataset to produce imputation as they do not learn a model. k-means and KNN imputation were not included in our testbed as according to other empirical studies are outperformed by MissForest [30, 42, 66].

As representatives of matrix completion-based methods, we chose PPCA that have shown the best performance according to previous empirical studies [25, 31, 58]. SoftImpute was also selected based on the experimental results presented in Reference [85]. GAIN and DAE were also included as representatives of neural network–based methods as they excel in several studies [11, 54]. The former is based on using a generative adversarial network to learn the probability distribution to impute and the latter on autoencoder. VAE and Aimnet were not included based on the inferior or comparable results to GAIN and MM, respectively [11, 38]. Finally, Datawig [9] is not included as, according to the results reported in Reference [9], it is outperformed by MissForest for both continuous and categorical data while it comes with a high cost to fit one neural network per feature with missing values.

2.5 How Is Imputation Treated in AutoML Platforms

AutoML platforms employ imputation methods, as well as modeling algorithms that directly treat missing values as a separate category. The current versions of JADBio (version v1.4) and AutoSklearn [16] employ MM by default, while DataRobot³ may also include BI variables. AutoSklearn allows the user to specify additional imputation methods to optimize over as part of the pipeline. TPOT employs median imputation for all missing features [36]. Auger.AI⁴ does iterative regression or mean imputation for numerical features depending on the dataset size and creates a new category for the categorical features. BigML⁵ by default does not impute the missing value; the missing values are handled internally by their predictive models, which are based on trees only. Autoprognosis optimizes the ML pipeline over a variety of missing data imputation algorithms. Specifically, it employs MICE, MissForest, Bootstrapped Expectation-Maximization imputation, Soft-Impute, and MM [3]. DriverlessAI by H2O creates a new value to express missingness when the XGBoost, LightGBM, and RuleFit algorithms are used. For generalized linear models, it performs MM imputation, while for tensorflow models missing values are treated as outliers [22]. GAMA [20] does not impute missing values by default. Autogluon [14] uses median imputation for continuous features and introduces a new “Unknown” category for categorical features.

3 Experimental Setup

We now present the design choices for the experimental setup and the comparative evaluation.

3.1 Datasets

Incomplete Real Datasets. There are currently 364 datasets with missing values in the OpenML repository [78], we restricted our selection to binary classification datasets. We selected 25 binary classification datasets in an effort to cover a range of various dataset characteristics. The datasets contain both continuous and discrete features. The number of features ranges from 7 to 69, the sample size ranges from 155 to 31,406, the prevalence of the minority class ranges from 0.06 to 0.48, the number of features with at least 1% missing values ranges from 1 to 32, and, finally, the percentage of missing values ranges from 1.11% to 71.64%. Table 6 presents the characteristics of the datasets, along with their OpenML id.

Table 2.

Algorithm	Hyper-parameter	Value
Mean/Mode	—	—
MissForest	n-trees	250
	maxDepth	20, 30
	maxLeafNodes	30
SoftImpute	variance-explained	50%, 70%, 90%
PPCA	variance-explained	50%, 70%, 90%
DAE	dropout	0.25, 0.4, 0.5
	batch-size	64
	\(\theta\)	5, 7, 10
	epochs	500
GAIN	alpha	0.1, 1, 10
	hint-rate	0.5, 0.9
	batch-size	64
	epochs	10.000

Table 2. The Set of Values Tried for Each Hyper-Parameter Tuned

For each algorithm, all combinations of values for its hyper-parameters shown were tried and combined with all other choices for feature selection and modeling by JADBio. There are 48 combinations of imputation algorithms and hyper-parameter values. The default hyper-parameter values are underlined.

Table 3.

Base Imp.Method	p-value	q-value
MF	0.036	0.182
PPCA	0.064	0.182
GAIN	0.091	0.182
MM	0.148	0.222
SOFT	0.31	0.375
DAE	0.552	0.552

Table 3. p-values of the Matched t-test and the q-values after FDR Correction (Sorted)

Only MF has p-value < 0.05. GAIN and PPCA have p-value < 0.1. Setting the q-value threshold to 0.25 leads to accepting the hypothesis that BI’s are beneficial for four algorithms (MF, GAIN, PPCA, MM), expecting a 25% (one of four) of these discoveries to be false on average.

Table 4.

Name	Category	Description
inst_to_attr	General	Samples to features ratio
Minority Class %	General	% of minority class
nr_attr	General	Number of features
nr_inst	General	Number of samples
n_num	General	Number of numerical features
n_cat	General	Number of categorical features
% NA	Missing	% of missing values in data
% samples /w NA	Missing	% of samples with missing values
% features /w NA	Missing	% of features with missing values
% NA/Feat. /w (NA 1+%)	Missing	Mean % of missing values per feature with more than 1% missing
# components 50%	Clustering	Number of components that explain 50% of data variance
# components 70%	Clustering	Number of components that explain 70% of data variance
# components 90%	Clustering	Number of components that explain 90% of data variance
Slh(k=2)	Clustering	Mean Silhouette Coefficient of all samples when using 2 clusters
Slh (k=3)	Clustering	Mean Silhouette Coefficient of all samples when using 3 clusters
Slh (k=4)	Clustering	Mean Silhouette Coefficient of all samples when using 4 clusters

Table 4. Meta-features Used in the Meta-level Analysis

The first column contains the name of the meta-feature, the second column denotes the category of the meta-feature, and the third column provides a brief explanation of the meta-feature.

Table 5.

Study	#Datasets	Mechanism	% Missing values	#Imp. methods	NNs	BI	System	FS	Tuning	#Models	Eval	Metric	Meta
[38]	6B	Nat	7–84%	3N, 2C, 2M.	No	No	Adhoc	No	Imp+Pred	7	R-TT (70-30)	ACC-F1	No
[81]	13B	Nat	0.6–33.6%	2N,1C,4M	No	No	Adhoc	No	No	5	TT (80-20)	AUC-F1	No
[57]	2B	MC,MR	0–40% \(^{**}\)	7C	No	No	Adhoc	No	No	3	TT(66.6-33.3)	ACC	No
[8]	5B,5R	MC-MN	10–50%	8M	No	No	Adhoc	No	Imp	4	R-TT(50-50)	ACC-R2	No
[30]	31B, 21R, 17M	MC,MR,MN	1–50% \(^{*}\)	6M	Yes	No	Adhoc	No	Imp	2	CV(3-5)	RMSE-F1	No
[55]	10B, 3R	Nat	—	7M	No	Yes	Adhoc	No	Pred	3	NCV	ACC	No
[19]	23B,M	MC	7%	6M	No	No	AutoML	Yes	—	10	R-TT(75-25)	ACC	No
[49]	5B	Nat	—	4N, 2C, 1M	No	No	AutoML	—	Imp+Pred	Ensemble	CV(5)	B-ACC	No
Ours	35B	N,MC,MR	1–72%	6M	Yes	Yes	AutoML	Yes	Imp+Pred	4	TT(50-50)	ACC-F1-AUC	Yes

Table 5. An Overview of Related Work on Predictive Modeling

Most benchmarks either use datasets with simulated missing values or with native but not both. Abbreviations: # symbol means number, “—” denotes that the paper does not mention any details about the topic, on column data B, M, R denotes binary, multiclass, and regression datasets, respectively. On the column mechanism, values Nat, MC, MR, and MN denote Native, MCAR, MAR, and MNAR. On column #Imp. methods, N, C, and M denotes numerical, categorical, and mixed imputation method, respectively. The NNs column denotes that Neural Network imputation methods were included. BI means that methods were extended with BIs, FS means feature selection was included in the pipeline, and Tuning represents whether the study tuned the imputation methods (Imp), the predictive models (Pred), or both (Imp+Pred). Eval column presents the evaluation methodology, R denotes repeated, TT: train-test split and number in parenthesis the percentages of the train and test set, respectively, CV: Cross-validation and the number in parenthesis the number of folds, NCV denotes Nested Cross Validation. The metric column denotes the metric used for the evaluation methodology, ACC is classification accuracy, B-ACC is balanced accuracy, F1 is F1-score, RMSE is the root mean squared error, and AUC is the Area under the ROC curve. Meta presents whether a study has conducted a meta-level analysis. \(^{*}\) One of the features was made missing. \(^{*}\) missing values were only generated on the train data.

Table 6.

Dataset	ID	Samples	Features	#Numerical	#Categorical	Missing %	Imbalance ratio	#Feat with miss>1%	%Missing/Feature
analcatdata_reviewer	1,008	379	7	0	7	51.56	0.43	7	51.56
audiology	999	226	69	0	69	2.03	0.25	6	23.23
anneal	989	898	38	16	22	64.98	0.24	29	85.15
autoHorse	840	205	25	17	8	1.11	0.40	4	6.46
braziltourism	957	412	8	7	1	2.91	0.23	2	10.68
bridges	328	107	11	4	7	6.03	0.41	7	9.35
cjs	1,024	2,796	34	32	2	71.64	0.24	28	86.97
colic	27	368	22	7	15	23.80	0.37	19	27.50
colleges_aaup	897	1,161	15	13	2	1.47	0.30	6	3.68
cylinder-bands	6,332	540	39	24	15	4.74	0.42	23	7.93
dresses-sales	23,381	500	12	1	11	13.92	0.42	5	33.04
eucalyptus	990	736	19	14	5	3.21	0.29	6	9.95
hepatitis	55	155	19	6	13	5.67	0.21	11	9.56
hungarian	231	294	13	12	1	20.46	0.36	5	52.93
kdd_el_nino-small	839	782	8	8	0	7.45	0.35	4	14.90
mushroom	24	8,124	22	0	22	1.39	0.48	1	30.53
pbcseq	802	1,945	17	13	4	3.43	0.50	6	9.71
primary-tumor	1,003	339	17	0	17	3.90	0.25	2	32.74
profb	470	672	9	5	4	19.84	0.33	2	89.29
schizo	466	340	14	12	2	17.52	0.48	11	22.30
sick	38	3,772	29	7	22	5.54	0.06	7	22.96
soybean	1,023	683	35	0	35	9.78	0.13	32	10.68
stress	42,167	199	12	8	4	8.29	0.20	7	14.22
vote	56	435	16	0	16	5.63	0.39	16	5.63
water-treatment	940	527	36	36	0	2.86	0.15	22	4.53

Table 6. Binary Classification Real-World Datasets Used in the Comparative Evaluation

The table below contains the dataset name, id, number of samples, number of features, number of categorical, and numeric features, Missingness percentage in the whole dataset, Minority Class with missing values over 1 finally the outcome type of each dataset.

Complete Datasets: We selected 10 complete datasets from OpenML, where we introduce and simulate missingness. The number of features ranges from 9 to 135, the sample size ranges from 101 to 5,473, and the prevalence of the minority class ranges from 10% to 49%. Table 7 contains these values for each dataset, along with their OpenML id.

Table 7.

Dataset	ID	#Samples	#Features	#Numerical	#Categorical	Minority Class %
Australian	40,981	690	14	8	6	0.44
boston	853	506	13	12	1	0.41
churn	40,701	5,000	20	16	4	0.14
compas-two-years	42,193	5,278	13	7	6	0.47
image	40,592	2,000	135	135	0	0.21
page-blocks	1,021	5,473	10	10	0	0.1
parkinsons	1,488	195	22	22	0	0.25
segment	958	2,310	19	19	0	0.14
stock	841	950	9	9	0	0.49
zoo	965	101	16	1	15	0.41

Table 7. Binary Complete Datasets in Which We Inject Missing Values

The table reports the dataset name, ID, the number of samples, number of features, the imbalance ratio, and the number of numerical and categorical variables.

3.2 Evaluation Task and Metric

We note that the evaluation concerns only binary classification. The main metric of predictive performance is the Area Under the ROC curve (AUC). To save space and make interpretation easier, we report classification accuracy and F1-score results in the Appendices C. The datasets are split to 50% training and 50% hold-out test set used only for performance evaluation. Our experiments were conducted only once, due to the computational complexity of the experimental procedure (see Section 3.6). We applied statistical tests to compensate for the lack of repeated experiments. This allows reliable conclusions to be drawn from the experimental results.

3.3 AutoML Environment

To experiment with different imputation algorithms when CASH optimization is taking place, we employed the JADBio AutoML platform [73]. JADBio is a commercial product (a version of JADBio with basic functionality is freely available) but was offered to us for research purposes. JADBio includes feature selection as part of the ML pipeline and, thus, it can be used to study the effect of feature selection on imputation.

A quick description of JADBio’s architecture now follows. For each dataset to analyze, an internal knowledge base system, called Algorithm and Hyper-Parameter Space selection (AHPS) in Reference [73], selects the feature construction, preprocessing, feature selection, and modeling algorithms to try, along with a set of values for their hyper-parameters. The AHPS also selects the configuration evaluation protocol, e.g., 10-fold cross-validation, repeated cross-validation, or hold-out to estimate the performance of each configuration and select the winning one. The knowledge in AHPS is engineered by experienced analysts but also induced by meta-level learning algorithms.

The choices of the AHPS are based on the meta-features of the dataset (e.g., sample size, number of features), as well as the user preferences. For example, an algorithm that does not scale to the number of samples in the current dataset, will not be selected by AHPS. The choice of the evaluation protocol also depends on the meta-features: For a typical-sized dataset, JADBio may run a 10-fold cross-validation, for a large balanced dataset a hold-out, while for a small sample or an imbalanced dataset, it may run a repeated cross-validation protocol.

Subsequently, JADBio executes all configurations effectively performing a grid search for CASH optimization. However, JADBio includes pruning heuristics that may drop a configuration in the early folds of cross-validation if it is not deemed promising, departing from a pure grid search strategy [74]. Once configurations execute, the final model is built on all available data using the winning configuration.

The final performance of the model producing with the winning configuration is the cross-validated AUC adjusted for the bias incurred due to multiple tries (called “winner’s curse” in statistics). This adjustment is conceptually equivalent to adjusting p-values in multiple hypotheses testing. JADBio uses the BBC-CV algorithm for the performance estimate adjustment [74]. In Reference [73], experiments on 360 omics datasets of small sample size show that this estimation protocol returns slightly conservative out-of-sample AUC performances of the returned model. Nevertheless, for the purposes of this article, JADBio’s performance estimation was not used; instead, the performances on the 50% held-out set are reported.

Regarding the settings of JADBio employed in this set of experiments, we note the following. One of the user preferences indirectly controls the execution time and the number of configurations to try and has the settings Preliminary, Typical, Extensive, with Extensive trying more configurations and performing a more thorough optimization. All subsequent experiments were run using the Preliminary setting to make the computational requirements manageable. The number of configurations may vary between datasets depending on their meta-features, but in our experiments, it ranges from 900 to over 1,000. The training protocol of JADBio depends on the sample size, the class imbalance, and other factors. For typical-size datasets, JADBio uses a repeated 10-fold cross-validation with #repeats from 1 to 20. A heuristic procedure stops repetitions of cross-validation if no progress is detected. Overall, JADBio uses estimation protocols that execute each configuration between 10 to 200 times per dataset to choose the winning configuration and produce a model.

JADBio optimizes over the following set of algorithms. For feature selection, JADBio uses the Lasso [69] and a variant of the SES algorithm [75] with an upper bound on the number of conditional independence tests to perform. For classification, it optimizes over Ridge Logistic Regression, Decision Tree, Random Forests, and Support Vector Machines with polynomial, linear, and radial basis kernels.

To evaluate imputation algorithms, we embedded them into the JADBio configurations as the second step, after the standardization of continuous features and before feature selection, using the API provided. It is important to note that configurations are cross-validated as an atom, and hence, learning to impute is based only on the training data. This is necessary to avoid overestimating the performances of configurations and correspondingly, the imputation methods. Each imputation method returns an imputation model that is used to impute the test data before modeling is applied. It is worth noting that even if the feature selection step selects a small subset S of features when some values of S are missing in the test set, the imputation model may impute them based on other features. Hence, even if the predictive model requires just the features in S, the predictive pipeline may require more features. Specifically, all multivariate algorithms selected in the article require all features to impute. Hence, the predictive pipeline always requires all features when these algorithms are employed, even with feature selection.

3.4 Imputation Algorithms Implementations

We used the JadBio version 1.4.0 for our experiments. MM and BI methods were already implemented by the developing team of the tool used. For PPCA and SoftImpute, we relied on third-party implementations in R from the PCA methods 1.64.0 [65] and ‘softImpute’ package version 1.4.1 respectively. We implemented MissForest in python 3.8.4 using the iterativeImputer and RandomForest models from sci-kit learn 1.0.1 [53]. Pytorch 1.7.1 version [52] was utilized for the implementation of GAIN and DAE. We adapted the DAE implementation found at https://github.com/Harry24k/MIDA-pytorch to closely follow the description of DAE by the original authors in Reference [21]. We employed the GAIN from https://github.com/dhanajitb/GAIN-Pytorch.

3.5 Machine Specifications

The predictive performance experiments of the article were conducted on a fedora-powered VM using 8-core AMD Threadripper 3970x at 3.7 GHz with 12 GB RAM. The neural networks were trained using CPUs. The execution time results reported were measured on an eight-core AMD Ryzen-3600x at 4.6 GHz with 16 GB RAM and Windows 11 OS.

3.6 Computational Resources Employed

During the experiments, more than 41 days of CPU time have been spent training more than 80,000 configurations to conduct the experiments mentioned in the article.

3.7 Availability of Code

The code is available on the Github repository: https://github.com/mensxmachina/Imputation_in_AutoML. The code in the repository consists of scripts for the plots, the datasets, the meta-level analysis as well as the basic implementation of each imputation algorithm.

3.8 Exploring the Hyper-parameter Space of Imputation Algorithms

In the experiments, 24 hyper-parameter (hp) value sets were tried for the imputation algorithms: MM (1 hp set), MissForest (2 hp sets), SoftImpute (3 hp sets), PPCA (3 hp sets), DAE (9 hp sets), and GAIN (6 hp sets). The values tried for each hyper-parameter are shown in Table 2. These choices were based on the algorithm’s authors’ defaults and suggestions. These 24 hp sets were coupled with all other choices of JADBio multiplying by 24 the number of configurations normally tried. In subsequent experiments, each of these 24 hp sets is run on the original dataset, as well as the dataset with the inclusion of the BI features, leading to 48 different combinations. MM has no parameters and therefore does not need tuning. For MissForest, we train RF models with 250 trees, which offers higher imputation accuracy according to Reference [66]. However, we restrict the maximum depth of the tree and maximum leaf nodes, because the trained model had storing memory issues (see Section 2.3.2). SoftImpute and PPCA require selecting the number of principal components to use as a hyper-parameter. The majority of papers in the literature fails to report the tuning of the aforementioned methods that led us to develop the following heuristic: We select as many components required to explain \(x\%\) of the data variance. The values of x are shown in Table 2 as the values of “variance-explained.” The default hyper-parameters are used for DAE with the exception of the dropout layer and the hidden layers’ dimensions. The range of the dropout layer is based on Reference [63], while the theta value is tuned within a neighborhood of the author’s suggested default value. In the current implementation, we have three hidden layers for the encoder and the decoder. For each successive layer in the encoder, \(\theta\) hidden layer nodes are added and hyperbolic tangent is used as the activation function, as it produces better results for small and medium-sized datasets [21]. The model is trained using Stochastic Gradient Descent with an adaptive learning rate with a time decay factor of 0.99 and Nesterov’s accelerated gradient. GAIN architecture consists of three hidden layers for the discriminator and the generator while using Rectified Linear Unit as the activation function. For GAIN, we tune two hyper-parameters; alpha and hint rate. These hyper-parameters are considered the most important for GAIN. Alpha balances the loss between the discriminator and the generator, while the hint rate is responsible for the training of the discriminator. Both DAE and GAIN are trained at the specific epochs as suggested by authors and use the sigmoid activation function for the output layer.

4 Simulating Missing Data

To experiment with a ranging percentage of missing values, as well as different missing mechanisms, we simulated the presence of missing values in the complete datasets presented in Appendix A.2.

4.1 Simulating Missing Completely at Random Data

Under MCAR, missing values are missing with a given probability (percentage) independently of any other factors such as the value itself or the values of other features. To simulate missing values at a realistic missingness percentage we sampled 64 real-world datasets from the OpenML repository with varying characteristics (see Section B.1). We then computed the 25%, 50%, and 75% quantiles of missingness percentages. Features with less than 1% of missing values were excluded from the calculation, as they probably point to features that missing values from typos or non-systematic sources. The quantile values turn out to be about 10%, 25%, and 50% of missingness. The quantile values are then used to vary the missingness percentages in both MCAR and MAR simulation experiments. We then introduced missing values with the given percentages at the 10 complete datasets described in Section 3.1. Even though it is trivial to introduce MCAR missing values, for consistency reasons, we employed the code available in Reference [48], which is also used for the MAR simulations below. To simulate the MCAR mechanism the software discards values uniformly at random from the dataset at the specified missingness percentage.

4.2 Simulating Missing at Random Data

Under MAR, missing values are missing with a probability (percentage) that depends (is conditional) on other observed features, i.e., \(P(I_j = 1|F_{k_1}, \ldots , F_{k_m})\) . To realistically simulate data under MAR, one needs to decide (a) the number of features upon which the probability depends, (b) the functional form of the conditional probability function, and (c) the set \(\lbrace F_{k_1}, \ldots , F_{k_m}\rbrace\) . To answer (a) we needed a realistic estimate of the number m of features in the conditional probability. To that end, in the corpus of the 10 real-world binary datasets of Table 7, we randomly selected one feature with missing values as the target feature and then performed predictive modeling using JADBio including feature selection.⁶ These experimental results suggest that, on average, a feature with missing values is dependent on 12 features, so we set \(m=12\) . Subsequently, for each \(F_j\) we randomly selected a set of m other features with uniform probability. Finally, for the functional form of P, we used a logistic regression model: \(P(I_j = 1|F_{k_1}=f_1, \ldots , F_{k_m}=f_m) = \frac{1}{1+e^{\langle -w, f\rangle }}\) , where w is a set of randomly chosen coefficients from a normal Gaussian distribution, f is the vector of values of the features \(F_{k_l}\) , and \(\langle \cdot , \cdot \rangle\) denotes the inner product. For the simulation, the software [48] was also used. The software allows the simulation of MAR missing data, as described above, with prespecified missingess percentages. The same percentages as in the MCAR case were used.

5 Comparative Evaluation On Real-World Datasets with Missing Values

The 25 real-world datasets with missing values were analyzed with JADBio, optimizing over configurations that include the imputation algorithms selected and their hyper-parameter values.

5.1 Binary Indicators Improve the Predictive Performance

First, we partition results achieved when optimizing over any single imputation algorithm. Specifically, for each imputation algorithm, on a given dataset, the best AUC was selected over all configurations that include the specific algorithm. We will refer to this best AUC simply as the AUC of a given imputation algorithm, in all subsequently reported results. Figure 3(a) shows the difference in AUC performance when Binary Indicators are used versus when excluded. As we can see, MM and GAIN have the largest average increase by 0.0074 AUC and 0.0056 AUC, respectively. MF when extended with BI shows an average AUC increase of 0.0046, while PPCA shows an increase of 0.0038 AUC. The lowest average improvement is achieved by SOFT, which improves by 0.00022 AUC. Contrary to the above observations, DAE is the only method that does not benefit from the addition of BIs with a negligible decrease of 0.0004 AUC when BI’s are included. Figure 3(b) offers a complementary view. It illustrates the count of datasets per imputation method where the inclusion of BI is beneficial to the downstream performance. We observe that for every imputation method, including BI is beneficial in most instances. SOFT exhibits improvement across 19 of 25 datasets. GAIN, MM, and PPCA in 17 datasets. Finally, including BI, leads to enhancements in DAE and MF across 16 and 15 datasets, respectively.

Fig. 3.

To determine the statistical significance of the results, we performed a paired matched t-test for each algorithm with the null hypothesis H0 being that the BI+base has worse performance than the base method. The resulting p-values were converted to q-values with the Benjamini/Hochberg [7] method, to control for multiple testing. Table 3 shows the results. Using a q-value threshold of 0.25 there are four statistically significant results, resulting in accepting the alternative hypotheses that MF, GAIN, PPCA, and MM improve their performance when BIs are present. At the level of \(q=0.25=\frac{1}{4}\) this implies that, in the worse case, we expect one of these four discoveries to be false. While the inclusion of BIs may, in the worse case, double the dimensionality of the dataset, based on the above results, we would recommend their inclusion when the above imputation algorithms are employed.

5.2 BI+DAE Is the Best Imputation Method in Real-world Data

Figure 4 shows the average ranking achieved by each algorithm using the Autorank tool [26] (lower ranking is better). To avoid clutter, and based on the results of Section 5.1, we only show results when BIs are included. The horizontal black bars in the graph connect tools with non-statistically different ranks, according to a non-parametric Friedman test and post hoc Nemenyi test.

Fig. 4.

Results show that BI+DAE is the highest ranking algorithm with an average rank of 2.84, followed by BI+MM with a 2.94 average ranking, BI+MF with 3.5, and BI+GAIN with 3.56, although their rank difference is not statistically significant at the 0.05 level. The two lowest-ranked methods are BI+SOFT and BI+PPCA, with 3.74 and 4.42 average rankings, respectively. BI+DAE’s rank is statistically significantly lower compared to BI+PPCA.

5.3 BI+MM Is the Best Method When Considering the Efficiency–Effectiveness Tradeoff

We now study the performance effectiveness vs. the computational efficiency tradeoff of the algorithms. In Figure 5(a) we use MM as the baseline. A point (execution run) corresponds to AutoML predictive modeling on a dataset with a given imputation algorithm. This results in \(5 \times 25 = 125\) points. The x-axis shows the effectiveness ratio defined as the ratio of the AUC corresponding to the point divided by the corresponding performance of BI+MM. Similarly, the y-axis shows the efficiency ratio defined as the training time of the point divided by the corresponding time of BI+MM. Hence, points in the first/fourth quadrant (top-left/bottom-right) correspond to runs where BI+MM dominates/is-dominated by other algorithms on the same datasets in both time and AUC. Notice that the scale of the y-axis is logarithmic. Larger points correspond to the mean value of an imputation method over all datasets.

Fig. 5.

In total BI+MM is inferior in terms of predictive performance in 42 cases (16 of the 25 datasets) and, unsurprisingly, never gets dominated in terms of AUC performance and efficiency at the same time. The computational time of the other algorithms is orders of magnitude slower than BI+MM. However, in 83 of 125 points, BI+MM is both more efficient and effective than the compared method. Only BI+DAE scores on average higher than the AUC score. All the other imputation methods are on average slower to train and worse in terms of predictive performance.

Figure 5(b) shows the same exact results with BI+DAE as the baseline. In contrast to BI+MM above, BI+DAE dominates the other imputation methods on predictive performance and training time, in only 35 of 125 combinations. In 41 of 125 cases, it provides better predictive performance but at a higher computation cost. In 15 points, BI+DAE is faster but has lower predictive performance than the compared imputation method. Finally, 34 times it is dominated in both metrics. In conclusion, when if a single imputation algorithm is to be used, BI+MM arguably provides the best tradeoff between computational time and predictive performance.

5.4 Best Imputation Subset for Maximizing AUC Performance Is {BI+MM, BI+DAE}

In this section, we examine the results from a different perspective, trying to answer the question: What is the minimal-size subset of algorithms to try to achieve close-to-maximum AUC performance? To answer this question, we have implemented a simple greedy algorithm, where we assume the analyst starts with the subset \(\lbrace\) BI+MM \(\rbrace\) as an efficient baseline and adds algorithms to consider. In each iteration, the algorithm that leads to the largest AUC improvement of the subset when added is selected for inclusion. The maximum AUC performance is the sum of the maximum AUC for each dataset when including all imputation methods in the optimization pipeline, averaged across all datasets.

The results are shown in Figure 6 and quantitatively in Table 11. The x-axis shows the imputation algorithms in order of addition to the subset. For each algorithm, several hyper-parameter combinations are tried and combined with all other feature selection and modeling choices by AutoML. Hence the total number of configurations tried is multiplied by this factor. At each tick, the multiplication factor for the whole set is depicted in the parenthesis next to the name of the algorithm added to the set in that step. For example, BI+MM has no hyper-parameters ( \(1\times\) ), while BI+DAE has 9, so the multiplicative factor of the set \(\lbrace BI+MM, BI+DAE\rbrace\) is \(10\times\) . The y-axis is the average (over all datasets) relative AUC achieved when performance is optimized over all algorithms and their hyper-parameters in the corresponding subset.

Table 8.

Dataset	ID	Samples	Features	#Numerical	#Categorical	Missing %	Minority Class %	#Feat miss>1%	%Missing/Feature	Type
adult	179	48,842	14	6	8	0.95	0.24	3	4.41	Binary
albert	41,147	425,240	78	78	0	13.64	0.50	43	24.73	Binary
analcatdata_reviewer	1,008	379	7	0	7	51.56	0.43	7	51.56	Binary
anneal	989	898	38	16	22	64.98	0.24	29	85.15	Binary
aps_failure	41,138	76,000	170	170	0	8.35	0.02	160	8.83	Binary
ASP-POTASSCO-classification	41,705	1,294	142	139	3	9.94	0.02	138	10.23	MultiClass
ASP-POTASSCO-regression	41,704	14,234	142	138	4	9.94	0.00	138	10.23	Regression
audiology	999	226	69	0	69	2.03	0.25	6	23.23	Binary
autoHorse	840	205	25	17	8	1.11	0.40	4	6.46	Binary
braziltourism	957	412	8	7	1	2.91	0.23	2	10.68	Binary
bridges	328	107	11	4	7	6.03	0.41	7	9.35	Binary
Census-Income-KDD	42,750	199,523	41	13	28	5.08	0.06	7	29.72	Binary
cjs	1,024	2,796	34	32	2	71.64	0.24	28	86.97	Binary
Code_Smells_Data_Class	43,079	86,467	66	66	0	49.99	0.00	62	53.20	Regression
colic	27	368	22	7	15	23.80	0.37	19	27.50	Binary
colleges	42,727	7,063	47	31	16	31.42	0.00	30	49.19	Regression
colleges_aaup	897	1,161	15	13	2	1.47	0.30	6	3.68	Binary
colleges_usnews	930	1,302	33	32	1	18.22	0.47	25	23.96	Binary
cylinder-bands	6,332	540	39	24	15	4.74	0.42	23	7.93	Binary
Domainome	41,533	1,623	9838	9838	0	82.17	0.35	9688	83.44	Binary
dresses-sales	23,381	500	12	1	11	13.92	0.42	5	33.04	Binary
echoMonths	222	130	9	7	2	8.29	0.00	6	12.31	Regression
eucalyptus	990	736	19	14	5	3.21	0.29	6	9.95	Binary
fishcatch	232	158	7	7	0	7.87	0.00	1	55.06	Regression
fps-in-video-games	42,737	425,833	44	33	11	6.94	0.00	12	25.44	Regression
hepatitis	55	155	19	6	13	5.67	0.21	11	9.56	Binary
house_prices_nominal	42,563	1,460	79	36	43	6.04	0.00	16	29.74	Regression
hungarian	231	294	13	12	1	20.46	0.36	5	52.93	Binary
ipums_la_97-small	993	7,019	60	34	26	11.42	0.04	18	38.06	MultiClass
ipums_la_98-small	381	7,485	60	34	26	11.59	0.01	17	40.91	MultiClass
ipums_la_99-small	378	8,844	60	34	26	9.71	0.02	18	32.36	MultiClass
jungle_chess_2pcs_endgame_rat_panther	41,002	5,880	46	18	28	1.30	0.23	6	10.00	MultiClass
KDD98	42,343	82,318	477	358	119	11.30	0.12	87	61.98	Binary
KDDCup09-Upselling	1,112	50,000	15000	13,391	1609	3.35	0.07	608	82.59	Binary
KDDCup09_churn	42,759	50,000	230	192	38	69.78	0.07	205	78.28	Binary
kdd_coil_1	567	316	11	8	3	1.61	0.00	3	4.85	Regression
kdd_el_nino-small	839	782	8	8	0	7.45	0.35	4	14.90	Binary
kick	41,162	72,983	32	17	15	6.39	0.12	5	40.51	Binary
lymphoma_2classes	1,101	45	4026	4,026	0	3.28	0.49	2116	6.25	Binary
meta	566	528	21	18	3	4.55	0.00	3	31.82	Regression
MiceProtein	40,966	1,080	81	77	4	1.60	0.10	8	14.69	MultiClass
Midwest_Survey_nominal	42,532	2,778	27	1	26	1.95	0.03	5	10.51	MultiClass
mlr_ranger_rng	42,458	278,863	14	8	6	3.56	0.00	1	49.69	Regression
mlr_svm_rng	42,456	540,576	13	7	6	9.38	0.00	2	60.95	Regression
Moneyball	41,021	1,232	14	11	3	20.87	0.00	4	73.05	Regression
mushroom	24	8,124	22	0	22	1.39	0.48	1	30.53	Binary
NewFuelCar	41,506	36,203	17	17	0	1.46	0.00	1	24.78	Regression
okcupid-stem	42,734	50,789	19	3	16	15.97	0.10	12	25.28	MultiClass
pbc	524	418	18	17	1	16.47	0.00	12	24.66	Regression
pbcseq	802	1,945	17	13	4	3.43	0.50	6	9.71	Binary
porto-seguro	42,206	595,212	37	25	12	3.84	0.04	5	28.21	Binary
primary-tumor	1,003	339	17	0	17	3.90	0.25	2	32.74	Binary
profb	470	672	9	5	4	19.84	0.33	2	89.29	Binary
rl	41,160	31,406	22	22	0	10.45	0.10	8	28.71	Binary
road-safety	42,803	363,243	66	61	5	9.10	0.05	41	14.62	MultiClass
SAT11-HAND-runtime-regression	41,980	4,440	116	113	3	5.27	0.00	10	61.15	Regression
schizo	466	340	14	12	2	17.52	0.48	11	22.30	Binary
sick	38	3,772	29	7	22	5.54	0.06	7	22.96	Binary
soybean	1,023	683	35	0	35	9.78	0.13	32	10.68	Binary
speeddating	40,536	8,378	122	61	61	2.87	0.00	109	3.17	Binary
stress	42,167	199	12	8	4	8.29	0.20	7	14.22	Binary
us_crime	315	1,994	127	126	1	15.48	0.00	24	81.91	Regression
vote	56	435	16	0	16	5.63	0.39	16	5.63	Binary
water-treatment	940	527	36	36	0	2.86	0.15	22	4.53	Binary

Table 8. Datasets Used for Missing Value Simulation Experimental Setup

The table contains the dataset name, id, number of samples, number of features, number of categorical and numeric features, Missingness percentage in the whole dataset, Minority Class %, the number of features with missing values over 1%, the missingness percentage over features with missing values, and, finally, the outcome type of each dataset.

Table 9.

Dataset	Missing Feature (target)	#Selected Features
aps_failure	cn_006	25
colleges_aaup	Average_salary-full_professors	5
colleges_usnews	Out-of-state_tuition	18
dresses-sales	V3	7
eucalyptus	PMCno	6
hepatitis	ALBUMIN	6
hungarian	thalach	3
mushroom	stalk-root	16
pbcseq	presence_of_asictes	7
speeddating	attractive	24

Table 9. Summary of the Feature Selection Experiments for MAR Simulation

On average, a missing feature depends on 12 other features.

Table 10.

Table 11.

Fig. 6.

BI+MM, by itself, accounts for 98.69% of the maximum AUC. When BI+DAE is added to the mix, relative performance reaches 99.69%. The next best algorithm to add is BI+MF; 100% of AUC is reached when invoking all imputation algorithms. In summary, the addition of BI+MF, BI+PPCA, BI+GAIN, and BI+SOFT provide only marginal gains to the set \(\lbrace\) BI+MM, BI+DAE \(\rbrace\) .

5.5 The Interplay between Feature Selection and Imputation

Feature selection algorithms try to reduce the number of features that enter the model without sacrificing predictive performance. Feature selection is often the primary task in analysis, while the predictive model may be just a side-benefit. For example, a medical doctor may be more interested in the quantities that determine the risk of disease and may reveal new medical knowledge, rather than the risk model itself. Feature selection leads to more interpretable models that provide intuition into the domain. In fact, the solution to the feature selection problem is directly linked to the causal model that underlies data generation [72]. In other circumstances, it is important to reduce the cost of measuring the features to provide predictions. The cost may be measured in monetary units, the computational cost to compute the features or risk to a patient from medical procedures that measure these features.

Figure 7 shows the impact of feature selection for each imputation algorithm on the real dataset. The drop in AUC performance when feature selection is enforced vs. not enforced (i.e., optimizing over all configurations) in the final configuration is shown. For each algorithm is about two to three AUC points. In other extensive experiments with hundreds of complete (no missing values), small-sample, high-dimensional omics datasets, JADBio has been shown to reduce the number of features by a factor of 4,000 without a noticeable drop in AUC performance [73]. The results provide evidence that feature selection may be more challenging in the presence of missing values.

Fig. 7.

In any case, the problem of including both imputation and a feature selection step in the ML pipeline is that imputation invalidates feature selection, in some sense. Let us explain this statement with an example. Let us assume the pipeline that produces the final model consists of MF imputation, Lasso feature selection, and RF predictive modeling. Let us assume that Lasso selects the features \(\lbrace A, B, C\rbrace\) . If any of these values (say the value of A) is missing on a new sample, then the MF imputation model will impute them using a Random Forest for A using some other subsets of features. If any of those are also missing, then MF will invoke its Random Forests for each value that is missing, and so on, recursively. Hence, if there are missing values on the test samples, one may need to measure an arbitrarily large feature subset, not just the selected features. The storage required to apply the ML pipeline includes both the RF as well as the MF model, which in turn includes a Random Forest for every feature that may need imputation.

6 Comparative Evaluation On Datasets with Simulated Missing Values

This section focuses on comparing imputation methods in datasets with generated missing values. To that end, we compare the predictive performance of each imputation method under various missingness mechanisms and percentages. Additionally, we study the effect of feature selection when the missingness increases. The figures in this section illustrate the more general MAR case. The results for MCAR results are included in Appendix C.2 and are qualitatively similar. Finally, results regarding imputation accuracy can be found in Appendix C.7.

6.1 BI+MF Is the Best Imputation Method in MCAR and MAR Simulated Missing Data

Figure 8 presents the AUC performance results (see Figure 16(b) for MCAR results). The AUC performance denoted is the absolute difference in performance at the specified missingness percentage minus the AUC performance of the complete dataset. First, we note that the figure illustrates that as the missingness percentage increases the average predictive performance for every imputation method used decreases, as expected. As we can see, increasing the missingness from 25% to 50% leads to a sizable performance drop for all imputation methods. Specifically, methods based on linear dimensionality reduction, namely BI+PPCA and BI+SOFT are the most affected by this increase in missing values.

Fig. 8.

Fig. 9.

Fig. 10.

Fig. 11.

Fig. 12.

Fig. 13.

Fig. 14.

Fig. 15.

Fig. 16.

Figures 8 and 16(b) illustrate that in both MAR and MCAR data, respectively, MissForest combined with Binary Indicators is, on average, the best-performing method. Additionally, we note that PPCA and SOFT are the two worst imputation methods, especially as the missingness percentage increases. Table 12(a) in Appendix C.2 contains the quantitative results in detail and a detailed discussion on the ranking of the algorithms.

Table 12.

6.2 The Best Imputation Subset for Maximizing AUC Performance Is {BI+MM, BI+MF}

We now identify the minimal-size algorithm subset with close-to-optimal performance for simulated missing data. We use again the simple greedy algorithm introduced in Section 5.4 and apply it to the MCAR and MAR simulated data results. The results for MAR are in Figure 9, which is similar to Figure 6. The quantitative results are shown in Table 13(b). As shown in the figure, the \(\lbrace BI+MM, BI+MF\rbrace\) subset can score over 99% of the total max AUC for MAR data and would be the suggested set of algorithms to run in such problems. The results for MCAR are in Appendix C.2.6. They are qualitatively similar. The results for simulated missing values are somewhat different than the ones in the real datasets, namely BI+MF scores better than BI+DAE, which is now placed in third place. Possible reasons why are discussed in Section 9.

Table 13.

6.3 BI+MM Provides the Best Tradeoff between Effectiveness and Efficiency

Figures 10 and 16(a), show the effectiveness vs. efficiency tradeoff of the algorithms. The aformentioned figures are similar to Figure 5(a) above for the real datasets. We repeat the explanation of the figure: The reference (baseline) algorithm is BI+MM. The x-axis shows the effectiveness ratio defined as the ratio of the AUC corresponding to the point divided by the corresponding performance of BI+MM. Similarly, the y-axis shows the efficiency ratio defined as the training time of the point divided by the corresponding time of BI+MM. Hence, points in the first/fourth quadrant (top-left/bottom-right) correspond to runs where BI+MM dominates/is-dominated-by other algorithms on the same datasets in both time and AUC. Notice that the scale of the y-axis is logarithmic. Larger points correspond to the mean value of an imputation method over all datasets. There are five imputation methods to compare against MM for 10 datasets over 3 percentages of missing values. This will naturally result in 150 points. However, MissForest did not run in three of the datasets (image dataset variations) due to its dimensionality; see Section 2.3.2 for details. The resulting plot will consist of 147 points.

For MAR data, BI+MM is never dominated in both metrics, as it is by far the most efficient method. In 104 of 147 cases, it dominates the opposing imputation methods in terms of both effectiveness and efficiency. However, 43 times it is dominated in efficiency. Only BI+MF has on average better predictive performance than BI+MM. However, it is 23,000 times slower to train on average. All the other imputation methods are worse than BI+MM on average while also taking more time to train. The results for MCAR data are qualitatively similar (see Figure 16(a) and discussion in Appendix C.2).

In total, BI+MM is again found to provide the best, arguably, tradeoff between efficiency and effectiveness. The results in the simulated data are further validating the results in the real-world data, verifying that BI+MM is indeed a decent imputation method all around. BI+MM, on average, is on par with more sophisticated methods such as BI+MF, BI+GAIN, and BI+DAE, while being thousands of times faster to train.

7 Meta-Level Analysis of Real-World Results

In Machine Learning it is always invariably the case that there is no single better algorithm for all datasets, a one-size-fits-all type of algorithm. Hence, one needs to optimize over several choices for the dataset at hand. This school of thought is what gave rise to AutoML systems. The field of Meta-Level Learning [18] studies how to predict the most promising algorithm or algorithms to run on a given dataset based on its characteristics. These characteristics are called meta-level features or meta-features of the dataset and include the sample size, the number of features, the type of features, the percentage of missing values, and others [60].

In this section, we try to identify meta-features that correlate with the performance of the imputation algorithms. Such correlations could help predict which algorithms to run on a given dataset. They could also shed light on the dataset properties that enable an algorithm to perform better and lead to the design of better algorithms. Hence, we defined and computed the meta-features in Table 4. The selected meta-features can be split into three categories: (1) General meta-features, which report general characteristics of the dataset such as the number of samples or features. (2) Missing value-related meta-features, which provide insight into the dataset’s missing patterns, such as missing value percentage of features. (3) Cluster-based meta-features. One such type of metric is the silhouette coefficient, computed with the k-means algorithm with \(k=2, 3, 4\) , as was proposed in Reference [1]. It shows the tendency of the data to cluster. Another type of such metric is the number of PCA components that explain \(\%x\) of the data. It shows whether the data are limited to a lower-dimensional subspace and the extent of cross-correlations between features. General meta-features were extracted using the pymfe package [5]. We implemented the missing and clustering-based meta-feature extraction using sklearn [53] and numpy [23]. To apply clustering or PCA the data are first imputed with MM.

We then correlated (Spearman correlation) these meta-features with the AUC performance of an algorithm relative to the performance of BI+MM as the baseline. A positive (negative) correlation indicates that when the meta-feature increases, the performance of the algorithm increases (decreases), relative to BI+MM. There are five algorithms (except BI+MM, which is used as a baseline) and 16 meta-features, leading to 80 correlations over datasets. Only one correlation was found to be significant at the level significance 0.1 (p-value = 0.059). Specifically, BI+PPCA relative AUC performance is positively correlated (correlation = 0.383) with the number of categorical variables in a dataset. This means that as the number of categorical variables increases we expect BI+PPCA to perform better relative to BI+MM. However, when correcting the p-values for multiple testing using the FDR control technique of Benjamini-Hochberg [7], we see that the q-value is 0.991, which means that detecting one such correlation is expected even if all meta-features are uncorrelated with the relative performance. BI+PPCA does not handle categorical features natively, which further makes us believe that the result is probably a false positive. Statistically significant correlations could not be found using meta-learning analysis. Further experiments containing more datasets and meta-features need to be conducted.

8 Related Work

In this section, we discuss related work on missing values imputation and position our contributions. We focus on empirical studies that compare different imputation methods based on the performance of the predictive models build on imputed datasets rather than the original values of complete datasets [13, 29, 56, 80].

Current literature can be split into two categories: AutoML and Adhoc ML modeling. The first category extends a specific AutoML tool by adding imputation methods, while the latter creates a predictive modeling pipeline that may contain a subset of a modern AutoML tool’s pipeline, such as hyper-parameter optimization, model selection, and pre-processing. AutoML in general is able to optimize the performance over various stages in a pipeline. As we optimize the whole pipeline, we expect the effect of each stage to become less significant, as other stages may compensate. AutoML tools allow us to get more insights on which features are more important for the task (feature selection), optimize the hyper-parameters for each stage of the pipeline (hyper-parameter tuning), and select the best predictive model for each imputation method (model selection). Consequently, imputation methods can be evaluated fairly under this optimization framework.

As shown in Table 5, the majority of related work either uses datasets with native or simulated missing values. The literature mainly focuses on the binary classification task (included in all previous works). Of the eight previous works, two papers include binary+regression [8, 55], and one work binary and multi-class data [19]. Reference [30] is the only study that includes all three types of outcomes. The most prominent missingness mechanism is MCAR found in all works that simulate missing values. Reference [30] is the only work that includes deep learning–based imputation methods. Binary Indicators are very prominent in AutoML tools; however, only Reference [55] has studied their effect when extending imputation methods. Finally, Reference [49] is the only work that included ensemble models for the prediction phase while Reference [19] is the only work that includes feature selection as part of the pipeline. As shown in the Table 5, none of the related work has included every step mentioned in the table’s columns.

Summarizing the related work section, the majority of the literature uses datasets with native missing values or generates them through a simulation based on various missingness mechanisms and missingness proportions. However, none of the mentioned studies benchmarks imputation methods on both native and generated missing value datasets. The studies on real-world datasets in general conclude that simple imputation methods such as MM are on par with other more complex methods. Research on datasets with simulated missing values concludes that more complex methods can indeed improve predictive performance on average. However, there is no universal best method proposed by any of the aforementioned benchmarks. Literature mainly focuses on the binary classification task. Accuracy and F1-score are the more prominent metrics in the literature. In the majority of the studies, the hold-out split is used for the evaluation. Some studies, use repeated splits or cross-validation to handle randomness. Specific predictive models could benefit from native handling of missing values compared to simple imputation, for instance, Gradient Boosted Trees. However, not all classifiers support missing value handling, making imputation still an essential part of the pre-processing step of ML pipelines. In general, hyper-parameter tuning, model selection, and feature selection are given less importance in previous literature. Most works skip one or more of the previous steps or fail to mention information about the specific stage. For example, only one predictive model is tuned or imputation methods are used with default parameters specified by the authors of the methods or the package implementations.

The research closer to ours is Reference [49]. In the aforementioned paper, Autosklearn was extended to include the data cleaning process, the emerging tool named AutoClean. Part of the extension was imputation. The study compares mean, median, mode, KNN [71], and Iterative imputation [77] for continuous features. For categorical features, constant, KNNi, and mode imputation were selected. The study used five binary classification datasets with 891 to 10,500 observations and 9 to 39 features that include missing values at low percentages. AutoClean optimized the pipeline by Bayesian hyper-parameter optimization. In autosklearn the predictive model is an ensemble of methods. They evaluated the performance by using a fivefold cross-validation and balanced accuracy metric. The study concluded that KNNi is a valuable addition to the simpler imputation methods. However, in most cases, simple imputation methods are selected more frequently than KNNi for both continuous and categorical data. Contrary to the aforementioned literature, we included feature selection in our experimental setup. Also, we conducted comparisons on both datasets with native and simulated missing values. In general, our evaluation was conducted on more datasets, with a higher range in terms of samples and features. Finally, we included neural network imputation methods and extended imputation with binary indicators.

In Reference [19], TPOT AutoML tool was extended with imputation methods, specifically mean, median, mode, max, MICE [77], and EM [27]. The median and mode were found to be the best imputation methods based on a restricted simulation study on 23 datasets at 7% MCAR missingness. The data were split multiple times (20) to account for randomness. At each split, 25% of the data were used as a hold-out set. Compared to the mentioned work, we simulated missing values with other mechanisms and missing proportions as well as used datasets with native missing values in our experiments. Also, we included recent state-of-the-art methods based on NNs such as DAE and GAIN. We also implemented and measured the effect of binary indicators when coupled with MM and complex methods.

Similarly, missing data imputation has been also researched as part of the data cleaning systems. Reference [38] compared deletion, mean, median, mode, new-category, and HoloClean [59] on six datasets with native missing values. For the predictive task 7 models were considered: Logistic Regression, KNN, Decision Tree, Random Forest, AdaBoost, Naive Bayes, and XGBoost. For the evaluation, the data were split 70% train–30% test set, repeated 20 times to account for randomness. They used accuracy and F1-score for evaluation according to the dataset’s imbalance. They concluded that simple imputation methods yield competitive performance to more complex methods such as Holoclean. Contrary to the aforementioned work, we included neural networks, the binary indicator method, and feature selection in the experimental setup. We conducted experiments on more datasets that, generated missing values but also had real-world missing values.

The benchmark study [81] was conducted on 13 real-world datasets from OpenML and concluded that mean/mode is comparable to more complex imputation methods such as random, SOFT [44], MF [66], KNNi [71], Hot-Deck [33], and MICE [77]. In the study, 20% of the data was kept as a hold-out and reported measures were AUC and F1-score. Specifically, while measuring the F1-score, MM had the highest average ranking. In contrast, for the AUC score, KNNi is found to be the best-performing method. However, both KNNi and MM are among the three best methods in both metrics. Hyper-parameter tuning was not considered in this article, the imputation and predictive methods used default parameters. In our work, we tune all the steps of the pipeline for a fair comparison. We also included deep learning methods that are the current state-of-the-art for imputation as well as binary indicators.

The largest benchmark study was conducted in Reference [30] on 69 real-world datasets with simulated missing values. The missing mechanisms were MCAR, MAR, and MNAR. The generation of missing values was set at 1%, 10%, 30%, and 50% missingness. They compared MM, KNNi [71], MF [66], custom DL-based imputation inspired by Reference [9], GAIN [83] and variational autoencoders [32] for the imputation problem. Cross-validation scores are reported, data split into 5 folds for all but deep learning methods. For deep learning methods, three folds were used for the split due to training costs. For regression data, RMSE is the reported metric. For classification data, the F1-score was reported. They concluded that MissForest is the best imputation method. However, they used a single classifier for the prediction phase. We argue that different imputation methods work better with different classifiers, which should be tuned as well. For example, on the cylinder-band dataset GAIN imputation method performs best with the Ridge Logistic Regression, whereas the DAE imputation method performs best with the RandomForest classifier on the same dataset. Additionally, missing values for the downstream task were generated for a randomly sampled feature in the dataset. We also uniformly simulated missing values, which is a harder problem to solve for the imputation methods as less observed data exist. In the mentioned work, GAIN had a convergence problem in 33% of the cases resulting in the worst ranking among the mentioned methods. In our work, GAIN does indeed converge due to different hyperparameter tuning. Finally, we include Binary Indicators as well as the DAE imputation method, which is the best method on average in real-world data with missing values. Simulated missingness results in our work are on par with the results of the aforementioned work as MissForest is the best method in both works. However, deep learning methods in our work are among the best methods and not the worst as in the literature mentioned.

Another study [57], compared the predictive performance on two datasets with imputed and incomplete data. Missing values were simulated on categorical features on the train data. They generated MCAR and MNAR missing values from 10% to 40% missingness in categorical features. One-third of the data were kept as hold-out test set. The accuracy score on the test set is reported. For the imputation of categorical features, they used six imputation models: mode, random, k-NN [71], iterative imputation based on logistic regression, random forest [66], and SVM. For the classification, they used three predictive models, ANNs, decision trees, and random forests. The authors optimized the hyper-parameters for the ANNs only. The imputation models and the other classifiers were not tuned. They did not conclude that an imputation method or a classifier is better than others and heavily depends on the nature and proportion of the missing data. However, results indicated that imputation is better than simply creating a new category in the data. In our work, for fair evaluation, we tune both imputation and predictive models. We include binary indicator methods and neural network imputation. Finally, we simulated missingness on both numeric and categorical features.

In Reference [55], the authors compared mean, median, KNNi [71], Iterative Imputer, Iterative Imputer /w Bagging (Multiple Imputation) [77], MIA (the native handling of missing values by Gradient Boosted Trees), and MIA /w bag. All of the previous methods, except MIA, were also extended with binary indicators. The study was conducted on 13 real-world datasets from four databases with native missing values. Nested cross-validation with five outer folds is used for estimating the accuracy score of the downstream task. The predictive models were set to Gradient-boosted trees and linear models. They concluded that MIA is a better alternative to imputation. Also, the indicator method helps improve the performance of the predictive task, which is on par with the results of our work. They conclude that simple imputation using mean or median is on par with KNNi and iterative imputation with linear models. In our work, we included deep learning imputation in our set of imputation methods. Also, we tuned the imputation methods to fairly evaluate the performance of each method, as tuning is important in the performance of some imputation methods.

Last, Reference [8] introduced a group of three methods named OptImpute, focusing on optimizing KNNi and iterative imputation based on SVMs and decision trees. They compared the group against five other imputation methods: mean/mode, K-nearest neighbors [71], iterative known [84], Bayesian PCA [50], and predictive-mean matching [77]. They compared the introduced method across 84 datasets with simulated missing values measuring imputation accuracy. They additionally measured the group’s performance on learning algorithm performance on 10 datasets. The missing values are generated by the MCAR mechanism with a range from 10% to 50%. The classifiers used for regression tasks are LASSO and SVR while for the classification tasks SVM and Optimal Trees. These datasets range in size, having 150 to 5,875 observations and 4 to 16 features. Data were split 50%–50% into train and test sets. The splits were repeated 100 times to account for randomness. Their group of methods improved the predictive performance of the models. Their method scored 86.1% average accuracy and average R-Squared (R2) of 0.339 compared to 84.4% and 0.315 R2 for the classification and regression data, respectively. However, no neural networks were used and the methods introduced have not been compared individually. Additionally, tuning was applied only to the group of proposed methods and not to the other imputation methods that were used and the predictive models. We tune all of the imputation methods and models. We also extend methods with binary indicators and include NNs imputation methods in our test bed, as well as MissForest. We also report multiple metrics (Accuracy, F1, and AUC) for the binary classification task. Finally, we have a wider range of datasets, both with native and simulated missing values.

8.1 Synopsis of Contributions Relative to the Related Work

Compared to the related work, we contribute in various ways. Our work can be directly compared to two other works that are conducted in an AutoML tool [19, 49]. Compared to the mentioned works, we include more datasets, more missingness mechanisms, neural network methods, and binary indicators in the experimental setup. For the first time, deep learning methods are compared to simple imputation methods in an AutoML predictive setting. One of the deep learning methods (BI+DAE) has the best average performance on real-world data with native missing values. Additionally, for the first time, the effect of the imputation methods on predictive performance is measured on datasets with generated missing values and native missing values. Until now, comparisons were conducted on only one of the two settings, specifically half of the papers use real-world datasets with missing values in them, while the other half use complete datasets with generated missing values. Contrary to the majority of the literature, we tune both imputation and predictive methods to fairly evaluate them. Only two of the eight related mention tuning both imputation and predictive modeling methods [38, 49]. We also conducted experiments on more datasets compared to the majority of the literature, while unlike [30], our simulation setting is applied to all features in the datasets and not only one. Also, only one of the eight aforementioned works includes feature selection as part of the ML pipeline. Finally, meta-learning, for the first time, is used to identify useful data characteristics that could give insights into the choice of a simple vs. a sophisticated imputation method. In general, as shown in Table 5, our testbed is the most complete overall in terms of dataset selection, missingness selection, imputation method selection, and pipeline steps. This allows us to fairly evaluate imputation methods in a state-of-the-art AutoML environment.

9 Lessons Learned and Contributions

The main insights that are drawn from our experimental results are the following:

—

Including BI in the dataset improves the predictive performance of the machine learning pipeline for most algorithms (see Section 5.1). The inclusion of BIs does increase the dimensionality—and difficulty—of the machine learning task. However, it does encode the information about which missing values are missing; this allows a classifier to learn which values to trust or not. Results indicate that encoding this information turns out to be more beneficial than harmful.

—

BI+DAE is found to be the single best imputation method in real-world data with native missing values followed by BI+MM, which is the standard in AutoML tools. As seen in Section 5.2, both methods have the same number of wins (when comparing only BI extended methods) across datasets with BI+DAE having higher mean AUC. The worst performance is exhibited by matrix-factorization (linear dimensionality reduction) methods such as PPCA. These methods do scale with the number of features and may be more suitable for high-dimensional, low-sample datasets.

—

BI+MM exhibits the best tradeoff between efficiency and effectiveness. As expected (see Sections 5.3 and 6.3 and Appendix C.2.4), BI+MM is the fastest method to train and also is more effective in the majority of the comparisons. MF, due to its iterative nature, is the slowest among all closely followed by GAIN. GAIN’s main bottleneck is the number of epochs required to train the network. The authors’ suggestion was 10,000 epochs, which is 20 times more than the 500 epochs suggested by the authors of the DAE method.

—

Based on the results of Section 5.4, we would suggest practitioners to optimize their models over the BI+MM and BI+DAE algorithms. BI+MM and BI+DAE score over 99% of the maximum AUC in real-world data as shown in Section 5.4. Specifically, BI+MM scores 98.68% of the maximum AUC. Adding BI+DAE to the pipeline leads to 99.69% of the maximum AUC. However, this comes at the cost of increasing the configuration space by 10 \(\times\) , as DAE has nine tuning configurations compared to one of BI+MM. Also, to reach 100% of the optimal performance, we have to train 24-times more configurations than by simply using BI+MM.

—

BI+MF is the best method in datasets with simulated missing values. As shown in Section 6.1, in both MCAR and MAR simulations, BI+MF is on average the best. In contrast, BI+MF is the third best with real-world data, falling behind BI+DAE. Despite our best efforts to realistically simulate missing values, there may still be differences between real-world missing-data generative mechanisms and our simulations. First, we simulated MCAR and MAR missing values. Real-world missing values may be NMAR. Second, the missingness probability for MAR data is determined by a generalized linear model (logistic regression model). Real-world missing values may follow non-linear models. The majority of the literature employs similar simulations for comparing imputation algorithms. However, as indicated by this study, results with simulated missingness may not generalize to real-world datasets. New simulation methodologies need to be proposed to this end.

—

Missingness increase leads to a deterioration in predictive performance. As shown in Section 6.1, increasing missingness causes a drop in the AutoML tool’s capability of predicting the outcome. Missingness at 10% leads to a 0.024 AUC drop compared to the complete dataset. Similarly, 25% missingness leads to 0.05 AUC drop, while at 50% we can inspect up to 0.144 drop average as seen in Tables 12(a) and (b).

—

The set containing BI+MM and BI+MF reaches 99% of maximum AUC for simulated data as shown in Section 6.2. BI+MM scores the 98.7% of the maximum AUC for MCAR data and 98.99% for MAR data. To surpass 99% of the maximum AUC, the addition of BI+MF is needed. This addition allows the tool to reach 99.62% and 99.43% on MCAR and MAR data, respectively. However, BI+MF has to be tuned, leading to a total 3 \(\times\) increase in pipeline complexity.

—

A meta-learning methodology to correlate meta-features with performance is presented in Section 7. It could allow scientists to select the appropriate sophisticated methods based on meta-features, saving training time and improving overall performance. In addition, it could provide insight into the design choice of an algorithm that leads to better or worse performance on a given dataset. Unfortunately, no statistically significant results were found. This means that either there are no correlations present with the selected meta-features, or these correlations are not strong enough to be found significant with the given sample size of 25 datasets.

There are, of course, several limitations of the study that we would like to point to. The results and conclusions stem from computational experiments with binary classification tasks within a range of a number of features, sample size, imbalance of the classes, and missingness percentage. MNAR missingness pattern is not included in our experiments. Also, the mechanism for generating MAR data is based on a linear model. Results may differ for non-linear missingness generation and MNAR data. Despite the significant computational effort involved—optimizing over thousands of ML pipelines for each dataset—results stem from only 25 real-world datasets with native missing values and 60 complete datasets where missing values were introduced (10 original datasets times 2 missingness mechanisms (MCAR, MAR) times 3 missingness percentages). This fact limits the statistical power of our statistical tests. While JADBio is an effective AutoML tool, results should also be obtained from other AutoML tools to further generalize the conclusions. Another limitation of our work, concerns the comparison of methods on only binary classification data. Even though imputation algorithms are unsupervised learning methods and do not use information from the target variable (in our work), results may vary according to the supervised task. Finally, we selected models based on the AUC score in the training set. Optimizing for another metric, such as accuracy or F1-score, during training may yield different results.

10 Conclusions

In this article, we conducted experiments on real-world datasets with native missing values and simulated missing values. We compared six imputation methods extended by binary indicators on a state-of-the-art AutoML tool. BI+DAE is the best method on real-world datasets with native missing values. However, BI+MM is comparable to, if not better than, the more sophisticated imputation methods in terms of predictive performance and efficiency on real-world data. Increasing missingness leads to predictive performance deterioration. Additionally, simulation data lead to contradicting results compared to real-world datasets. BI+DAE and BI+MM are the best methods on real-world data; however, when simulated data are considered BI+MF is the best method on average followed by BI+MM. Finally, meta-learning was employed but could not successfully find any patterns to predict whether a sophisticated imputation method can be used instead of the simple BI+MM to improve the downstream performance.

The results make us question whether advanced, multivariate imputation algorithms are really necessary for predictive modeling with AutoML. The simple BI+MM imputation is surprisingly effective and computationally efficient when the ML pipeline is properly tuned within an AutoML setting. BI features allow advanced classifiers to learn when to trust a value or not. Multivariate Imputation algorithms try to learn the full joint distribution of the dataset, a task that is quite challenging with low sample, imbalanced, or high-dimensional data and prone to error. It is also a very computationally demanding task. Imputing values for features that are redundant or irrelevant to the final model is a waste of computations. When imputing using multivariate imputation, one needs to store not only the final model (e.g., RF, SVM, or a NN) but also the imputation model to impute test samples. For some imputation models (Deep Neural Networks, or one RF for each feature as in MF) the additional storage may be non-negligible. In addition, the imputation model requires measuring all features and invalidates the efforts of feature selection. Arguably, the research effort that goes into novel and better-perfoming imputation methods would be more productive to be spent on novel and better-performing ways to natively handle missing values in our classification and feature selection algorithms.

Footnotes

Although some ML algorithms such as KNN and Naive Bayes are robust to missing values, their implementations in popular platforms like sklearn does not currently support the presence of missing values.

Recent work [46] shows that when the causal graph of the distribution is known there are cases where MNAR data can be imputed.

https://www.datarobot.com/

⁴

https://auger.ai/

⁵

https://bigml.com/

⁶

Table 9 contains the dataset name, the missing feature, and the number of features selected for the missing feature as the outcome.

Supplementary Material

tkdd-2023-03-0117-File002 (tkdd-2023-03-0117-file002.zip)

Supplementary material

Download
9.27 MB

tkdd-2023-03-0117-File003 (tkdd-2023-03-0117-file003.zip)

Supplementary material

Download
2.56 MB

Appendices

A Datasets Appendix

A.1 Real-World Datasets with Native Missing Values

This section presents the 25 real-world binary classification datasets with native missing values. See Table 6 for more details.

A.2 Complete Datasets for Missing Data Simulation

This section presents the 10 complete binary classification datasets used for the simulated missing value experiments. Table 7 contains the dataset names and their characteristics.

B Missing Value Simulation Setup Appendix

B.1 Datasets Selected to Determine the Percentage of Missing Values per Feature.

Realistic simulation of missing values requires selecting the missingness percentage for each feature; see Section 4.1. We sampled 64 real-world datasets with missing values from OpenML repository. Table 8 describes the datasets’ characteristics.

B.2 Determining the Average Number of Features on Wich a Missing Feature Depends.

In this section, we present the quantitative results for the experiments regarding the simulation of MAR mechanism presented in Section 4. Table 9 presents the dataset name, the randomly selected feature with missing values (target) and the result of the feature selection (# features selected). On average a missing feature is dependent on 12 features.

C Experimental Results Appendix

C.1 Real-world Results

C.1.1 BI Improve Performance across All Metrics.

BI extended methods perform better than their base methods when the AUC score is measured (see Section 5.1). As seen in Figure 11(b) and (d), BI indeed improves the accuracy and F1-score of the downstream task, in the majority of the datasets. Figure 11(a) and (c) illustrate the gain or loss of including BI for each imputation method in the x-axis. BI improves the performance of the downstream task, on average.

C.1.2 BI+DAE Is the Best Method Followed by BI+MM.

Table 10(a), (b), and (c) show the quantitative results of the real-world experiments. Specifically, it depicts the number of wins (ties included) for each imputation method, as well as the average difference in AUC from the winning imputation method for each dataset. The table also reports the average AUC, average AUC difference from MM (set as baseline), and average ranking per method. For completeness, we include methods without BI as well. BI+DAE and BI+MM are the two best methods across all metrics, in terms of average AUC and average ranking. BI+MM gets the highest number of wins in all metrics. While BI+DAE closely follows with one win for each AUC and two wins for F1 and accuracy. However, BI+DAE is more consistent and has the highest average ranking and highest score for AUC and F1. Finally, BI+DAE exhibits the highest improvement over MM and the lowest difference from the best method in each dataset, on average.

C.1.3 BI+DAE Is the Best Method across All BI Extended Methods.

As seen in Section 5.2, BI+DAE is the highest-ranked method for the AUC metric. Figure 12(a) and (b) illustrate the average ranking of BI extended methods for F1 and accuracy metrics, respectively. BI+DAE is the highest-ranked method for accuracy metric but is the third best for F1. BI+MM is the second-best in accuracy and the best in F1. Overall, across different metrics, the relative order varies slightly. Statistically significant results remain the same for AUC and accuracy. No statistically significant results are found for the F1-score.

C.1.4 BI+MM and BI+DAE Score 99% of the Maximum across All Metrics.

As shown in Tables 11(a), (b), (c), and Figure 13, BI+MM scores over 98% of the maximum performance for each metric. Adding BI+DAE, which is the best next method, to the imputation set that already contains BI+MM allows the tool to score over 99.5% of the maximum performance. However, the complexity increases by 10 \(\times\) . Finally, to reach 100% of the maximum, all imputation methods need to be included in the imputation set. Including all methods in the pipeline of the tool, increases the original complexity by a factor of 24.

C.1.5 BI+MM Exhibits the Best Tradeoff between Effectiveness and Efficiency.

This section presents the tradeoff between the effectiveness and efficiency of imputation methods against a baseline (BI+MM). For a detailed explanation of the illustration see Section 5.3. Figure 14(a) shows that BI+MM dominates the other imputation methods in both effectiveness and efficiency in 84 of 125 pairs for F1-score. In 41 pairs, BI+MM is dominated in effectiveness. For the accuracy metric, BI+MM dominates the other methods in relative effectiveness and efficiency in 86 of 125 pairs. As seen in Figure 14(b), BI+MM is dominated in only 39 pairs.

C.1.6 Feature Selection–enforced Pipelines Degrade the Performance.

As seen in Figure 15(a) and (b), feature selection deteriorates the performance of the pipelines. The average absolute difference between feature selection–enforced pipelines and non-enforced is less than 5% for accuracy. On average, the F1-score is lower by 5 points when enforcing feature selection.

C.2 Simulation Results

C.2.1 A Decline in Predictive Performance Is Caused by Increasing Missingness.

In this section, we investigate the average performance drop in terms of multiple metrics compared to the complete dataset. Specifically, Table 12(a), (c), and (e) denote the results for the MCAR missingness and AUC, F1, and accuracy score, respectively. Table 12(b), (d), and (f) present the results for MAR missing data for AUC, F1, and accuracy score, respectively. Summarizing the results, across all metrics and both missingness mechanisms, as missingness increases, the performance of the tool deteriorates. For MCAR data, average AUC drops in absolute terms up to 0.023 at 10% missingness and 0.05 and 0.144 for 25% and 50%, respectively. While measuring F1, the loss is even bigger. At 10%, loss is up to 0.05, and at 25% loss can reach up to 0.078 average absolute difference, while average F1 loss at 50% missingness can be up to 0.158. The accuracy score deteriorates comparably to AUC. At 10% the loss can be up to 0.03, at 25% up to 0.05, and at 50% up to 0.118. For MAR data, results are similar to MCAR. It is noteworthy to mention that at 50% missingness, the performance does not deteriorate as much as for MCAR. This leads us to conclude that most methods in this comparative evaluation can recover information for high MAR missingness better than MCAR. This is theoretically sound, as every method except MM uses multiple variables for the imputation.

C.2.2 BI+MF Is the Best Method for MCAR Data.

Table 12(a), (c), and (e) and Figure 16(b), (d), and (f) present the results for MCAR missing data at various missingness percentages and multiple metrics (AUC, F1, and Accuracy). Summarizing the results, across all missingness percentages and measured metrics, BI+MF is the best method in terms of average absolute loss to the complete data. The second best method, at 10%, is BI+DAE, while at 25% BI+MM is the second best for F1 and accuracy metrics. The relative order of imputation methods across metrics remains stable until 50% missingness. The order at 50% missingness may vary according to the performance metric. The two worst methods are BI+PPCA and BI+SOFT, while the positions of second-, third-, and fourth-best methods are shared by BI+MM, BI+GAIN, and BI+DAE.

C.2.3 BI+MF Is the Best Method for MAR Data.

As seen in Section 6.1, BI+MF is the best method for MAR data. Table 12(b), (d), and (f) provide an overview of the results for AUC, F1, and accuracy score. The results are robust across all metrics. Figure 17(b) and (d) show that BI+MF has the lowest average loss for 25% and 50% missingness in both F1 and accuracy metric. At 10% missingness, BI+MF has comparable performance to BI+MM, which has the lowest loss at that missingness rate. A detailed review of the results for the AUC metric follows.

Fig. 17.

C.2.4 BI+MM Exhibits the Best Efficiency vs. Effectiveness Tradeoff for MCAR Missing data.

Regarding the MCAR missing data, it is worth noting that BI+MM dominates in over 90 of 147 pairs, for each performance metric. Specifically, for AUC metric BI+MM dominates in 98 pairs, for F1-score in 90 pairs, and for accuracy in 95 pairs. BI+MM is never dominated in efficiency, as seen in Figure 16(a), (c), and (e). This is not surprising, as BI+MM is significantly faster to train than any other imputation method. BI+MM is only dominated by BI+MF across all metrics in average effectiveness. BI+PPCA and BI+SOFT are on average less effective and less efficient than BI+MM.

C.2.5 BI+MM Exhibits the Best Efficiency vs. Effectiveness Tradeoff for MAR Missing Data.

In section 6.3, we presented the efficiency–effectiveness tradeoff for the MAR simulated data when AUC is reported. We extend and confirm our conclusion in this section by measuring and computing the tradeoff for F1-score and classification accuracy metrics. As seen in Figure 17(a) and (d), BI+MM is never dominated in terms of efficiency, as expected. BI+MM dominates the other methods in effectiveness and efficiency in 94 and 98 pairs for the F1-score and accuracy score, respectively. For F1, it is dominated in terms of relative effectiveness in 53 of 157 pairs, while it is dominated for accuracy in 49 pairs. BI+MF is on average more effective than BI+MM for MAR data. However, it is thousands of times less efficient (up to 90,000 for big datasets).

C.2.6 BI+MM and BI+MF Score over 99% of Maximum Performance for MCAR Data.

The minimal-size subset of algorithms with close-to-optimal performance for MCAR missing data is \({BI+MM, BI+MF}\) . We used the simple greedy algorithm introduced in Section 5.4. As illustrated in Figure 18(a), this subset achieves over 99% of the maximum achievable AUC for MCAR data. The subset containing \({BI+MM, BI+MF}\) , also scores over 99% of the maximum performance score for both F1 and accuracy, as denoted in Figure 18(b) and (c). Detailed quantitative results are presented in Table 13(a), (c), and (e) for the AUC, F1, and accuracy score, respectively.

Fig. 18.

C.2.7 BI+MM and BI+MF Score over 99% of Maximum Performance for MAR Data.

The minimal-size subset of algorithms with close-to-optimal performance for MCAR missing data is \({BI+MM, BI+MF}\) as presented in Section 6.2. In this section, we present results for F1 and accuracy score. We used the simple greedy algorithm introduced in Section 5.4. The subset containing \({BI+MM, BI+MF}\) , also scores over 99% of the maximum performance score for both F1 and accuracy, as denoted in Figure 19(a) and (b). Detailed quantitative results are presented in Table 13(b), (d), and (f) for the AUC, F1, and accuracy score, respectively.

Fig. 19.

C.3 Evaluation of Imputation Accuracy

In this section, we present the results of imputation accuracy experiments. We measure the imputation accuracy for each imputation method, missing mechanism, and missingness percentage, by training the imputation methods on the default configurations (highlighted in Table 2). For each dataset, we measure the average R2-score between the imputed and the complete values (clamped between the 0–1 range) for the continuous features. For the categorical features, we measure the accuracy score between the imputed and the complete values. We measure the imputation scores in both train and test sets.

In summary, MF has the highest on-average imputation accuracy for categorical and continuous features. For both MCAR and MAR data, MF is the best method. Additionally, we observe that the differences between MF and the other methods are more prominent in MAR data. The result remains relatively the same across the train and test sets. MM is one of the worst performing in terms of imputation accuracy. However, MM is performing similarly to MF when measuring the downstream task performance, as seen in Sections 5 and 6. This observation further enhances our original hypothesis that imputation accuracy does not necessarily lead to better downstream task performance.

C.3.1 MF Has the Highest Imputation Accuracy for MAR Data.

For MAR data, MF has the highest average R2 and accuracy score in the train data, as seen in Figure 20(a) and (b). Figure 20(c) and (d) show that MF imputes values more accurately on test MAR data. In general, MF is the most accurate method for all missingness percentages. MM, as expected, has the lowest R2-score as it does not predict any of the variance of the continuous data. One interesting observation is that SOFT imputation fails to generalize on test data. As missingness increases, the imputation methods make worse predictions leading to lower scores.

Fig. 20.

C.3.2 MF Has the Highest Imputation Accuracy for MCAR Data.

For MCAR data, MF is the most accurate imputation method in the train data, as seen in Figure 21(a) and (b). Figure 21(c) and (d) show that MF has the highest average R2 and accuracy score on test MCAR data, across all missingness percentages. MM, as expected, has the lowest R2-score. SOFT fails to generalize on new unseen data. Finally, as missingness increases, the quality of imputed values deteriorates.

Fig. 21.

C.4 Real World: Downstream Task Results

In this section, we include the quantitative results of the real-world experiments. Table 14 contains AUC results. Table 15 the results of F1-score, and Table 16 contains the results for the accuracy metric.

Table 14.

Dataset	MM	BI+MM	MF	BI+MF	GAIN	BI+GAIN	SOFT	BI+SOFT	PPCA	BI+PPCA	DAE	BI+DAE
analcatdata_reviewer-FS	0.585	0.585	0.5	0.597	0.561	0.595	0.602	0.602	0.558	0.558	0.585	0.585
analcatdata_reviewer-NOFS	0.661	0.668	0.599	0.646	0.602	0.61	0.597	0.659	0.606	0.656	0.661	0.668
analcatdata_reviewer-Overall	0.661	0.668	0.605	0.643	0.607	0.63	0.597	0.659	0.606	0.656	0.661	0.668
anneal-FS	0.883	0.988	0.761	0.97	0.916	0.973	0.987	0.967	0.995	0.986	0.975	0.968
anneal-NOFS	0.881	0.996	0.938	0.97	0.931	0.983	0.972	0.991	0.996	0.996	0.982	0.982
anneal-Overall	0.883	0.996	0.943	0.969	0.896	0.991	0.972	0.991	0.996	0.996	0.975	0.982
audiology-FS	0.986	0.986	0.98	0.98	0.98	0.974	0.98	0.98	0.98	0.98	0.98	0.98
audiology-NOFS	0.998	0.998	0.992	0.993	0.994	0.991	0.992	0.998	0.995	0.989	0.993	0.995
audiology-Overall	0.998	0.998	0.98	0.981	0.993	0.992	0.992	0.98	0.995	0.989	0.993	0.98
autoHorse-FS	0.967	0.967	0.966	0.966	0.966	0.966	0.966	0.966	0.901	0.996	0.966	0.966
autoHorse-NOFS	0.99	0.989	0.983	0.988	0.982	0.988	0.981	0.99	0.989	0.993	0.976	0.976
autoHorse-Overall	0.99	0.989	0.966	0.966	0.987	0.988	0.981	0.99	0.989	0.993	0.966	0.966
braziltourism-FS	0.634	0.634	0.64	0.64	0.643	0.634	0.64	0.64	0.725	0.725	0.643	0.643
braziltourism-NOFS	0.616	0.727	0.716	0.725	0.721	0.709	0.709	0.725	0.669	0.668	0.731	0.715
braziltourism-Overall	0.616	0.727	0.643	0.64	0.632	0.64	0.709	0.64	0.669	0.668	0.731	0.715
bridges-FS	0.844	0.844	0.853	0.857	0.891	0.849	0.891	0.891	0.892	0.885	0.88	0.847
bridges-NOFS	0.909	0.911	0.902	0.901	0.882	0.916	0.902	0.889	0.901	0.916	0.915	0.912
bridges-Overall	0.909	0.911	0.902	0.909	0.905	0.909	0.902	0.889	0.901	0.916	0.915	0.912
cjs-FS	1.0	1.0	1.0	1.0	1.0	1.0	1.0	1.0	0.987	0.997	1.0	1.0
cjs-NOFS	1.0	1.0	0.994	1.0	0.985	0.998	0.996	0.991	0.987	0.99	0.999	0.999
cjs-Overall	1.0	1.0	1.0	1.0	1.0	1.0	1.0	1.0	0.987	0.997	1.0	1.0
colic-FS	0.829	0.829	0.837	0.855	0.845	0.839	0.836	0.836	0.839	0.839	0.83	0.83
colic-NOFS	0.839	0.838	0.853	0.863	0.845	0.872	0.842	0.865	0.848	0.858	0.852	0.856
colic-Overall	0.829	0.829	0.849	0.881	0.846	0.862	0.836	0.865	0.839	0.858	0.83	0.83
colleges_aaup-FS	0.999	0.999	0.999	0.999	0.999	0.999	0.999	0.999	0.996	0.996	0.999	0.999
colleges_aaup-NOFS	0.999	0.999	0.997	0.999	0.997	0.997	0.998	0.997	0.998	0.998	0.998	0.997
colleges_aaup-Overall	0.999	0.999	0.999	0.999	0.999	0.999	0.999	0.999	0.998	0.998	0.999	0.999
cylinder-bands-FS	0.808	0.785	0.81	0.788	0.808	0.797	0.808	0.798	0.723	0.723	0.832	0.826
cylinder-bands-NOFS	0.82	0.819	0.826	0.832	0.826	0.828	0.815	0.821	0.807	0.824	0.848	0.858
cylinder-bands-Overall	0.82	0.819	0.832	0.831	0.813	0.816	0.815	0.821	0.807	0.824	0.848	0.858
dresses-sales-FS	0.562	0.562	0.564	0.561	0.56	0.552	0.565	0.565	0.5	0.549	0.562	0.562
dresses-sales-NOFS	0.619	0.605	0.6	0.597	0.569	0.583	0.597	0.601	0.545	0.539	0.631	0.62
dresses-sales-Overall	0.619	0.562	0.564	0.561	0.567	0.552	0.565	0.565	0.545	0.539	0.631	0.562
eucalyptus-FS	0.833	0.833	0.821	0.82	0.823	0.835	0.842	0.817	0.75	0.816	0.807	0.834
eucalyptus-NOFS	0.777	0.777	0.778	0.778	0.778	0.777	0.779	0.844	0.82	0.824	0.849	0.855
eucalyptus-Overall	0.833	0.833	0.778	0.778	0.832	0.836	0.842	0.817	0.82	0.824	0.849	0.855
hepatitis-FS	0.748	0.748	0.834	0.83	0.803	0.807	0.673	0.673	0.799	0.799	0.826	0.826
hepatitis-NOFS	0.826	0.848	0.866	0.866	0.844	0.85	0.869	0.864	0.876	0.866	0.852	0.869
hepatitis-Overall	0.826	0.848	0.867	0.858	0.858	0.866	0.869	0.864	0.799	0.799	0.852	0.869
hungarian-FS	0.899	0.899	0.883	0.884	0.893	0.867	0.871	0.871	0.897	0.897	0.875	0.875
hungarian-NOFS	0.918	0.915	0.901	0.901	0.917	0.915	0.881	0.895	0.895	0.898	0.914	0.912
hungarian-Overall	0.918	0.915	0.887	0.888	0.913	0.918	0.871	0.871	0.895	0.897	0.914	0.912
kdd_el_nino-small-FS	0.983	0.983	0.98	0.981	0.988	0.987	0.983	0.981	0.95	0.926	0.983	0.986
kdd_el_nino-small-NOFS	0.987	0.989	0.984	0.985	0.988	0.988	0.985	0.988	0.98	0.985	0.986	0.987
kdd_el_nino-small-Overall	0.987	0.989	0.982	0.986	0.988	0.987	0.985	0.988	0.98	0.985	0.986	0.987
mushroom-FS	1.0	1.0	1.0	1.0	1.0	1.0	1.0	1.0	1.0	1.0	1.0	1.0
mushroom-NOFS	1.0	1.0	1.0	1.0	1.0	1.0	1.0	1.0	1.0	1.0	1.0	1.0
mushroom-Overall	1.0	1.0	1.0	1.0	1.0	1.0	1.0	1.0	1.0	1.0	1.0	1.0
pbcseq-FS	0.849	0.849	0.851	0.852	0.84	0.849	0.836	0.831	0.849	0.849	0.846	0.843
pbcseq-NOFS	0.849	0.85	0.857	0.844	0.856	0.848	0.846	0.845	0.841	0.838	0.848	0.842
pbcseq-Overall	0.849	0.85	0.851	0.85	0.851	0.849	0.846	0.845	0.841	0.838	0.848	0.842
primary-tumor-FS	0.875	0.887	0.855	0.875	0.846	0.874	0.829	0.882	0.829	0.864	0.786	0.887
primary-tumor-NOFS	0.88	0.892	0.875	0.875	0.871	0.88	0.877	0.889	0.861	0.87	0.88	0.892
primary-tumor-Overall	0.88	0.892	0.867	0.892	0.886	0.87	0.877	0.889	0.861	0.87	0.88	0.892
profb-FS	0.642	0.642	0.642	0.642	0.642	0.642	0.642	0.642	0.631	0.631	0.642	0.642
profb-NOFS	0.695	0.696	0.691	0.693	0.695	0.692	0.695	0.696	0.579	0.58	0.693	0.695
profb-Overall	0.695	0.696	0.692	0.694	0.692	0.686	0.695	0.696	0.631	0.631	0.693	0.695
schizo-FS	0.526	0.526	0.556	0.557	0.534	0.515	0.556	0.556	0.543	0.543	0.623	0.623
schizo-NOFS	0.719	0.684	0.764	0.765	0.717	0.74	0.76	0.76	0.56	0.559	0.794	0.805
schizo-Overall	0.719	0.684	0.781	0.772	0.736	0.751	0.76	0.76	0.543	0.543	0.794	0.805
sick-FS	0.984	0.984	0.992	0.993	0.993	0.992	0.977	0.969	0.991	0.991	0.992	0.992
sick-NOFS	0.99	0.989	0.994	0.994	0.992	0.995	0.986	0.988	0.992	0.989	0.993	0.992
sick-Overall	0.99	0.989	0.994	0.994	0.992	0.993	0.986	0.988	0.992	0.989	0.993	0.992
soybean-FS	0.981	0.983	0.983	0.992	0.991	0.981	0.985	0.985	0.973	0.973	0.991	0.991
soybean-NOFS	0.991	0.994	0.989	0.986	0.989	0.992	0.987	0.991	0.991	0.985	0.989	0.993
soybean-Overall	0.981	0.994	0.985	0.991	0.99	0.991	0.987	0.991	0.991	0.985	0.989	0.993
stress-FS	0.916	0.916	0.932	0.932	0.932	0.932	0.932	0.932	0.933	0.933	0.932	0.932
stress-NOFS	0.902	0.904	0.899	0.903	0.906	0.904	0.902	0.901	0.948	0.946	0.909	0.909
stress-Overall	0.916	0.916	0.932	0.932	0.932	0.932	0.932	0.932	0.933	0.933	0.909	0.932
vote-FS	0.983	0.985	0.992	0.995	0.986	0.99	0.991	0.991	0.989	0.991	0.978	0.986
vote-NOFS	0.992	0.991	0.994	0.991	0.994	0.99	0.995	0.991	0.995	0.992	0.992	0.992
vote-Overall	0.992	0.991	0.992	0.991	0.995	0.992	0.991	0.991	0.989	0.991	0.992	0.992
water-treatment-FS	0.916	0.988	0.986	0.987	0.958	0.988	0.943	0.943	0.5	0.5	0.962	0.979
water-treatment-NOFS	0.988	0.988	0.988	0.987	0.988	0.988	0.954	0.986	0.788	0.774	0.98	0.981
water-treatment-Overall	0.988	0.988	0.987	0.987	0.988	0.988	0.954	0.986	0.788	0.774	0.98	0.981

Table 14. Real-world Results for the AUC Metric