1 Introduction
Machine learning (ML) has revolutionized diverse fields like natural language processing [
7], computer vision [
36], robotics [
24], material design [
22], and fraud detection [
35]. However, unlocking this potential hinges on optimal hyperparameter tuning for ML algorithms [
38]. This crucial yet time-consuming task is addressed by hyperparameter optimization (HPO) techniques [
3,
28]. Among HPO algorithms, Bayesian Optimization (BO) stands out for its efficiency [
23,
37]. It excels at converging on near optimal hyperparameters with fewer iterations compared to alternatives like random search and evolutionary algorithms [
3,
34,
39].
AutoML, a subfield of machine learning (ML), encompasses a range of objectives, primarily focusing on maximizing predictive performance on datasets [
14]. AutoML tools achieve this by constructing pipelines that include various processing steps such as pre-processing, imputation, feature selection, and modeling. Each stage utilizes specific algorithms, and AutoML optimizes the hyperparameters for each algorithm within the pipeline. A configuration refers to the combination of algorithms selected for each stage and their corresponding hyperparameter values [
32]. A key objective of AutoML is to identify the optimal configuration for a given problem based on predictive performance. This task requires extensive optimization across a complex search space with many factors influencing the outcome, and is referred to as the combined algorithm selection and hyperparameter optimization (CASH) problem [
30]. Additionally, fitting a configuration for medium to large datasets can be time-consuming, ranging from minutes to hours. To address this challenge, AutoML tools incorporate optimization algorithms to achieve high performance with fewer function evaluations, thus reducing analysis time [
10,
30]. Another critical aspect of AutoML tools is providing unbiased performance estimates. Various methods exist to address this challenge [
29,
31,
33]. However, the majority of AutoML tools rely on simple cross-validation scores, which are prone to estimation bias [
32].
Bayesian Optimization (BO) is a state-of-the art optimization method that utilizes a probabilistic model for predictions and uncertainty estimates [
23]. BO starts with an initial sampling phase where a few random configurations are evaluated to gather initial data. These configurations are assessed by running the objective function, which can involve tasks like fitting a machine learning model on a dataset and returning a performance metric (e.g., cross-validated AUC score). The probabilistic model aims to predict the average outcome (mean) of the objective function and quantify the associated uncertainty (standard deviation). During each iteration, the probabilistic model predicts the mean value and uncertainty for configurations that
haven’t been evaluated yet. The next configuration to be evaluated is chosen by an acquisition function, with Expected Improvement (EI) being the most common choice [
16,
23]. This function balances exploration vs. exploitation by selecting configurations with both potentially high scores or/and high uncertainty. By utilizing the acquisition function, BO avoids getting stuck in local optima and converges to near-optimal solutions with sufficient iterations. Finally, to identify the configuration that maximizes the acquisition function in each step, we can either use random sampling or an evolutionary algorithm to generate a set of candidate configurations. Once a configuration is evaluated, the probabilistic model is retrained with the new data point.
This paper introduces Conditional Local Bayesian Optimization (CLBO), a novel Bayesian optimization procedure designed to address the challenges posed by conditional search spaces in AutoML. CLBO tackles this challenge by splitting the complex search space into smaller, more manageable sub-spaces. The search space is naturally partitioned through the conditional variables. In this paper, the conditional variable is the selection of the ML algorithm. Each sub-space is then optimized by localized, semi-independent Bayesian optimization algorithms we refer to as responders. A central controller identifies the most promising configuration by selecting the one with the highest acquisition value among all configurations proposed by the responders. CLBO constructs multiple Random Forest (RF) models for each responder. Each RF model is trained to predict the performance of individual folds. CLBO adopts a progressive approach, similar to cross-validation. In early iterations, the algorithm optimizes a simpler objective function using a subset of folds. As the optimization progresses, more budget is allocated and a higher number of folds is utilized. Finally, we include a method to correct bias in performance estimates within Bayesian optimization. CLBO uses the Bootstrap Bias Correction algorithm for Cross-Validation (BBC-CV) to report unbiased performance estimates [
33]. BBC-CV eliminates the need for constructing additional models, unlike Nested Cross-Validation (NCV).
We validated CLBO’s overall effectiveness through a large-scale evaluation. We used 35 classification datasets from the complete OpenML CC-18 benchmark [
6] to compare CLBO against several leading optimization frameworks designed for conditional hyperparameter spaces. These frameworks included SMAC [
20], Hyperopt [
4], Optuna [
1], and Mango [
26]. Additionally, we included Random Search [
3] as a baseline for comparison. Our experiments revealed that CLBO achieved the best average performance across all datasets, secured the highest number of wins (16) compared to the other methods and CLBO’s performance was significantly better than both Mango and Random Search. Finally, we compared the performance estimates obtained from CLBO against the held-out performance and the reported cross-validation (CV) performance. The results demonstrated that CLBO delivers estimates that are very close to unbiased, accurately reflecting the true predictive performance.
2 Related Work
While this paper doesn’t delve into a comprehensive review of the broader black-box optimization literature, we acknowledge the existence of well-established surveys and tutorials on the subject for those interested in a more extensive overview [
5,
11,
27,
37]. Our specific focus lies on applying black-box optimization within the context of AutoML, which introduces unique complexities compared to the general black-box optimization domain. The AutoML search space presents a significant challenge for most existing optimization algorithms. This space is characterized by several complexities: high dimensionality, the presence of conditional hyper-parameters (where a parameter’s value depends on another parameter), and the need to tune both continuous and discrete variables. Such complexity renders many prominent methods incompatible with AutoML optimization [
28].
While a significant portion of black-box optimization research focuses on single-objective, non-conditional search spaces, the area of conditional hyper-parameter space optimization remains relatively unexplored. Early work proposed using independent Bayesian optimization procedures based on Gaussian Processes (GPs) to address conditional spaces [
3]. Subsequent advancements explored alternative approaches to handle conditionality. For instance, SMAC introduced Random Forests as surrogate models [
20]. Similarly, tree-structured Parzen estimators have been adopted as surrogate models in both Hyperopt and Optuna [
1,
4]. Building upon the work of [
3], Mango leverages the acquisition function to establish dependencies between independent BO instances [
26]. Large-scale benchmarks designed for evaluating BO methods in AutoML remain scarce. However, in the context of AutoML tasks, Random Forests have been shown to outperform TPE (Tree-Parzen Estimator) [
30].
Most AutoML tools rely heavily on Bayesian Optimization algorithms for pipeline optimization. An overview of the optimization methods across various open-source AutoML tools follows. Auto-sklearn and Auto-WEKA 2.0, use SMAC for the optimization [
10,
17]. TPOT utilizes a genetic algorithm [
25]. Auto-prognosis 2.0 utilizes the Optuna framework, with Hyperband as an alternative [
1,
15,
19]. H2O currently employs optimization with random search [
18]. Auto-Torch relies on Bayesian optimization with Hyperband, which is a variant of SMAC [
9,
40]. Auto-Gluon relies on ensemble learning without performing hyperparameter optimization [
8].
3 Our Method
The central piece of CLBO is the controller-responder architecture. By partitioning the search space into semi-independent sub-spaces, we are able to create multiple local Bayesian optimization procedures that "communicate" via the controller through the acquisition function. This approach was first proposed within the Mango framework [
26].
Surrogate model: The central part of every BO algorithm is the probabilistic model, also known as the surrogate model, which learns the characteristics of the objective function. Gaussian Processes (GP), Random Forests (RF), and Tree Parzen Estimators (TPE) are all prominent examples of surrogate models. We selected RF as our surrogate model due to their sample efficiency compared to GPs. One of the novelties of our work is the utilization of an group of RFs as surrogate models. Unlike traditional approaches that employ a single surrogate model, CLBO constructs a RF specifically for predicting the performance of each fold, rather than the fold average. This enables CLBO to generate more fine-grained estimations of the predicted score and the uncertainty for each configuration.
Configuration Sampler: The Bayesian optimization algorithm relies on some initial data acquisition procedure to train the surrogate model for the first time. Common methods for generating these initial random configurations include Sobol Sequences, Hypercube sampling and Random uniform sampling [
2,
21]. Our work leverages Sobol Sequences due to their ability to more evenly explore the search space compared to uniform random search [
3]. Sobol Sequences are used in each iteration to generate a set of 900 configurations to maximize the Expected Improvement (EI) acquisition function. To further refine the search, a local search method is implemented. This local search, utilizes a Gaussian sampler to generate 100 additional configurations around each group’s current best evaluated configuration.
Optimization Budget allocation: We propose a novel progressive budget allocation algorithm, that dynamically allocates more resources (iterations) to later folds. This is a cross-validation flavored procedure, in which we initially train models on a limited subset of folds in the early iterations. Estimates from more folds are included as the optimization progresses. At the start of each fold, the configurations selected on the previous folds, are run on the new fold. We introduce a linear weighting system to determine the budget allocation for each fold. Let k denote the total number of folds and n denote the total number of iterations allocated (Excluding the initial random configurations). To begin, a normalized value, norm, is calculated using the formula norm = (k*(k + 1))/2. For each fold i, a weight C[i] is then computed as follows C[i] = (n*i)/norm. This weighting system ensures a gradual increase in the number of iterations allocated to each subsequent fold.
4 Experimental Setup
In this section, we introduce the search space, the datasets selected, the objective to optimize, and the optimization methods.
4.1 Search Space
This section describes the approach for jointly optimizing the hyperparameters of a variety of classification algorithms. The selected classifiers encompass a range of complexities, including basic models, ensemble methods, and boosting algorithms. We also tune a variety of hyper-parameters with a wide range of values. Table
1 gives an overview of the classifiers, the hyper-parameters and the hyper-parameter ranges that the optimization algorithms tunes.
4.2 Datasets
We employed the curated OpenML CC-18 benchmark [
6], which provides a comprehensive suite of multi-class and binary classification problems. A crucial aspect of OpenML CC-18’s dataset selection involved prioritizing datasets where simple models (e.g Decision Trees) typically achieve lower performance. This characteristic ensures that CASH optimization can improve performance over these baselines. For the evaluation we randomly selected 35 datasets. The sample sizes ranges from 540 to 6430, feature size from 5 to 857 features, the dataset include binary and multi-class labels up-to 11 classes. The number of continuous features ranges between 0 - 856, while for the categorical features the value ranges between 1 and 61.
4.3 Evaluation Protocol - Optimization Metric
The data are partitioned into 80% training and 20% testing sets (for estimating performance bias). We repeat this process 5 times to produce disjointed test-sets, aligning with the principles of Nested-Cross Validation (NCV). The training set is used by the optimization algorithms to maximize the 5-fold CV (inner-fold) performance and the Area Under the ROC Curve (AUC) is used as the metric optimized. Our proposed method deviates slightly from this approach. While other algorithms directly optimize the inner-fold CV performance throughout the optimization process, our method incorporates a dynamic optimization procedure that delays this focus until the final iterations (see Section
3).
4.4 Optimization algorithms
Due to space constraints, we refrain from providing a comprehensive explanation of the optimization algorithms employed in this work. However, we present an overview of the five algorithms included in our test-bed.
Random Search: Random search is the simplest hyperparameter optimization method. It generates a set of configurations by uniformly sampling values from the defined hyperparameter ranges. It is used as the baseline method.
SMAC: The first Bayesian optimization algorithm to utilize Random Forest as surrogate models. We select the "Hyperparameter Optimization" version from SMAC3 package [
20]. SMAC uses Sobol sequence to sample initial random configurations and also for the selection of the next most promising configurations, coupled with a Local Search around the current best configuration [
13].
Mango: Instead of using a tree-model to handle conditionality, Mango adopts local-BO for each sub-space. In order to sample the next best configuration, it selects the highest acquisition value across all local-BO instances. The surrogate model utilized by mango is sparse Gaussian processes [
26]. Contrary to us, Mango relies on sparse Gaussian processes for the surrogate model.
HyperOpt: By using Tree-Parzen Estimators (TPE) for both surrogate modeling and configuration sampling, HyperOpt is able to handle the conditional search space. We include the default TPE implementation provided by the HyperOpt package [
4].
Optuna: Utilizes TPE for both surrogate modeling and configuration sampling. (Same as HyperOpt). We select the default implementation from Optuna package [
1].
6 Bias Correction Experiments
This section reports results of using BBC-CV for bias removal in final performance. Figure
2 shows the difference between performance estimated by BBC-CV (y-axis) and CV (x-axis) from holdout performance, across all datasets. Each point represents a dataset, divided into four quadrants for visualization. Each quadrant describes whether the performance is optimistic or conservative and the number of datasets in that category (out of 35). Points on the diagonal white line show datasets where CV estimation matches BBC-CV. Points to the right of the diagonal indicate CV is more optimistic than BBC-CV, which is true for most datasets.
In figure
2, CV estimation is optimistically biased in 22 datasets (green and red quadrants), while BBC-CV addresses this bias, providing conservative estimates in 30 datasets (grey and green quadrants). Only 5 datasets show optimistic estimates by BBC-CV (red quadrant). CV never reports conservative estimates when BBC-CV reports optimistic ones (blue quadrant). BBC-CV has an absolute average bias of 0.0026 AUC, compared to 0.0036 AUC for 5-fold CV.
We conducted a Wilcoxon signed-rank test to compare BBC-CV and Holdout performance. The null hypothesis (BBC-CV performance equals Holdout) was rejected with p-value = 0.000096, supporting the alternative that BBC-CV is conservative. A second test compared CV and Holdout performance. The null hypothesis (CV performance equals Holdout) was rejected with p-value = 0.0096, supporting the alternative that CV is optimistic. These results indicate BBC-CV provides more accurate performance estimates.