Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                

Sampling in CMA-ES: Low Numbers of Low Discrepancy Points

Jacob de Nobel\orcidlink0000-0003-1169-1962, Diederick Vermetten\orcidlink0000-0003-3040-7162, Thomas H.W. Bäck\orcidlink0000-0001-6768-1478, Anna V. Kononova\orcidlink0000-0002-4138-7024
LIACS, Leiden University, Leiden, Netherlands
{j.p.de.nobel, d.l.vermetten, t.h.w.baeck, a.kononova}@liacs.leidenuniv.nl
Abstract

The Covariance Matrix Adaptation Evolution Strategy (CMA-ES) is one of the most successful examples of a derandomized evolution strategy. However, it still relies on randomly sampling offspring, which can be done via a uniform distribution and subsequently transforming into the required Gaussian. Previous work has shown that replacing this uniform sampling with a low-discrepancy sampler, such as Halton or Sobol sequences, can improve performance over a wide set of problems. We show that iterating through small, fixed sets of low-discrepancy points can still perform better than the default uniform distribution. Moreover, using only 128 points throughout the search is sufficient to closely approximate the empirical performance of using the complete pseudorandom sequence up to dimensionality 40 on the BBOB benchmark. For lower dimensionalities (below 10), we find that using as little as 32 unique low discrepancy points performs similar or better than uniform sampling. In 2D, for which we have highly optimized low discrepancy samples available, we demonstrate that using these points yields the highest empirical performance and requires only 16 samples to improve over uniform sampling. Overall, we establish a clear relation between the L2subscript𝐿2L_{2}italic_L start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT discrepancy of the used point set and the empirical performance of the CMA-ES.

1 Introduction

Optimization techniques play a crucial role in various scientific and engineering applications. Exact methods systematically explore the parameter space but often suffer from inefficiency due to their exhaustive nature. For example, it has been shown that randomized search is superior to grid search for hyperparameter tuning [Bergstra and Bengio, 2012]. This is because, especially in higher dimensions, a randomized process will provide better coverage of sample points in the domain than an exhaustive search, given a limited evaluation budget. While samples generated uniformly at random provide an improvement in exploring the search space, such samples can still be quite suboptimal in covering the domain. Given a limited number of samples, uniform samples can be distributed very unevenly [Halton, 1960]. This notion of evenly spreading points across a given domain motivates the research into Low-Discrepancy Sequences. These are sequences of pseudo-randomly generated points that are designed to minimize the gaps and clusters that often occur in uniform random sampling, providing more uniform coverage of the search space. Specifically, discrepancy measures are designed to measure how regularly a given point set is distributed in a given space [Clément et al., 2023b]. While early work with low discrepancy point sets focuses on Monte Carlo integration [Halton, 1960, Sobol’, 1967], they have subsequently been used in various domains, such as computer vision [Paulin et al., 2022] and financial modeling [Galanti and Jung, 1997]. Low Discrepancy point sets have been used in the optimization domain to set up the Design of Experiments (DoE) within a constrained budget [Santner et al., 2003]. One application of particular interest in our context is one-shot optimization, where low-discrepancy sequences have been shown to outperform more traditional uniform sampling [Bousquet et al., 2017]. Moreover, random search using quasi-random points has been shown to outperform traditional random search [Niederreiter, 1992]. In metaheuristics, randomized search is employed by many different algorithms, such as Evolution Strategies (ES) [Beyer, 2001]. In ES, quasi-random point sets have been used as an alternative sampling strategy to pure random sampling [Teytaud and Gelly, 2007], specifically for the CMA-ES [Hansen and Ostermeier, 2001]. Simplified, the procedure of the CMA-ES can be divided into two steps, which are repeated until convergence:

  • Sample λ𝜆\lambdaitalic_λ points from a multivariate normal distribution 𝒩(𝐦,𝐂)𝒩𝐦𝐂\mathcal{N}(\mathbf{m},\mathbf{C})caligraphic_N ( bold_m , bold_C ).

  • Adjust the parameters of the multivariate normal distribution to move towards the μ𝜇\muitalic_μ points with the highest fitness.

Modifying the sampling step to use points from a pseudo-random sequence [Teytaud, 2015] demonstrated increased performance and stability on benchmark functions. Moreover, this furthers derandomization, which aims to achieve self-adaptation without any independent stochastic variation of the strategy parameters [Bäck et al., 2023]. This paper aims to extend this work by focusing on the number of samples drawn from a (pseudo) random sampling strategy specifically for the CMA-ES. Specifically, we investigate if repeatedly reusing small subsets of pseudorandom sequences in a deterministic manner can be an effective sampling strategy for the CMA-ES. We provide an analysis for several well-known low discrepancy point sets to investigate the relation between discrepancy and empirical performance on the BBOB benchmark functions.

2 Preliminaries

2.1 Low-discrepancy sequences

The discrepancy of a set of points quantifies how regularly they are spaced in the domain. One of the most common discrepancy measures of a point set P[0,1]d𝑃superscript01𝑑P\subseteq[0,1]^{d}italic_P ⊆ [ 0 , 1 ] start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT is the Lsubscript𝐿L_{\infty}italic_L start_POSTSUBSCRIPT ∞ end_POSTSUBSCRIPT star discrepancy [Clément et al., 2023a], which is defined as follows:

d(P)=supq[0,1]d||P[0,q)||P|λ(q)|subscriptsuperscript𝑑𝑃subscriptsup𝑞superscript01𝑑𝑃0𝑞𝑃𝜆𝑞d^{*}_{\infty}(P)=\textit{sup}_{q\in[0,1]^{d}}\left|\frac{|P\cap[0,q)|}{|P|}-% \lambda(q)\right|italic_d start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT ∞ end_POSTSUBSCRIPT ( italic_P ) = sup start_POSTSUBSCRIPT italic_q ∈ [ 0 , 1 ] start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT end_POSTSUBSCRIPT | divide start_ARG | italic_P ∩ [ 0 , italic_q ) | end_ARG start_ARG | italic_P | end_ARG - italic_λ ( italic_q ) | (1)

Here, λ(q)𝜆𝑞\lambda(q)italic_λ ( italic_q ) is the Lebesgu measure of the box [0,q)0𝑞[0,q)[ 0 , italic_q ) and d(P)subscriptsuperscript𝑑𝑃d^{*}_{\infty}(P)italic_d start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT ∞ end_POSTSUBSCRIPT ( italic_P ) measures the worst absolute difference between λ(q)𝜆𝑞\lambda(q)italic_λ ( italic_q ) of a d-dimensional box anchored at the origin and the proportion of points that fall inside this box. Note that this measure should be minimized to evenly space points in the domain. Since the Lsubscript𝐿L_{\infty}italic_L start_POSTSUBSCRIPT ∞ end_POSTSUBSCRIPT star discrepancy can be computationally expensive [Clément et al., 2023b], we can also consider the L2subscript𝐿2L_{2}italic_L start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT star discrepancy, which is defined as follows [Zhou et al., 2013]:

d2(P)=([0,1]d||P[0,q)||P|λ(q)|𝑑q)1/2subscriptsuperscript𝑑2𝑃superscriptsubscriptsuperscript01𝑑𝑃0𝑞𝑃𝜆𝑞differential-d𝑞12d^{*}_{2}(P)=\left(\int_{[0,1]^{d}}\left|\frac{|P\cap[0,q)|}{|P|}-\lambda(q)% \right|\,dq\right)^{1/2}italic_d start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( italic_P ) = ( ∫ start_POSTSUBSCRIPT [ 0 , 1 ] start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT end_POSTSUBSCRIPT | divide start_ARG | italic_P ∩ [ 0 , italic_q ) | end_ARG start_ARG | italic_P | end_ARG - italic_λ ( italic_q ) | italic_d italic_q ) start_POSTSUPERSCRIPT 1 / 2 end_POSTSUPERSCRIPT (2)

Several pseudo-random sequences have lower star discrepancies than corresponding uniform sequences. These methods include Latin Hypercube Sampling [Loh, 1996], Jittered sampling [Pausinger and Steinerberger, 2016], and Hammersly sequences [Peart, 1982]. Our work considers the Halton [Halton, 1960] and Sobol [Sobol’, 1967] sequences.

2.2 Derandomization and CMA-ES

While Evolution Strategies [Beyer, 2001] depend on a random process to sample candidate solutions, the internal parameter update has been derandomized in state-of-the-art implementations of the algorithm. Derandomization ensures self-adaptation happens without any independent stochastic variation of the strategy parameters [Ostermeier et al., 1994]. Effectively, this means the update of the strategy parameters is decoupled from the sampling of candidate solutions, moving away from the notion that ‘good solutions have good parameters’ of traditional ES. This allows modern ES, such as the CMA-ES, to be more robust and learn good strategy parameters while using relatively small population sizes [Hansen and Ostermeier, 2001]. Within the CMA-ES, the sampling procedure is the only remaining source of stochasticity. At every generation, λ𝜆\lambdaitalic_λ individuals are sampled from the d𝑑ditalic_d-dimensional Gaussian distribution 𝒩(𝐦,σ2𝐂)𝒩𝐦superscript𝜎2𝐂\mathcal{N}(\mathbf{m},\sigma^{2}\mathbf{C})caligraphic_N ( bold_m , italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT bold_C ). Where 𝐦𝐦\mathbf{m}bold_m is the mean of the sampling distribution, σ𝜎\sigmaitalic_σ is the global step size, and 𝐂𝐂\mathbf{C}bold_C is the covariance matrix. In practice, 𝐂𝐂\mathbf{C}bold_C is spectrally decomposed into two matrices, 𝐁𝐁\mathbf{B}bold_B and 𝐃𝐃\mathbf{D}bold_D, representing the eigenvectors and inverse square root of the eigenvalues of 𝐂𝐂\mathbf{C}bold_C, respectively. This allows the sampling of points in the CMA-ES to happen in a three-stage process:

  1. 1.

    𝐳𝐤𝒩(𝟎,𝟏)similar-tosubscript𝐳𝐤𝒩01\mathbf{z_{k}}\sim\mathcal{N}(\mathbf{0},\mathbf{1})bold_z start_POSTSUBSCRIPT bold_k end_POSTSUBSCRIPT ∼ caligraphic_N ( bold_0 , bold_1 )

  2. 2.

    𝐲𝐤=𝐁𝐃𝐳𝐤𝒩(𝟎,𝐂)\mathbf{y_{k}}=\mathbf{B}\mathbf{D}\mathbf{z_{k}}\quad\quad\sim\mathcal{N}(% \mathbf{0},\mathbf{C})bold_y start_POSTSUBSCRIPT bold_k end_POSTSUBSCRIPT = bold_BDz start_POSTSUBSCRIPT bold_k end_POSTSUBSCRIPT ∼ caligraphic_N ( bold_0 , bold_C )

  3. 3.

    𝐱𝐤=𝐦+σ𝐲𝐤𝒩(𝐦,σ2𝐂)\mathbf{x_{k}}=\mathbf{m}+\sigma\mathbf{y_{k}}\quad\sim\mathcal{N}(\mathbf{m},% \sigma^{2}\mathbf{C})bold_x start_POSTSUBSCRIPT bold_k end_POSTSUBSCRIPT = bold_m + italic_σ bold_y start_POSTSUBSCRIPT bold_k end_POSTSUBSCRIPT ∼ caligraphic_N ( bold_m , italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT bold_C )

Given this decomposition, the first step of the sampling process can be practically achieved by sampling from a uniform distribution 𝐮𝒰(0,1)dsimilar-to𝐮𝒰superscript01𝑑\mathbf{u}\sim\mathcal{U}(0,1)^{d}bold_u ∼ caligraphic_U ( 0 , 1 ) start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT, and transforming each coordinate 𝐮isubscript𝐮𝑖\mathbf{u}_{i}bold_u start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT of the sample by:

ϕ1(𝐮i)=2erf1(2𝐮i1)superscriptitalic-ϕ1subscript𝐮𝑖2superscripterf12subscript𝐮𝑖1\phi^{-1}(\mathbf{u}_{i})=\sqrt{2}\text{erf}^{-1}(2\mathbf{u}_{i}-1)italic_ϕ start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ( bold_u start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) = square-root start_ARG 2 end_ARG erf start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ( 2 bold_u start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT - 1 ) (3)

which is the inverse of the cumulative density function for a standard Gaussian distribution.

Since the sampling procedure in CMA-ES can be seen as sampling in [0,1]dsuperscript01𝑑[0,1]^{d}[ 0 , 1 ] start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT, replacing the uniform sampling with a low-discrepancy sequence is a natural step. Previous work has shown that scrambled Halton sequences can improve performance over the standard sampling procedure on most considered benchmark problems [Teytaud and Gelly, 2007, Teytaud, 2015].

3 Methods

3.1 Point set generation

For our experiments, we modify how the CMA-ES samples from a normal distribution by exchanging the uniform number generator with a selection of points from a fixed point set stored in a cache. The cache is randomly permuted once, at the beginning of an optimization run, and cycled through in steps of size λ𝜆\lambdaitalic_λ. The permutation is performed to break the bias that might be present in the ordering of the point set.

Refer to caption
Figure 1: (Average) log10(d2)subscript10subscriptsuperscript𝑑2\log_{10}(d^{*}_{2})roman_log start_POSTSUBSCRIPT 10 end_POSTSUBSCRIPT ( italic_d start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) star discrepancy for the generated fixed-size point sets across all dimensionalities. Colors are (min-max) normalized on a per-dimensionality basis; darker colors indicate a worse (higher) d2superscriptsubscript𝑑2d_{2}^{*}italic_d start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT value.

We consider four methods for generating point sets: Sobol, Halton, uniform, and ‘optimized’. Halton sequences are known to have unwanted correlations in higher dimensional spaces, and as such, we employ a scrambling method to prevent this [Braaten and Weller, 1979]. For the Sobol sequences, its balance properties require that the number of points generated is equal to a power of 2, so we always round up our number of points sampled to the closest power of 2. Finally, our ‘optimized’ method for generating low-discrepancy point sets is split into two parts based on search space dimensionality. When d=2𝑑2d=2italic_d = 2, we use optimized Fibonacci sets [Clément et al., 2023a], considered among the best low-discrepancy point sets available. However, these are only available for 2D since generating optimized low-discrepancy point sets is a very hard computational problem. We use the improved Threshold Accepting subset selection heuristic from [Clément et al., 2024] for higher dimensionalities, using Sobol sequences as the base sets.

We create point sets of sizes k{16,32,64,128,256}𝑘163264128256k\in\{16,32,64,128,256\}italic_k ∈ { 16 , 32 , 64 , 128 , 256 } for each generation mechanism and calculate their L2subscript𝐿2L_{2}italic_L start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT star discrepancy111We use L2subscript𝐿2L_{2}italic_L start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT instead of Lsubscript𝐿L_{\infty}italic_L start_POSTSUBSCRIPT ∞ end_POSTSUBSCRIPT star discrepancy for computational reasons.222For the uniform points sets, we calculate the average over 100 samples.. Figure 1 shows how the discrepancy changes with increasing dimensionality and size of the point sets. We observe that, as expected, the standard uniform sampling has a noticeably higher discrepancy than both Halton and Sobol sequences, with the optimized method usually having the lowest discrepancy. Note that for the optimized method, the relative difference in discrepancy is largest when d=2𝑑2d=2italic_d = 2 due to the different generation mechanism used for this dimensionality.

Refer to caption
Figure 2: Average area under the EAF curve for each sampling method on the BBOB benchmark, grouped by dimension. Colors are (min-max) normalized on a per-dimensionality basis; darker colors indicate a worse (lower) EAF value.
Refer to caption
Figure 3: Empirical Attainment Function aggregated over all 24 BBOB functions for dimensionality 2. The methods are shown in color for each sampling strategy, with cache size indicating the number of points in the cache and \infty indicating no caching; every sample is unique. From left to right, the subfigures show results using a uniform sampler, a Sobol sequence, a scrambled Halton sequence, and the optimized point sets. Note that in the three rightmost figures, the default CMA-ES sampling strategy, UNIFORM-\infty, is included for comparison.

3.2 Experimental setup

To gauge the impact of our derandomization, we run the Modular CMA-ES [de Nobel et al., 2021] with each of the generated point sets on the single-objective, noiseless BBOB suite [Hansen et al., 2009]. This suite consists of 24 functions, of which we use 100100100100 different instances (generated by transformations such as rotation and translation) in dimensionalities d{2,5,10,20,40}𝑑25102040d\in\{2,5,10,20,40\}italic_d ∈ { 2 , 5 , 10 , 20 , 40 }. We set our evaluation budget at B=d10 000𝐵𝑑10000B=d\cdot 10\,000italic_B = italic_d ⋅ 10 000. By using only a single run on each instance, we only need to generate a single point set for each (d,N)𝑑𝑁(d,N)( italic_d , italic_N )-pair, while the fact that we use 100100100100 instances ensures a large enough sample size for each function. Additionally, we benchmark the default sampling mechanism, UNIFORM-\infty, and sampling from arbitrarily long scrambled Halton and Sobol sequences, denoted by the \infty suffix in our results. No caching is applied in these cases, and each generated sample in the optimization process is unique.

To measure performance, we consider the attainment-based cumulative distribution function [López-Ibáñez et al., 2024], with bounds 102superscript10210^{2}10 start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT and 108superscript10810^{-8}10 start_POSTSUPERSCRIPT - 8 end_POSTSUPERSCRIPT for the log-scaled precision values.

Reproducibility

All used point sets, the full experimental code, and data are in our reproducibility repository [de Nobel et al., 2024].

4 Results

Figure 2 shows each sampling method’s normalized area under the EAF curve. Without caching, we observe that both low-discrepancy sequences outperform the uniform sampler, and overall, the scrambled Halton shows the highest empirical performance. Note that the differences tend to decrease with dimensionality. When comparing the results within each fixed point set size, the uniform sampler often performs notably worse than any of the low-discrepancy sets. This effect is especially noticeable for smaller set sizes (lower values of k𝑘kitalic_k) but remains observable for all cache sizes. Interestingly, the optimized point sets only seem to bring performance benefits in low dimensionalities and using smaller cache sizes. For higher dimensionalities, they often perform similarly to uniform sampling. Overall, increasing the size of the cached point set increases performance, and using the complete sequence (\infty) yields the highest performance for each method. Interestingly, point sets of only 64 to 128 samples are often enough to get very close in performance to using the complete sequence as a sampling strategy. Moreover, we find that such sequences still outperform the default sampling strategy of the CMA-ES, i.e., UNIFORM-\infty, especially in lower dimensions. In 2D2𝐷2D2 italic_D, we can observe that only using 16 optimized samples can outperform the default sampling strategy.

Refer to caption
Figure 4: Average area under the Empirical Attainment Function over all BBOB functions, grouped by dimension vs. the L2subscript𝐿2L_{2}italic_L start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT star discrepancy, normalized by dimensionality and the number of points. Lines indicate a linear (least-squares) model for each dimensionality.

When looking at the performance of the algorithm variants over time, as visualized in Figure 3, we notice that the differences between the different sampling methods occur rather early in the search. While the used set does not have a noticeable impact on the initialization, soon after, the small sets with low discrepancy are seen to stagnate, resulting in the much worse anytime performance observed in Figure 2. By comparing the different sampler to the dashed line of the default uniform sequence, we note that all sets of size 32 and higher outperform this baseline at the end of the optimization. Additionally, we should note that the optimized sets perform best among the smallest sample sizes, but the Halton and Sobol sequences benefit more from increasing their size. Plots for other dimensionalities are available in the appendix. Finally, we show the relation between discrepancy and performance in Figure 4, where we observe a clear correlation between these two measures. This supports our previous observation that low discrepancy point sets are beneficial to the performance of the CMA-ES.

4.1 Multiples of λ𝜆\lambdaitalic_λ

Since the CMA-ES only requires λ𝜆\lambdaitalic_λ points at every iteration, a natural assumption would be that only λ𝜆\lambdaitalic_λ unique points would be required for an effective sampling strategy, given that these points cover the domain sufficiently. In our previous experiments, we have used the default setting for λ𝜆\lambdaitalic_λ, i.e., 4+3+ln(d)43ln𝑑4+\lfloor 3+\text{ln}(d)\rfloor4 + ⌊ 3 + ln ( italic_d ) ⌋, which is strictly less than 16, the smallest tested cache size k𝑘kitalic_k. This causes the sampler to cycle through the point set, causing variance between the samples used in each generation. Here, we investigate this effect using either a population size λ𝜆\lambdaitalic_λ of 15 or 16. In the latter case, λ𝜆\lambdaitalic_λ is a multiple of the cache size, and for k=16𝑘16k=16italic_k = 16, this ensures that every generation uses exactly the same samples. For k<16𝑘16k<16italic_k < 16, this causes a cyclic pattern, where every k16𝑘16\frac{k}{16}divide start_ARG italic_k end_ARG start_ARG 16 end_ARG-th generation has the same samples. In figure 5, the empirical performance of this experiment is visualized. The figure shows that when using the default sampling strategy, i.e., UNIFORM-\infty, using a λ=16𝜆16\lambda=16italic_λ = 16 is better than using λ=15𝜆15\lambda=15italic_λ = 15. This is reversed for a cache size k=16𝑘16k=16italic_k = 16, where there is no variance between the samples in subsequent generations. For larger k𝑘kitalic_k, this performance of λ=15𝜆15\lambda=15italic_λ = 15 becomes closer to λ=16𝜆16\lambda=16italic_λ = 16. This indicates there must be variation between generations, not only between the samples within each generation, which aligns with [Teytaud and Gelly, 2007]. Intuitively, this makes sense, as the strategy parameters are being averaged over multiple generations, making diversity between subsequent populations important. Notably, though, using only 4-8 unique populations during the entire optimization procedure already yields higher performance than the UNIFORM-\infty sampling strategy (for d=2𝑑2d=2italic_d = 2 and d=5𝑑5d=5italic_d = 5).

Refer to caption
Figure 5: Empirical Attainment Function aggregated over all 24 BBOB functions for dimensionality 2 (left) and 5 (right), zoomed to final fraction reached. The default sampling strategy of the CMA-ES, UNIFORM-\infty, is shown in comparison to using a cached sampling strategy, which uses the ‘OPT’ samples for a cache size k{16,32,64,128}𝑘163264128k\in\{16,32,64,128\}italic_k ∈ { 16 , 32 , 64 , 128 }. The solid lines represent λ=15𝜆15\lambda=15italic_λ = 15 and the dashed lines λ=16𝜆16\lambda=16italic_λ = 16.

5 Conclusions and Future Work

In this paper, we have shown that replacing the uniform sampling with low discrepancy samples in the offspring generation of the CMA-ES can yield clear benefits in performance on commonly used benchmarks. While this matches previous observations on the advantage of low-discrepancy sequences [Teytaud, 2015], our experiments on repeatedly (re-)using small sets of points show that we don’t necessarily need to rely on generators when sampling with the CMA-ES. While we note that, in general, sampling from an arbitrarily long low discrepancy sequence is better than using a fixed point set, such sets remain competitive when using only 128 points. In fact, on the two-dimensional problems, where our Lsubscript𝐿L_{\infty}italic_L start_POSTSUBSCRIPT ∞ end_POSTSUBSCRIPT-optimized point sets can be considered state-of-the-art [Clément et al., 2023a], a cached set consisting of only 16 samples is enough to outperform the default sampling strategy of the CMA-ES.

However, our results in higher dimensionalities show that while there is a clear correlation between the discrepancy of a point set and the performance of the CMA-ES using it, this correlation is not perfect. While our optimized sets generated by improved Threshold Accptance [Clément et al., 2024] have lower discrepancy than the corresponding Halton and Sobol sets, their anytime performance is slightly worse, indicating that there might be other aspects of the point sets we should take into account.

We have shown that not only does the discrepancy of samples within a single generation impact performance, but the diversity between subsequent generations also has an impact. Future work might focus on diving deeper into the relationship between point set size, discrepancy, diversity, and performance.

Acknowledgements

We want to thank François Clément and Carola Doerr for providing us with the optimized low-discrepancy point sets used in this paper.

REFERENCES

  • Bergstra and Bengio, 2012 Bergstra, J. and Bengio, Y. (2012). Random search for hyper-parameter optimization. Journal of machine learning research, 13(2).
  • Beyer, 2001 Beyer, H. (2001). The theory of evolution strategies. Natural computing series. Springer.
  • Bousquet et al., 2017 Bousquet, O., Gelly, S., Kurach, K., Teytaud, O., and Vincent, D. (2017). Critical hyper-parameters: No random, no cry. arXiv preprint arXiv:1706.03200.
  • Braaten and Weller, 1979 Braaten, E. and Weller, G. (1979). An improved low-discrepancy sequence for multidimensional quasi-monte carlo integration. Journal of Computational Physics, 33(2):249–258.
  • Bäck et al., 2023 Bäck, T. H. W., Kononova, A. V., van Stein, B., Wang, H., Antonov, K. A., Kalkreuth, R. T., de Nobel, J., Vermetten, D., de Winter, R., and Ye, F. (2023). Evolutionary Algorithms for Parameter Optimization—Thirty Years Later. Evolutionary Computation, 31(2):81–122.
  • Clément et al., 2023a Clément, F., Doerr, C., Klamroth, K., and Paquete, L. (2023a). Constructing optimal lsubscript𝑙l_{\infty}italic_l start_POSTSUBSCRIPT ∞ end_POSTSUBSCRIPT star discrepancy sets. arXiv preprint arXiv:2311.17463.
  • Clément et al., 2024 Clément, F., Doerr, C., and Paquete, L. (2024). Heuristic approaches to obtain low-discrepancy point sets via subset selection. Journal of Complexity, 83:101852.
  • Clément et al., 2023b Clément, F., Vermetten, D., De Nobel, J., Jesus, A. D., Paquete, L., and Doerr, C. (2023b). Computing star discrepancies with numerical black-box optimization algorithms. In Proceedings of the Genetic and Evolutionary Computation Conference, pages 1330–1338.
  • de Nobel et al., 2024 de Nobel, J., Vermetten, D., Kononova, A. V., and Bäck, T. (2024). Reproducibility files and additional figures.
  • de Nobel et al., 2021 de Nobel, J., Vermetten, D., Wang, H., Doerr, C., and Bäck, T. (2021). Tuning as a means of assessing the benefits of new ideas in interplay with existing algorithmic modules. In Krawiec, K., editor, GECCO ’21: Genetic and Evolutionary Computation Conference, Companion Volume, Lille, France, July 10-14, 2021, pages 1375–1384. ACM.
  • Galanti and Jung, 1997 Galanti, S. and Jung, A. (1997). Low-discrepancy sequences: Monte carlo simulation of option prices. The Journal of Derivatives, 5(1):63–83.
  • Halton, 1960 Halton, J. H. (1960). On the efficiency of certain quasi-random sequences of points in evaluating multi-dimensional integrals. Numerische Mathematik, 2:84–90.
  • Hansen et al., 2009 Hansen, N., Finck, S., Ros, R., and Auger, A. (2009). Real-parameter black-box optimization benchmarking 2009: Noiseless functions definitions. Research Report RR-6829, INRIA.
  • Hansen and Ostermeier, 2001 Hansen, N. and Ostermeier, A. (2001). Completely derandomized self-adaptation in evolution strategies. Evolutionary computation, 9(2):159–195.
  • Loh, 1996 Loh, W.-L. (1996). On latin hypercube sampling. The annals of statistics, 24(5):2058–2080.
  • López-Ibáñez et al., 2024 López-Ibáñez, M., Vermetten, D., Dréo, J., and Doerr, C. (2024). Using the empirical attainment function for analyzing single-objective black-box optimization algorithms. CoRR, abs/2404.02031.
  • Niederreiter, 1992 Niederreiter, H. (1992). Random number generation and quasi-Monte Carlo methods. SIAM.
  • Ostermeier et al., 1994 Ostermeier, A., Gawelczyk, A., and Hansen, N. (1994). A derandomized approach to self-adaptation of evolution strategies. Evolutionary Computation, 2(4):369–380.
  • Paulin et al., 2022 Paulin, L., Bonneel, N., Coeurjoly, D., Iehl, J.-C., Keller, A., and Ostromoukhov, V. (2022). MatBuilder: Mastering sampling uniformity over projections. ACM Transactions on Graphics (proceedings of SIGGRAPH).
  • Pausinger and Steinerberger, 2016 Pausinger, F. and Steinerberger, S. (2016). On the discrepancy of jittered sampling. Journal of Complexity, 33:199–216.
  • Peart, 1982 Peart, P. (1982). The dispersion of the hammersley sequence in the unit square. Monatshefte für Mathematik, 94(3):249–261.
  • Santner et al., 2003 Santner, T., Williams, B., and Notz, W. (2003). The Design and Analysis of Computer Experiments. Springer Series in Statistics, Springer.
  • Sobol’, 1967 Sobol’, I. M. (1967). On the distribution of points in a cube and the approximate evaluation of integrals. Zhurnal Vychislitel’noi Matematiki i Matematicheskoi Fiziki, 7(4):784–802.
  • Teytaud, 2015 Teytaud, O. (2015). Quasi-random numbers improve the CMA-ES on the BBOB testbed. In Bonnevay, S., Legrand, P., Monmarché, N., Lutton, E., and Schoenauer, M., editors, Artificial Evolution - 12th International Conference, Evolution Artificielle, EA 2015, Lyon, France, October 26-28, 2015. Revised Selected Papers, volume 9554 of Lecture Notes in Computer Science, pages 58–70. Springer.
  • Teytaud and Gelly, 2007 Teytaud, O. and Gelly, S. (2007). DCMA: yet another derandomization in covariance-matrix-adaptation. In Lipson, H., editor, Genetic and Evolutionary Computation Conference, GECCO 2007, Proceedings, London, England, UK, July 7-11, 2007, pages 955–963. ACM.
  • Zhou et al., 2013 Zhou, Y.-D., Fang, K.-T., and Ning, J.-H. (2013). Mixture discrepancy for quasi-random point sets. Journal of Complexity, 29(3-4):283–301.
(a) Dimensionality 5
Refer to caption
(b) Dimensionality 10
Refer to caption
(c) Dimensionality 20
Refer to caption
(d) Dimensionality 40
Refer to caption
Figure 6: Empirical Attainment Function aggregated over all 24 BBOB functions for dimensionalities 5, 10, 20, and 40 (from top to bottom). The methods are shown in color for each sampling strategy, with cache size indicating the number of points in the cache and \infty indicating no caching; every sample is unique. From left to right, the subfigures show results using a uniform sampler, a Sobol sequence, a scrambled Halton sequence, and the optimized point sets. Note that in the three rightmost figures, the default CMA-ES sampling strategy, UNIFORM-\infty, is included for comparison.