Design Editing for Offline Model-based Optimization

Ye Yuan^{1, 2} &Youyuan Zhang¹¹¹footnotemark: 1&Can (Sam) Chen^{1, 2}&Haolun Wu^{1, 2}&Zixuan (Melody) Li^{1, 2}&Jianmo Li¹&James J. Clark¹&Xue Liu¹ &
¹ McGill University, ² Mila - Quebec AI Institute
ye.yuan3@mail.mcgill.ca, youyuan.zhang@mail.mcgill.ca,
can.chen@mila.quebec, haolun.wu@mail.mcgill.ca,
zixuan.li3@mail.mcgill.ca, jianmo.li@mail.mcgill.ca,
james.clark1@mcgill.ca, xueliu@cs.mcgill.ca Equal contribution with random order.Corresponding author.

Abstract

Offline model-based optimization (MBO) aims to maximize a black-box objective function using only an offline dataset of designs and scores. These tasks span various domains, such as robotics, material design, protein and molecular engineering. A prevalent approach involves training a conditional generative model on existing designs and their associated scores, followed by the generation of new designs conditioned on higher target scores. However, these newly generated designs often underperform due to the lack of high-scoring training data. To address this challenge, we introduce a novel method, Design Editing for Offline Model-based Optimization (DEMO), which consists of two phases. In the first phase, termed pseudo-target distribution generation, we apply gradient ascent on the offline dataset using a trained surrogate model, producing a synthetic dataset where the predicted scores serve as new labels. A conditional diffusion model is subsequently trained on this synthetic dataset to capture a pseudo-target distribution, which enhances the accuracy of the conditional diffusion model in generating higher-scoring designs. Nevertheless, the pseudo-target distribution is susceptible to noise stemming from inaccuracies in the surrogate model, consequently predisposing the conditional diffusion model to generate suboptimal designs. We hence propose the second phase, existing design editing, to directly incorporate the high-scoring features from the offline dataset into design generation. In this phase, top designs from the offline dataset are edited by introducing noise, which are subsequently refined using the conditional diffusion model to produce high-scoring designs. Overall, high-scoring designs begin with inheriting high-scoring features from the second phase and are further refined with a more accurate conditional diffusion model in the first phase. Empirical evaluations on $7$ offline MBO tasks show that DEMO outperforms various baseline methods, achieving the highest mean rank of $1.7$ and median rank of $1$ .

1 Introduction

In numerous fields, a primary goal is to innovate and design new objects with specific desired traits [1]. This encompasses areas like robotics, material design, protein and molecular engineering [2, 3, 4, 5]. Conventionally, these objectives are pursued by iteratively testing a black-box objective function that maps a design to its property score. However, such testing can be expensive, time-consuming, or even hazardous [3, 4, 5, 6, 7]. Thus, it is more feasible to utilize an existing offline dataset of designs and their scores to find optimal solutions, without additional real-world testing [1]. This problem is known as offline model-based optimization (MBO). The aim of MBO is to identify a design that optimizes the black-box objective function using only the offline dataset.

A common strategy in MBO involves training a conditional generative model on the available offline dataset to capture the conditional probability distribution $p(\bm{x}|y)$ , where $\bm{x}$ denotes designs and $y$ represents property scores. The model then generates new designs conditioned on higher target scores. Essentially, conditional generative models are designed to establish a one-to-many relationship, mapping property scores to all possible designs. This becomes particularly challenging when the black-box objective function operates over a high-dimensional space. Fortunately, previous research has demonstrated that generative techniques can be effective in solving offline MBO tasks. For instance, the CbAS method utilizes a variational autoencoder [8], while MIN applies a generative adversarial network (GAN) [9, 10]. DDOM extends these techniques by integrating a classifier-free conditional diffusion model to enhance generative capabilities [11].

Nonetheless, one important yet unexplored problem with these generative model-based methods is their reliance on merely training on the offline dataset. This training approach results in models that effectively mimic the distribution of the offline dataset they are trained on but fail to capture the information of designs with higher scores. Therefore, while these models learn to replicate the distribution of existing designs, they struggle to consistently produce new designs that significantly outperform those in the offline dataset.

To address this challenge, we introduce an innovative and effective approach, Design Editing for Offline Model-based Optimization (DEMO). DEMO is structured into two primary phases: pseudo-target distribution generation and existing design editing. In the phase of pseudo-target distribution generation, to address the scarcity of high-scoring training data, we first augment new data by utilizing the offline dataset, which may contain pairs of superconductor materials and critical temperatures for example. To achieve this, a surrogate model, represented as $f_{\bm{\theta}}(\cdot)$ , is trained on the offline dataset $\mathcal{D}$ , and gradient ascent is applied to existing designs with respect to the surrogate model, creating a synthetic dataset $\mathcal{D}^{\prime}$ with predicted scores as new labels. As illustrated in Figure 1 (a), the surrogate model fits the offline data $p_{1}$ to $p_{5}$ , generating new data points $p_{a}$ and $p_{b}$ through gradient ascent. Subsequently, a classifier-free conditional diffusion model is trained on $\mathcal{D}^{\prime}$ to learn the conditional probability distribution of these synthetic designs along with their predicted scores. This diffusion model characterizes a pseudo-target distribution, which has improved accuracy in generating higher-scoring designs.

Refer to caption — Figure 1: Illustration of DEMO: A conditional diffusion model, acting as the pseudo-target distribution, is trained on a synthetic dataset produced through a surrogate model. New designs are generated by modifying top existing designs using the diffusion model, under the guidance of target scores.

However, as shown in Figure 1 (a), the surrogate model may not accurately capture the black-box objective function, resulting in the pseudo-target distribution possibly containing noisy information stemming from the surrogate model. Thus, generating directly from the pseudo-target distribution could lead to some suboptimal designs, which have high predicted scores but low ground-truth scores, necessitating the second phase of DEMO. This phase, termed existing design editing, directly incorporates the high-scoring features from the offline dataset to provide more guidance to the design generation process. Specifically, we edit top designs from the offline dataset by introducing random noise to them and employing the conditional diffusion model from the first phase to remove the noise, guided by higher target scores. As illustrated in Figure 1 (a), after injecting noise, the distribution of top designs in the offline dataset (represented by the purple contour) has more overlap with the pseudo-target distribution (represented by the orange contour). By progressively removing the noise, we gradually project these existing top designs to the manifold of higher-scoring designs, as demonstrated in Figure 1 (b). In essence, DEMO produces new designs which first inherit high-scoring features from existing designs and then refine them by a more accurate conditional diffusion model.

In summary, this paper makes three principal contributions:

•

We introduce a novel method, Design Editing for Offline Model-based Optimization (DEMO). DEMO operates in two main phases: the first, pseudo-target distribution generation, involves employing a surrogate model to create a synthetic dataset and training a conditional diffusion model on this synthetic dataset to serve as the pseudo-target distribution.
•

The second phase, existing design editing, introduces random noise to existing top designs and uses the trained conditional diffusion model to refine them, resulting in designs which not only inherit high-scoring features from existing top designs but also achieve higher scores by leveraging information from the pseudo-target distribution.
•

Extensive experiments demonstrate DEMO effectively and reliably generates new designs, yielding state-of-the-art results across $7$ offline MBO tasks, with the mean rank of $1.7$ and the median rank of $1$ among $16$ methods.

2 Preliminary

2.1 Offline Model-based Optimization

Offline model-based optimization (MBO) addresses a range of optimization challenges with the aim of maximizing a black-box objective function based on an offline dataset. Mathematically, we define the valid design space as $\mathcal{X}=\mathbb{R}^{d}$ , with $d$ representing the dimension of the design. Offline MBO is formulated as:

\bm{x}^{*}=\arg\max_{\bm{x}\in\mathcal{X}}f(\bm{x}),

(1)

where $f(\cdot)$ is the black-box objective function, and $\bm{x}\in\mathcal{X}$ is a potential design. For the optimization process, we utilize an offline dataset $\mathcal{D}=\{(\bm{x}_{i},y_{i})\}_{i=1}^{N}$ , with $\bm{x}_{i}$ representing an existing design, such as a superconductor material, and $y_{i}$ representing the associated property score, such as the critical temperature. Usually, this optimization process outputs $K$ candidates for optimal designs, where $K$ is a small budget to test the black-box objective function. The offline MBO problem also finds applications in other areas, like robot design, as well as protein and molecule engineering.

2.2 Classifier-free Conditional Diffusion Models

Diffusion models stand out in the family of generative models due to their unique approach involving forward diffusion and backward denoising processes. The essence of diffusion models is to gradually add noise to a sample, followed by training a neural network to reverse this noise addition, thus recovering the original data distribution. In this work, we follow the formulation of diffusion models with continuous time [12, 13]. Here, $\bm{x}_{t}$ is a random variable denoting the state of a data point at time $t\in\left[0,T\right]$ . The diffusion process is defined by a stochastic differential equation (SDE):

\differential\bm{x}=\bm{f}(\bm{x},t)\differential t+g(t)\differential\bm{w},

(2)

where $\bm{f}(\cdot,t)$ is the drift coefficient of $\bm{x}_{t}$ , $g(\cdot)$ is the diffusion coefficient of $\bm{x}_{t}$ , and $\bm{w}$ is a standard Wiener process. The backward denoising process is given by the reverse time SDE:

\differential\bm{x}=\left[\bm{f}(\bm{x},t)-g(t)^{2}\nabla_{\bm{x}}\log p_{t}(% \bm{x})\right]\differential t+g(t)\,\differential\bar{\bm{w}},

(3)

where $\differential t$ represents a negative infinitesimal step in time, and $\bar{w}$ is a reverse time Wiener process. The gradient of the log probability, $\nabla_{\bm{x}}\log p_{t}(\bm{x})$ , is approximated by a neural network $s_{\bm{\phi}}(\bm{x}_{t},t)$ with score-matching objectives [14, 15].

Beyond basic diffusion models, our focus is to train a conditional diffusion model that learns the conditional probability distribution of designs based on their associated property scores. To incorporate conditions to diffusion models, Ho et al. [16] achieve it by dividing the score function into a combination of conditional and unconditional components, known as classifier-free diffusion models. Specifically, a single neural network, $s_{\bm{\phi}}(\bm{x}_{t},t,y)$ , is trained to handle both components by utilizing $y$ as the condition or leaving it empty for unconditional functions. Formally, we can write this combination as follows:

s_{\bm{\phi}}(\bm{x}_{t},t,y)=(1+\omega)s_{\bm{\phi}}(\bm{x}_{t},t,y)-\omega s% _{\bm{\phi}}(\bm{x}_{t},t),

(4)

where $\omega$ is a parameter that adjusts the influence of the conditions. A higher value of $\omega$ ensures that the generation process adheres more closely to the specified conditions, while a lower $\omega$ value allows greater flexibility in the outputs.

3 Related Works

3.1 Offline Model-based Optimization

Recent offline model-based optimization (MBO) techniques broadly fall into two categories: (i) those that employ gradient-based optimizations and (ii) those that create new designs via generative models. Gradient-based methods often employ regularization techniques that enhance either the surrogate model [17, 18, 19] or the design itself [20, 21], thus improving the model’s robustness and generalization capacity. It’s worth noting that while some approaches also involve synthesizing new data with pseudo labels [22, 23], they aim to identify useful information from these synthetic data to correct the surrogate model’s inaccuracies. The second category encompasses methods that learn to replicate the distribution of existing designs and include approaches such as MIN [9], CbAS [8], Auto CbAS [24], and DDOM [11]. These methods are known for their ability to generate innovative designs by sampling from learned distributions. DEMO distinguishes itself by training a conditional diffusion model that learns a pseudo-target distribution and incorporating features from existing top designs, which facilitates effectively and consistently generating new superior candidates.

3.2 Diffusion-Based Editing

Diffusion models have shown remarkable success in various generation tasks across multiple modalities, especially for their ability to control the generation process based on given conditions. For instance, recent advancements have utilized diffusion models for zero-shot, test-time editing in the domains of text-based image and video generation. SDEdit [25] employs an editing strategy to balance realism and faithfulness in image generation. To improve the reconstruction quality, methodologies such as DDIM Inversion [26], Null-text Inversion [27] and Negative-prompt Inversion [28] concentrate on deterministic mappings from source latents to initial noise, conditioned on source text. Building on these, CycleDiffusion [29] and Direct Inversion [30] leverage source latents from each inversion step and further improve the faithfulness of the target image to the source image. Following the image editing technique, several video editing methods [31, 32, 33, 34, 35, 36] adopt image diffusion models and enforce temporal consistency across frames, offering practical and efficient solutions for video editing. Inspired by the success of these editing techniques in the field of computer vision, we edit existing top designs towards a pseudo-target distribution in the context of the offline MBO problem, enhancing both the effectiveness and reliability of generating new designs.

4 Methodology

In this section, we elaborate on the details of our proposed Design Editing for Offline Model-based Optimization (DEMO), including two phases. We introduce the first phase, named pseudo-target distribution generation, in section 4.1. This phase trains a conditional diffusion model, serving as the pseudo-target distribution, on a synthetic dataset created by performing gradient ascent with respect to a surrogate model trained on the offline dataset. While the first phase achieves a more accurate conditional diffusion model capable of generating designs with higher scores than a model trained solely on the offline dataset, it is susceptible to noise caused by inaccuracies in the surrogate model. This motivates the second phase, termed Existing Design Editing, described in section 4.2, which explicitly incorporates high-scoring features from existing top designs. Intuitively, one can make an analogy of our method to writing code for a new research project. In coding for research, the initial step often involves sourcing and adapting useful existing code from previous projects, tailoring it to new requirements through modifications and enhancements. In a similar fashion, DEMO generates new designs by initially inheriting high-scoring features from top existing designs (akin to reusing existing code) and subsequently refining them through a more accurate conditional diffusion model (akin to modifying code for a new purpose). Algorithm 1 illustrates the complete process of DEMO.

4.1 Pseudo-target Distribution Generation

Due to the scarcity of high-scoring training data, conditional generative models trained only on the offline dataset often fail to consistently produce new designs that substantially surpass the existing ones. One promising yet underutilized approach to address this issue is to generate a synthetic dataset first, by applying gradient ascent on existing designs using a trained surrogate model. Conditional generative models trained on this synthetic dataset capture a pseudo-target distribution, which are more adept at creating designs with higher scores.

Creation of Synthetic Dataset. Initially, a deep neural network (DNN), denoted as $f_{\bm{\theta}}(\cdot)$ with parameters $\bm{\theta}$ , is trained on the offline dataset $\mathcal{D}=\{(\bm{x}_{i},y_{i})\}_{i=1}^{N}$ , where $\bm{x}_{i}$ and $y_{i}$ denotes a design and its associated score, respectively. The parameters $\bm{\theta}$ are optimized as:

\bm{\theta}^{*}=\arg\min_{\bm{\theta}}\frac{1}{N}\sum_{i=1}^{N}\left(f_{\bm{% \theta}}(\bm{x}_{i})-y_{i}\right)^{2}.

(5)

The solution $f_{\bm{\theta}^{*}}(\cdot)$ obtained from Eq. (5) serves as a surrogate for the unknown black-box objective function $f(\cdot)$ in Eq. (1). New data are then generated by performing gradient ascent on the existing designs with respect to the learned surrogate model $f_{\bm{\theta}^{*}}(\cdot)$ . For a design $\bm{x}_{i}$ in $\mathcal{D}$ , we update it as:

\bm{x}_{i,t}=\bm{x}_{i,t-1}+\eta\nabla_{\bm{x}}f_{\bm{\theta}^{*}}(\bm{x})\Big% {|}_{\bm{x}=\bm{x}_{i,t}},\quad\text{for }t\in\{1,\cdots,T\},

(6)

where $T$ is the total number of iterations, and $\eta$ is the step size for the gradient ascent update. The initial point $\bm{x}_{i,0}$ is same as $\bm{x}_{i}$ , and $\bm{x}_{i,T}$ acquired at step $T$ is a synthetic design with enhanced predicted score. By iteratively using each design in the offline dataset $\mathcal{D}$ as the initial point, a synthetic dataset $\mathcal{D^{\prime}}$ of the same size as $\mathcal{D}$ is created, with predicted scores as labels. This process is outlined from line $2$ to line $8$ in Algorithm1.

Training of Conditional Diffusion Model. We employ a classifier-free conditional diffusion model [16] to learn the conditional probability distribution of synthetic designs and their predicted scores in $\mathcal{D}^{\prime}$ , which captures a pseudo-target distribution. Following the approach in DDOM [11], we use the Variance Preserving (VP) stochastic differential equation (SDE) for the forward diffusion process, as specified in [12]:

\differential\bm{x}=-\frac{\beta(t)}{2}\bm{x}\differential t+\sqrt{\beta(t)}% \differential\bm{w},

(7)

where $\beta(t)$ is a continuous time function for $t\in[0,1]$ . The forward process in DDPM [37] is proved to be a discretization of Eq. (7) [12]. To integrate conditions in the backward denoising process, we need to train a DNN $s_{\bm{\phi}}(\bm{x}_{t},t,y)$ with parameters $\bm{\phi}$ , conditioned on the time $t$ and the score $y$ associated with the unperturbed design $\bm{x}_{0}$ corresponding to $\bm{x}_{t}$ . The parameters $\bm{\phi}$ are optimized as:

\bm{\phi}^{*}=\arg\min_{\bm{\phi}}\mathbb{E}_{t}\left[\lambda(t)\mathbb{E}_{% \bm{x}_{0},y}\left[\mathbb{E}_{\bm{x}_{t}|\bm{x}_{0}}\left[\|\bm{s}_{\bm{\phi}% }(\bm{x}_{t},t,y)-\nabla_{\bm{x}}\log p_{t}(\bm{x}_{t}|\bm{x}_{0})\|^{2}\right% ]\right]\right],

(8)

where $\lambda(t)$ is a positive weighting function depending on time. Since we train on the synthetic dataset $\mathcal{D}^{\prime}$ , the model optimized according to Eq. (8) more accurately represents the gradient of the logarithm of a pseudo-target distribution. This distribution essentially reflects the marginal probability distribution of designs that have enhanced predicted scores. With the optimized model $s_{\bm{\phi}^{*}}(\bm{x}_{t},t,y)$ , we thereby improve the accuracy in generating new high-scoring designs by simulating the backward denoising process. This part is described in Line $9$ of Algorithm 1.

4.2 Existing Design Editing

Due to potential inaccuracies of the surrogate model $f_{\bm{\theta}^{*}}(\cdot)$ in representing the black-box objective function, the synthetic dataset $\mathcal{D}^{\prime}$ might include noisy data. Therefore, directly generating from the pseudo-target distribution could lead to suboptimal new designs. Driven by the success of editing techniques in image synthesis tasks [25, 38], we explore the potential of creating new designs from top existing designs, instead of initiating from a random latent variable sampled from the standard Gaussian prior. We perturb $\bm{x}_{top}$ by introducing noise at a specific time $m$ out of $\{1,\cdots,M\}$ and auxiliary noise levels $\beta_{1},\cdots,\beta_{M}$ :

\bm{x}_{perturb}=\bm{x}_{top}+\sqrt{1-\bar{\alpha}_{m}}\epsilon,

(9)

where $\alpha_{m}=1-\beta_{m}$ , $\bar{\alpha}_{m}=\prod_{s=1}^{m}\alpha_{s}$ , and $\epsilon\sim\mathcal{N}(\bm{0},\textbf{I})$ . This results in a closed form that samples $\bm{x}_{perturb}\sim\mathcal{N}(\bm{x}_{top},(1-\bar{\alpha}_{m})\textbf{I})$ . The perturbed design is then used as the starting point. Given a target property score $\hat{y}$ , a new design is synthesized using a second-order Heun’s sampler [11] with the model $s_{\bm{\phi}^{*}}(\cdot)$ . To yield $K$ candidate optimal designs, we select the top $K$ designs from $\mathcal{D}$ to obtain various perturbed designs and denoise them conditioned on $\hat{y}$ . Lines $11$ to $16$ of Algorithm 1 present the process of this phase.

Algorithm 1 Design Editing for Offline Model-based Optimization

Input: Offline dataset $\mathcal{D}=\{(\bm{x}_{i},y_{i})\}_{i=1}^{N}$ , a target score $\hat{y}$ , and a time $m$ .
Output: $K$ candidate optimal designs.

1/* Pseudo-target Distribution Generation */

2Initialize a surrogate model

f_{\bm{\theta}}(\cdot)

and optimize

\bm{\theta}

with Eq. (5) to obtain

f_{\bm{\theta}^{*}}(\cdot)

\mathcal{D}^{\prime}=\{\}

4for

i=1,2,\cdots,N

\bm{x}_{i,0}\longleftarrow\bm{x}_{i}

6 for

t=1,2,\cdots,T

7 Update

\bm{x}_{i,t}

with Eq. (6).

8 Append

(\bm{x}_{i,T},f_{\bm{\theta}(\bm{x}_{i,T})})

\mathcal{D}^{\prime}

9Initialize

s_{\bm{\phi}}(\cdot)

and optimize

\bm{\phi}

with Eq. (8) on

\mathcal{D}^{\prime}

to obtain

s_{\bm{\phi}^{*}}(\cdot)

10/* Existing Design Editing */

11Candidates =

\{\}

12for

k=1,2,\cdots,K

13 Select design

\bm{x}_{top}

with the

k

-th best score among all designs in

\mathcal{D}

14 Perturb

\bm{x}_{top}

with Eq. (9) and the given time

m

15 Denoise

\bm{x}_{perturb}

and generate

\bm{x}_{new}

using the Heun’s method with

s_{\bm{\phi}^{*}}(\cdot)

and

\hat{y}

16 Append

\bm{x}_{new}

to Candidates

17return Candidates

5 Experiments

This section first describes the experiment setup, followed by the implementation details and results. We aim to answer the following questions in this section: (Q $1$ ) Is our proposed DEMO more effective than baseline methods in solving the offline MBO problem? (Q $2$ ) Are the two phases described in section 4 both necessary? (Q $3$ ) Compared to existing generative model-based approaches, can DEMO more reliably and consistently generate new higher-scoring designs?

5.1 Dataset and Tasks

We carry out experiments on $7$ tasks selected from Design-Bench [1] and BayesO Benchmarks [39], including $4$ continuous tasks and $3$ discrete tasks. The continuous tasks are as follows: (i) Superconductor (SuperC) [5], where the goal is to create a superconductor with $86$ continuous components to maximize critical temperature, using $17,010$ designs; (ii) Ant Morphology (Ant) [1, 40], where the objective is to design a four-legged ant with $60$ continuous components to increase crawling speed, based on $10,004$ designs; (iii) D’Kitty Morphology (D’Kitty) [1, 41], where the focus is on designing a four-legged D’Kitty with $56$ continuous components to enhance crawling speed, using $10,004$ designs; (iv) Inverse Levy Function (Levy) [39], where the aim is to maximize function values of the inverse black-box Levy function with $60$ input dimensions, using $15,000$ designs. The discrete tasks include: (v) TF Bind $8$ (TF $8$ ) [6], where the goal is to identify an $8$ -unit DNA sequence that maximizes binding activity score, with $32,898$ designs; (vi) TF Bind $10$ (TF $10$ ) [6], where the objective is to find a $10$ -unit DNA sequence that optimizes binding activity score, using $50,000$ designs; (vii) NAS [42], where the aim is to discover the optimal neural network architecture to improve test accuracy on the CIFAR- $10$ dataset [43], using $1,771$ designs.

5.2 Evaluation and Metrics

Following the evaluation protocol used in previous studies [1, 11, 22], we assume the budget $K=256$ and generate $256$ new designs for each method. The $100$ -th (max) percentile normalized ground-truth score is reported in section 5.5, and the $50$ -th (median) percentile score is provided in Appendix A.1. This normalized score is calculated as $y_{n}=\frac{y-y_{\text{min}}}{y_{\text{max}}-y_{\text{min}}},$ where $y_{\text{min}}$ and $y_{\text{max}}$ are the minimum and maximum scores in the entire offline dataset, respectively. For better comparison, we include the normalized score of the best design in the offline dataset, denoted as $\mathcal{D}(\textbf{best})$ . Additionally, we provide mean and median rankings across all $7$ tasks for a comprehensive performance evaluation.

5.3 Comparison Methods

We benchmark DEMO against three groups of baseline approaches: (i) traditional methods, (ii) those utilizing gradient optimizations from current designs, and (iii) those employing generative models for sampling. Traditional methods include: (1) BO-qEI [44]: conducts Bayesian Optimization to maximize the surrogate, proposes designs using the quasi-Expected-Improvement acquisition function, and labels the designs using the surrogate model. (2) CMA-ES [45]: progressively adjusts the distribution toward the optimal design by altering the covariance matrix. (3) REINFORCE [46]: optimizes the distribution over the input space using the learned surrogate. The second category includes: (4) Grad: performs simple gradient ascent on existing designs to create new ones. (5) Mean: optimizes the average prediction of the ensemble of surrogate models. (6) Min: optimizes the lowest prediction from a group of learned objective functions. (7) COMs [18]: applies regularization to assign lower scores to designs derived through gradient ascent. (8) ROMA [17]: introduces smoothness regularization to the DNN. (9) NEMO [19]: limits the discrepancy between the surrogate and the black-box objective function using normalized maximum likelihood before performing gradient ascent. (10) BDI [21] employs forward and backward mappings to transfer knowledge from the offline dataset to the design. (11) IOM [47]: ensures representation consistency between the training dataset and the optimized designs. Generative model-based methods include: (12) CbAS [8], which adapts a VAE model to steer the design distribution toward areas with higher scores. (13) Auto CbAS [24], which uses importance sampling to update a regression model based on CbAS. (14) MIN [9], which establishes a relationship between scores and designs and seeks optimal designs within this framework. (15) DDOM [11], which learns a generative diffusion model conditioned on the score values.

5.4 Implementation Details

We follow the training protocols from [18] for all comparative methods unless stated otherwise. A $3$ -layer MLP with ReLU activation is used for both $f_{\bm{\theta}}(\cdot)$ and $s_{\bm{\phi}}(\cdot)$ , with a hidden layer size of $2048$ . In Algorithm 1, the iteration count, $T$ , is established at $100$ for both continuous and discrete tasks. The Adam optimizer [48] is utilized to train the surrogate models over $200$ epochs with a batch size of $128$ , and a learning rate set at $1e-1$ . The step size, $\eta$ , in equation 6 is configured at $1e-3$ for continuous tasks and $1e-1$ for discrete tasks. The conditional diffusion model, $s_{\bm{\phi}}(\cdot)$ , undergoes training for $1000$ epochs with a batch size of $128$ . For the existing design editing, following precedents set by previous studies [49, 11], we assign a target score, $\hat{y}$ , of $1$ and $M$ at $1000$ . The selected value of $m$ is $400$ , with further elaboration provided in Appendix A.2. Results from traditional methodologies are referenced from [1], and we conduct $8$ independent trials for other methods, reporting the mean and standard error. All experiments are conducted on a single NVIDIA V $100$ GPU, with execution times per trial ranging from $10$ minutes to $20$ hours, depending on the specific tasks.

Table 1: Experimental results on continuous tasks for comparison.

Method	Superconductor	Ant Morphology	D’Kitty Morphology	Levy
$\mathcal{D}$ (best)	$0.399$	$0.565$	$0.884$	$0.613$
BO-qEI	$0.402\pm 0.034$	$0.819\pm 0.000$	$0.896\pm 0.000$	$0.810\pm 0.016$
CMA-ES	$0.465\pm 0.024$	$\textbf{1.214}\pm\textbf{0.732}$	$0.724\pm 0.001$	$0.887\pm 0.025$
REINFORCE	$0.481\pm 0.013$	$0.266\pm 0.032$	$0.562\pm 0.196$	$0.564\pm 0.090$
Grad	$0.489\pm 0.018$	$0.927\pm 0.027$	$0.949\pm 0.014$	$0.948\pm 0.031$
Mean	$0.505\pm 0.013$	$0.940\pm 0.014$	$\textbf{0.956}\pm\textbf{0.014}$	$\textbf{0.984}\pm\textbf{0.023}$
Min	$0.501\pm 0.019$	$0.918\pm 0.034$	$0.942\pm 0.009$	$0.964\pm 0.023$
COMs	$0.481\pm 0.028$	$0.842\pm 0.037$	$0.926\pm 0.019$	$0.936\pm 0.025$
ROMA	$\textbf{0.509}\pm\textbf{0.015}$	$0.916\pm 0.030$	$0.929\pm 0.013$	$0.976\pm 0.019$
NEMO	$0.502\pm 0.002$	$0.955\pm 0.006$	$\textbf{0.952}\pm\textbf{0.004}$	$0.969\pm 0.019$
BDI	$0.513\pm 0.000$	$0.906\pm 0.000$	$0.919\pm 0.000$	$0.938\pm 0.000$
IOM	$\textbf{0.518}\pm\textbf{0.020}$	$0.922\pm 0.030$	$0.944\pm 0.012$	$\textbf{0.988}\pm\textbf{0.021}$
CbAS	$\textbf{0.503}\pm\textbf{0.069}$	$0.876\pm 0.031$	$0.892\pm 0.008$	$0.938\pm 0.037$
Auto CbAS	$0.421\pm 0.045$	$0.882\pm 0.045$	$0.906\pm 0.006$	$0.797\pm 0.033$
MIN	$0.499\pm 0.017$	$0.445\pm 0.080$	$0.892\pm 0.011$	$0.761\pm 0.037$
DDOM	$0.486\pm 0.013$	$0.952\pm 0.007$	$0.941\pm 0.006$	$0.927\pm 0.031$
DEMO_(ours)	$\textbf{0.520}\pm\textbf{0.006}$	$\textbf{0.971}\pm\textbf{0.005}$	$\textbf{0.957}\pm\textbf{0.006}$	$\textbf{1.005}\pm\textbf{0.020}$

Table 2: Experimental results on discrete tasks, and ranking on all tasks for comparison.

Method	TF Bind $8$	TF Bind $10$	NAS	Rank Mean	Rank Median
$\mathcal{D}$ (best)	$0.439$	$0.467$	$0.436$
BO-qEI	$0.798\pm 0.083$	$0.652\pm 0.038$	$\textbf{1.079}\pm\textbf{0.059}$	$11.1/16$	$13/16$
CMA-ES	$0.953\pm 0.022$	$0.670\pm 0.023$	$0.985\pm 0.079$	$7.1/16$	$3/16$
REINFORCE	$0.948\pm 0.028$	$0.663\pm 0.034$	$-1.895\pm 0.000$	$12.1/16$	$16/16$
Grad	$0.898\pm 0.033$	$0.638\pm 0.022$	$0.611\pm 0.052$	$8.9/16$	$10/16$
Mean	$0.895\pm 0.020$	$0.654\pm 0.028$	$0.663\pm 0.058$	$6.4/16$	$5/16$
Min	$0.931\pm 0.036$	$0.634\pm 0.033$	$0.708\pm 0.027$	$8.0/16$	$8/16$
COMs	$0.474\pm 0.053$	$0.625\pm 0.010$	$0.796\pm 0.029$	$11.1/16$	$12/16$
ROMA	$0.921\pm 0.040$	$0.669\pm 0.035$	$0.934\pm 0.025$	$5.7/16$	$4/16$
NEMO	$0.942\pm 0.003$	$0.708\pm 0.010$	$0.735\pm 0.012$	$4.6/16$	$5/16$
BDI	$0.870\pm 0.000$	$0.605\pm 0.000$	$0.722\pm 0.000$	$9.7/16$	$10/16$
IOM	$0.870\pm 0.074$	$0.648\pm 0.025$	$0.411\pm 0.044$	$7.6/16$	$7/16$
CbAS	$0.927\pm 0.051$	$0.651\pm 0.060$	$0.683\pm 0.079$	$9.3/16$	$8/16$
Auto CbAS	$0.910\pm 0.044$	$0.630\pm 0.045$	$0.506\pm 0.074$	$12.4/16$	$13/16$
MIN	$0.905\pm 0.052$	$0.616\pm 0.021$	$0.717\pm 0.046$	$12.3/16$	$13/16$
DDOM	$\textbf{0.961}\pm\textbf{0.024}$	$0.640\pm 0.029$	$0.737\pm 0.014$	$7.3/16$	$7/16$
DEMO_(ours)	$\textbf{0.980}\pm\textbf{0.004}$	$\textbf{0.762}\pm\textbf{0.058}$	$0.766\pm 0.017$	1.7/16	1/16

5.5 Results

Performance in Continuous Tasks. Table 1 presents the results of the $4$ continuous tasks. DEMO reaches state-of-the-art performance on all of them. When compared to other generative model-based approaches, such as MIN and DDOM, DEMO generally outperforms them because these methods train models only on the offline dataset and may not learn characteristics of higher-scoring designs. DEMO achieves better performance by effectively mitigating this issue. Moreover, DEMO beats gradient-based methods, like Grad and COMs, by leveraging guidance from existing top designs and a higher target score simultaneously. This indicates that DEMO is effective for continuous tasks.

Performance in Discrete Tasks. Table 2 exhibits the results of the $3$ discrete tasks. DEMO attains top performances in TF Bind $8$ and TF Bind $10$ , where the results on TF $10$ surpass other methods by a significant margin, suggesting the ability of DEMO to solve discrete offline MBO tasks. Nonetheless, DEMO underperforms on NAS, which might be caused by two reasons. First, each neural network architecture is encoded as a sequence of one-hot vectors, which has a length of $64$ . This encoding process might be incapable of precisely representing all features of a given architecture, inducing undesirable performance on NAS. Furthermore, after checking the offline dataset of NAS, we find that many existing designs share commonalities. This redundancy means that the offline dataset of NAS contains less useful information than those of other tasks, which further explains why the performance of DEMO on NAS is not as strong.

Summary. These results on both continuous and discrete tasks soundly answer Q $1$ . DEMO attains the highest rankings with a mean of $1.7/16$ and median of $1/16$ as detailed in Table 2 and Figure 3, as well as secures top performances in all tasks. We have further run a Welch’s t-test on the tasks where DEMO obtains state-of-the-art results. We obtain p-values of $0.007$ on SuperC, $0.00003$ on Ant, $0.08$ on D’Kitty, $0.005$ on Levy, $0.005$ on TF8, and $0.02$ on TF10. This confirms that DEMO accomplishes statistically significant improvements in $5/7$ tasks.

5.6 Ablation Study

Table 3: Ablation studies on two phases of DEMO.

Task	D	DEMO	w/o pseudo-target	w/o editing
SuperC	$86$	$\textbf{0.520}\pm\textbf{0.006}$	$0.487\pm 0.012$	$0.482\pm 0.013$
Ant	$60$	$\textbf{0.971}\pm\textbf{0.005}$	$0.945\pm 0.016$	$0.963\pm 0.008$
D’Kitty	$56$	$\textbf{0.957}\pm\textbf{0.006}$	$0.955\pm 0.005$	$0.933\pm 0.002$
Levy	$60$	$\textbf{1.005}\pm\textbf{0.020}$	$0.901\pm 0.029$	$0.990\pm 0.020$
TF8	$8$	$\textbf{0.980}\pm\textbf{0.004}$	$0.757\pm 0.063$	$0.965\pm 0.008$
TF10	$10$	$\textbf{0.762}\pm\textbf{0.058}$	$0.626\pm 0.009$	$0.658\pm 0.019$
NAS	$64$	$\textbf{0.766}\pm\textbf{0.017}$	$0.741\pm 0.022$	$0.668\pm 0.084$

To rigorously assess the individual contributions of pseudo-target distribution generation (pseudo-target) and existing design editing (editing) within our DEMO method, ablation experiments are conducted by systematically removing each phase. The omission of the pseudo-target phase includes training a conditional diffusion model only on the offline dataset and then applying the editing phase. In contrast, the removal of the editing phase involves using the model trained during the pseudo-target phase to generate new designs starting from a random Gaussian noise.

The results, as summarized in Table 3, provide clear insights into the impact of these modifications. For the $4$ continuous tasks, DEMO consistently achieves higher performance compared to its ablated versions. For instance, in the task SuperC, DEMO achieves a score of $0.520\pm 0.006$ , significantly higher than the versions without the pseudo-target phase ( $0.487\pm 0.012$ ) and without the editing phase ( $0.482\pm 0.013$ ). Similar improvements are observed in Ant, D’Kitty, and Levy, underscoring the effectiveness of integrating both phases in enhancing performance in continuous tasks. In the discrete tasks TF8, TF10, and NAS, DEMO’s superior performance over both partial versions is evident, highlighting its comprehensive effectiveness in managing discrete challenges. Overall, the ablation studies validate the importance of the pseudo-target distribution generation and existing design editing within the DEMO method, answering Q $2$ that both phases are necessary for DEMO. These phases collectively contribute to enhancements across a range of tasks and input dimensions.

5.7 Reliability Study

As previously noted, generative model-based methods, which train solely on the offline dataset, often fail to generate new designs that consistently score higher. In this subsection, we assess the ability of DEMO to reliably produce superior designs compared to DDOM, which represents the latest and most robust among generative model-based approaches. We also discuss the comparison to gradient-based approaches in Appendix A.3. To measure reliability, we compute the proportion of new designs that exceed the best scores in the offline dataset $\mathcal{D}(\textbf{best})$ . The results are depicted in Figure 3. DEMO consistently outperforms DDOM across all tasks, achieving notable improvements, particularly in the SuperC and NAS tasks. This confirms DEMO’s enhanced reliability over the state-of-the-art generative model-based baseline in both continuous and discrete settings. The Median scores included in Appendix A.1 further support these findings. DEMO achieves the top median-score rankings, affirming the reliability of DEMO and answering Q $3$ .

6 Conclusion and Discussion

In this study, we introduce Design Editing for Offline Model-based Optimization (DEMO), which consists of two phases. The first phase, pseudo-target distribution generation, involves training a surrogate model on the offline dataset and applying gradient ascent to create a synthetic dataset where the predicted scores serve as new labels. A conditional diffusion model is subsequently trained on this synthetic dataset to learn a pseudo-target distribution. The second phase, existing design editing, introduces random noise to existing top designs and employs the learned diffusion model to denoise them, conditioned on higher target scores. Overall, DEMO generates new designs by inheriting high-scoring features from top existing designs in the second phase and refine them with a more accurate conditional diffusion model obtained in the first phase. Extensive experiments on diverse offline MBO tasks validate that DEMO outperform various baseline approaches, yielding state-of-the-arts performance. The limitations and potential negative impacts of this study are discussed in Appendix A.4 and Appendix A.5, respectively.

7 Acknowledgement

This research is partly facilitated by the computational resources provided by Compute Canada and Mila Cluster.

References

[1] Brandon Trabucco, Xinyang Geng, Aviral Kumar, and Sergey Levine. Design-bench: Benchmarks for data-driven offline model-based optimization. arXiv preprint arXiv:2202.08450, 2022.
[2] Thomas Liao, Grant Wang, Brian Yang, Rene Lee, Kristofer Pister, Sergey Levine, and Roberto Calandra. Data-efficient learning of morphology and controller for a microrobot. arXiv preprint arXiv:1905.01334, 2019.
[3] Karen S Sarkisyan et al. Local fitness landscape of the green fluorescent protein. Nature, 2016.
[4] Christof Angermueller, David Dohan, David Belanger, Ramya Deshpande, Kevin Murphy, and Lucy Colwell. Model-based reinforcement learning for biological sequence design. In Proc. Int. Conf. Learning Rep. (ICLR), 2019.
[5] Kam Hamidieh. A data-driven statistical model for predicting the critical temperature of a superconductor. Computational Materials Science, 2018.
[6] Luis A Barrera et al. Survey of variation in human transcription factors reveals prevalent dna binding changes. Science, 2016.
[7] Paul J Sample, Ban Wang, David W Reid, Vlad Presnyak, Iain J McFadyen, David R Morris, and Georg Seelig. Human 5 UTR design and variant effect prediction from a massively parallel translation assay. Nature Biotechnology, 2019.
[8] David Brookes, Hahnbeom Park, and Jennifer Listgarten. Conditioning by adaptive sampling for robust design. In Proc. Int. Conf. Machine Lea. (ICML), 2019.
[9] Aviral Kumar and Sergey Levine. Model inversion networks for model-based optimization. Proc. Adv. Neur. Inf. Proc. Syst (NeurIPS), 2020.
[10] Ian J. Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. Generative adversarial networks, 2014.
[11] Siddarth Krishnamoorthy, Satvik Mehul Mashkaria, and Aditya Grover. Diffusion models for black-box optimization, 2023.
[12] Yang Song, Jascha Sohl-Dickstein, Diederik P. Kingma, Abhishek Kumar, Stefano Ermon, and Ben Poole. Score-based generative modeling through stochastic differential equations, 2021.
[13] Chin-Wei Huang, Jae Hyun Lim, and Aaron Courville. A variational perspective on diffusion-based generative models and score matching, 2021.
[14] Pascal Vincent. A connection between score matching and denoising autoencoders. Neural Computation, 23(7):1661–1674, 2011.
[15] Yang Song and Stefano Ermon. Generative modeling by estimating gradients of the data distribution, 2020.
[16] Jonathan Ho and Tim Salimans. Classifier-free diffusion guidance, 2022.
[17] Sihyun Yu, Sungsoo Ahn, Le Song, and Jinwoo Shin. Roma: Robust model adaptation for offline model-based optimization. Proc. Adv. Neur. Inf. Proc. Syst (NeurIPS), 2021.
[18] Brandon Trabucco, Aviral Kumar, Xinyang Geng, and Sergey Levine. Conservative objective models for effective offline model-based optimization, 2021.
[19] Justin Fu and Sergey Levine. Offline model-based optimization via normalized maximum likelihood estimation. Proc. Int. Conf. Learning Rep. (ICLR), 2021.
[20] Can Chen, Yingxue Zhang, Xue Liu, and Mark Coates. Bidirectional learning for offline model-based biological sequence design, 2023.
[21] Can Chen, Yingxue Zhang, Jie Fu, Xue Liu, and Mark Coates. Bidirectional learning for offline infinite-width model-based optimization, 2023.
[22] Ye Yuan, Can Chen, Zixuan Liu, Willie Neiswanger, and Xue Liu. Importance-aware co-teaching for offline model-based optimization, 2023.
[23] Can Chen, Christopher Beckham, Zixuan Liu, Xue Liu, and Christopher Pal. Parallel-mentoring for offline model-based optimization, 2023.
[24] Clara Fannjiang and Jennifer Listgarten. Autofocused oracles for model-based design. Proc. Adv. Neur. Inf. Proc. Syst (NeurIPS), 2020.
[25] Chenlin Meng, Yutong He, Yang Song, Jiaming Song, Jiajun Wu, Jun-Yan Zhu, and Stefano Ermon. Sdedit: Guided image synthesis and editing with stochastic differential equations, 2022.
[26] Jiaming Song, Chenlin Meng, and Stefano Ermon. Denoising diffusion implicit models, 2022.
[27] Ron Mokady, Amir Hertz, Kfir Aberman, Yael Pritch, and Daniel Cohen-Or. Null-text inversion for editing real images using guided diffusion models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 6038–6047, 2023.
[28] Daiki Miyake, Akihiro Iohara, Yu Saito, and Toshiyuki Tanaka. Negative-prompt inversion: Fast image inversion for editing with text-guided diffusion models. arXiv preprint arXiv:2305.16807, 2023.
[29] Chen Henry Wu and Fernando De la Torre. A latent space of stochastic diffusion models for zero-shot image editing and guidance. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 7378–7387, 2023.
[30] Xuan Ju, Ailing Zeng, Yuxuan Bian, Shaoteng Liu, and Qiang Xu. Direct inversion: Boosting diffusion-based editing with 3 lines of code. arXiv preprint arXiv:2310.01506, 2023.
[31] Chenyang Qi, Xiaodong Cun, Yong Zhang, Chenyang Lei, Xintao Wang, Ying Shan, and Qifeng Chen. Fatezero: Fusing attentions for zero-shot text-based video editing. arXiv preprint arXiv:2303.09535, 2023.
[32] Duygu Ceylan, Chun-Hao P Huang, and Niloy J Mitra. Pix2video: Video editing using image diffusion. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 23206–23217, 2023.
[33] Shuai Yang, Yifan Zhou, Ziwei Liu, and Chen Change Loy. Rerender a video: Zero-shot text-guided video-to-video translation. arXiv preprint arXiv:2306.07954, 2023.
[34] Michal Geyer, Omer Bar-Tal, Shai Bagon, and Tali Dekel. Tokenflow: Consistent diffusion features for consistent video editing. arXiv preprint arXiv:2307.10373, 2023.
[35] Yuren Cong, Mengmeng Xu, Christian Simon, Shoufa Chen, Jiawei Ren, Yanping Xie, Juan-Manuel Perez-Rua, Bodo Rosenhahn, Tao Xiang, and Sen He. Flatten: optical flow-guided attention for consistent text-to-video editing. arXiv preprint arXiv:2310.05922, 2023.
[36] Youyuan Zhang, Xuan Ju, and James J Clark. Fastvideoedit: Leveraging consistency models for efficient text-to-video editing. arXiv preprint arXiv:2403.06269, 2024.
[37] Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising diffusion probabilistic models, 2020.
[38] Xuan Su, Jiaming Song, Chenlin Meng, and Stefano Ermon. Dual diffusion implicit bridges for image-to-image translation. In International Conference on Learning Representations, 2023.
[39] Jungtaek Kim. BayesO Benchmarks: Benchmark functions for Bayesian optimization, 2023.
[40] Greg Brockman, Vicki Cheung, Ludwig Pettersson, Jonas Schneider, John Schulman, Jie Tang, and Wojciech Zaremba. Openai gym. arXiv preprint arXiv:1606.01540, 2016.
[41] Michael Ahn, Henry Zhu, Kristian Hartikainen, Hugo Ponte, Abhishek Gupta, Sergey Levine, and Vikash Kumar. Robel: Robotics benchmarks for learning with low-cost robots. In Conf. on Robot Lea. (CoRL), 2020.
[42] Barret Zoph and Quoc V. Le. Neural architecture search with reinforcement learning. arXiv preprint arXiv:1611.01578, 2017.
[43] Geoffrey E. Hinton, Nitish Srivastava, Alex Krizhevsky, Ilya Sutskever, and Ruslan R. Salakhutdinov. Improving neural networks by preventing co-adaptation of feature detectors. arXiv preprint arXiv:1207.0580, 2012.
[44] James T Wilson, Riccardo Moriconi, Frank Hutter, and Marc Peter Deisenroth. The reparameterization trick for acquisition functions. arXiv preprint arXiv:1712.00424, 2017.
[45] Nikolaus Hansen. The CMA evolution strategy: A comparing review. In Towards a New Evolutionary Computation: Advances in the Estimation of Distribution Algorithms, 2006.
[46] Ronald J Williams. Simple statistical gradient-following algorithms for connectionist reinforcement learning. Machine Learning, 1992.
[47] Han Qi, Yi Su, Aviral Kumar, and Sergey Levine. Data-driven model-based optimization via invariant representation learning. In Proc. Adv. Neur. Inf. Proc. Syst (NeurIPS), 2022.
[48] Diederik P. Kingma and Jimmy Ba. Adam: A method for stochastic optimization, 2017.
[49] Minsu Kim, Federico Berto, Sungsoo Ahn, and Jinkyoo Park. Bootstrapped training of score-conditioned generator for offline design of biological sequences, 2023.

Appendix A Appendix

A.1 Median Normalized Scores

Table 4: Experimental results on continuous tasks for comparison.

Method	Superconductor	Ant Morphology	D’Kitty Morphology	Levy
$\mathcal{D}$ (best)	$0.399$	$0.565$	$0.884$	$0.613$
BO-qEI	$0.300\pm 0.015$	$0.567\pm 0.000$	$\textbf{0.883}\pm\textbf{0.000}$	$0.643\pm 0.009$
CMA-ES	$0.379\pm 0.003$	$-0.045\pm 0.004$	$0.684\pm 0.016$	$0.410\pm 0.009$
REINFORCE	$\textbf{0.463}\pm\textbf{0.016}$	$0.138\pm 0.032$	$0.356\pm 0.131$	$0.377\pm 0.065$
Grad	$0.293\pm 0.010$	$0.463\pm 0.023$	$0.862\pm 0.007$	$0.613\pm 0.019$
Mean	$0.334\pm 0.004$	$0.569\pm 0.011$	$0.876\pm 0.005$	$0.561\pm 0.007$
Min	$0.364\pm 0.030$	$0.569\pm 0.021$	$0.873\pm 0.009$	$0.537\pm 0.006$
COMs	$0.316\pm 0.024$	$0.564\pm 0.002$	$0.881\pm 0.002$	$0.511\pm 0.012$
ROMA	$0.370\pm 0.019$	$0.477\pm 0.038$	$0.854\pm 0.007$	$0.558\pm 0.003$
NEMO	$0.320\pm 0.008$	$0.592\pm 0.000$	$\textbf{0.883}\pm\textbf{0.000}$	$0.538\pm 0.006$
BDI	$0.412\pm 0.000$	$0.474\pm 0.000$	$0.855\pm 0.000$	$0.534\pm 0.003$
IOM	$0.350\pm 0.023$	$0.513\pm 0.035$	$0.876\pm 0.006$	$0.562\pm 0.007$
CbAS	$0.111\pm 0.017$	$0.384\pm 0.016$	$0.753\pm 0.008$	$0.479\pm 0.020$
Auto CbAS	$0.131\pm 0.010$	$0.364\pm 0.014$	$0.736\pm 0.025$	$0.499\pm 0.022$
MIN	$0.336\pm 0.016$	$\textbf{0.618}\pm\textbf{0.040}$	$\textbf{0.887}\pm\textbf{0.004}$	$\textbf{0.681}\pm\textbf{0.030}$
DDOM	$0.346\pm 0.009$	$\textbf{0.615}\pm\textbf{0.007}$	$0.861\pm 0.003$	$0.595\pm 0.012$
DEMO_(ours)	$0.412\pm 0.008$	$\textbf{0.624}\pm\textbf{0.014}$	$0.875\pm 0.003$	$0.601\pm 0.006$

Table 5: Experimental results on discrete tasks, and ranking on all tasks for comparison.

Method	TF Bind $8$	TF Bind $10$	NAS	Rank Mean	Rank Median
$\mathcal{D}$ (best)	$0.439$	$0.467$	$0.436$
BO-qEI	$0.439\pm 0.000$	$0.467\pm 0.000$	$0.544\pm 0.099$	$6.7/16$	$7/16$
CMA-ES	$0.537\pm 0.014$	$0.484\pm 0.014$	$\textbf{0.591}\pm\textbf{0.102}$	$8.9/16$	$6/16$
REINFORCE	$0.462\pm 0.021$	$0.475\pm 0.008$	$-1.895\pm 0.000$	$11.3/16$	$15/16$
Grad	$0.556\pm 0.021$	$\textbf{0.562}\pm\textbf{0.017}$	$0.227\pm 0.110$	$7.9/16$	$9/16$
Mean	$0.539\pm 0.030$	$0.539\pm 0.010$	$0.494\pm 0.077$	$6.1/16$	$5/16$
Min	$0.569\pm 0.050$	$0.485\pm 0.021$	$\textbf{0.567}\pm\textbf{0.006}$	$5.3/16$	$5/16$
COMs	$0.439\pm 0.000$	$0.467\pm 0.002$	$0.525\pm 0.003$	$8.7/16$	$8/16$
ROMA	$0.555\pm 0.020$	$0.512\pm 0.020$	$0.525\pm 0.003$	$6.9/16$	$6/16$
NEMO	$0.438\pm 0.001$	$0.454\pm 0.001$	$\textbf{0.564}\pm\textbf{0.016}$	$8.1/16$	$9/16$
BDI	$0.439\pm 0.000$	$0.476\pm 0.000$	$0.517\pm 0.000$	$8.3/16$	$8/16$
IOM	$0.439\pm 0.000$	$0.477\pm 0.010$	$-0.050\pm 0.011$	$8.0/16$	$7/16$
CbAS	$0.428\pm 0.010$	$0.463\pm 0.007$	$0.292\pm 0.027$	$13.6/16$	$13/16$
Auto CbAS	$0.419\pm 0.007$	$0.461\pm 0.007$	$0.217\pm 0.005$	$14.3/16$	$14/16$
MIN	$0.421\pm 0.015$	$0.468\pm 0.006$	$0.433\pm 0.000$	$6.7/16$	$9/16$
DDOM	$0.401\pm 0.008$	$0.464\pm 0.006$	$0.306\pm 0.017$	$9.4/16$	$10/16$
DEMO_(ours)	$\textbf{0.826}\pm\textbf{0.005}$	$0.475\pm 0.004$	$0.541\pm 0.005$	4.0/16	4/16

Performance in Continuous Tasks. Table 4 showcases the median normalized scores for various baseline methods across $4$ continuous tasks. DEMO, while not always topping the charts, demonstrates robust performance across these tasks, consistently outperforming several baseline methods. For example, in the Ant Morphology task, DEMO’s score of $0.624\pm 0.014$ is the highest one among all approaches. This highlights DEMO’s capability to approximate the distribution of higher-scoring designs effectively. Notably, DEMO outperforms traditional generative models like CbAS and Auto CbAS by significant margins across all tasks, underscoring its advanced generative capabilities. It also maintains a competitive edge against more recent generative methods like MIN and DDOM.

Performance in Discrete Tasks. Moving to discrete tasks, as detailed in Table 5, DEMO exhibits impressive performance in the TF Bind $8$ task, substantially surpassing all baselines with a score of $0.826\pm 0.005$ . However, in more complex tasks like TF Bind $10$ and NAS, while DEMO performs competitively, it does not lead the field. This mixed performance can be attributed to DEMO’s methodology which, although highly effective in capturing a broad distribution of high-quality designs, might struggle in task environments with redundancy in design features.

Summary. The results presented in Tables 4 and 5 collectively validate DEMO’s efficacy across both continuous and discrete optimization tasks, providing further support for answering Q $1$ affirmatively. With a mean rank of $4.0/16$ and a median rank of $4/16$ in terms of the median normalized scores, DEMO stands out among $16$ competing methods. This comprehensive performance underscores DEMO’s capacity to integrate and leverage complex design distributions effectively, setting a new standard in generative optimization methods.

A.2 Sensitivity to the Choice of m

In Eq. (9), selecting a time $m$ close to $M$ results in $\bm{x}_{perturb}$ resembling random Gaussian noise, which introduces greater flexibility into the new design generation process. On the other hand, if $m$ is closer to $0$ , the resulting design retains more characteristics of the existing top design. Thus, $m$ serves as a critical hyperparameter in our methodology. This section explores the robustness of DEMO to various choices of $m$ . We perform experiments on one continuous task, SuperC, and one discrete task, TF $8$ , with $m$ ranging from $0$ to $1000$ in increments of $100$ . As illustrated in Figure 4, DEMO generally outperforms the baseline methods with different choices of $m$ . Nevertheless, overly extreme values of $m$ , whether too high or too low, can diminish performance. Selecting an excessively low $m$ causes the model to adhere too closely to the distribution of existing designs, while choosing an overly high $m$ biases the model towards the pseudo-target distribution, neglecting the guidance of existing top designs. Choosing $m$ from a mid-range effectively balances the influences from both the pseudo-target distribution and the top existing designs. Empirical results suggest that an $m$ within the range of $[200,600]$ yields optimal performance, leading us to set $m=400$ for all tasks.

A.3 Extension of Reliability Study

This section extends the reliability study in section 5.7, comparing DEMO with a gradient-based approach. When compared to Grad, DEMO demonstrates greater consistency in 5 out of 7 tasks. However, Grad outperforms DEMO in Levy and TF $10$ tasks, which can be attributed to the gradient-based method’s tendency to generate new designs within a narrower distribution. While Grad achieves a higher proportion of higher-scoring new designs in these two tasks, DEMO generates new designs within a wider distribution and thus produces candidates with higher maximum scores, as evidenced in Table 2.

A.4 Limitations

We have demonstrated the effectiveness of DEMO across a wide range of tasks. However, some evaluation methods may not fully capture real-world complexities. For example, in the superconductor task [5], we follow traditional practice by using a random forest regression model as the oracle, as done in prior studies [1]. Unfortunately, this model might not entirely reflect the intricacies of real-world situations, which could lead to discrepancies between our oracle and actual ground-truth outcomes. Engaging with domain experts in the future could help enhance these evaluation approaches. Nevertheless, given DEMO’s straightforward approach and the empirical evidence supporting its robustness and efficacy across various tasks detailed in the Design-Bench [1] and BayesO Benchmarks [39], we remain confident in its ability to generalize effectively to different contexts.

A.5 Negative Impacts

This study seeks to advance the field of Machine Learning. However, it’s important to recognize that advanced optimization techniques can be used for either beneficial or detrimental purposes, depending on their application. For example, while these methods can contribute positively to society through the development of drugs and materials, they also have the potential to be misused to create harmful substances or products. As researchers, we must stay aware and ensure that our contributions promote societal betterment, while also carefully assessing potential risks and ethical concerns.