[1,2]\fnmYuangang \surPan

1]\orgdivCentre for Frontier AI Research, \orgnameAgency for Science, Technology and Research (A*STAR), \orgaddress\postcode138632, \countrySingapore

2]\orgdivInstitute of High Performance Computing, \orgnameAgency for Science, Technology and Research (A*STAR), \orgaddress\postcode138632, \countrySingapore

3]\orgdivDepartment of Computing and Decision Sciences, \orgnameLingnan University, \orgaddress\cityHong Kong

PROUD: PaRetO-gUided Diffusion Model for Multi-objective Generation

\fnmYinghua \surYao eva.yh.yao@gmail.com yuangang.pan@gmail.com \fnmJing \surLi j.lee9383@gmail.com \fnmIvor \surTsang ivor.tsang@gmail.com \fnmXin \surYao xinyao@ln.edu.hk [ [ [

Abstract

Recent advancements in the realm of deep generative models focus on generating samples that satisfy multiple desired properties. However, prevalent approaches optimize these property functions independently, thus omitting the trade-offs among them. In addition, the property optimization is often improperly integrated into the generative models, resulting in an unnecessary compromise on generation quality (i.e., the quality of generated samples). To address these issues, we formulate a constrained optimization problem. It seeks to optimize generation quality while ensuring that generated samples reside at the Pareto front of multiple property objectives. Such a formulation enables the generation of samples that cannot be further improved simultaneously on the conflicting property functions and preserves good quality of generated samples. Building upon this formulation, we introduce the PaRetO-gUided Diffusion model (PROUD), wherein the gradients in the denoising process are dynamically adjusted to enhance generation quality while the generated samples adhere to Pareto optimality. Experimental evaluations on image generation and protein generation tasks demonstrate that our PROUD consistently maintains superior generation quality while approaching Pareto optimality across multiple property functions compared to various baselines.

keywords:

Multi-objective generation, diffusion model, Pareto optimality, generative model

1 Introduction

Deep generative models have been developing prosperously over the last decade, with advances in variational autoencoders [27], generative adversarial networks [18, 61], normalizing flows [37], energy-based models [46], and diffusion models [44, 22]. Particularly, controllable generative models can generate samples that satisfy multiple properties of interest, showing great promise in various applications, such as material design [26, 50] and controlled text/image generation [8, 32]. These properties of interest vary depending on the specific application domains. For example, in protein design, the properties can refer to specified structural or functional characteristics, such as solubility or binding affinity [56]. In image generation, the properties can refer to certain attributes or features, such as specified hairstyle & makeup [55], or specified color patches [34]. In addition, it is considered imperative that generated samples should reside in the same data manifold¹¹1This relates to the manifold hypothesis that many real-world high-dimensional datasets lie on low-dimensional latent manifolds in the high-dimensional space [15] as training samples for data naturalness concerns [19].

Before delving into details, we first establish the problem setting. Given a dataset $X\subseteq\mathcal{X}$ , where $\mathcal{X}\subset\mathbb{R}^{d}$ denotes a low-dimensional manifold in the high-dimensional space $\mathbb{R}^{d}$ . Suppose we have $m$ objective functions $F(x)=[f_{1}(x),f_{2}(x),\ldots,f_{m}(x)]$ , each of which returns a property value for the sample $x\in\mathcal{X}$ . The aim of multi-objective generation is to learn a generative model that produces samples optimized to achieve the best values across these functions while ensuring the generated samples remain within the manifold $\mathcal{X}$ (green cross in Fig. 1(a), namely, ensuring that the quality of generated samples (dubbed as generation quality) is good²²2In other words, the generated samples is as realistic as samples in the given dataset $X$ ..

The multi-objective generation problem introduced above inherently requires reconciling the optimization challenges in two spaces: the functionality space and the sample space as shown in Fig. 1(a). Given the need to deal with multiple conflicting objectives in order to achieve the generation with desired properties, one challenge is how to produce samples that cannot be further improved simultaneously across the objectives, a.k.a. Pareto optimality [6] (the Pareto front in Fig. 1(a)). The second challenge arises from the manifold assumption that the generated samples should lie within the data manifold $\mathcal{X}$ , namely, generated samples are supposed to be of good quality [40]. Optimizing multiple objectives without considering generation quality could result in Pareto solutions outside of the data manifold (i.e., invalid samples on the Pareto front of Fig. 1(a)). The third challenge relates to the coordination of generation quality and multi-property optimization. To guarantee generation quality, generative models typically define a divergence between the distribution of generated data and that of real training data $X$ [58, 18], which tends to disperse the generated data throughout the whole data manifold $\mathcal{X}$ (the purple plane in Fig. 1(a)). However, since only a limited fraction of the samples on the data manifold lie on the Pareto front, there inevitably exists some distribution gap between the generated data and the training data, leading to compromise of generation quality, when achieving Pareto optimality.

A large number of studies [28, 11, 54, 31] attempt to design controllable generative models with multiple properties by simply assuming that these properties are independent and aggregating the multiple property objectives into a single one $\sum_{i=1}^{m}f_{i}$ for controlled generation. Notably, a very recent study [19] takes into consideration the trade-offs between multiple properties by incorporating the multi-objective optimization techniques into the generative models. It modified the gradient of sampling in vanilla diffusion models as a linear combination of the original diffusion gradient and the gradient solved by the multi-objective Bayesian optimization. However, the adopted fixed coefficient is challenging to effectively coordinate the generation quality and the optimization of multiple property objectives. This results in an unnecessary loss of generation quality while achieving Pareto optimality for the property objectives.

Refer to caption — Figure 1: (a) Diagram of multi-objective generation (best viewed in color). Our multi-objective generation aims to produce samples that simultaneously lie on the Pareto front in the functionality space (Left Panel) and remain within the manifold $\mathcal{X}$ in the sample space (Right Panel), i.e., the green cross. (b) Visualization of the image generation task optimized with two objectives on CIFAR10. Images are directly taken from the original CIFAR10 dataset (see full resolution images in Fig. 12), whose objective values lie on the Pareto front, namely, $\{x|x\in X,F(x)=[f_{1}^{\ast},f_{2}^{\ast}]\in F^{\ast}\}$ , where $F^{\ast}$ denotes the points on the Pareto front.

In this work, we propose PaRetO-gUided Diffusion model (PROUD) for multi-objective generation. PROUD is formulated as a constrained optimization that minimizes the Kullback–Leibler (KL) divergence between the distribution of the generated data and that of the training data, where the distribution of the generated data is also constrained to be close to the distribution of Pareto solutions under the KL divergence. This guarantees that generated samples are moved towards the Pareto set and then the quality of these generated samples is optimized to the best within a neighborhood of the Pareto set. Specifically, constrained optimization is implemented during the generative process of a pre-trained unconditional diffusion model. Multiple gradient descents for the multiple objectives and the original diffusion gradient are adaptively weighted to denoise samples. The contributions of this work are summarized as follows:

•

We propose a novel constrained optimization formulation for controllable generation adhering to multiple properties, defined as multi-objective generation, which can better coordinate the generation quality and the optimization for multi-objectives.
•

A new controllable diffusion model (PROUD) is introduced to solve the constrained optimization formulation. The guidance of multiple objectives is adaptively integrated with that of data likelihood, which can reduce the needless comprise of generation quality while achieving Pareto optimality in terms of multiple property objectives.
•

We apply our PROUD to optimizing multiple objectives in the tasks of controllable image generation and protein design. Additionally, we establish various baselines based on diffusion models to demonstrate the superiority of our PROUD.

2 Related Work

In the section, we summarize the related works based on their strategies for integrating the optimization of multiple property objectives into deep generative models.

Single-objective generation (SOG) refers to approaches that simply combine multiple objectives into a single one to guide the generation. Extensive efforts have been devoted to controllable generation with multiple properties independent of each other [28, 20, 26, 11, 54, 31]. Nevertheless, these methods fail to capture the correlation between properties and ignore the conflicting nature among properties, leading to an insufficient exploration of the solution space.

Multi-objective Generation (MOG) refers to approaches that introduce multi-objective optimization techniques into generative models. Wang et al [53] adopted a weighted-sum strategy to deal with the trade-offs between properties, which can only work in cases of convex Pareto fronts and a uniformly distributed grid of weighting cannot guarantee uniform points on the Pareto front [41, 33]. Stanton et al [49] proposed LaMBO (Latent Multi-objective Bayesian Optimization), which applies multi-objective Bayesian optimization in the latent space of denoising autoencoder to optimize the generated samples with multiple black-box objectives. Although it can characterize the Pareto front, the data generated by denoising autoencoder is of inferior quality. Gruver et al [19] further applied LaMBO to the latent space of discrete diffusion models. It generalized classifier-guided diffusion models [14] by replacing the classifier gradient with the gradient obtained by LaMBO. The combination of the score estimate of a diffusion model and the classifier gradient necessitates manual tuning of the combination coefficient, which is theoretically inappropriate for non-convex functions [17]. Tagasovska et al [50] proposed to use multiple gradient descent [12] for sampling within compositional energy-based models (EBMs) where each EBM is conditioned on one specific property, but training multiple conditional EBMs requires much more supervision than training discriminative models. Moreover, this kind of paradigm cannot enjoy post-hoc controls upon the pre-trained unconditional generative models. Multi-objective generative flow networks (GFlowNets) [25] fully integrated guidance from multiple objectives into the training process. So, they must be retrained whenever the objectives change and are also not suitable for use with pre-trained generative models. In addition, this kind of models are usually difficult to train [42].

Diffusion models [22, 43, 44, 48] represent the state-of-the-art (SOTA) in deep generative models. Therefore, we build our multiple-objective generation model based on diffusion models. While most related works design their methods based on other deep generative models, we apply their ideas to the diffusion model as much as possible for the sake of comparison. Please refer to Section 5 for more details.

3 Preliminaries

Before delving into our method, we introduce the technical background about multi-objective optimization in Section 3.1 and diffusion models in Section 3.2, respectively.

3.1 Multi-objective Optimization

Let $x\in\mathbb{R}^{d}$ be a decision variable. Assuming that $F(x)=\left[f_{1}(x),f_{2}(x),\ldots,f_{m}(x)\right]$ be a set of $m$ objective functions, each of which represents a property and is preferred to have a smaller value. The multi-objective optimization problem [6, 9] can be conventionally expressed as:

\min_{x\in\mathbb{R}^{d}}F(x)=\min_{x\in\mathbb{R}^{d}}\left[f_{1}(x),f_{2}(x)% ,\ldots,f_{m}(x)\right].

(1)

In this context, for $x_{1},x_{2}\in\mathbb{R}^{d}$ , $x_{1}$ is said to dominate $x_{2}$ , i.e., $x_{1}\prec x_{2}$ , iff $f_{i}(x_{1})\leq f_{i}(x_{2}),\forall i=1,2,\ldots,m$ , and $F(x_{1})\neq F(x_{2})$ .

Definition 1 (Pareto optimality).

A point $x^{\ast}\in\mathbb{R}^{d}$ is called Pareto optimal iff there exists no any other $x^{\prime}\in\mathbb{R}^{d}$ such that $x^{\prime}\prec x^{\ast}$ . The collection of Pareto optimal points are called Pareto set, denoted as $\mathcal{P}^{\ast}$ . The collection of function values $F(x^{\ast})$ of the Pareto set is called the Pareto front [52, 4].

Definition 2 (Pareto stationarity).

Pareto stationarity is a necessary condition for Pareto optimality. A point $x$ is called Pareto stationary if there exists a set of scalar $\omega_{i},i=1,2,\ldots,m$ , such that $\sum_{i=1}^{m}\omega_{i}\nabla f_{i}(x)=\mathbf{0},\sum_{i=1}^{m}\omega_{i}=1,% \omega_{i}>0,\forall i=1,2\ldots,m$ .

Désidéri [12] proposed Multiple Gradient Descent (MGD) to find the Pareto optimal solutions of Eq.(1). To be specific, given any initial point $x\in\mathbb{R}^{d}$ , we can iteratively update $x$ according to:

x_{t+1}=x_{t}-\eta v_{t},

(2)

where $t$ is the iteration step. The update direction $v_{t}$ is expected to be close to each gradient $\nabla f_{i}(x)$ $\forall i=1,2,\ldots,m$ as much as possible, which is therefore formulated into the following problem:

\underset{v\in\mathbb{R}^{d}}{\max}\left\{\min_{i}\nabla f_{i}\left(x\right)^{% \top}v-\frac{1}{2}\|v\|^{2}\right\}.

(3)

Through Lagrange strong duality, the solution to Eq.(3) can be framed into

v(x)=\nabla F(x)=\sum_{i=1}^{m}\omega_{i}^{\ast}\nabla f_{i}\left(x\right),

(4)

where $\{\omega_{i}^{\ast}\}_{i=1}^{m}=\arg\min\limits_{\{\omega_{i}\}_{i=1}^{m}}\|% \sum_{i=1}^{m}\omega_{i}\nabla f_{i}\left(x\right)\|^{2}$ under the constraint that $\sum_{i=1}^{m}\omega_{i}=1,\omega_{i}>0,\forall i=1,2\ldots,m$ .

3.2 Diffusion Models

The idea of Diffusion models is to progressively diffuse data to noise, and then learn to reverse this process for sample generation. Considering a sequence of prescribed noise scales $0<\beta_{1}<\beta_{2}<\ldots<\beta_{T}<1$ , Denoising Diffusion Probabilistic Model (DDPM) [22] diffuses data $x_{0}\sim q_{\text{data}}(x)$ to noise via constructing a discrete Markov chain $\{x_{0},x_{1},\ldots,x_{T}\}$ , where $q(x_{t}|x_{t-1})=\mathcal{N}(x_{t};\sqrt{1-\beta_{t}}x_{t-1},\beta_{t}\mathbf{% I}),x_{T}\sim\mathcal{N}(\mathbf{0},\mathbf{I})$ . This process is called the forwarded process or diffusion process. In particular, $q(x_{t}|x_{0})=\mathcal{N}(x_{t};\sqrt{\alpha_{t}}x_{0},(1-\alpha_{t})\mathbf{% I})$ , where $\alpha_{t}=\prod_{i=1}^{t}\left(1-\beta_{t}\right)$ .

The key of diffusion-based generative models is to train a reverse Markov chain so that we can generate data starting from a Gaussian noise $p(x_{T})\sim\mathcal{N}(\mathbf{0},\mathbf{I})$ . The training loss of the reverse diffusion process, a.k.a. generative process, is to minimize a simplified variational bound of negative log likelihood. Namely,

\mathbb{E}_{x_{0}\sim q_{\text{data}}(x),\epsilon\sim\mathcal{N}(\mathbf{0},% \mathbf{I})}\left[\|\epsilon-\epsilon_{\theta}\left(\sqrt{\alpha_{t}}x_{0}+% \sqrt{1-\alpha_{t}}\epsilon,t\right)\|^{2}\right],

(5)

where $\epsilon_{\theta}(x_{t},t)$ is a neural network-based approximator to predict the noise $\epsilon$ from $x_{t}=\sqrt{\alpha_{t}}x_{0}+\sqrt{1-\alpha_{t}}\epsilon$ .

After training the neural network parameterized by $\theta$ to obtain the optimal $\epsilon_{\theta}^{\ast}(x_{t},t)$ , samples can be generated by starting from $x_{T}\sim\mathcal{N}(\mathbf{0},\mathbf{I})$ and reversing the Markov chain:

x_{t-1}=\frac{1}{\sqrt{1-\beta_{t}}}\left(x_{t}-\frac{\beta_{t}}{\sqrt{1-% \alpha_{t}}}\epsilon_{\theta}^{\ast}\left(x_{t},t\right)\right)+\sqrt{\beta_{t% }}z_{t},

(6)

where $z_{t}\sim\mathcal{N}(\mathbf{0},\mathbf{I})$ and $t=T,T-1,\ldots,1$ . More variants of diffusion models can be seen in Yang et al [58].

Existing attempts for incorporating multiple desired properties into the diffusion model [19] can be straightforwardly adding the derived MGD $\nabla F(x)$ in Eq.(4) to the noise predictor $\epsilon_{\theta}^{\ast}(x_{t},t)$ at each denoising step, namely,

x_{t-1}=\frac{1}{\sqrt{1-\beta_{t}}}\left(x_{t}-\frac{\beta_{t}}{\sqrt{1-% \alpha_{t}}}\Big{(}\epsilon_{\theta}^{\ast}\left(x_{t},t\right)+\lambda\nabla F% (x)\Big{)}\right)+\sqrt{\beta_{t}}z_{t},

(7)

where $t=T,T-1,\ldots,1$ . $\lambda$ is a trade-off hyper-parameter which balances the generation quality (i.e., the noise predictor $\epsilon_{\theta}^{\ast}(x_{t},t)$ ) and multiple-objectives (i.e., the MGD $\nabla F(x)$ ). Note that an inappropriate $\lambda$ may lead to unsatisfied samples which either suffer from low quality or fail to possess required properties (Refer to experimental observations in Section 5).

4 Multi-Objective Generation

As discussed above, optimizing generative models in terms of $m$ objectives aims to produce samples that cannot be simultaneously improved for all objectives, namely, Pareto optimality (see Definition 1). Meanwhile, the generated samples are required to be as realistic as the training samples, which is usually achieved by enforcing distribution alignment between the generated samples and the training samples.

MOG compared with MOO

As shown in Table 1, both the MOO and MOG share the same objectives $F(x)$ but differ in the space that $x$ resides in, which is termed as “decision space” or “solution space” in the MOO problem [6] and is termed as “data space” in the MOG problem [19, 54]. To be specific, the decision space of the MOO problem is defined as the whole space of $\mathbb{R}^{d}$ [5], while the data space of the MOG problem only resides in a low-dimensional manifold $\mathcal{X}$ embedded in $\mathbb{R}^{d}$ (a.k.a. the ambient space) [15, 38, 35]. Such a difference highlights that the objectives to be optimized for MOG are only meaningful within the data manifold. When simply applying MOO algorithms to search for solutions in the high-dimensional sample space, the obtained solutions cannot guarantee residing within the data manifold, thus resulting in very low data quality (i.e., invalid samples in Fig. 1(a)) and a loss of practicability [40].

To sum up, the necessity to concurrently consider generation quality distinguishes the MOG problem from the MOO problem. Specifically, a dataset with real samples is required to define the data manifold on which the generated samples are expected to reside (Eq.(8)).

Table 1: The MOO problem vs. the MOG problem. The generation quality in MOG is usually modeled based on the given dataset

X\subset\mathcal{X}

, where

\mathcal{X}

denotes a low-dimensional manifold embedded in the high dimensional space

\mathbb{R}^{d}

	objectives	decision/data space	generation quality
MOO	$F(x)=[f_{1}(x),f_{2}(x),\ldots,f_{m}(x)]$	$x\in\mathbb{R}^{d}$	✗
MOG	$F(x)=[f_{1}(x),f_{2}(x),\ldots,f_{m}(x)]$	$x\in\mathcal{X},\mathcal{X}\subset\mathbb{R}^{d}$	✓

4.1 Constrained Optimization for MOG

A straightforward solution of MOG is to take consideration of generation quality as an additional objective and formulate it into a $m+1$ objectives problem. However, the heterogeneity of multiple objective optimization (usually defined w.r.t. a single sample) and the distribution alignment (defined w.r.t. a dataset) would bring out the optimization difficulty for the resultant MOO. Although it is feasible to simplify the distribution divergence w.r.t. a dataset as quality scores for individual samples in some deep generative models [3], it is still challenging to obtain desired solutions that achieve Pareto optimality on $m$ objectives from the optimization of $m+1$ objectives which explore a much larger space, as empirically verified in the experiments. In addition, the complexity of multi-objective optimization increases significantly with the number of objectives [23].

Instead of formulating a complex and ineffective $m+1$ objective problem, we implement the multi-objective generation through a tailor-designed constrained optimization problem upon $m$ property objectives. Such a formulation also allows us to stress respective significance of data generation and $m$ -objective optimization, instead of treating them equally important. Specifically, let $p_{\theta}(x)$ denote the target data distribution parameterized by $\theta$ , and $p_{0}$ denote the distribution of the solution samples on the Pareto front, our constrained optimization problem can be formulated as follows

\displaystyle\min_{\theta}D\left[q_{\text{data}}(x)||p_{\theta}(x)\right]\quad s% .t.\ D\left[p_{0}(x)||p_{\theta}(x)\right]\leq\varepsilon.

(8)

where $D(\cdot,\cdot)$ denotes the distribution divergence and $\varepsilon$ is a small positive value.

The loss in Eq.(8) controls the generation quality, which ensures the quality of the generated data as realistic as possible. The constraint in Eq.(8) ensures the generated data $x\sim p_{\theta}(x)$ to be Pareto optimal (with a small bearable error). Overall, Eq.(8) provides certain quality assurance while obtaining samples that can approach Pareto optimality of multiple property objectives.

4.2 Langevin Dynamics for Data Distribution Approximation

It is difficult to directly solve Eq.(8) when both $q_{\text{data}}(x)$ and $p_{0}(x)$ are unknown. Motivated by those widely-developed techniques of sampling algorithms for approximating data distribution [2, 44, 33], we develop Langevin dynamic-based sampling techniques to solve Eq.(8). Specifically, Langevin dynamics are capable of generating samples from a given probability distribution $q(x)$ solely by utilizing its score function $\nabla\log q(x)$ . Given an initial value $x_{T}\sim\mathcal{N}(\mathbf{0},\mathbf{I})$ , the Langevin method recursively computes the following:

x_{t-1}=x_{t}-\kappa g(x_{t})+\sqrt{2\kappa}z,\quad t=T,T-1,\ldots,0,

(9)

where $\kappa$ is the step size and can be fixed or dynamic, $z$ is sampled from the standard normal distribution $\mathcal{N}(\mathbf{0},\mathbf{I})$ and $g(x_{t})$ is the update direction for $x_{t}$ , equal to $\nabla\log q(x_{t})$ . The distribution of $x_{0}$ will be close to the given data distribution $q(x)$ when $\kappa\rightarrow 0$ and $T\rightarrow\infty$ under some regularity conditions [57].

Before deriving the proper gradient $g(x_{t})$ to approximate the distribution optimized in Eq.(8) as a whole, we investigate the gradient-based strategies to optimize $D\left[q_{\text{data}}(x)||p_{\theta}(x)\right]$ and $D\left[p_{0}(x)||p_{\theta}(x)\right]$ via Langevin dynamics, separately.

Optimization of $D\left[q_{\text{data}}(x)||p_{\theta}(x)\right]$ in Eq.(8). Actually, various generative models are deduced to approximate the minimization of the KL divergence between the data distribution $q_{\text{data}}(x)$ and the model distribution $p_{\theta}(x)$ [27, 47, 37]. Here, we choose diffusion models as the representative for optimizing $D\left[q_{\text{data}}(x)||p_{\theta}(x)\right]$ given their equivalent form to Eq.(9) [22, 48]. Particularly, the time-dependent predicted noise $\epsilon_{\theta}^{\ast}\left(x_{t},t\right)$ in Eq.(6) is the update direction $g(x_{t})$ in anneal Langevin dynamics with a dynamic step size $\eta_{t}$ :

x_{t-1}=x_{t}-\eta_{t}\epsilon_{\theta}^{\ast}\left(x_{t},t\right)+\sqrt{2\eta% _{t}}z.

(10)

Consequently, the distribution of $p_{\theta}(x_{0})$ will approach $q_{\text{data}}(x)$ [47].

Optimization of $D\left[p_{0}(x)||p_{\theta}(x)\right]$ in Eq.(8). On the other hand, we can integrate MGD (Eq.(4)) into Langevin dynamics to optimize $D\left[p_{0}(x)||p_{\theta}(x)\right]$ , aiming to approximate the distribution of the Pareto set $p_{0}(x)$ upon convergence. Namely,

x_{t-1}=x_{t}-\eta\nabla F(x_{t})+\sqrt{2\eta}z,

(11)

where $\eta$ is a fixed step size. The distribution of $x_{0}$ will converge to $p_{0}(x)$ , as demonstrated in Theorem 3.3 of Liu et al [33].

4.3 Pareto-guided Diffusion Model

Based on the above analysis, the key to solving the constrained optimization problem (Eq.(8)) is to design a proper strategy for unifying the optimization of $D\left[q_{\text{data}}(x)||p_{\theta}(x)\right]$ and $D\left[p_{0}(x)||p_{\theta}(x)\right]$ within the framework of Langevin dynamic sampling. Therefore, we can indirectly solve Eq.(8) by designing the following strategies to update the gradient $g(x_{t})$ in Eq.(9):

1)

If the sample $x_{t}$ is far away from the Pareto front (constraint violation), $g(x_{t})$ is chosen to assure Pareto improvement (i.e., decreasing all the $m$ objectives) to $x_{t}$ . The amount of Pareto improvement is determinant by the distance of $x_{t}$ to the Pareto front.
2)

If there are multiple directions that can yield Pareto improvement (constraint violation), the direction of Pareto improvement that decreases $D\left[q_{\text{data}}(x)||p_{\theta}(x)\right]$ most (reducing loss) is chosen as $g(x_{t})$ .
3)

If $x_{t}$ is close to the Pareto front (constraint satisfaction), i.e., having a small $\|\nabla F\left(x_{t}\right)\|$ according to Definition 2, $g(x_{t})$ is chosen to fully optimize $D\left[q_{\text{data}}(x)||p_{\theta}(x)\right]$ (reducing loss).

Following Ye and Liu [60], we design a new objective based on the gradients to achieve the above conditions. To be specific, since $\epsilon_{\theta}^{\ast}\left(x_{t},t\right)$ is the gradient for optimizing $D\left[q_{\text{data}}(x)||p_{\theta}(x)\right]$ , and $\nabla F(x)$ is the gradient for optimizing $D\left[p_{0}(x)||p_{\theta}(x)\right]$ , the integrated gradient $g(x_{t})$ can be solved by the following objective:

		$\displaystyle g(x_{t})=\arg\min_{g}\frac{1}{2}\\|g-\epsilon_{\theta}^{\ast}% \left(x_{t},t\right)\\|^{2}$		(12)
		$\displaystyle s.t.\quad\nabla f_{i}(x)^{T}g\geq\phi_{t},\quad\forall i=1,2,% \ldots,m,$
		$\displaystyle\qquad\phi_{t}=\begin{cases}\alpha\\|\nabla F\left(x_{t}\right)\\|&% \text{if }\\|\nabla F\left(x_{t}\right)\\|>e\\ \quad-\infty&\text{otherwise}\end{cases},$

where $\alpha$ and $e$ are positive hyper-parameters. The constraint in Eq.(8) can be approximated by the small gradient norm $\nabla F(x)$ due to Pareto stationarity (Definition 2). In particular, when $\|\nabla F\left(x_{t}\right)\|>e$ , $\phi_{t}$ is set to be proportionate to $\|\nabla F\left(x_{t}\right)\|$ . This will encourage the gradient $g(x_{t})$ to have positive inner products with all $\nabla f_{i}(x)$ , approximating $\nabla F(x)$ . Meanwhile, the amount of Pareto improvement is based on the distance of $x_{t}$ to the Pareto front. If $\|\nabla F\left(x_{t}\right)\|$ has a very small norm, which means that the sample $x_{t}$ is close to the Pareto front, we will have $g_{t}(x)=\epsilon_{\theta}^{\ast}\left(x_{t},t\right)$ with $\phi_{t}=-\infty$ . Therefore, samples will be updated with a pure gradient descent on $D\left[q_{\text{data}}(x)||p_{\theta}(x)\right]$ without taking into account the $m$ objectives $\{f_{i}(x)\}_{i=1}^{m}$ , namely, $\lambda_{i,t}=0,\forall i\in[m]$ .

At the situation of $\|\nabla F\left(x_{t}\right)\|>e$ , the solution $g(x_{t})$ of Eq.(12) is expressed as:

g(x_{t})=\epsilon_{\theta}^{\ast}\left(x_{t},t\right)+\sum_{i=1}^{m}\lambda_{i% ,t}\nabla f_{i}(x_{t}),

(13)

where $\{\lambda_{i,t}\}_{i=1}^{m}$ is the solution of the following dual problem:

\max_{\lambda_{i,t}\in\mathbb{R}_{+}^{m}}-\frac{1}{2}\|\epsilon_{\theta}^{\ast% }\left(x_{t},t\right)+\sum_{i=1}^{m}\lambda_{i,t}\nabla f_{i}(x_{t})\|^{2}+% \sum_{i=1}^{m}\lambda_{i,t}\phi_{t}.

(14)

Substituting the derived gradient $g(x_{t})$ (Eq.(13)) into Eq.(9) and adopting a dynamic step size $\eta_{t}$ , we can obtain a new kind of controllable diffusion modeling, which is named as PaRetO-gUided Diffusion model (PROUD):

x_{t-1}=x_{t}-\eta_{t}\left(\epsilon_{\theta}^{\ast}\left(x_{t},t\right)+\sum_% {i=1}^{m}\lambda_{i,t}\nabla f_{i}(x_{t})\right)+\sqrt{2\eta_{t}}z.

(15)

PROUD does not modify the training process of diffusion models but only updates gradients during the generative process, as summarized in Algorithm 1. Therefore, our PROUD can be plugged into any pre-trained diffusion model to gain post-hoc control during the generative process.

In contrast to existing methods that crudely combine generative models with multi-objective optimization techniques using a predefined balance coefficient, our constrained optimization formulation (Eq.(8)) allows to dynamically infer the balance coefficient (Eq.(14)), prioritizing the guarantee of Pareto optimality.

Algorithm 1 Pareto-guided Reverse Diffusion Process for a Single Sample

1:Input: a pre-trained unconditional diffusion model

\epsilon_{\theta}^{\ast}

, the dynamic step size

\{\eta_{t}\}_{t=1}^{T}

, multiple property objectives

\{f_{i}\}_{i=1}^{m}

2:Hyper-parameters:

\alpha

and

e

in Eq.(12).

3:Initialize:

x_{T}\sim\mathcal{N}(\mathbf{0},\mathbf{I})

4:for t =

T,T-1,\ldots,0

5: calculate the multiple gradient descent:

\nabla F(x_{t})

based on Eq.(4);

6: if

\|\nabla F(x_{t})\|>e

then # calculate the weight coefficients

\{\lambda_{i,t}\}_{i=1}^{m}

takes the solution of Eq.(14) with

\phi_{t}=\alpha\|\nabla F(x)\|

;

8: else

\lambda_{i,t}=0,\forall i\in[m]

;

10: end if

11: calculate the denoising gradient:

g(x_{t})=\epsilon_{\theta}^{\ast}\left(x_{t},t\right)+\sum_{i=1}^{m}\lambda_{i% ,t}\nabla f_{i}(x_{t})

as Eq.(13);

12: sample

z\sim\mathcal{N}(\mathbf{0},\mathbf{I})

;

13: denoise the sample:

x_{t-1}=x_{t}-\eta_{t}g(x_{t})+\sqrt{2\eta_{t}}z

;

14:end for

15:Output: the sample

x_{0}

which meets Pareto optimality of

m

objectives.

4.4 Diversity Regularization for Diversiﬁed Pareto Solutions

In practice, MGD integrated with Langevin dynamics fails to obtain diversified Pareto solutions although it can be guaranteed to obtain solutions on the Pareto front [33]. To make the solutions be evenly distributed on the Pareto front, we consider adding a diversity regularization, which can be enforced either in the sample space or the functionality space. Because we are interested in high-dimensional data generation, imposing larger distances between samples can be challenging. Furthermore, a significant separation between samples does not necessarily ensure a substantial distinction between their respective functionalities. Therefore, we define the diversity regularization based on the objective values.

Suppose there are $N$ particles $\{x^{1},x^{2},\ldots,x^{N}\}$ in each step of our PROUD. We omit the subscript $t$ of the time step for simplicity. The diversity loss is defined to encourage the dissimilarity of the objective values:

\displaystyle l(x^{1},x^{2},\ldots,x^{N})

\displaystyle=\sum_{i\neq j}\frac{1}{\|F(x^{i})-F(x^{j})\|^{2}}.

(16)

The diversity loss Eq.(16) is added to the main objective in Eq.(8) with a weight coefficient $\gamma$ .

5 Experiments

In this section, we evaluate the effectiveness of our PROUD in optimizing image generation and protein generation with multiple conflicting objectives. We study white-box multi-objectives in this work and particularly focus on using MGD as the MOO technique to obtain the gradient from multi-objectives. The exploration of the black-box setting, as mentioned in [49], is discussed in the conclusion and remains for future work.

Dataset. In the task of image generation, we use the CIFAR10 [29] dataset, which consists of 60,000 color images, each with a size of $3\times 32\times 32$ , distributed across 10 classes. Regarding protein generation, following Gruver et al [19], the experiment was conducted on the paired Observed Antibody Space (pOAS) dataset [36], which comprises $90,990$ antibody sequences, each processed to a fixed length of 300.

Baselines. First, we include the most closely-related and SOTA work in MOG that applies the MOO technique to the deep generative model [19]. This baseline is termed as “DM+ $m$ -MGD”, where the MGD of $m$ objectives is used to guide the generation of diffusion models (DM). We also include the baseline regarding single-objective generation, termed as “DM+single”. It fuses multiple objectives into a single objective and uses the gradient of the obtained single objective to guide the generation of diffusion models. Another considered baseline is “ $m+1$ -MGD”. It treats the objective of the diffusion model as an additional objective and formulates multi-objective generation as the optimization of $m+1$ objectives. MGD is then applied directly for the resultant $m+1$ objectives. To stress the necessity of quality assurance in the generation problem, which is the core difference between MOG and MOO, we include the MGD of $m$ objectives as the baseline, called “m-MGD”.

For all methods equipped with MGD, the diversity regularization (Eq.(16)) is included except for $m+1$ -MGD since its extra objective $f_{m+1}(x)$ , i.e., data likelihood, is not accessible for the diffusion models.

Metrics. In terms of generation quality, the Frechet Inception Distance (FID) [21] is adopted as the metric for image quality, while the log-likelihood assigned by ProtGPT [16] is considered as the metric for the quality of protein sequences following Gruver et al [19]. Concerning Pareto optimality, Hypervolume (HV) [62] is adopted to measure how well the methods approximate the Pareto set.

5.1 Image Generation

We follow Liu et al [34]³³3As demonstrated in Section 3 and Figure 3(b) of their study, an objective that forces the center of generated images to be a black square can be used for constrained sampling on CIFAR10. Accordingly, they obtain samples that lie on the CIFAR10 data manifold and exhibit the black square in the middle, such as “black plane” and “black dog” images which contain a black square (smaller size than the object) in the middle. This task can be considered as image outpainting [59], namely, extrapolating images based on specified color patches on CIFAR10. to optimize CIFAR10 images with the objectives that force the middle of an image to be a specified color square.
(1) Controllable generation on CIFAR10 with two objectives (Fig. 1(b)):

$\bullet$

$f_{1}(x)=\|x_{\Omega}-1_{\Omega}\|_{2}^{2}$ , where $x$ represents the entire image, and $x_{\Omega}\subseteq x$ is an image patch in the region $\Omega$ , corresponding to the square at the center of the image. Similar to the practical relevance shown in Liu et al [34], this objective is to restrict the center of the generated images to be a white square, which is to sample CIFAR10 images that exhibit white color in their middle. The patch size is set to $3\times 8\times 8$ in the experiment.
$\bullet$

$f_{2}(x)=\|x_{\Omega}-0.5_{\Omega}\|_{2}^{2}$ with the similar setting. This objective is to constrain the center to be a grey square.

The desired generation for these two objectives would be those CIFAR10-like images with patches in normalized RGB color values⁴⁴4RGB values [0, 255] are divided by 255. between [0.5, 0.5, 0.5] (grey) and [1, 1, 1] (white), in the middle, according to Ishibuchi et al [24], Li et al [30]. Please refer to Appendix B for more details.
(2) Controllable generation on CIFAR10 with three objectives:

$\bullet$

$f_{1}(x)=\|x_{\Omega}-a_{\Omega}\|_{2}^{2}$ , where $x$ represents the entire image, and $x_{\Omega}\subseteq x$ is an image patch in the region $\Omega$ , corresponding to the square at the center of the image. This objective is to restrict the center of the generated images to be a black square. The patch size is set to $3\times 8\times 8$ in the experiment. $a_{\Omega}=[0,0,0]_{8\times 8}$ .
$\bullet$

$f_{2}(x)=\|x_{\Omega}-b_{\Omega}\|_{2}^{2}$ with the similar setting. This objective is to constrain the center to be a deep red square. $b_{\Omega}=[0.5,0,0]_{8\times 8}$ .
$\bullet$

$f_{3}(x)=\|x_{\Omega}-c_{\Omega}\|_{2}^{2}$ with the similar setting. This objective is to constrain the center to be a deep yellow square. $c_{\Omega}=[0.5,0.5,0]_{8\times 8}$ .

The desired generation for these three objectives would be those CIFAR10-like images with patches in normalized RGB color values belonging to the convex triangle formed by the points [0, 0, 0] (black), [0.5, 0, 0] (deep red) and [0.5, 0.5, 0] (deep yellow). Please refer to Appendix B for more details. We adopt the diffusion model used in Song and Ermon [45] as the backbone for CIFAR10 image generation.

We sample images from our PROUD and other baselines using the same seeds for the sake of comparison. From Fig. 2, we can observe that: (1) our PROUD and two baselines, DM+ $m$ -MGD and $m$ -MGD, can successfully generate harmonious images consistent with the patch-level constraints imposed by two conflicting objectives. Among them, the generated images of our PROUD exhibit better quality than DM+ $m$ -MGD in some instances, as the latter tends to sacrifice generation quality to excessively meet Pareto optimality of the objectives due to the lack of a mechanism to emphasize the quality of generated samples. (2) $m+1$ -MGD fails to generate satisfactory images consistent with the patch-level constraints, as the new objective (i.e, generation quality) biases the optimization of the original two objectives. Although the Pareto set of the original $m$ -objectives resides within that of the $m+1$ -objectives [51], the proportion is negligible even when sampling a large number of images. Refer to Fig. 3(d)&(f) and Fig. 4(d)&(f) for more details. (3) $m$ -MGD, which does not consider generative quality in its optimization, generates meaningless images because the optimization of multiple objectives in the data generation task is only meaningful within the data manifold, as image data usually concentrate on low-dimensional manifolds embedded in a high-dimensional space.

For the MOG setting on CIFAR10 optimized with two objectives, we randomly select 1,000 generated images for each method, and calculate their objective values $[f_{1}(x),f_{2}(x)]$ , respectively. Fig. 3 shows that: (1) our PROUD (Fig. 3(a)) and two baselines DM+ $m$ -MGD (Fig. 3(b)) and $m$ -MGD (Fig. 3(e)) successfully generate samples which can cover the entire Pareto front. Among them, our PROUD and $m$ -MGD spread more evenly over the Pareto front. (2) DM+single only covers a partial Pareto front as shown in Fig. 3(c), because simply averaging multiple objectives into a single objective fails to explore the trade-off between multiple objectives and leads to insufficient solutions. (3) As discussed in Fig. 2, $m+1$ -MGD explores a much larger solution space (Fig. 3(f)), while only a few of them are located at the Pareto front of the original $m$ objectives (Fig. 3 (d)).

For the MOG setting on CIFAR10 optimized with three objectives, we randomly select 5,000 generated images for each method and calculate their objective values [ $f_{1}(x),f_{2}(x),f_{3}(x)$ ], respectively. Fig. 4 shows that our PROUD exhibits significant superiority in evenly covering the Pareto front under this more challenging setting. This is because our constrained optimization formulation can better coordinate the generation quality and the optimization for multi-objectives, while ensuring sample diversity (Eq.(8), Eq.(16)). Although it is possible to force the two baselines DM+ $m$ -MGD and $m$ -MGD to exhibit better diversity by setting a large diversity coefficient $\gamma$ , but this would cause the samples they generate to violate Pareto optimality, as shown in Fig. 8 and Fig. 9 in the Appendix.

Table 2: Quantitative evaluation for Pareto approximation and generation quality. Bolded values and underlined values indicate the best results and the second best results, respectively. The Friedman & Nemenyi test in Appendix B demonstrates that our PROUD is significantly better than other baselines. “-” denotes that the value is not available as no valid data are generated.

Method	CIFAR10 (2-obj)		CIFAR10 (3-obj)		pOAS
Method	HV $\uparrow$ ( $10^{-2}$ )	FID $\downarrow$	HV $\uparrow$ ( $10^{-3}$ )	FID $\downarrow$	HV $\uparrow$	ProtGPT $\uparrow$
PROUD (ours)	5.21 $\pm$ 0.00	31.39 $\pm$ 0.05	3.26 $\pm$ 0.00	44.22 $\pm$ 0.13	2472.55 $\pm$ 60.15	-645.93 $\pm$ 0.99
DM+ $m$ -MGD	5.20 $\pm$ 0.01	38.72 $\pm$ 0.36	3.26 $\pm$ 0.01	49.90 $\pm$ 0.14	2289.61 $\pm$ 65.12	-692.80 $\pm$ 0.34
DM+single	4.77 $\pm$ 0.01	36.35 $\pm$ 0.47	2.21 $\pm$ 0.00	57.77 $\pm$ 0.05	2302.21 $\pm$ 58.25	-682.26 $\pm$ 0.49
$m+1$ -MGD	5.17 $\pm$ 0.00	11.21 $\pm$ 0.10	2.87 $\pm$ 0.03	11.80 $\pm$ 0.05	838.74 $\pm$ 14.08	-662.86 $\pm$ 0.76
$m$ -MGD	5.21 $\pm$ 0.00	-	3.26 $\pm$ 0.01	-	-	-

To further demonstrate the superiority of our PROUD on multi-objective generation, we collect the quantitative evaluation for Pareto approximation and image quality in the left part of Table 2 by sampling $50,000$ images. It shows that: our PROUD achieves the best or the second best values in both two metrics, i.e., HV for Pareto approximation and FID for image quality. It demonstrates our claim that our PROUD can provide certain quality assurance for generated samples approaching the Pareto set of multiple properties. On the contrary, either single or multiple objective generation baselines, i.e., DM+single and DM+ $m$ -MGD, would inevitably sacrifice generation quality to excessively optimize the objectives.

5.2 Protein Sequence Generation

To further verify our model in more challenging applications, we design multiple-objective generation task on the pOAS dataset which aims to optimize two conflicting objectives for antibody sequences:

•

$f_{1}(x)$ , the solvent accessible surface area (SASA) of the protein’s predicted structure. Please refer to Ruffolo et al [39] for detailed procedures of calculating the SASA value using the protein sequences.
•

$f_{2}(x)$ , the percentage of beta sheets (%Sheets), which is measured on protein sequences directly [7].

The ground-truth Pareto front is not available due to the complexity of property objectives. Since the evaluation functions for SASA and %Sheet are not differentiable, we adopt the network predictors as differential surrogate functions for all methods. We apply the ground-truth evaluation functions for calculating the HV values on the generated samples. We adopt the discrete diffusion model in Gruver et al [19] as the backbone for protein sequence generation.

To demonstrate the superiority of our PROUD in multi-objective protein generation, we initially sample $5,000$ protein sequences for each method and collect the non-dominated samples based on their two target properties, as depicted in Fig.5. The observations are as follows: (1) DM+single exhibits a wide coverage of the objective values. This could be attributed to the fact that the noise in discrete diffusion models can bring out large diversity [19]. By incorporating MGD into diffusion models, PROUD and DM+ $m$ -MGD achieve larger coverage of the objective values. This verifies the superiority of MOG over SOG. Our PROUD and DM+ $m$ -MGD emphasize respective Pareto improvement of the objectives. Nevertheless, Table 2 shows that our PROUD achieves a better HV. (2) Similar to the image generation task, $m+1$ -MGD demonstrates a much poorer approximation of the Pareto front for the original $m$ objectives. Meanwhile, $m$ -MGD even fails to generate any valid protein sequences, as the SASA evaluation ( $f_{1}(x)$ ) for all its generated samples is nonexistent. This further highlights the difference between MOG and MOO.

Furthermore, we collect the quantitative evaluation for Pareto approximation and protein quality in the right part of Table 2 by sampling $5,000$ protein sequence⁵⁵5We only sample $5,000$ protein sequence since the computation cost of SASA values is very high.. Benefiting from our constrained-optimization formulation, our PROUD can avoid unnecessary loss of protein quality compared to other MOG/SOG counterparts, DM+ $m$ -MGD and DM+single. This improvement will greatly increase the practicality of its generated samples.

Table 3: Sensitivity analysis on

\alpha

and

e

in Eq.(12). We retain more decimal places here to demonstrate the subtle differences between results.

Metric	$\gamma=0.2,\;e=0.03$			$\gamma=0.2,\;\alpha=0.5$
Metric	$\alpha=0.1$	$\alpha=0.5$	$\alpha=1$	$e=0.01$	$e=0.03$	$e=0.05$
FID	31.58963073	31.48232218	31.5896311	31.58966697	31.48232218	31.58966696
HV	0.05211343	0.05211350	0.05211343	0.05211343	0.05211350	0.05211343

Table 4: Sensitivity analysis on

\gamma

\alpha

and

e

are set to

0.5

and

0.03

, respectively. The best results are marked in bold.

Metric	$\gamma=0$	$\gamma=0.01$	$\gamma=0.1$	$\gamma=0.2$	$\gamma=0.3$	$\gamma=1$
FID	34.80	30.98	31.80	31.48	31.63	33.59
HV	0.0483	0.0498	0.0521	0.0521	0.0521	0.0521

5.3 Hyper-parameter Sensitivity Study

We study PROUD with different configurations of the hyper-parameters, namely, $\alpha$ and $e$ in Eq.(12) as well as the diversity coefficient $\gamma$ in Eq.(16). The experiments are conducted on CIFAR10, with the same setting as Section 5.1.

We set $\alpha$ as $0.1,0.5,1$ and $e$ as $0.01,0.03,0.05$ , respectively. We observe in Table 3 that PROUD is not sensitive to the choice of the hyper-parameters $\alpha$ and $e$ .

We set $\gamma$ as 0, 0.1, 0.2, 1. The results are summarized in Fig. 6(a) to Fig. 6(d), showing that: (1) With an appropriate diversity coefficient, our PROUD can well cover the Pareto front. (2) Without the diversity regularization, PROUD can only obtain a small set of Pareto solutions. This demonstrates the necessity of the diversity loss, consistent with the finding in the former work [33]. (3) With a too large value of $\gamma$ , the generated samples could fall outside the Pareto front. The effect of the diversity coefficient on DM+ $m$ -MGD (Fig. 6(e) to Fig. 6(h)) is similar.

To further investigate the effects of the diversity coefficient on the generation quality, we collect FID results in Table 4. With $\gamma=0.2$ , PROUD obtains both the best FID and HV, which is thus set as the hyper-parameter used in Section 5.1.

To demonstrate that the single-objective generation would fail to cover the Pareto front even with a uniform grid of weighting, we set the weight coefficient $w$ for combining two objectives into a single objective in DM+single “ $w\times f_{1}(x)+(1-w)\times f_{2}(x)$ ” as 0 to 1 with a step $0.1$ . We put the results of 0, 0.1, 0.5, 1 in Fig. 6(i) to Fig. 6(l) and rest in Appendix. With $w=0,0.1,0.2,0.3,0.4$ , the single objective is dominated by $f_{2}(x)$ . Consequently, the generated samples achieve the smallest value for $f_{2}(x)$ but the largest one for $f_{1}(x)$ ; vice versa. With an equal weight, the generated samples are supposed to obtain the comprise value between two objectives, i.e., (0.0625, 0.0625). We notice that the generated samples cover a small range around this point. This diversity could result from the diffusion noise in diffusion models.

6 Conclusion

This paper studies the problem of optimizing deep generative models with multiple conflicting objectives. We highlight this problem setting by treating the optimization of samples with multiple properties and the process of sample generation as a unified task. By analyzing the connections and differences from multi-objective optimization, we introduce a constrained optimization formulation to solve the multi-objective generation problem, based on which we developed PROUD. Our experiments demonstrate the efficacy of PROUD in both image and protein sequence generation. While we explored the white-box multi-objectives in this work, it would be interesting to explore our PROUD in the black-box setting in the future. The multiple gradient descent technique used can be replaced by methods such as Bayesian multi-objective optimization [49].

Declarations

Funding. This work was supported by the Agency for Science, Technology and Research (A*STAR) Centre for Frontier AI Research, the A*STAR GAP project (Grant No.I23D1AG079), and the AISG Grand Challenge in AI for Materials Discovery (Grant No. AISG2-GC-2023-010).
Competing interests. The authors have no financial or non-financial interests to disclose that are relevant to the content of this article.
Ethics approval. Not applicable.
Consent to participate. Not applicable.
Consent to publish. Not applicable.
Availability of data and materials. All datasets used in this work are available online and clearly cited.
Code availability. The code of this work will be publicly released on github.
Authors’ contributions. Idea: YY; Methodology & Experiment: YY, YP; Writing - comments/edits: all.

References

\bibcommenthead
Afshari et al [2019] Afshari H, Hare W, Tesfamariam S (2019) Constrained multi-objective optimization algorithms: Review and comparison with application in reinforced concrete structures. Applied Soft Computing 83:105631. 10.1016/J.ASOC.2019.105631
Andrieu et al [2003] Andrieu C, De Freitas N, Doucet A, et al (2003) An introduction to mcmc for machine learning. Machine learning 50:5–43. 10.1023/A:1020281327116
Arjovsky et al [2017] Arjovsky M, Chintala S, Bottou L (2017) Wasserstein generative adversarial networks. In: International Conference on Machine Learning, pp 214–223, URL https://proceedings.mlr.press/v70/arjovsky17a.html
Borghi et al [2023] Borghi G, Herty M, Pareschi L (2023) An adaptive consensus based method for multi-objective optimization with uniform pareto front approximation. Applied Mathematics & Optimization 88(2):58. 10.1007/s00245-023-10036-y
Cheng et al [2017] Cheng R, Li M, Tian Y, et al (2017) A benchmark test suite for evolutionary many-objective optimization. Complex & Intelligent Systems 3:67–81. 10.1007/s40747-017-0039-7
Chinchuluun and Pardalos [2007] Chinchuluun A, Pardalos PM (2007) A survey of recent developments in multiobjective optimization. Annals of Operations Research 154(1):29–50. 10.1007/S10479-007-0186-0
Cock et al [2009] Cock PJ, Antao T, Chang JT, et al (2009) Biopython: freely available python tools for computational molecular biology and bioinformatics. Bioinformatics 25(11):1422–1423. 10.1093/bioinformatics/btp163
Dathathri et al [2020] Dathathri S, Madotto A, Lan J, et al (2020) Plug and play language models: A simple approach to controlled text generation. In: International Conference on Learning Representations, URL https://openreview.net/forum?id=H1edEyBKDS
Deb [2001] Deb K (2001) Multi-objective optimization using evolutionary algorithms, vol 16. John Wiley & Sons
Demšar [2006] Demšar J (2006) Statistical comparisons of classifiers over multiple data sets. The Journal of Machine learning research 7:1–30. URL http://jmlr.org/papers/v7/demsar06a.html
Deng et al [2020] Deng Y, Yang J, Chen D, et al (2020) Disentangled and controllable face image generation via 3d imitative-contrastive learning. In: IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp 5154–5163, 10.1109/CVPR42600.2020.00520
Désidéri [2012] Désidéri JA (2012) Multiple-gradient descent algorithm (mgda) for multiobjective optimization. Comptes Rendus Mathematique 350(5-6):313–318. 10.1016/j.crma.2012.03.014
Désidéri [2018] Désidéri JA (2018) Quasi-riemannian multiple gradient descent algorithm for constrained multiobjective differential optimization. PhD thesis, Inria Sophia-Antipolis; Project-Team Acumes, URL https://inria.hal.science/hal-01740075
Dhariwal and Nichol [2021] Dhariwal P, Nichol A (2021) Diffusion models beat gans on image synthesis. In: Advances in Neural Information Processing Systems, pp 8780–8794, URL https://proceedings.neurips.cc/paper_files/paper/2021/file/49ad23d1ec9fa4bd8d77d02681df5cfa-Paper.pdf
Fefferman et al [2016] Fefferman C, Mitter S, Narayanan H (2016) Testing the manifold hypothesis. Journal of the American Mathematical Society 29(4):983–1049. 10.1090/jams/852
Ferruz et al [2022] Ferruz N, Schmidt S, Höcker B (2022) Protgpt2 is a deep unsupervised language model for protein design. Nature Communications 13(1):4348. 10.1038/s41467-022-32007-7
Gong et al [2021] Gong C, Liu X, Liu Q (2021) Bi-objective trade-off with dynamic barrier gradient descent. In: Advances in Neural Information Processing Systems, pp 29630–29642, URL https://proceedings.neurips.cc/paper_files/paper/2021/file/f7b027d45fd7484f6d0833823b98907e-Paper.pdf
Goodfellow et al [2014] Goodfellow IJ, Pouget-Abadie J, Mirza M, et al (2014) Generative adversarial nets. In: Advances in Neural Information Processing Systems, pp 2672–2680, URL https://proceedings.neurips.cc/paper_files/paper/2014/file/5ca3e9b122f61f8f06494c97b1afccf3-Paper.pdf
Gruver et al [2023] Gruver N, Stanton S, Frey NC, et al (2023) Protein design with guided discrete diffusion. In: Advances in Neural Information Processing Systems, pp 12489–12517, URL https://proceedings.neurips.cc/paper_files/paper/2023/file/29591f355702c3f4436991335784b503-Paper-Conference.pdf
Guo et al [2020] Guo X, Du Y, Zhao L (2020) Property controllable variational autoencoder via invertible mutual dependence. In: International Conference on Learning Representations, URL https://openreview.net/forum?id=tYxG_OMs9WE
Heusel et al [2017] Heusel M, Ramsauer H, Unterthiner T, et al (2017) Gans trained by a two time-scale update rule converge to a local nash equilibrium. In: Advances in Neural Information Processing Systems, pp 6629–6640, URL https://proceedings.neurips.cc/paper_files/paper/2017/file/8a1d694707eb0fefe65871369074926d-Paper.pdf
Ho et al [2020] Ho J, Jain A, Abbeel P (2020) Denoising diffusion probabilistic models. In: Advances in Neural Information Processing Systems, pp 6840–6851, URL https://proceedings.neurips.cc/paper/2020/file/4c5bcfec8584af0d967f1ab10179ca4b-Paper.pdf
Ishibuchi et al [2008] Ishibuchi H, Tsukamoto N, Nojima Y (2008) Evolutionary many-objective optimization: A short review. In: IEEE Congress on Evolutionary Computation, pp 2419–2426, 10.1109/CEC.2008.4631121
Ishibuchi et al [2013] Ishibuchi H, Yamane M, Akedo N, et al (2013) Many-objective and many-variable test problems for visual examination of multiobjective search. In: IEEE Congress on Evolutionary Computation, pp 1491–1498, 10.1109/CEC.2013.6557739
Jain et al [2023] Jain M, Raparthy SC, Hernández-Garcıa A, et al (2023) Multi-objective gflownets. In: International Conference on Machine Learning, pp 14631–14653, URL https://proceedings.mlr.press/v202/jain23a.html
Jin et al [2020] Jin W, Barzilay R, Jaakkola T (2020) Multi-objective molecule generation using interpretable substructures. In: International Conference on Machine Learning, pp 4849–4859, URL http://proceedings.mlr.press/v119/jin20b.html
Kingma and Welling [2014] Kingma DP, Welling M (2014) Auto-encoding variational bayes. In: International Conference on Learning Representations, URL https://openreview.net/forum?id=33X9fd2-9FyZd
Klys et al [2018] Klys J, Snell J, Zemel R (2018) Learning latent subspaces in variational autoencoders. In: Advances in Neural Information Processing Systems, pp 6445–6455, URL https://proceedings.neurips.cc/paper_files/paper/2018/file/73e5080f0f3804cb9cf470a8ce895dac-Paper.pdf
Krizhevsky et al [2009] Krizhevsky A, Hinton G, et al (2009) Learning multiple layers of features from tiny images URL https://www.cs.utoronto.ca/~kriz/learning-features-2009-TR.pdf
Li et al [2017] Li M, Grosan C, Yang S, et al (2017) Multiline distance minimization: A visualized many-objective test problem suite. IEEE Transactions on Evolutionary Computation 22(1):61–78. 10.1109/TEVC.2017.2655451
Li et al [2022] Li S, Liu M, Walder C (2022) Editvae: Unsupervised parts-aware controllable 3d point cloud shape generation. In: AAAI Conference on Artificial Intelligence, pp 1386–1394, 10.1609/AAAI.V36I2.20027
Liao et al [2020] Liao Y, Schwarz K, Mescheder L, et al (2020) Towards unsupervised learning of generative models for 3d controllable image synthesis. In: IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp 5871–5880, 10.1109/CVPR42600.2020.00591
Liu et al [2021a] Liu X, Tong X, Liu Q (2021a) Profiling pareto front with multi-objective stein variational gradient descent. In: Advances in Neural Information Processing Systems, pp 14721–14733, URL https://proceedings.neurips.cc/paper/2021/file/7bb16972da003e87724f048d76b7e0e1-Paper.pdf
Liu et al [2021b] Liu X, Tong X, Liu Q (2021b) Sampling with trusthworthy constraints: A variational gradient framework. In: Advances in Neural Information Processing Systems, pp 23557–23568, URL https://papers.nips.cc/paper/2021/file/c61aed648da48aa3893fb3eaadd88a7f-Paper.pdf
McInnes et al [2018] McInnes L, Healy J, Melville J (2018) Umap: Uniform manifold approximation and projection for dimension reduction. arXiv preprint URL http://arxiv.org/abs/1802.03426
Olsen et al [2022] Olsen TH, Boyles F, Deane CM (2022) Observed antibody space: A diverse database of cleaned, annotated, and translated unpaired and paired antibody sequences. Protein Science 31(1):141–146. 10.1002/pro.4205
Papamakarios et al [2021] Papamakarios G, Nalisnick E, Rezende DJ, et al (2021) Normalizing flows for probabilistic modeling and inference. Journal of Machine Learning Research 22(57):1–64. URL http://jmlr.org/papers/v22/19-1028.html
Roweis and Saul [2000] Roweis ST, Saul LK (2000) Nonlinear dimensionality reduction by locally linear embedding. Science 290(5500):2323–2326. 10.1126/science.290.5500.2323
Ruffolo et al [2023] Ruffolo JA, Chu LS, Mahajan SP, et al (2023) Fast, accurate antibody structure prediction from deep learning on massive set of natural antibodies. Nature Communications 14(1):2389. 10.5281/zenodo.7709609
Sanchez-Lengeling and Aspuru-Guzik [2018] Sanchez-Lengeling B, Aspuru-Guzik A (2018) Inverse molecular design using machine learning: Generative models for matter engineering. Science 361(6400):360–365. 10.1126/science.aat2663
Sener and Koltun [2018] Sener O, Koltun V (2018) Multi-task learning as multi-objective optimization. In: Advances in Neural Information Processing Systems, pp 525–536, URL https://proceedings.neurips.cc/paper/2018/file/432aca3a1e345e339f35a30c8f65edce-Paper.pdf
Shen et al [2023] Shen MW, Bengio E, Hajiramezanali E, et al (2023) Towards understanding and improving gflownet training. In: International Conference on Machine Learning, pp 30956–30975, URL https://proceedings.mlr.press/v202/shen23a.html
Sohl-Dickstein et al [2015] Sohl-Dickstein J, Weiss E, Maheswaranathan N, et al (2015) Deep unsupervised learning using nonequilibrium thermodynamics. In: International Conference on Machine Learning, pp 2256–2265, URL http://proceedings.mlr.press/v37/sohl-dickstein15.html
Song and Ermon [2019] Song Y, Ermon S (2019) Generative modeling by estimating gradients of the data distribution. In: Advances in Neural Information Processing Systems, pp 11918–11930, URL https://proceedings.neurips.cc/paper_files/paper/2019/file/3001ef257407d5a371a96dcd947c7d93-Paper.pdf
Song and Ermon [2020] Song Y, Ermon S (2020) Improved techniques for training score-based generative models. In: Advances in Neural Information Processing Systems, pp 12438–12448, URL https://papers.neurips.cc/paper_files/paper/2020/file/92c3b916311a5517d9290576e3ea37ad-Paper.pdf
Song and Kingma [2021] Song Y, Kingma DP (2021) How to train your energy-based models. arXiv preprint URL https://arxiv.org/abs/2101.03288
Song et al [2021a] Song Y, Durkan C, Murray I, et al (2021a) Maximum likelihood training of score-based diffusion models. In: Advances in Neural Information Processing Systems, pp 1415–1428, URL https://papers.nips.cc/paper/2021/file/0a9fdbb17feb6ccb7ec405cfb85222c4-Paper.pdf
Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations, URL https://openreview.net/forum?id=PxTIG12RRHS
Stanton et al [2022] Stanton S, Maddox W, Gruver N, et al (2022) Accelerating bayesian optimization for biological sequence design with denoising autoencoders. In: International Conference on Machine Learning, pp 20459–20478, URL https://proceedings.mlr.press/v162/stanton22a.html
Tagasovska et al [2022] Tagasovska N, Frey NC, Loukas A, et al (2022) A pareto-optimal compositional energy-based model for sampling and optimization of protein sequences. In: NeurIPS 2022 Workshop AI for Science: Progress and Promises, URL https://openreview.net/forum?id=U2rNXaTTXPQ
Tanabe and Ishibuchi [2020] Tanabe R, Ishibuchi H (2020) An easy-to-use real-world multi-objective optimization problem suite. Applied Soft Computing 89:106078. 10.1016/J.ASOC.2020.106078
Van Veldhuizen et al [1998] Van Veldhuizen DA, Lamont GB, et al (1998) Evolutionary computation and convergence to a pareto front. In: Late breaking papers at the genetic programming 1998 conference, pp 221–228, URL https://citeseerx.ist.psu.edu/document?repid=rep1&type=pdf&doi=f329eb18a4549daa83fae28043d19b83fe8356fa
Wang et al [2022] Wang S, Guo X, Lin X, et al (2022) Multi-objective deep data generation with correlated property control. In: Advances in Neural Information Processing Systems, pp 28889–28901, URL https://proceedings.neurips.cc/paper_files/paper/2022/file/b9c2e8a0bbed5fcfaf62856a3a719ada-Paper-Conference.pdf
Wang et al [2024] Wang S, Du Y, Guo X, et al (2024) Controllable data generation by deep learning: A review. ACM Comput Surv 56(9). 10.1145/3648609
Wang et al [2023] Wang Z, Zhao L, Xing W (2023) Stylediffusion: Controllable disentangled style transfer via diffusion models. In: IEEE/CVF International Conference on Computer Vision, pp 7677–7689, 10.1109/ICCV51070.2023.00706
Watson et al [2023] Watson JL, Juergens D, Bennett NR, et al (2023) De novo design of protein structure and function with rfdiffusion. Nature 620(7976):1089–1100. 10.1038/s41586-023-06415-8
Welling and Teh [2011] Welling M, Teh YW (2011) Bayesian learning via stochastic gradient langevin dynamics. In: International Conference on Machine Learning, pp 681–688, URL https://icml.cc/2011/papers/398_icmlpaper.pdf
Yang et al [2023] Yang L, Zhang Z, Song Y, et al (2023) Diffusion models: A comprehensive survey of methods and applications. ACM Computing Surveys 56(4):1–39. 10.1145/3626235
Yao et al [2022] Yao K, Gao P, Yang X, et al (2022) Outpainting by queries. In: European Conference on Computer Vision, pp 153–169, 10.1007/978-3-031-20050-2_10
Ye and Liu [2022] Ye M, Liu Q (2022) Pareto navigation gradient descent: a first-order algorithm for optimization in pareto set. In: Uncertainty in Artificial Intelligence, pp 2246–2255, URL https://proceedings.mlr.press/v180/ye22a.html
Zhang et al [2023] Zhang S, Qian Z, Huang K, et al (2023) Robust generative adversarial network. Machine Learning 112:5135–5161. 10.1007/s10994-023-06367-0
Zitzler and Thiele [1999] Zitzler E, Thiele L (1999) Multiobjective evolutionary algorithms: a comparative case study and the strength pareto approach. IEEE transactions on Evolutionary Computation 3(4):257–271. 10.1109/4235.797969

Appendix A Complete sensitivity analysis for single-objective generation

We set the weight coefficient $w$ for combining two objectives in DM+single “ $w\times f_{1}(x)+(1-w)\times f_{2}(x)$ ” as 0 to 1 with a step $0.1$ . The results is shown in Fig. 7:

•

when $w<0.5$ , the resultant final objective is dominated by $f_{2}(x)$ . Consequently, the leading objective is optimized to the best where all the generated samples have the smallest value for $f_{2}(x)$ but the largest one for $f_{1}(x)$ .
•

when $w>0.5$ , the resultant final objective is dominated by $f_{1}(x)$ . Therefore, the generated samples achieve the smallest value for the first objective but the largest one for the second objective.
•

when $w=0.5=\frac{1}{m}$ , the generated samples are supposed to obtain the comprise value between $f_{1}(x)$ and $f_{2}(x)$ , i.e., (0.0625, 0.0625). We notice that the generated samples cover a small range around this point. This diversity could result from the diffusion noise in diffusion models.

Appendix B More Experimental Settings and Analyses

Image Generation

According to Ishibuchi et al [24], Li et al [30]⁶⁶6Our problem setting is slightly different as we take the distance square in order to obtain a non-linear shape of the Pareto front. We also refer reviewer to example-1 in Liu et al [33] that defines a same two-objective problem but with 1-D decision variable for easy understanding., we can obtain that: (1) the Pareto solutions of the two objective setting are the points on the line between $1_{\Omega}$ and $0.5_{\Omega}$ . Namely, the Pareto solutions are $\{x|x_{\Omega}=\kappa_{\Omega},\kappa_{\Omega}\in[0.5_{\Omega},1_{\Omega}]\}$ ⁷⁷7We use $[0.5_{\Omega},1_{\Omega}]$ to denote image patches in normalized RGB color values between [0.5, 0.5, 0.5] (grey) and [1, 1, 1] (white).. When taking images from CIFAR10 based on the Pareto set (Fig. 12), we follow Liu et al [34] to sample images in a small neighborhood around $\kappa_{\Omega}$ , namely, $\|x_{\Omega}-\kappa_{\Omega}\|_{2}^{2}\leq\epsilon$ , where $\epsilon=8\times 10^{-4}$ . (2) The Pareto solutions of the three objective setting are the points on the convex polygonal formed by three points $a_{\Omega},b_{\Omega},c_{\Omega}$ . For easy understanding, we assume $\Omega=3\times 1\times 1$ , which is actually to constrain the middle point of CIFAR10 images to be certain colors.

We visualize the Pareto front of these two settings in Fig. 11. Specifically, for the two objective setting, the Pareto optimal points lie on the line between [1, 1, 1] and [0.5, 0.5, 0.5] (Fig. 11(a)), which physically denote RGB values (normalized, RGB values [0, 255] divided by 255). Then, we calculate the objectives values $[f_{1}(x),f_{2}(x)]$ for these points accordingly, shown in Fig. 11(b). Fig. 11(c) and (d) are plotted for the three objective setting in a similar way. According to their Pareto fronts, we select [0.25, 0.25] and [0.2, 0.1, 0.2] as reference points to calculate the hypervolume (HV) for the two objective setting and the three objective setting in Table 2, respectively.

We sample CIFAR10 image using the constraint with different patch sizes to demonstrate its effect in Fig. 13. With a smaller size of the region $\Omega$ , more CIFAR10 images will meet the constraint.

Protein Sequence Generation

Our experiments in Section 5.2 adopted the same dataset and objectives as that in Section 5.2 of Gruver et al [19]. Note that we did not include their other experiments, because the experiment in their Section 5.1 is not a generation task equipped with property optimization and the dataset for the experiment in Section 5.3 and 5.4 has not been released due to private data. We select $[1\times 10^{4},0]$ as a reference point to calculate the HV for this task.

Justification of Our Experiment Designs

Our experiment designs can appropriately justify the motivation of the MOG problem. Both CIFAR10 and protein datasets are real-world datasets whose data lie on low-dimensional manifolds in high-dimensional space [29, 19], thus applicable to our MOG problem setting. Meanwhile, the objectives considered for CIFAR10 are indeed benchmark multi-objective optimization problems with clear evaluations [24]; the objectives considered for the protein design task represent real-world scenarios [19]. Lastly, Fig. 2 and Table 2 demonstrate the necessity of considering generation quality, as the generation quality of all baseline methods suffers to some extent when optimizing multiple properties.

Significant Test

We apply the Friedman test under the null hypothesis positing that all methods perform similarly, alongside the Nemenyi post-hoc test for pairwise comparisons among the four methods [10]. The number of factors was set to four, given the failure of $m$ -MGD to produce qualified samples, leading to its exclusion. The dataset comprised 30 instances, with each of the four methods independently evaluated five times across three datasets, employing two evaluation criteria. The Friedman test shows that $\tau_{F}=18.24$ , greater than the critical value $F_{3,87}=2.709$ when $\alpha=0.05$ . Therefore, the null hypothesis is rejected, which signifies a statistically significant difference among the four methods at the significance level of 0.05. Subsequent analysis via the Nemenyi post-hoc test in Fig. 14 unequivocally demonstrates that our PROUD exhibits marked superiority over the three baseline methods.

Appendix C Discussions

The constrained MOO problem defines its decision space $S$ on a constrained space expressed using specified linear, nonlinear, or box constraints [1, 13] in $\mathbb{R}^{d}$ . Consequently, it is different from our MOG problems, whose manifold is delineated by a given dataset $\mathcal{X}$ . Nevertheless, MOG problems could be understood as a type of constrained MOO problem in a broader context.

Table 5: Comparison of the MOG problem with the relevant MOO problems. The generation quality in MOG is usually modeled based on a given dataset

X\subset\mathcal{X}

, where

\mathcal{X}

denotes a low-dimensional manifold embedded in a high dimensional space

\mathbb{R}^{d}

F(x)=[f_{1}(x),f_{2}(x),\ldots,f_{m}(x)]

objectives

decision/data space

generation quality

MOO

F(x)

x\in\mathbb{R}^{d}

✗

Constrained

MOO

F(x)

x\in S,S\subset\mathbb{R}^{d}

defined by

(non)linear or box constraints

✗

MOG

F(x)

x\in\mathcal{X},\mathcal{X}\subset\mathbb{R}^{d}

✓