Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                

[1,2]\fnmYuangang \surPan

1]\orgdivCentre for Frontier AI Research, \orgnameAgency for Science, Technology and Research (A*STAR), \orgaddress\postcode138632, \countrySingapore

2]\orgdivInstitute of High Performance Computing, \orgnameAgency for Science, Technology and Research (A*STAR), \orgaddress\postcode138632, \countrySingapore

3]\orgdivDepartment of Computing and Decision Sciences, \orgnameLingnan University, \orgaddress\cityHong Kong

PROUD: PaRetO-gUided Diffusion Model for Multi-objective Generation

\fnmYinghua \surYao eva.yh.yao@gmail.com    yuangang.pan@gmail.com    \fnmJing \surLi j.lee9383@gmail.com    \fnmIvor \surTsang ivor.tsang@gmail.com    \fnmXin \surYao xinyao@ln.edu.hk [ [ [
Abstract

Recent advancements in the realm of deep generative models focus on generating samples that satisfy multiple desired properties. However, prevalent approaches optimize these property functions independently, thus omitting the trade-offs among them. In addition, the property optimization is often improperly integrated into the generative models, resulting in an unnecessary compromise on generation quality (i.e., the quality of generated samples). To address these issues, we formulate a constrained optimization problem. It seeks to optimize generation quality while ensuring that generated samples reside at the Pareto front of multiple property objectives. Such a formulation enables the generation of samples that cannot be further improved simultaneously on the conflicting property functions and preserves good quality of generated samples. Building upon this formulation, we introduce the PaRetO-gUided Diffusion model (PROUD), wherein the gradients in the denoising process are dynamically adjusted to enhance generation quality while the generated samples adhere to Pareto optimality. Experimental evaluations on image generation and protein generation tasks demonstrate that our PROUD consistently maintains superior generation quality while approaching Pareto optimality across multiple property functions compared to various baselines.

keywords:
Multi-objective generation, diffusion model, Pareto optimality, generative model

1 Introduction

Deep generative models have been developing prosperously over the last decade, with advances in variational autoencoders [27], generative adversarial networks [18, 61], normalizing flows [37], energy-based models [46], and diffusion models [44, 22]. Particularly, controllable generative models can generate samples that satisfy multiple properties of interest, showing great promise in various applications, such as material design [26, 50] and controlled text/image generation [8, 32]. These properties of interest vary depending on the specific application domains. For example, in protein design, the properties can refer to specified structural or functional characteristics, such as solubility or binding affinity [56]. In image generation, the properties can refer to certain attributes or features, such as specified hairstyle & makeup [55], or specified color patches [34]. In addition, it is considered imperative that generated samples should reside in the same data manifold111This relates to the manifold hypothesis that many real-world high-dimensional datasets lie on low-dimensional latent manifolds in the high-dimensional space [15] as training samples for data naturalness concerns [19].

Before delving into details, we first establish the problem setting. Given a dataset X𝒳𝑋𝒳X\subseteq\mathcal{X}italic_X ⊆ caligraphic_X, where 𝒳d𝒳superscript𝑑\mathcal{X}\subset\mathbb{R}^{d}caligraphic_X ⊂ blackboard_R start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT denotes a low-dimensional manifold in the high-dimensional space dsuperscript𝑑\mathbb{R}^{d}blackboard_R start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT. Suppose we have m𝑚mitalic_m objective functions F(x)=[f1(x),f2(x),,fm(x)]𝐹𝑥subscript𝑓1𝑥subscript𝑓2𝑥subscript𝑓𝑚𝑥F(x)=[f_{1}(x),f_{2}(x),\ldots,f_{m}(x)]italic_F ( italic_x ) = [ italic_f start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( italic_x ) , italic_f start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( italic_x ) , … , italic_f start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT ( italic_x ) ], each of which returns a property value for the sample x𝒳𝑥𝒳x\in\mathcal{X}italic_x ∈ caligraphic_X. The aim of multi-objective generation is to learn a generative model that produces samples optimized to achieve the best values across these functions while ensuring the generated samples remain within the manifold 𝒳𝒳\mathcal{X}caligraphic_X (green cross in Fig. 1(a), namely, ensuring that the quality of generated samples (dubbed as generation quality) is good222In other words, the generated samples is as realistic as samples in the given dataset X𝑋Xitalic_X..

The multi-objective generation problem introduced above inherently requires reconciling the optimization challenges in two spaces: the functionality space and the sample space as shown in Fig. 1(a). Given the need to deal with multiple conflicting objectives in order to achieve the generation with desired properties, one challenge is how to produce samples that cannot be further improved simultaneously across the objectives, a.k.a. Pareto optimality [6] (the Pareto front in Fig. 1(a)). The second challenge arises from the manifold assumption that the generated samples should lie within the data manifold 𝒳𝒳\mathcal{X}caligraphic_X, namely, generated samples are supposed to be of good quality [40]. Optimizing multiple objectives without considering generation quality could result in Pareto solutions outside of the data manifold (i.e., invalid samples on the Pareto front of Fig. 1(a)). The third challenge relates to the coordination of generation quality and multi-property optimization. To guarantee generation quality, generative models typically define a divergence between the distribution of generated data and that of real training data X𝑋Xitalic_X [58, 18], which tends to disperse the generated data throughout the whole data manifold 𝒳𝒳\mathcal{X}caligraphic_X (the purple plane in Fig. 1(a)). However, since only a limited fraction of the samples on the data manifold lie on the Pareto front, there inevitably exists some distribution gap between the generated data and the training data, leading to compromise of generation quality, when achieving Pareto optimality.

A large number of studies [28, 11, 54, 31] attempt to design controllable generative models with multiple properties by simply assuming that these properties are independent and aggregating the multiple property objectives into a single one i=1mfisuperscriptsubscript𝑖1𝑚subscript𝑓𝑖\sum_{i=1}^{m}f_{i}∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT italic_f start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT for controlled generation. Notably, a very recent study [19] takes into consideration the trade-offs between multiple properties by incorporating the multi-objective optimization techniques into the generative models. It modified the gradient of sampling in vanilla diffusion models as a linear combination of the original diffusion gradient and the gradient solved by the multi-objective Bayesian optimization. However, the adopted fixed coefficient is challenging to effectively coordinate the generation quality and the optimization of multiple property objectives. This results in an unnecessary loss of generation quality while achieving Pareto optimality for the property objectives.

Refer to caption

(a)

Refer to caption

(b)

Figure 1: (a) Diagram of multi-objective generation (best viewed in color). Our multi-objective generation aims to produce samples that simultaneously lie on the Pareto front in the functionality space (Left Panel) and remain within the manifold 𝒳𝒳\mathcal{X}caligraphic_X in the sample space (Right Panel), i.e., the green cross. (b) Visualization of the image generation task optimized with two objectives on CIFAR10. Images are directly taken from the original CIFAR10 dataset (see full resolution images in Fig. 12), whose objective values lie on the Pareto front, namely, {x|xX,F(x)=[f1,f2]F}conditional-set𝑥formulae-sequence𝑥𝑋𝐹𝑥superscriptsubscript𝑓1superscriptsubscript𝑓2superscript𝐹\{x|x\in X,F(x)=[f_{1}^{\ast},f_{2}^{\ast}]\in F^{\ast}\}{ italic_x | italic_x ∈ italic_X , italic_F ( italic_x ) = [ italic_f start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT , italic_f start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ] ∈ italic_F start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT }, where Fsuperscript𝐹F^{\ast}italic_F start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT denotes the points on the Pareto front.

In this work, we propose PaRetO-gUided Diffusion model (PROUD) for multi-objective generation. PROUD is formulated as a constrained optimization that minimizes the Kullback–Leibler (KL) divergence between the distribution of the generated data and that of the training data, where the distribution of the generated data is also constrained to be close to the distribution of Pareto solutions under the KL divergence. This guarantees that generated samples are moved towards the Pareto set and then the quality of these generated samples is optimized to the best within a neighborhood of the Pareto set. Specifically, constrained optimization is implemented during the generative process of a pre-trained unconditional diffusion model. Multiple gradient descents for the multiple objectives and the original diffusion gradient are adaptively weighted to denoise samples. The contributions of this work are summarized as follows:

  • We propose a novel constrained optimization formulation for controllable generation adhering to multiple properties, defined as multi-objective generation, which can better coordinate the generation quality and the optimization for multi-objectives.

  • A new controllable diffusion model (PROUD) is introduced to solve the constrained optimization formulation. The guidance of multiple objectives is adaptively integrated with that of data likelihood, which can reduce the needless comprise of generation quality while achieving Pareto optimality in terms of multiple property objectives.

  • We apply our PROUD to optimizing multiple objectives in the tasks of controllable image generation and protein design. Additionally, we establish various baselines based on diffusion models to demonstrate the superiority of our PROUD.

2 Related Work

In the section, we summarize the related works based on their strategies for integrating the optimization of multiple property objectives into deep generative models.

Single-objective generation (SOG) refers to approaches that simply combine multiple objectives into a single one to guide the generation. Extensive efforts have been devoted to controllable generation with multiple properties independent of each other [28, 20, 26, 11, 54, 31]. Nevertheless, these methods fail to capture the correlation between properties and ignore the conflicting nature among properties, leading to an insufficient exploration of the solution space.

Multi-objective Generation (MOG) refers to approaches that introduce multi-objective optimization techniques into generative models. Wang et al [53] adopted a weighted-sum strategy to deal with the trade-offs between properties, which can only work in cases of convex Pareto fronts and a uniformly distributed grid of weighting cannot guarantee uniform points on the Pareto front [41, 33]. Stanton et al [49] proposed LaMBO (Latent Multi-objective Bayesian Optimization), which applies multi-objective Bayesian optimization in the latent space of denoising autoencoder to optimize the generated samples with multiple black-box objectives. Although it can characterize the Pareto front, the data generated by denoising autoencoder is of inferior quality. Gruver et al [19] further applied LaMBO to the latent space of discrete diffusion models. It generalized classifier-guided diffusion models [14] by replacing the classifier gradient with the gradient obtained by LaMBO. The combination of the score estimate of a diffusion model and the classifier gradient necessitates manual tuning of the combination coefficient, which is theoretically inappropriate for non-convex functions [17]. Tagasovska et al [50] proposed to use multiple gradient descent [12] for sampling within compositional energy-based models (EBMs) where each EBM is conditioned on one specific property, but training multiple conditional EBMs requires much more supervision than training discriminative models. Moreover, this kind of paradigm cannot enjoy post-hoc controls upon the pre-trained unconditional generative models. Multi-objective generative flow networks (GFlowNets) [25] fully integrated guidance from multiple objectives into the training process. So, they must be retrained whenever the objectives change and are also not suitable for use with pre-trained generative models. In addition, this kind of models are usually difficult to train [42].

Diffusion models [22, 43, 44, 48] represent the state-of-the-art (SOTA) in deep generative models. Therefore, we build our multiple-objective generation model based on diffusion models. While most related works design their methods based on other deep generative models, we apply their ideas to the diffusion model as much as possible for the sake of comparison. Please refer to Section 5 for more details.

3 Preliminaries

Before delving into our method, we introduce the technical background about multi-objective optimization in Section 3.1 and diffusion models in Section 3.2, respectively.

3.1 Multi-objective Optimization

Let xd𝑥superscript𝑑x\in\mathbb{R}^{d}italic_x ∈ blackboard_R start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT be a decision variable. Assuming that F(x)=[f1(x),f2(x),,fm(x)]𝐹𝑥subscript𝑓1𝑥subscript𝑓2𝑥subscript𝑓𝑚𝑥F(x)=\left[f_{1}(x),f_{2}(x),\ldots,f_{m}(x)\right]italic_F ( italic_x ) = [ italic_f start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( italic_x ) , italic_f start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( italic_x ) , … , italic_f start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT ( italic_x ) ] be a set of m𝑚mitalic_m objective functions, each of which represents a property and is preferred to have a smaller value. The multi-objective optimization problem [6, 9] can be conventionally expressed as:

minxdF(x)=minxd[f1(x),f2(x),,fm(x)].subscript𝑥superscript𝑑𝐹𝑥subscript𝑥superscript𝑑subscript𝑓1𝑥subscript𝑓2𝑥subscript𝑓𝑚𝑥\min_{x\in\mathbb{R}^{d}}F(x)=\min_{x\in\mathbb{R}^{d}}\left[f_{1}(x),f_{2}(x)% ,\ldots,f_{m}(x)\right].roman_min start_POSTSUBSCRIPT italic_x ∈ blackboard_R start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT end_POSTSUBSCRIPT italic_F ( italic_x ) = roman_min start_POSTSUBSCRIPT italic_x ∈ blackboard_R start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT end_POSTSUBSCRIPT [ italic_f start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( italic_x ) , italic_f start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( italic_x ) , … , italic_f start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT ( italic_x ) ] . (1)

In this context, for x1,x2dsubscript𝑥1subscript𝑥2superscript𝑑x_{1},x_{2}\in\mathbb{R}^{d}italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_x start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT, x1subscript𝑥1x_{1}italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT is said to dominate x2subscript𝑥2x_{2}italic_x start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT, i.e., x1x2precedessubscript𝑥1subscript𝑥2x_{1}\prec x_{2}italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ≺ italic_x start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT, iff fi(x1)fi(x2),i=1,2,,mformulae-sequencesubscript𝑓𝑖subscript𝑥1subscript𝑓𝑖subscript𝑥2for-all𝑖12𝑚f_{i}(x_{1})\leq f_{i}(x_{2}),\forall i=1,2,\ldots,mitalic_f start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) ≤ italic_f start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) , ∀ italic_i = 1 , 2 , … , italic_m, and F(x1)F(x2)𝐹subscript𝑥1𝐹subscript𝑥2F(x_{1})\neq F(x_{2})italic_F ( italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) ≠ italic_F ( italic_x start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ).

Definition 1 (Pareto optimality).

A point xdsuperscript𝑥superscript𝑑x^{\ast}\in\mathbb{R}^{d}italic_x start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT is called Pareto optimal iff there exists no any other xdsuperscript𝑥superscript𝑑x^{\prime}\in\mathbb{R}^{d}italic_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT such that xxprecedessuperscript𝑥superscript𝑥x^{\prime}\prec x^{\ast}italic_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ≺ italic_x start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT. The collection of Pareto optimal points are called Pareto set, denoted as 𝒫superscript𝒫\mathcal{P}^{\ast}caligraphic_P start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT. The collection of function values F(x)𝐹superscript𝑥F(x^{\ast})italic_F ( italic_x start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) of the Pareto set is called the Pareto front [52, 4].

Definition 2 (Pareto stationarity).

Pareto stationarity is a necessary condition for Pareto optimality. A point x𝑥xitalic_x is called Pareto stationary if there exists a set of scalar ωi,i=1,2,,mformulae-sequencesubscript𝜔𝑖𝑖12𝑚\omega_{i},i=1,2,\ldots,mitalic_ω start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_i = 1 , 2 , … , italic_m, such that i=1mωifi(x)=𝟎,i=1mωi=1,ωi>0,i=1,2,mformulae-sequencesuperscriptsubscript𝑖1𝑚subscript𝜔𝑖subscript𝑓𝑖𝑥0formulae-sequencesuperscriptsubscript𝑖1𝑚subscript𝜔𝑖1formulae-sequencesubscript𝜔𝑖0for-all𝑖12𝑚\sum_{i=1}^{m}\omega_{i}\nabla f_{i}(x)=\mathbf{0},\sum_{i=1}^{m}\omega_{i}=1,% \omega_{i}>0,\forall i=1,2\ldots,m∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT italic_ω start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∇ italic_f start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_x ) = bold_0 , ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT italic_ω start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = 1 , italic_ω start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT > 0 , ∀ italic_i = 1 , 2 … , italic_m.

Désidéri [12] proposed Multiple Gradient Descent (MGD) to find the Pareto optimal solutions of Eq.(1). To be specific, given any initial point xd𝑥superscript𝑑x\in\mathbb{R}^{d}italic_x ∈ blackboard_R start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT, we can iteratively update x𝑥xitalic_x according to:

xt+1=xtηvt,subscript𝑥𝑡1subscript𝑥𝑡𝜂subscript𝑣𝑡x_{t+1}=x_{t}-\eta v_{t},italic_x start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT = italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT - italic_η italic_v start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , (2)

where t𝑡titalic_t is the iteration step. The update direction vtsubscript𝑣𝑡v_{t}italic_v start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT is expected to be close to each gradient fi(x)subscript𝑓𝑖𝑥\nabla f_{i}(x)∇ italic_f start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_x ) i=1,2,,mfor-all𝑖12𝑚\forall i=1,2,\ldots,m∀ italic_i = 1 , 2 , … , italic_m as much as possible, which is therefore formulated into the following problem:

maxvd{minifi(x)v12v2}.𝑣superscript𝑑subscript𝑖subscript𝑓𝑖superscript𝑥top𝑣12superscriptnorm𝑣2\underset{v\in\mathbb{R}^{d}}{\max}\left\{\min_{i}\nabla f_{i}\left(x\right)^{% \top}v-\frac{1}{2}\|v\|^{2}\right\}.start_UNDERACCENT italic_v ∈ blackboard_R start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT end_UNDERACCENT start_ARG roman_max end_ARG { roman_min start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∇ italic_f start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_x ) start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT italic_v - divide start_ARG 1 end_ARG start_ARG 2 end_ARG ∥ italic_v ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT } . (3)

Through Lagrange strong duality, the solution to Eq.(3) can be framed into

v(x)=F(x)=i=1mωifi(x),𝑣𝑥𝐹𝑥superscriptsubscript𝑖1𝑚superscriptsubscript𝜔𝑖subscript𝑓𝑖𝑥v(x)=\nabla F(x)=\sum_{i=1}^{m}\omega_{i}^{\ast}\nabla f_{i}\left(x\right),italic_v ( italic_x ) = ∇ italic_F ( italic_x ) = ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT italic_ω start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ∇ italic_f start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_x ) , (4)

where {ωi}i=1m=argmin{ωi}i=1mi=1mωifi(x)2superscriptsubscriptsuperscriptsubscript𝜔𝑖𝑖1𝑚subscriptsuperscriptsubscriptsubscript𝜔𝑖𝑖1𝑚superscriptnormsuperscriptsubscript𝑖1𝑚subscript𝜔𝑖subscript𝑓𝑖𝑥2\{\omega_{i}^{\ast}\}_{i=1}^{m}=\arg\min\limits_{\{\omega_{i}\}_{i=1}^{m}}\|% \sum_{i=1}^{m}\omega_{i}\nabla f_{i}\left(x\right)\|^{2}{ italic_ω start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT } start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT = roman_arg roman_min start_POSTSUBSCRIPT { italic_ω start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ∥ ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT italic_ω start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∇ italic_f start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_x ) ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT under the constraint that i=1mωi=1,ωi>0,i=1,2,mformulae-sequencesuperscriptsubscript𝑖1𝑚subscript𝜔𝑖1formulae-sequencesubscript𝜔𝑖0for-all𝑖12𝑚\sum_{i=1}^{m}\omega_{i}=1,\omega_{i}>0,\forall i=1,2\ldots,m∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT italic_ω start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = 1 , italic_ω start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT > 0 , ∀ italic_i = 1 , 2 … , italic_m.

3.2 Diffusion Models

The idea of Diffusion models is to progressively diffuse data to noise, and then learn to reverse this process for sample generation. Considering a sequence of prescribed noise scales 0<β1<β2<<βT<10subscript𝛽1subscript𝛽2subscript𝛽𝑇10<\beta_{1}<\beta_{2}<\ldots<\beta_{T}<10 < italic_β start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT < italic_β start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT < … < italic_β start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT < 1, Denoising Diffusion Probabilistic Model (DDPM) [22] diffuses data x0qdata(x)similar-tosubscript𝑥0subscript𝑞data𝑥x_{0}\sim q_{\text{data}}(x)italic_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ∼ italic_q start_POSTSUBSCRIPT data end_POSTSUBSCRIPT ( italic_x ) to noise via constructing a discrete Markov chain {x0,x1,,xT}subscript𝑥0subscript𝑥1subscript𝑥𝑇\{x_{0},x_{1},\ldots,x_{T}\}{ italic_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_x start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT }, where q(xt|xt1)=𝒩(xt;1βtxt1,βt𝐈),xT𝒩(𝟎,𝐈)formulae-sequence𝑞conditionalsubscript𝑥𝑡subscript𝑥𝑡1𝒩subscript𝑥𝑡1subscript𝛽𝑡subscript𝑥𝑡1subscript𝛽𝑡𝐈similar-tosubscript𝑥𝑇𝒩0𝐈q(x_{t}|x_{t-1})=\mathcal{N}(x_{t};\sqrt{1-\beta_{t}}x_{t-1},\beta_{t}\mathbf{% I}),x_{T}\sim\mathcal{N}(\mathbf{0},\mathbf{I})italic_q ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT | italic_x start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT ) = caligraphic_N ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ; square-root start_ARG 1 - italic_β start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG italic_x start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT , italic_β start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT bold_I ) , italic_x start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT ∼ caligraphic_N ( bold_0 , bold_I ). This process is called the forwarded process or diffusion process. In particular, q(xt|x0)=𝒩(xt;αtx0,(1αt)𝐈)𝑞conditionalsubscript𝑥𝑡subscript𝑥0𝒩subscript𝑥𝑡subscript𝛼𝑡subscript𝑥01subscript𝛼𝑡𝐈q(x_{t}|x_{0})=\mathcal{N}(x_{t};\sqrt{\alpha_{t}}x_{0},(1-\alpha_{t})\mathbf{% I})italic_q ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT | italic_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) = caligraphic_N ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ; square-root start_ARG italic_α start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG italic_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , ( 1 - italic_α start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) bold_I ), where αt=i=1t(1βt)subscript𝛼𝑡superscriptsubscriptproduct𝑖1𝑡1subscript𝛽𝑡\alpha_{t}=\prod_{i=1}^{t}\left(1-\beta_{t}\right)italic_α start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = ∏ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ( 1 - italic_β start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ).

The key of diffusion-based generative models is to train a reverse Markov chain so that we can generate data starting from a Gaussian noise p(xT)𝒩(𝟎,𝐈)similar-to𝑝subscript𝑥𝑇𝒩0𝐈p(x_{T})\sim\mathcal{N}(\mathbf{0},\mathbf{I})italic_p ( italic_x start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT ) ∼ caligraphic_N ( bold_0 , bold_I ). The training loss of the reverse diffusion process, a.k.a. generative process, is to minimize a simplified variational bound of negative log likelihood. Namely,

𝔼x0qdata(x),ϵ𝒩(𝟎,𝐈)[ϵϵθ(αtx0+1αtϵ,t)2],subscript𝔼formulae-sequencesimilar-tosubscript𝑥0subscript𝑞data𝑥similar-toitalic-ϵ𝒩0𝐈delimited-[]superscriptnormitalic-ϵsubscriptitalic-ϵ𝜃subscript𝛼𝑡subscript𝑥01subscript𝛼𝑡italic-ϵ𝑡2\mathbb{E}_{x_{0}\sim q_{\text{data}}(x),\epsilon\sim\mathcal{N}(\mathbf{0},% \mathbf{I})}\left[\|\epsilon-\epsilon_{\theta}\left(\sqrt{\alpha_{t}}x_{0}+% \sqrt{1-\alpha_{t}}\epsilon,t\right)\|^{2}\right],blackboard_E start_POSTSUBSCRIPT italic_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ∼ italic_q start_POSTSUBSCRIPT data end_POSTSUBSCRIPT ( italic_x ) , italic_ϵ ∼ caligraphic_N ( bold_0 , bold_I ) end_POSTSUBSCRIPT [ ∥ italic_ϵ - italic_ϵ start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( square-root start_ARG italic_α start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG italic_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT + square-root start_ARG 1 - italic_α start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG italic_ϵ , italic_t ) ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ] , (5)

where ϵθ(xt,t)subscriptitalic-ϵ𝜃subscript𝑥𝑡𝑡\epsilon_{\theta}(x_{t},t)italic_ϵ start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_t ) is a neural network-based approximator to predict the noise ϵitalic-ϵ\epsilonitalic_ϵ from xt=αtx0+1αtϵsubscript𝑥𝑡subscript𝛼𝑡subscript𝑥01subscript𝛼𝑡italic-ϵx_{t}=\sqrt{\alpha_{t}}x_{0}+\sqrt{1-\alpha_{t}}\epsilonitalic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = square-root start_ARG italic_α start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG italic_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT + square-root start_ARG 1 - italic_α start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG italic_ϵ.

After training the neural network parameterized by θ𝜃\thetaitalic_θ to obtain the optimal ϵθ(xt,t)superscriptsubscriptitalic-ϵ𝜃subscript𝑥𝑡𝑡\epsilon_{\theta}^{\ast}(x_{t},t)italic_ϵ start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_t ), samples can be generated by starting from xT𝒩(𝟎,𝐈)similar-tosubscript𝑥𝑇𝒩0𝐈x_{T}\sim\mathcal{N}(\mathbf{0},\mathbf{I})italic_x start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT ∼ caligraphic_N ( bold_0 , bold_I ) and reversing the Markov chain:

xt1=11βt(xtβt1αtϵθ(xt,t))+βtzt,subscript𝑥𝑡111subscript𝛽𝑡subscript𝑥𝑡subscript𝛽𝑡1subscript𝛼𝑡superscriptsubscriptitalic-ϵ𝜃subscript𝑥𝑡𝑡subscript𝛽𝑡subscript𝑧𝑡x_{t-1}=\frac{1}{\sqrt{1-\beta_{t}}}\left(x_{t}-\frac{\beta_{t}}{\sqrt{1-% \alpha_{t}}}\epsilon_{\theta}^{\ast}\left(x_{t},t\right)\right)+\sqrt{\beta_{t% }}z_{t},italic_x start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT = divide start_ARG 1 end_ARG start_ARG square-root start_ARG 1 - italic_β start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG end_ARG ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT - divide start_ARG italic_β start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG start_ARG square-root start_ARG 1 - italic_α start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG end_ARG italic_ϵ start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_t ) ) + square-root start_ARG italic_β start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG italic_z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , (6)

where zt𝒩(𝟎,𝐈)similar-tosubscript𝑧𝑡𝒩0𝐈z_{t}\sim\mathcal{N}(\mathbf{0},\mathbf{I})italic_z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∼ caligraphic_N ( bold_0 , bold_I ) and t=T,T1,,1𝑡𝑇𝑇11t=T,T-1,\ldots,1italic_t = italic_T , italic_T - 1 , … , 1. More variants of diffusion models can be seen in Yang et al [58].

Existing attempts for incorporating multiple desired properties into the diffusion model [19] can be straightforwardly adding the derived MGD F(x)𝐹𝑥\nabla F(x)∇ italic_F ( italic_x ) in Eq.(4) to the noise predictor ϵθ(xt,t)superscriptsubscriptitalic-ϵ𝜃subscript𝑥𝑡𝑡\epsilon_{\theta}^{\ast}(x_{t},t)italic_ϵ start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_t ) at each denoising step, namely,

xt1=11βt(xtβt1αt(ϵθ(xt,t)+λF(x)))+βtzt,subscript𝑥𝑡111subscript𝛽𝑡subscript𝑥𝑡subscript𝛽𝑡1subscript𝛼𝑡superscriptsubscriptitalic-ϵ𝜃subscript𝑥𝑡𝑡𝜆𝐹𝑥subscript𝛽𝑡subscript𝑧𝑡x_{t-1}=\frac{1}{\sqrt{1-\beta_{t}}}\left(x_{t}-\frac{\beta_{t}}{\sqrt{1-% \alpha_{t}}}\Big{(}\epsilon_{\theta}^{\ast}\left(x_{t},t\right)+\lambda\nabla F% (x)\Big{)}\right)+\sqrt{\beta_{t}}z_{t},italic_x start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT = divide start_ARG 1 end_ARG start_ARG square-root start_ARG 1 - italic_β start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG end_ARG ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT - divide start_ARG italic_β start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG start_ARG square-root start_ARG 1 - italic_α start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG end_ARG ( italic_ϵ start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_t ) + italic_λ ∇ italic_F ( italic_x ) ) ) + square-root start_ARG italic_β start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG italic_z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , (7)

where t=T,T1,,1𝑡𝑇𝑇11t=T,T-1,\ldots,1italic_t = italic_T , italic_T - 1 , … , 1. λ𝜆\lambdaitalic_λ is a trade-off hyper-parameter which balances the generation quality (i.e., the noise predictor ϵθ(xt,t)superscriptsubscriptitalic-ϵ𝜃subscript𝑥𝑡𝑡\epsilon_{\theta}^{\ast}(x_{t},t)italic_ϵ start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_t )) and multiple-objectives (i.e., the MGD F(x)𝐹𝑥\nabla F(x)∇ italic_F ( italic_x )). Note that an inappropriate λ𝜆\lambdaitalic_λ may lead to unsatisfied samples which either suffer from low quality or fail to possess required properties (Refer to experimental observations in Section 5).

4 Multi-Objective Generation

As discussed above, optimizing generative models in terms of m𝑚mitalic_m objectives aims to produce samples that cannot be simultaneously improved for all objectives, namely, Pareto optimality (see Definition 1). Meanwhile, the generated samples are required to be as realistic as the training samples, which is usually achieved by enforcing distribution alignment between the generated samples and the training samples.

MOG compared with MOO

As shown in Table 1, both the MOO and MOG share the same objectives F(x)𝐹𝑥F(x)italic_F ( italic_x ) but differ in the space that x𝑥xitalic_x resides in, which is termed as “decision space” or “solution space” in the MOO problem [6] and is termed as “data space” in the MOG problem [19, 54]. To be specific, the decision space of the MOO problem is defined as the whole space of dsuperscript𝑑\mathbb{R}^{d}blackboard_R start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT [5], while the data space of the MOG problem only resides in a low-dimensional manifold 𝒳𝒳\mathcal{X}caligraphic_X embedded in dsuperscript𝑑\mathbb{R}^{d}blackboard_R start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT (a.k.a. the ambient space) [15, 38, 35]. Such a difference highlights that the objectives to be optimized for MOG are only meaningful within the data manifold. When simply applying MOO algorithms to search for solutions in the high-dimensional sample space, the obtained solutions cannot guarantee residing within the data manifold, thus resulting in very low data quality (i.e., invalid samples in Fig. 1(a)) and a loss of practicability [40].

To sum up, the necessity to concurrently consider generation quality distinguishes the MOG problem from the MOO problem. Specifically, a dataset with real samples is required to define the data manifold on which the generated samples are expected to reside (Eq.(8)).

Table 1: The MOO problem vs. the MOG problem. The generation quality in MOG is usually modeled based on the given dataset X𝒳𝑋𝒳X\subset\mathcal{X}italic_X ⊂ caligraphic_X, where 𝒳𝒳\mathcal{X}caligraphic_X denotes a low-dimensional manifold embedded in the high dimensional space dsuperscript𝑑\mathbb{R}^{d}blackboard_R start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT.
objectives decision/data space generation quality
MOO F(x)=[f1(x),f2(x),,fm(x)]𝐹𝑥subscript𝑓1𝑥subscript𝑓2𝑥subscript𝑓𝑚𝑥F(x)=[f_{1}(x),f_{2}(x),\ldots,f_{m}(x)]italic_F ( italic_x ) = [ italic_f start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( italic_x ) , italic_f start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( italic_x ) , … , italic_f start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT ( italic_x ) ] xd𝑥superscript𝑑x\in\mathbb{R}^{d}italic_x ∈ blackboard_R start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT
MOG x𝒳,𝒳dformulae-sequence𝑥𝒳𝒳superscript𝑑x\in\mathcal{X},\mathcal{X}\subset\mathbb{R}^{d}italic_x ∈ caligraphic_X , caligraphic_X ⊂ blackboard_R start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT

4.1 Constrained Optimization for MOG

A straightforward solution of MOG is to take consideration of generation quality as an additional objective and formulate it into a m+1𝑚1m+1italic_m + 1 objectives problem. However, the heterogeneity of multiple objective optimization (usually defined w.r.t. a single sample) and the distribution alignment (defined w.r.t. a dataset) would bring out the optimization difficulty for the resultant MOO. Although it is feasible to simplify the distribution divergence w.r.t. a dataset as quality scores for individual samples in some deep generative models [3], it is still challenging to obtain desired solutions that achieve Pareto optimality on m𝑚mitalic_m objectives from the optimization of m+1𝑚1m+1italic_m + 1 objectives which explore a much larger space, as empirically verified in the experiments. In addition, the complexity of multi-objective optimization increases significantly with the number of objectives [23].

Instead of formulating a complex and ineffective m+1𝑚1m+1italic_m + 1 objective problem, we implement the multi-objective generation through a tailor-designed constrained optimization problem upon m𝑚mitalic_m property objectives. Such a formulation also allows us to stress respective significance of data generation and m𝑚mitalic_m-objective optimization, instead of treating them equally important. Specifically, let pθ(x)subscript𝑝𝜃𝑥p_{\theta}(x)italic_p start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_x ) denote the target data distribution parameterized by θ𝜃\thetaitalic_θ, and p0subscript𝑝0p_{0}italic_p start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT denote the distribution of the solution samples on the Pareto front, our constrained optimization problem can be formulated as follows

minθD[qdata(x)||pθ(x)]s.t.D[p0(x)||pθ(x)]ε.\displaystyle\min_{\theta}D\left[q_{\text{data}}(x)||p_{\theta}(x)\right]\quad s% .t.\ D\left[p_{0}(x)||p_{\theta}(x)\right]\leq\varepsilon.roman_min start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT italic_D [ italic_q start_POSTSUBSCRIPT data end_POSTSUBSCRIPT ( italic_x ) | | italic_p start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_x ) ] italic_s . italic_t . italic_D [ italic_p start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ( italic_x ) | | italic_p start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_x ) ] ≤ italic_ε . (8)

where D(,)𝐷D(\cdot,\cdot)italic_D ( ⋅ , ⋅ ) denotes the distribution divergence and ε𝜀\varepsilonitalic_ε is a small positive value.

The loss in Eq.(8) controls the generation quality, which ensures the quality of the generated data as realistic as possible. The constraint in Eq.(8) ensures the generated data xpθ(x)similar-to𝑥subscript𝑝𝜃𝑥x\sim p_{\theta}(x)italic_x ∼ italic_p start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_x ) to be Pareto optimal (with a small bearable error). Overall, Eq.(8) provides certain quality assurance while obtaining samples that can approach Pareto optimality of multiple property objectives.

4.2 Langevin Dynamics for Data Distribution Approximation

It is difficult to directly solve Eq.(8) when both qdata(x)subscript𝑞data𝑥q_{\text{data}}(x)italic_q start_POSTSUBSCRIPT data end_POSTSUBSCRIPT ( italic_x ) and p0(x)subscript𝑝0𝑥p_{0}(x)italic_p start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ( italic_x ) are unknown. Motivated by those widely-developed techniques of sampling algorithms for approximating data distribution [2, 44, 33], we develop Langevin dynamic-based sampling techniques to solve Eq.(8). Specifically, Langevin dynamics are capable of generating samples from a given probability distribution q(x)𝑞𝑥q(x)italic_q ( italic_x ) solely by utilizing its score function logq(x)𝑞𝑥\nabla\log q(x)∇ roman_log italic_q ( italic_x ). Given an initial value xT𝒩(𝟎,𝐈)similar-tosubscript𝑥𝑇𝒩0𝐈x_{T}\sim\mathcal{N}(\mathbf{0},\mathbf{I})italic_x start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT ∼ caligraphic_N ( bold_0 , bold_I ), the Langevin method recursively computes the following:

xt1=xtκg(xt)+2κz,t=T,T1,,0,formulae-sequencesubscript𝑥𝑡1subscript𝑥𝑡𝜅𝑔subscript𝑥𝑡2𝜅𝑧𝑡𝑇𝑇10x_{t-1}=x_{t}-\kappa g(x_{t})+\sqrt{2\kappa}z,\quad t=T,T-1,\ldots,0,italic_x start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT = italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT - italic_κ italic_g ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) + square-root start_ARG 2 italic_κ end_ARG italic_z , italic_t = italic_T , italic_T - 1 , … , 0 , (9)

where κ𝜅\kappaitalic_κ is the step size and can be fixed or dynamic, z𝑧zitalic_z is sampled from the standard normal distribution 𝒩(𝟎,𝐈)𝒩0𝐈\mathcal{N}(\mathbf{0},\mathbf{I})caligraphic_N ( bold_0 , bold_I ) and g(xt)𝑔subscript𝑥𝑡g(x_{t})italic_g ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) is the update direction for xtsubscript𝑥𝑡x_{t}italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT, equal to logq(xt)𝑞subscript𝑥𝑡\nabla\log q(x_{t})∇ roman_log italic_q ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ). The distribution of x0subscript𝑥0x_{0}italic_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT will be close to the given data distribution q(x)𝑞𝑥q(x)italic_q ( italic_x ) when κ0𝜅0\kappa\rightarrow 0italic_κ → 0 and T𝑇T\rightarrow\inftyitalic_T → ∞ under some regularity conditions [57].

Before deriving the proper gradient g(xt)𝑔subscript𝑥𝑡g(x_{t})italic_g ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) to approximate the distribution optimized in Eq.(8) as a whole, we investigate the gradient-based strategies to optimize D[qdata(x)||pθ(x)]D\left[q_{\text{data}}(x)||p_{\theta}(x)\right]italic_D [ italic_q start_POSTSUBSCRIPT data end_POSTSUBSCRIPT ( italic_x ) | | italic_p start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_x ) ] and D[p0(x)||pθ(x)]D\left[p_{0}(x)||p_{\theta}(x)\right]italic_D [ italic_p start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ( italic_x ) | | italic_p start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_x ) ] via Langevin dynamics, separately.

Optimization of D[qdata(x)||pθ(x)]D\left[q_{\text{data}}(x)||p_{\theta}(x)\right]italic_D [ italic_q start_POSTSUBSCRIPT data end_POSTSUBSCRIPT ( italic_x ) | | italic_p start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_x ) ] in Eq.(8). Actually, various generative models are deduced to approximate the minimization of the KL divergence between the data distribution qdata(x)subscript𝑞data𝑥q_{\text{data}}(x)italic_q start_POSTSUBSCRIPT data end_POSTSUBSCRIPT ( italic_x ) and the model distribution pθ(x)subscript𝑝𝜃𝑥p_{\theta}(x)italic_p start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_x ) [27, 47, 37]. Here, we choose diffusion models as the representative for optimizing D[qdata(x)||pθ(x)]D\left[q_{\text{data}}(x)||p_{\theta}(x)\right]italic_D [ italic_q start_POSTSUBSCRIPT data end_POSTSUBSCRIPT ( italic_x ) | | italic_p start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_x ) ] given their equivalent form to Eq.(9[22, 48]. Particularly, the time-dependent predicted noise ϵθ(xt,t)superscriptsubscriptitalic-ϵ𝜃subscript𝑥𝑡𝑡\epsilon_{\theta}^{\ast}\left(x_{t},t\right)italic_ϵ start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_t ) in Eq.(6) is the update direction g(xt)𝑔subscript𝑥𝑡g(x_{t})italic_g ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) in anneal Langevin dynamics with a dynamic step size ηtsubscript𝜂𝑡\eta_{t}italic_η start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT:

xt1=xtηtϵθ(xt,t)+2ηtz.subscript𝑥𝑡1subscript𝑥𝑡subscript𝜂𝑡superscriptsubscriptitalic-ϵ𝜃subscript𝑥𝑡𝑡2subscript𝜂𝑡𝑧x_{t-1}=x_{t}-\eta_{t}\epsilon_{\theta}^{\ast}\left(x_{t},t\right)+\sqrt{2\eta% _{t}}z.italic_x start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT = italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT - italic_η start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT italic_ϵ start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_t ) + square-root start_ARG 2 italic_η start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG italic_z . (10)

Consequently, the distribution of pθ(x0)subscript𝑝𝜃subscript𝑥0p_{\theta}(x_{0})italic_p start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) will approach qdata(x)subscript𝑞data𝑥q_{\text{data}}(x)italic_q start_POSTSUBSCRIPT data end_POSTSUBSCRIPT ( italic_x ) [47].

Optimization of D[p0(x)||pθ(x)]D\left[p_{0}(x)||p_{\theta}(x)\right]italic_D [ italic_p start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ( italic_x ) | | italic_p start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_x ) ] in Eq.(8). On the other hand, we can integrate MGD (Eq.(4)) into Langevin dynamics to optimize D[p0(x)||pθ(x)]D\left[p_{0}(x)||p_{\theta}(x)\right]italic_D [ italic_p start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ( italic_x ) | | italic_p start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_x ) ], aiming to approximate the distribution of the Pareto set p0(x)subscript𝑝0𝑥p_{0}(x)italic_p start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ( italic_x ) upon convergence. Namely,

xt1=xtηF(xt)+2ηz,subscript𝑥𝑡1subscript𝑥𝑡𝜂𝐹subscript𝑥𝑡2𝜂𝑧x_{t-1}=x_{t}-\eta\nabla F(x_{t})+\sqrt{2\eta}z,italic_x start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT = italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT - italic_η ∇ italic_F ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) + square-root start_ARG 2 italic_η end_ARG italic_z , (11)

where η𝜂\etaitalic_η is a fixed step size. The distribution of x0subscript𝑥0x_{0}italic_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT will converge to p0(x)subscript𝑝0𝑥p_{0}(x)italic_p start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ( italic_x ), as demonstrated in Theorem 3.3 of Liu et al [33].

4.3 Pareto-guided Diffusion Model

Based on the above analysis, the key to solving the constrained optimization problem (Eq.(8)) is to design a proper strategy for unifying the optimization of D[qdata(x)||pθ(x)]D\left[q_{\text{data}}(x)||p_{\theta}(x)\right]italic_D [ italic_q start_POSTSUBSCRIPT data end_POSTSUBSCRIPT ( italic_x ) | | italic_p start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_x ) ] and D[p0(x)||pθ(x)]D\left[p_{0}(x)||p_{\theta}(x)\right]italic_D [ italic_p start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ( italic_x ) | | italic_p start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_x ) ] within the framework of Langevin dynamic sampling. Therefore, we can indirectly solve Eq.(8) by designing the following strategies to update the gradient g(xt)𝑔subscript𝑥𝑡g(x_{t})italic_g ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) in Eq.(9):

  • 1)

    If the sample xtsubscript𝑥𝑡x_{t}italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT is far away from the Pareto front (constraint violation), g(xt)𝑔subscript𝑥𝑡g(x_{t})italic_g ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) is chosen to assure Pareto improvement (i.e., decreasing all the m𝑚mitalic_m objectives) to xtsubscript𝑥𝑡x_{t}italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT. The amount of Pareto improvement is determinant by the distance of xtsubscript𝑥𝑡x_{t}italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT to the Pareto front.

  • 2)

    If there are multiple directions that can yield Pareto improvement (constraint violation), the direction of Pareto improvement that decreases D[qdata(x)||pθ(x)]D\left[q_{\text{data}}(x)||p_{\theta}(x)\right]italic_D [ italic_q start_POSTSUBSCRIPT data end_POSTSUBSCRIPT ( italic_x ) | | italic_p start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_x ) ] most (reducing loss) is chosen as g(xt)𝑔subscript𝑥𝑡g(x_{t})italic_g ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ).

  • 3)

    If xtsubscript𝑥𝑡x_{t}italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT is close to the Pareto front (constraint satisfaction), i.e., having a small F(xt)norm𝐹subscript𝑥𝑡\|\nabla F\left(x_{t}\right)\|∥ ∇ italic_F ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ∥ according to Definition 2, g(xt)𝑔subscript𝑥𝑡g(x_{t})italic_g ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) is chosen to fully optimize D[qdata(x)||pθ(x)]D\left[q_{\text{data}}(x)||p_{\theta}(x)\right]italic_D [ italic_q start_POSTSUBSCRIPT data end_POSTSUBSCRIPT ( italic_x ) | | italic_p start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_x ) ] (reducing loss).

Following Ye and Liu [60], we design a new objective based on the gradients to achieve the above conditions. To be specific, since ϵθ(xt,t)superscriptsubscriptitalic-ϵ𝜃subscript𝑥𝑡𝑡\epsilon_{\theta}^{\ast}\left(x_{t},t\right)italic_ϵ start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_t ) is the gradient for optimizing D[qdata(x)||pθ(x)]D\left[q_{\text{data}}(x)||p_{\theta}(x)\right]italic_D [ italic_q start_POSTSUBSCRIPT data end_POSTSUBSCRIPT ( italic_x ) | | italic_p start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_x ) ], and F(x)𝐹𝑥\nabla F(x)∇ italic_F ( italic_x ) is the gradient for optimizing D[p0(x)||pθ(x)]D\left[p_{0}(x)||p_{\theta}(x)\right]italic_D [ italic_p start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ( italic_x ) | | italic_p start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_x ) ], the integrated gradient g(xt)𝑔subscript𝑥𝑡g(x_{t})italic_g ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) can be solved by the following objective:

g(xt)=argming12gϵθ(xt,t)2𝑔subscript𝑥𝑡subscript𝑔12superscriptnorm𝑔superscriptsubscriptitalic-ϵ𝜃subscript𝑥𝑡𝑡2\displaystyle g(x_{t})=\arg\min_{g}\frac{1}{2}\|g-\epsilon_{\theta}^{\ast}% \left(x_{t},t\right)\|^{2}italic_g ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) = roman_arg roman_min start_POSTSUBSCRIPT italic_g end_POSTSUBSCRIPT divide start_ARG 1 end_ARG start_ARG 2 end_ARG ∥ italic_g - italic_ϵ start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_t ) ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT (12)
s.t.fi(x)Tgϕt,i=1,2,,m,\displaystyle s.t.\quad\nabla f_{i}(x)^{T}g\geq\phi_{t},\quad\forall i=1,2,% \ldots,m,italic_s . italic_t . ∇ italic_f start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_x ) start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT italic_g ≥ italic_ϕ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , ∀ italic_i = 1 , 2 , … , italic_m ,
ϕt={αF(xt)if F(xt)>eotherwise,subscriptitalic-ϕ𝑡cases𝛼norm𝐹subscript𝑥𝑡if norm𝐹subscript𝑥𝑡𝑒otherwise\displaystyle\qquad\phi_{t}=\begin{cases}\alpha\|\nabla F\left(x_{t}\right)\|&% \text{if }\|\nabla F\left(x_{t}\right)\|>e\\ \quad-\infty&\text{otherwise}\end{cases},italic_ϕ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = { start_ROW start_CELL italic_α ∥ ∇ italic_F ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ∥ end_CELL start_CELL if ∥ ∇ italic_F ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ∥ > italic_e end_CELL end_ROW start_ROW start_CELL - ∞ end_CELL start_CELL otherwise end_CELL end_ROW ,

where α𝛼\alphaitalic_α and e𝑒eitalic_e are positive hyper-parameters. The constraint in Eq.(8) can be approximated by the small gradient norm F(x)𝐹𝑥\nabla F(x)∇ italic_F ( italic_x ) due to Pareto stationarity (Definition 2). In particular, when F(xt)>enorm𝐹subscript𝑥𝑡𝑒\|\nabla F\left(x_{t}\right)\|>e∥ ∇ italic_F ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ∥ > italic_e, ϕtsubscriptitalic-ϕ𝑡\phi_{t}italic_ϕ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT is set to be proportionate to F(xt)norm𝐹subscript𝑥𝑡\|\nabla F\left(x_{t}\right)\|∥ ∇ italic_F ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ∥. This will encourage the gradient g(xt)𝑔subscript𝑥𝑡g(x_{t})italic_g ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) to have positive inner products with all fi(x)subscript𝑓𝑖𝑥\nabla f_{i}(x)∇ italic_f start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_x ), approximating F(x)𝐹𝑥\nabla F(x)∇ italic_F ( italic_x ). Meanwhile, the amount of Pareto improvement is based on the distance of xtsubscript𝑥𝑡x_{t}italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT to the Pareto front. If F(xt)norm𝐹subscript𝑥𝑡\|\nabla F\left(x_{t}\right)\|∥ ∇ italic_F ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ∥ has a very small norm, which means that the sample xtsubscript𝑥𝑡x_{t}italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT is close to the Pareto front, we will have gt(x)=ϵθ(xt,t)subscript𝑔𝑡𝑥superscriptsubscriptitalic-ϵ𝜃subscript𝑥𝑡𝑡g_{t}(x)=\epsilon_{\theta}^{\ast}\left(x_{t},t\right)italic_g start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_x ) = italic_ϵ start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_t ) with ϕt=subscriptitalic-ϕ𝑡\phi_{t}=-\inftyitalic_ϕ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = - ∞. Therefore, samples will be updated with a pure gradient descent on D[qdata(x)||pθ(x)]D\left[q_{\text{data}}(x)||p_{\theta}(x)\right]italic_D [ italic_q start_POSTSUBSCRIPT data end_POSTSUBSCRIPT ( italic_x ) | | italic_p start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_x ) ] without taking into account the m𝑚mitalic_m objectives {fi(x)}i=1msuperscriptsubscriptsubscript𝑓𝑖𝑥𝑖1𝑚\{f_{i}(x)\}_{i=1}^{m}{ italic_f start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_x ) } start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT, namely, λi,t=0,i[m]formulae-sequencesubscript𝜆𝑖𝑡0for-all𝑖delimited-[]𝑚\lambda_{i,t}=0,\forall i\in[m]italic_λ start_POSTSUBSCRIPT italic_i , italic_t end_POSTSUBSCRIPT = 0 , ∀ italic_i ∈ [ italic_m ].

At the situation of F(xt)>enorm𝐹subscript𝑥𝑡𝑒\|\nabla F\left(x_{t}\right)\|>e∥ ∇ italic_F ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ∥ > italic_e, the solution g(xt)𝑔subscript𝑥𝑡g(x_{t})italic_g ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) of Eq.(12) is expressed as:

g(xt)=ϵθ(xt,t)+i=1mλi,tfi(xt),𝑔subscript𝑥𝑡superscriptsubscriptitalic-ϵ𝜃subscript𝑥𝑡𝑡superscriptsubscript𝑖1𝑚subscript𝜆𝑖𝑡subscript𝑓𝑖subscript𝑥𝑡g(x_{t})=\epsilon_{\theta}^{\ast}\left(x_{t},t\right)+\sum_{i=1}^{m}\lambda_{i% ,t}\nabla f_{i}(x_{t}),italic_g ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) = italic_ϵ start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_t ) + ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT italic_λ start_POSTSUBSCRIPT italic_i , italic_t end_POSTSUBSCRIPT ∇ italic_f start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) , (13)

where {λi,t}i=1msuperscriptsubscriptsubscript𝜆𝑖𝑡𝑖1𝑚\{\lambda_{i,t}\}_{i=1}^{m}{ italic_λ start_POSTSUBSCRIPT italic_i , italic_t end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT is the solution of the following dual problem:

maxλi,t+m12ϵθ(xt,t)+i=1mλi,tfi(xt)2+i=1mλi,tϕt.subscriptsubscript𝜆𝑖𝑡superscriptsubscript𝑚12superscriptnormsuperscriptsubscriptitalic-ϵ𝜃subscript𝑥𝑡𝑡superscriptsubscript𝑖1𝑚subscript𝜆𝑖𝑡subscript𝑓𝑖subscript𝑥𝑡2superscriptsubscript𝑖1𝑚subscript𝜆𝑖𝑡subscriptitalic-ϕ𝑡\max_{\lambda_{i,t}\in\mathbb{R}_{+}^{m}}-\frac{1}{2}\|\epsilon_{\theta}^{\ast% }\left(x_{t},t\right)+\sum_{i=1}^{m}\lambda_{i,t}\nabla f_{i}(x_{t})\|^{2}+% \sum_{i=1}^{m}\lambda_{i,t}\phi_{t}.roman_max start_POSTSUBSCRIPT italic_λ start_POSTSUBSCRIPT italic_i , italic_t end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUBSCRIPT + end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT end_POSTSUBSCRIPT - divide start_ARG 1 end_ARG start_ARG 2 end_ARG ∥ italic_ϵ start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_t ) + ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT italic_λ start_POSTSUBSCRIPT italic_i , italic_t end_POSTSUBSCRIPT ∇ italic_f start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT italic_λ start_POSTSUBSCRIPT italic_i , italic_t end_POSTSUBSCRIPT italic_ϕ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT . (14)

Substituting the derived gradient g(xt)𝑔subscript𝑥𝑡g(x_{t})italic_g ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) (Eq.(13)) into Eq.(9) and adopting a dynamic step size ηtsubscript𝜂𝑡\eta_{t}italic_η start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT, we can obtain a new kind of controllable diffusion modeling, which is named as PaRetO-gUided Diffusion model (PROUD):

xt1=xtηt(ϵθ(xt,t)+i=1mλi,tfi(xt))+2ηtz.subscript𝑥𝑡1subscript𝑥𝑡subscript𝜂𝑡superscriptsubscriptitalic-ϵ𝜃subscript𝑥𝑡𝑡superscriptsubscript𝑖1𝑚subscript𝜆𝑖𝑡subscript𝑓𝑖subscript𝑥𝑡2subscript𝜂𝑡𝑧x_{t-1}=x_{t}-\eta_{t}\left(\epsilon_{\theta}^{\ast}\left(x_{t},t\right)+\sum_% {i=1}^{m}\lambda_{i,t}\nabla f_{i}(x_{t})\right)+\sqrt{2\eta_{t}}z.italic_x start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT = italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT - italic_η start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_ϵ start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_t ) + ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT italic_λ start_POSTSUBSCRIPT italic_i , italic_t end_POSTSUBSCRIPT ∇ italic_f start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ) + square-root start_ARG 2 italic_η start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG italic_z . (15)

PROUD does not modify the training process of diffusion models but only updates gradients during the generative process, as summarized in Algorithm 1. Therefore, our PROUD can be plugged into any pre-trained diffusion model to gain post-hoc control during the generative process.

In contrast to existing methods that crudely combine generative models with multi-objective optimization techniques using a predefined balance coefficient, our constrained optimization formulation (Eq.(8)) allows to dynamically infer the balance coefficient (Eq.(14)), prioritizing the guarantee of Pareto optimality.

Algorithm 1 Pareto-guided Reverse Diffusion Process for a Single Sample
1:Input: a pre-trained unconditional diffusion model ϵθsuperscriptsubscriptitalic-ϵ𝜃\epsilon_{\theta}^{\ast}italic_ϵ start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT, the dynamic step size {ηt}t=1Tsuperscriptsubscriptsubscript𝜂𝑡𝑡1𝑇\{\eta_{t}\}_{t=1}^{T}{ italic_η start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_t = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT, multiple property objectives {fi}i=1msuperscriptsubscriptsubscript𝑓𝑖𝑖1𝑚\{f_{i}\}_{i=1}^{m}{ italic_f start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT.
2:Hyper-parameters: α𝛼\alphaitalic_α and e𝑒eitalic_e in Eq.(12).
3:Initialize: xT𝒩(𝟎,𝐈)similar-tosubscript𝑥𝑇𝒩0𝐈x_{T}\sim\mathcal{N}(\mathbf{0},\mathbf{I})italic_x start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT ∼ caligraphic_N ( bold_0 , bold_I ).
4:for t = T,T1,,0𝑇𝑇10T,T-1,\ldots,0italic_T , italic_T - 1 , … , 0 do
5:     calculate the multiple gradient descent: F(xt)𝐹subscript𝑥𝑡\nabla F(x_{t})∇ italic_F ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) based on Eq.(4);
6:     if F(xt)>enorm𝐹subscript𝑥𝑡𝑒\|\nabla F(x_{t})\|>e∥ ∇ italic_F ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ∥ > italic_e then # calculate the weight coefficients
7:         {λi,t}i=1msuperscriptsubscriptsubscript𝜆𝑖𝑡𝑖1𝑚\{\lambda_{i,t}\}_{i=1}^{m}{ italic_λ start_POSTSUBSCRIPT italic_i , italic_t end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT takes the solution of Eq.(14) with ϕt=αF(x)subscriptitalic-ϕ𝑡𝛼norm𝐹𝑥\phi_{t}=\alpha\|\nabla F(x)\|italic_ϕ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = italic_α ∥ ∇ italic_F ( italic_x ) ∥;
8:     else
9:         λi,t=0,i[m]formulae-sequencesubscript𝜆𝑖𝑡0for-all𝑖delimited-[]𝑚\lambda_{i,t}=0,\forall i\in[m]italic_λ start_POSTSUBSCRIPT italic_i , italic_t end_POSTSUBSCRIPT = 0 , ∀ italic_i ∈ [ italic_m ];
10:     end if
11:     calculate the denoising gradient: g(xt)=ϵθ(xt,t)+i=1mλi,tfi(xt)𝑔subscript𝑥𝑡superscriptsubscriptitalic-ϵ𝜃subscript𝑥𝑡𝑡superscriptsubscript𝑖1𝑚subscript𝜆𝑖𝑡subscript𝑓𝑖subscript𝑥𝑡g(x_{t})=\epsilon_{\theta}^{\ast}\left(x_{t},t\right)+\sum_{i=1}^{m}\lambda_{i% ,t}\nabla f_{i}(x_{t})italic_g ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) = italic_ϵ start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_t ) + ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT italic_λ start_POSTSUBSCRIPT italic_i , italic_t end_POSTSUBSCRIPT ∇ italic_f start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) as Eq.(13);
12:     sample z𝒩(𝟎,𝐈)similar-to𝑧𝒩0𝐈z\sim\mathcal{N}(\mathbf{0},\mathbf{I})italic_z ∼ caligraphic_N ( bold_0 , bold_I );
13:     denoise the sample: xt1=xtηtg(xt)+2ηtzsubscript𝑥𝑡1subscript𝑥𝑡subscript𝜂𝑡𝑔subscript𝑥𝑡2subscript𝜂𝑡𝑧x_{t-1}=x_{t}-\eta_{t}g(x_{t})+\sqrt{2\eta_{t}}zitalic_x start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT = italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT - italic_η start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT italic_g ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) + square-root start_ARG 2 italic_η start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG italic_z;
14:end for
15:Output: the sample x0subscript𝑥0x_{0}italic_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT which meets Pareto optimality of m𝑚mitalic_m objectives.

4.4 Diversity Regularization for Diversified Pareto Solutions

In practice, MGD integrated with Langevin dynamics fails to obtain diversified Pareto solutions although it can be guaranteed to obtain solutions on the Pareto front [33]. To make the solutions be evenly distributed on the Pareto front, we consider adding a diversity regularization, which can be enforced either in the sample space or the functionality space. Because we are interested in high-dimensional data generation, imposing larger distances between samples can be challenging. Furthermore, a significant separation between samples does not necessarily ensure a substantial distinction between their respective functionalities. Therefore, we define the diversity regularization based on the objective values.

Suppose there are N𝑁Nitalic_N particles {x1,x2,,xN}superscript𝑥1superscript𝑥2superscript𝑥𝑁\{x^{1},x^{2},\ldots,x^{N}\}{ italic_x start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT , italic_x start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT , … , italic_x start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT } in each step of our PROUD. We omit the subscript t𝑡titalic_t of the time step for simplicity. The diversity loss is defined to encourage the dissimilarity of the objective values:

l(x1,x2,,xN)𝑙superscript𝑥1superscript𝑥2superscript𝑥𝑁\displaystyle l(x^{1},x^{2},\ldots,x^{N})italic_l ( italic_x start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT , italic_x start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT , … , italic_x start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT ) =ij1F(xi)F(xj)2.absentsubscript𝑖𝑗1superscriptnorm𝐹superscript𝑥𝑖𝐹superscript𝑥𝑗2\displaystyle=\sum_{i\neq j}\frac{1}{\|F(x^{i})-F(x^{j})\|^{2}}.= ∑ start_POSTSUBSCRIPT italic_i ≠ italic_j end_POSTSUBSCRIPT divide start_ARG 1 end_ARG start_ARG ∥ italic_F ( italic_x start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT ) - italic_F ( italic_x start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT ) ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG . (16)

The diversity loss Eq.(16) is added to the main objective in Eq.(8) with a weight coefficient γ𝛾\gammaitalic_γ.

5 Experiments

In this section, we evaluate the effectiveness of our PROUD in optimizing image generation and protein generation with multiple conflicting objectives. We study white-box multi-objectives in this work and particularly focus on using MGD as the MOO technique to obtain the gradient from multi-objectives. The exploration of the black-box setting, as mentioned in [49], is discussed in the conclusion and remains for future work.

Dataset. In the task of image generation, we use the CIFAR10 [29] dataset, which consists of 60,000 color images, each with a size of 3×32×32332323\times 32\times 323 × 32 × 32, distributed across 10 classes. Regarding protein generation, following Gruver et al [19], the experiment was conducted on the paired Observed Antibody Space (pOAS) dataset [36], which comprises 90,9909099090,99090 , 990 antibody sequences, each processed to a fixed length of 300.

Baselines. First, we include the most closely-related and SOTA work in MOG that applies the MOO technique to the deep generative model [19]. This baseline is termed as “DM+m𝑚mitalic_m-MGD”, where the MGD of m𝑚mitalic_m objectives is used to guide the generation of diffusion models (DM). We also include the baseline regarding single-objective generation, termed as “DM+single”. It fuses multiple objectives into a single objective and uses the gradient of the obtained single objective to guide the generation of diffusion models. Another considered baseline is “m+1𝑚1m+1italic_m + 1-MGD”. It treats the objective of the diffusion model as an additional objective and formulates multi-objective generation as the optimization of m+1𝑚1m+1italic_m + 1 objectives. MGD is then applied directly for the resultant m+1𝑚1m+1italic_m + 1 objectives. To stress the necessity of quality assurance in the generation problem, which is the core difference between MOG and MOO, we include the MGD of m𝑚mitalic_m objectives as the baseline, called “m-MGD”.

For all methods equipped with MGD, the diversity regularization (Eq.(16)) is included except for m+1𝑚1m+1italic_m + 1-MGD since its extra objective fm+1(x)subscript𝑓𝑚1𝑥f_{m+1}(x)italic_f start_POSTSUBSCRIPT italic_m + 1 end_POSTSUBSCRIPT ( italic_x ), i.e., data likelihood, is not accessible for the diffusion models.

Metrics. In terms of generation quality, the Frechet Inception Distance (FID) [21] is adopted as the metric for image quality, while the log-likelihood assigned by ProtGPT [16] is considered as the metric for the quality of protein sequences following Gruver et al [19]. Concerning Pareto optimality, Hypervolume (HV) [62] is adopted to measure how well the methods approximate the Pareto set.

5.1 Image Generation

We follow Liu et al [34]333As demonstrated in Section 3 and Figure 3(b) of their study, an objective that forces the center of generated images to be a black square can be used for constrained sampling on CIFAR10. Accordingly, they obtain samples that lie on the CIFAR10 data manifold and exhibit the black square in the middle, such as “black plane” and “black dog” images which contain a black square (smaller size than the object) in the middle. This task can be considered as image outpainting [59], namely, extrapolating images based on specified color patches on CIFAR10. to optimize CIFAR10 images with the objectives that force the middle of an image to be a specified color square.
(1) Controllable generation on CIFAR10 with two objectives (Fig. 1(b)):

  • \bullet

    f1(x)=xΩ1Ω22subscript𝑓1𝑥superscriptsubscriptnormsubscript𝑥Ωsubscript1Ω22f_{1}(x)=\|x_{\Omega}-1_{\Omega}\|_{2}^{2}italic_f start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( italic_x ) = ∥ italic_x start_POSTSUBSCRIPT roman_Ω end_POSTSUBSCRIPT - 1 start_POSTSUBSCRIPT roman_Ω end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT, where x𝑥xitalic_x represents the entire image, and xΩxsubscript𝑥Ω𝑥x_{\Omega}\subseteq xitalic_x start_POSTSUBSCRIPT roman_Ω end_POSTSUBSCRIPT ⊆ italic_x is an image patch in the region ΩΩ\Omegaroman_Ω, corresponding to the square at the center of the image. Similar to the practical relevance shown in Liu et al [34], this objective is to restrict the center of the generated images to be a white square, which is to sample CIFAR10 images that exhibit white color in their middle. The patch size is set to 3×8×83883\times 8\times 83 × 8 × 8 in the experiment.

  • \bullet

    f2(x)=xΩ0.5Ω22subscript𝑓2𝑥superscriptsubscriptnormsubscript𝑥Ωsubscript0.5Ω22f_{2}(x)=\|x_{\Omega}-0.5_{\Omega}\|_{2}^{2}italic_f start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( italic_x ) = ∥ italic_x start_POSTSUBSCRIPT roman_Ω end_POSTSUBSCRIPT - 0.5 start_POSTSUBSCRIPT roman_Ω end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT with the similar setting. This objective is to constrain the center to be a grey square.

The desired generation for these two objectives would be those CIFAR10-like images with patches in normalized RGB color values444RGB values [0, 255] are divided by 255. between [0.5, 0.5, 0.5] (grey) and [1, 1, 1] (white), in the middle, according to Ishibuchi et al [24], Li et al [30]. Please refer to Appendix B for more details.
(2) Controllable generation on CIFAR10 with three objectives:

  • \bullet

    f1(x)=xΩaΩ22subscript𝑓1𝑥superscriptsubscriptnormsubscript𝑥Ωsubscript𝑎Ω22f_{1}(x)=\|x_{\Omega}-a_{\Omega}\|_{2}^{2}italic_f start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( italic_x ) = ∥ italic_x start_POSTSUBSCRIPT roman_Ω end_POSTSUBSCRIPT - italic_a start_POSTSUBSCRIPT roman_Ω end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT, where x𝑥xitalic_x represents the entire image, and xΩxsubscript𝑥Ω𝑥x_{\Omega}\subseteq xitalic_x start_POSTSUBSCRIPT roman_Ω end_POSTSUBSCRIPT ⊆ italic_x is an image patch in the region ΩΩ\Omegaroman_Ω, corresponding to the square at the center of the image. This objective is to restrict the center of the generated images to be a black square. The patch size is set to 3×8×83883\times 8\times 83 × 8 × 8 in the experiment. aΩ=[0,0,0]8×8subscript𝑎Ωsubscript00088a_{\Omega}=[0,0,0]_{8\times 8}italic_a start_POSTSUBSCRIPT roman_Ω end_POSTSUBSCRIPT = [ 0 , 0 , 0 ] start_POSTSUBSCRIPT 8 × 8 end_POSTSUBSCRIPT.

  • \bullet

    f2(x)=xΩbΩ22subscript𝑓2𝑥superscriptsubscriptnormsubscript𝑥Ωsubscript𝑏Ω22f_{2}(x)=\|x_{\Omega}-b_{\Omega}\|_{2}^{2}italic_f start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( italic_x ) = ∥ italic_x start_POSTSUBSCRIPT roman_Ω end_POSTSUBSCRIPT - italic_b start_POSTSUBSCRIPT roman_Ω end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT with the similar setting. This objective is to constrain the center to be a deep red square. bΩ=[0.5,0,0]8×8subscript𝑏Ωsubscript0.50088b_{\Omega}=[0.5,0,0]_{8\times 8}italic_b start_POSTSUBSCRIPT roman_Ω end_POSTSUBSCRIPT = [ 0.5 , 0 , 0 ] start_POSTSUBSCRIPT 8 × 8 end_POSTSUBSCRIPT.

  • \bullet

    f3(x)=xΩcΩ22subscript𝑓3𝑥superscriptsubscriptnormsubscript𝑥Ωsubscript𝑐Ω22f_{3}(x)=\|x_{\Omega}-c_{\Omega}\|_{2}^{2}italic_f start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT ( italic_x ) = ∥ italic_x start_POSTSUBSCRIPT roman_Ω end_POSTSUBSCRIPT - italic_c start_POSTSUBSCRIPT roman_Ω end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT with the similar setting. This objective is to constrain the center to be a deep yellow square. cΩ=[0.5,0.5,0]8×8subscript𝑐Ωsubscript0.50.5088c_{\Omega}=[0.5,0.5,0]_{8\times 8}italic_c start_POSTSUBSCRIPT roman_Ω end_POSTSUBSCRIPT = [ 0.5 , 0.5 , 0 ] start_POSTSUBSCRIPT 8 × 8 end_POSTSUBSCRIPT.

The desired generation for these three objectives would be those CIFAR10-like images with patches in normalized RGB color values belonging to the convex triangle formed by the points [0, 0, 0] (black), [0.5, 0, 0] (deep red) and [0.5, 0.5, 0] (deep yellow). Please refer to Appendix B for more details. We adopt the diffusion model used in Song and Ermon [45] as the backbone for CIFAR10 image generation.

Refer to caption

(a) Two objectives

Refer to caption

(b) Three objectives

Figure 2: Generated images from our PROUD and various baselines on CIFAR10 under two/three conflicting patch-based objectives. The scores under each image refer to its objective values [f1(x),f2(x)]subscript𝑓1𝑥subscript𝑓2𝑥[f_{1}(x),f_{2}(x)][ italic_f start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( italic_x ) , italic_f start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( italic_x ) ]/[f1(x),f2(x),f3(x)]subscript𝑓1𝑥subscript𝑓2𝑥subscript𝑓3𝑥[f_{1}(x),f_{2}(x),f_{3}(x)][ italic_f start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( italic_x ) , italic_f start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( italic_x ) , italic_f start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT ( italic_x ) ], respectively, where those objective values do not reside on the Pareto front are marked in red.

We sample images from our PROUD and other baselines using the same seeds for the sake of comparison. From Fig. 2, we can observe that: (1) our PROUD and two baselines, DM+m𝑚mitalic_m-MGD and m𝑚mitalic_m-MGD, can successfully generate harmonious images consistent with the patch-level constraints imposed by two conflicting objectives. Among them, the generated images of our PROUD exhibit better quality than DM+m𝑚mitalic_m-MGD in some instances, as the latter tends to sacrifice generation quality to excessively meet Pareto optimality of the objectives due to the lack of a mechanism to emphasize the quality of generated samples. (2) m+1𝑚1m+1italic_m + 1-MGD fails to generate satisfactory images consistent with the patch-level constraints, as the new objective (i.e, generation quality) biases the optimization of the original two objectives. Although the Pareto set of the original m𝑚mitalic_m-objectives resides within that of the m+1𝑚1m+1italic_m + 1-objectives [51], the proportion is negligible even when sampling a large number of images. Refer to Fig. 3(d)&(f) and Fig. 4(d)&(f) for more details. (3) m𝑚mitalic_m-MGD, which does not consider generative quality in its optimization, generates meaningless images because the optimization of multiple objectives in the data generation task is only meaningful within the data manifold, as image data usually concentrate on low-dimensional manifolds embedded in a high-dimensional space.

Refer to caption

(a) PROUD

Refer to caption

(b) DM+m𝑚mitalic_m-MGD

Refer to caption

(c) DM+single

Refer to caption

(d) m+1𝑚1m+1italic_m + 1-MGD (Cropped)

Refer to caption

(e) m𝑚mitalic_m-MGD

Refer to caption

(f) m+1𝑚1m+1italic_m + 1-MGD (full)

Figure 3: Approximation of Pareto front of various methods on CIFAR10 optimized with two objectives. Each point denotes a generated sample, 1,000 in total, where the coordinate corresponds to its objective values. The depth of color represents sample density, the deeper the higher.

For the MOG setting on CIFAR10 optimized with two objectives, we randomly select 1,000 generated images for each method, and calculate their objective values [f1(x),f2(x)]subscript𝑓1𝑥subscript𝑓2𝑥[f_{1}(x),f_{2}(x)][ italic_f start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( italic_x ) , italic_f start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( italic_x ) ], respectively. Fig. 3 shows that: (1) our PROUD (Fig. 3(a)) and two baselines DM+m𝑚mitalic_m-MGD (Fig. 3(b)) and m𝑚mitalic_m-MGD (Fig. 3(e)) successfully generate samples which can cover the entire Pareto front. Among them, our PROUD and m𝑚mitalic_m-MGD spread more evenly over the Pareto front. (2) DM+single only covers a partial Pareto front as shown in Fig. 3(c), because simply averaging multiple objectives into a single objective fails to explore the trade-off between multiple objectives and leads to insufficient solutions. (3) As discussed in Fig. 2, m+1𝑚1m+1italic_m + 1-MGD explores a much larger solution space (Fig. 3(f)), while only a few of them are located at the Pareto front of the original m𝑚mitalic_m objectives (Fig. 3 (d)).

For the MOG setting on CIFAR10 optimized with three objectives, we randomly select 5,000 generated images for each method and calculate their objective values [f1(x),f2(x),f3(x)subscript𝑓1𝑥subscript𝑓2𝑥subscript𝑓3𝑥f_{1}(x),f_{2}(x),f_{3}(x)italic_f start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( italic_x ) , italic_f start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( italic_x ) , italic_f start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT ( italic_x )], respectively. Fig. 4 shows that our PROUD exhibits significant superiority in evenly covering the Pareto front under this more challenging setting. This is because our constrained optimization formulation can better coordinate the generation quality and the optimization for multi-objectives, while ensuring sample diversity (Eq.(8), Eq.(16)). Although it is possible to force the two baselines DM+m𝑚mitalic_m-MGD and m𝑚mitalic_m-MGD to exhibit better diversity by setting a large diversity coefficient γ𝛾\gammaitalic_γ, but this would cause the samples they generate to violate Pareto optimality, as shown in Fig. 8 and Fig. 9 in the Appendix.

Refer to caption

(a) PROUD (9.83×1059.83superscript1059.83\times 10^{-5}9.83 × 10 start_POSTSUPERSCRIPT - 5 end_POSTSUPERSCRIPT)

Refer to caption

(b) DM+m𝑚mitalic_m-MGD (3.10×1043.10superscript1043.10\times 10^{-4}3.10 × 10 start_POSTSUPERSCRIPT - 4 end_POSTSUPERSCRIPT)

Refer to caption

(c) DM+single (6.89×1046.89superscript1046.89\times 10^{-4}6.89 × 10 start_POSTSUPERSCRIPT - 4 end_POSTSUPERSCRIPT)

Refer to caption

(d) m+1𝑚1m+1italic_m + 1-MGD (Cropped, 2.28×1032.28superscript1032.28\times 10^{-3}2.28 × 10 start_POSTSUPERSCRIPT - 3 end_POSTSUPERSCRIPT)

Refer to caption

(e) m𝑚mitalic_m-MGD (3.27×1043.27superscript1043.27\times 10^{-4}3.27 × 10 start_POSTSUPERSCRIPT - 4 end_POSTSUPERSCRIPT)

Refer to caption

(f) m+1𝑚1m+1italic_m + 1-MGD (full, 2.50×1032.50superscript1032.50\times 10^{-3}2.50 × 10 start_POSTSUPERSCRIPT - 3 end_POSTSUPERSCRIPT)

Figure 4: Approximation of Pareto front of various methods on CIFAR10 optimized with three objectives. Each point denotes each generated sample, 5,000 in total, where the coordinate corresponds to its objective values. The depth of color represents sample density, the deeper the higher. The values in the brackets are earth mover distances between the generated samples and the ground-truth Pareto solutions. We add this measure to indicate that our generated samples are indeed close to the Pareto front given the 3D visualization.
Table 2: Quantitative evaluation for Pareto approximation and generation quality. Bolded values and underlined values indicate the best results and the second best results, respectively. The Friedman & Nemenyi test in Appendix B demonstrates that our PROUD is significantly better than other baselines. “-” denotes that the value is not available as no valid data are generated.
Method CIFAR10 (2-obj) CIFAR10 (3-obj) pOAS
HV\uparrow (102superscript10210^{-2}10 start_POSTSUPERSCRIPT - 2 end_POSTSUPERSCRIPT) FID\downarrow HV\uparrow (103superscript10310^{-3}10 start_POSTSUPERSCRIPT - 3 end_POSTSUPERSCRIPT) FID\downarrow HV\uparrow ProtGPT\uparrow
PROUD (ours) 5.21±plus-or-minus\pm±0.00 31.39±plus-or-minus\pm±0.05 3.26±plus-or-minus\pm±0.00 44.22±plus-or-minus\pm±0.13 2472.55±plus-or-minus\pm±60.15 -645.93±plus-or-minus\pm±0.99
DM+m𝑚mitalic_m-MGD 5.20±plus-or-minus\pm±0.01 38.72±plus-or-minus\pm±0.36 3.26±plus-or-minus\pm±0.01 49.90±plus-or-minus\pm±0.14 2289.61±plus-or-minus\pm±65.12 -692.80±plus-or-minus\pm±0.34
DM+single 4.77±plus-or-minus\pm±0.01 36.35±plus-or-minus\pm±0.47 2.21±plus-or-minus\pm±0.00 57.77±plus-or-minus\pm±0.05 2302.21±plus-or-minus\pm±58.25 -682.26±plus-or-minus\pm±0.49
m+1𝑚1m+1italic_m + 1-MGD 5.17±plus-or-minus\pm±0.00 11.21±plus-or-minus\pm±0.10 2.87±plus-or-minus\pm±0.03 11.80±plus-or-minus\pm±0.05 838.74±plus-or-minus\pm±14.08 -662.86±plus-or-minus\pm±0.76
m𝑚mitalic_m-MGD 5.21±plus-or-minus\pm±0.00 - 3.26±plus-or-minus\pm±0.01 - - -

To further demonstrate the superiority of our PROUD on multi-objective generation, we collect the quantitative evaluation for Pareto approximation and image quality in the left part of Table 2 by sampling 50,0005000050,00050 , 000 images. It shows that: our PROUD achieves the best or the second best values in both two metrics, i.e., HV for Pareto approximation and FID for image quality. It demonstrates our claim that our PROUD can provide certain quality assurance for generated samples approaching the Pareto set of multiple properties. On the contrary, either single or multiple objective generation baselines, i.e., DM+single and DM+m𝑚mitalic_m-MGD, would inevitably sacrifice generation quality to excessively optimize the objectives.

5.2 Protein Sequence Generation

To further verify our model in more challenging applications, we design multiple-objective generation task on the pOAS dataset which aims to optimize two conflicting objectives for antibody sequences:

  • f1(x)subscript𝑓1𝑥f_{1}(x)italic_f start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( italic_x ), the solvent accessible surface area (SASA) of the protein’s predicted structure. Please refer to Ruffolo et al [39] for detailed procedures of calculating the SASA value using the protein sequences.

  • f2(x)subscript𝑓2𝑥f_{2}(x)italic_f start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( italic_x ), the percentage of beta sheets (%Sheets), which is measured on protein sequences directly [7].

The ground-truth Pareto front is not available due to the complexity of property objectives. Since the evaluation functions for SASA and %Sheet are not differentiable, we adopt the network predictors as differential surrogate functions for all methods. We apply the ground-truth evaluation functions for calculating the HV values on the generated samples. We adopt the discrete diffusion model in Gruver et al [19] as the backbone for protein sequence generation.

Refer to caption

Figure 5: The approximation of Pareto front (i.e., generated protein sequences) of various methods. We cannot visualize the results of m𝑚mitalic_m-MGD because all its generated protein sequences are invalid, resulting in nonexistent SASA evaluations (f1subscript𝑓1f_{1}italic_f start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT).

To demonstrate the superiority of our PROUD in multi-objective protein generation, we initially sample 5,00050005,0005 , 000 protein sequences for each method and collect the non-dominated samples based on their two target properties, as depicted in Fig.5. The observations are as follows: (1) DM+single exhibits a wide coverage of the objective values. This could be attributed to the fact that the noise in discrete diffusion models can bring out large diversity [19]. By incorporating MGD into diffusion models, PROUD and DM+m𝑚mitalic_m-MGD achieve larger coverage of the objective values. This verifies the superiority of MOG over SOG. Our PROUD and DM+m𝑚mitalic_m-MGD emphasize respective Pareto improvement of the objectives. Nevertheless, Table 2 shows that our PROUD achieves a better HV. (2) Similar to the image generation task, m+1𝑚1m+1italic_m + 1-MGD demonstrates a much poorer approximation of the Pareto front for the original m𝑚mitalic_m objectives. Meanwhile, m𝑚mitalic_m-MGD even fails to generate any valid protein sequences, as the SASA evaluation (f1(x)subscript𝑓1𝑥f_{1}(x)italic_f start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( italic_x )) for all its generated samples is nonexistent. This further highlights the difference between MOG and MOO.

Furthermore, we collect the quantitative evaluation for Pareto approximation and protein quality in the right part of Table 2 by sampling 5,00050005,0005 , 000 protein sequence555We only sample 5,00050005,0005 , 000 protein sequence since the computation cost of SASA values is very high.. Benefiting from our constrained-optimization formulation, our PROUD can avoid unnecessary loss of protein quality compared to other MOG/SOG counterparts, DM+m𝑚mitalic_m-MGD and DM+single. This improvement will greatly increase the practicality of its generated samples.

Table 3: Sensitivity analysis on α𝛼\alphaitalic_α and e𝑒eitalic_e in Eq.(12). We retain more decimal places here to demonstrate the subtle differences between results.
Metric γ=0.2,e=0.03formulae-sequence𝛾0.2𝑒0.03\gamma=0.2,\;e=0.03italic_γ = 0.2 , italic_e = 0.03 γ=0.2,α=0.5formulae-sequence𝛾0.2𝛼0.5\gamma=0.2,\;\alpha=0.5italic_γ = 0.2 , italic_α = 0.5
α=0.1𝛼0.1\alpha=0.1italic_α = 0.1 α=0.5𝛼0.5\alpha=0.5italic_α = 0.5 α=1𝛼1\alpha=1italic_α = 1 e=0.01𝑒0.01e=0.01italic_e = 0.01 e=0.03𝑒0.03e=0.03italic_e = 0.03 e=0.05𝑒0.05e=0.05italic_e = 0.05
FID 31.58963073 31.48232218 31.5896311 31.58966697 31.48232218 31.58966696
HV 0.05211343 0.05211350 0.05211343 0.05211343 0.05211350 0.05211343
Table 4: Sensitivity analysis on γ𝛾\gammaitalic_γ. α𝛼\alphaitalic_α and e𝑒eitalic_e are set to 0.50.50.50.5 and 0.030.030.030.03, respectively. The best results are marked in bold.
Metric γ=0𝛾0\gamma=0italic_γ = 0 γ=0.01𝛾0.01\gamma=0.01italic_γ = 0.01 γ=0.1𝛾0.1\gamma=0.1italic_γ = 0.1 γ=0.2𝛾0.2\gamma=0.2italic_γ = 0.2 γ=0.3𝛾0.3\gamma=0.3italic_γ = 0.3 γ=1𝛾1\gamma=1italic_γ = 1
FID 34.80 30.98 31.80 31.48 31.63 33.59
HV 0.0483 0.0498 0.0521 0.0521 0.0521 0.0521

5.3 Hyper-parameter Sensitivity Study

Refer to caption

(a) γ=0𝛾0\gamma=0italic_γ = 0

Refer to caption

(b) γ=0.1𝛾0.1\gamma=0.1italic_γ = 0.1

Refer to caption

(c) γ=0.2𝛾0.2\gamma=0.2italic_γ = 0.2

Refer to caption

(d) γ=1𝛾1\gamma=1italic_γ = 1

Refer to caption

(e) γ=0𝛾0\gamma=0italic_γ = 0

Refer to caption

(f) γ=0.1𝛾0.1\gamma=0.1italic_γ = 0.1

Refer to caption

(g) γ=0.2𝛾0.2\gamma=0.2italic_γ = 0.2

Refer to caption

(h) γ=1𝛾1\gamma=1italic_γ = 1

Refer to caption

(i) w=0𝑤0w=0italic_w = 0

Refer to caption

(j) w=0.1𝑤0.1w=0.1italic_w = 0.1

Refer to caption

(k) w=0.5𝑤0.5w=0.5italic_w = 0.5

Refer to caption

(l) w=1𝑤1w=1italic_w = 1

Figure 6: Analysis on the effects of the diversity coefficient γ𝛾\gammaitalic_γ in Eq.(16) to our PROUD (1st row) and DM+m𝑚mitalic_m-MGD (2nd row). As DM+single (3rd row) degenerates to SOG and does not have the diversity regularization, we conduct sensitivity analysis on its weight coefficient for combining two objectives, i.e., (1w)f1+wf21𝑤subscript𝑓1𝑤subscript𝑓2(1-w)f_{1}+wf_{2}( 1 - italic_w ) italic_f start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT + italic_w italic_f start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT. The depth of color represents sample density, the deeper the higher.

We study PROUD with different configurations of the hyper-parameters, namely, α𝛼\alphaitalic_α and e𝑒eitalic_e in Eq.(12) as well as the diversity coefficient γ𝛾\gammaitalic_γ in Eq.(16). The experiments are conducted on CIFAR10, with the same setting as Section 5.1.

We set α𝛼\alphaitalic_α as 0.1,0.5,10.10.510.1,0.5,10.1 , 0.5 , 1 and e𝑒eitalic_e as 0.01,0.03,0.050.010.030.050.01,0.03,0.050.01 , 0.03 , 0.05, respectively. We observe in Table 3 that PROUD is not sensitive to the choice of the hyper-parameters α𝛼\alphaitalic_α and e𝑒eitalic_e.

We set γ𝛾\gammaitalic_γ as 0, 0.1, 0.2, 1. The results are summarized in Fig. 6(a) to Fig. 6(d), showing that: (1) With an appropriate diversity coefficient, our PROUD can well cover the Pareto front. (2) Without the diversity regularization, PROUD can only obtain a small set of Pareto solutions. This demonstrates the necessity of the diversity loss, consistent with the finding in the former work [33]. (3) With a too large value of γ𝛾\gammaitalic_γ, the generated samples could fall outside the Pareto front. The effect of the diversity coefficient on DM+m𝑚mitalic_m-MGD (Fig. 6(e) to Fig. 6(h)) is similar.

To further investigate the effects of the diversity coefficient on the generation quality, we collect FID results in Table 4. With γ=0.2𝛾0.2\gamma=0.2italic_γ = 0.2, PROUD obtains both the best FID and HV, which is thus set as the hyper-parameter used in Section 5.1.

To demonstrate that the single-objective generation would fail to cover the Pareto front even with a uniform grid of weighting, we set the weight coefficient w𝑤witalic_w for combining two objectives into a single objective in DM+single “w×f1(x)+(1w)×f2(x)𝑤subscript𝑓1𝑥1𝑤subscript𝑓2𝑥w\times f_{1}(x)+(1-w)\times f_{2}(x)italic_w × italic_f start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( italic_x ) + ( 1 - italic_w ) × italic_f start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( italic_x )” as 0 to 1 with a step 0.10.10.10.1. We put the results of 0, 0.1, 0.5, 1 in Fig. 6(i) to Fig. 6(l) and rest in Appendix. With w=0,0.1,0.2,0.3,0.4𝑤00.10.20.30.4w=0,0.1,0.2,0.3,0.4italic_w = 0 , 0.1 , 0.2 , 0.3 , 0.4, the single objective is dominated by f2(x)subscript𝑓2𝑥f_{2}(x)italic_f start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( italic_x ). Consequently, the generated samples achieve the smallest value for f2(x)subscript𝑓2𝑥f_{2}(x)italic_f start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( italic_x ) but the largest one for f1(x)subscript𝑓1𝑥f_{1}(x)italic_f start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( italic_x ); vice versa. With an equal weight, the generated samples are supposed to obtain the comprise value between two objectives, i.e., (0.0625, 0.0625). We notice that the generated samples cover a small range around this point. This diversity could result from the diffusion noise in diffusion models.

6 Conclusion

This paper studies the problem of optimizing deep generative models with multiple conflicting objectives. We highlight this problem setting by treating the optimization of samples with multiple properties and the process of sample generation as a unified task. By analyzing the connections and differences from multi-objective optimization, we introduce a constrained optimization formulation to solve the multi-objective generation problem, based on which we developed PROUD. Our experiments demonstrate the efficacy of PROUD in both image and protein sequence generation. While we explored the white-box multi-objectives in this work, it would be interesting to explore our PROUD in the black-box setting in the future. The multiple gradient descent technique used can be replaced by methods such as Bayesian multi-objective optimization [49].

Declarations

Funding. This work was supported by the Agency for Science, Technology and Research (A*STAR) Centre for Frontier AI Research, the A*STAR GAP project (Grant No.I23D1AG079), and the AISG Grand Challenge in AI for Materials Discovery (Grant No. AISG2-GC-2023-010).
Competing interests. The authors have no financial or non-financial interests to disclose that are relevant to the content of this article.
Ethics approval. Not applicable.
Consent to participate. Not applicable.
Consent to publish. Not applicable.
Availability of data and materials. All datasets used in this work are available online and clearly cited.
Code availability. The code of this work will be publicly released on github.
Authors’ contributions. Idea: YY; Methodology & Experiment: YY, YP; Writing - comments/edits: all.

References

Appendix A Complete sensitivity analysis for single-objective generation

We set the weight coefficient w𝑤witalic_w for combining two objectives in DM+single “w×f1(x)+(1w)×f2(x)𝑤subscript𝑓1𝑥1𝑤subscript𝑓2𝑥w\times f_{1}(x)+(1-w)\times f_{2}(x)italic_w × italic_f start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( italic_x ) + ( 1 - italic_w ) × italic_f start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( italic_x )” as 0 to 1 with a step 0.10.10.10.1. The results is shown in Fig. 7:

  • when w<0.5𝑤0.5w<0.5italic_w < 0.5, the resultant final objective is dominated by f2(x)subscript𝑓2𝑥f_{2}(x)italic_f start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( italic_x ). Consequently, the leading objective is optimized to the best where all the generated samples have the smallest value for f2(x)subscript𝑓2𝑥f_{2}(x)italic_f start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( italic_x ) but the largest one for f1(x)subscript𝑓1𝑥f_{1}(x)italic_f start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( italic_x ) .

  • when w>0.5𝑤0.5w>0.5italic_w > 0.5, the resultant final objective is dominated by f1(x)subscript𝑓1𝑥f_{1}(x)italic_f start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( italic_x ). Therefore, the generated samples achieve the smallest value for the first objective but the largest one for the second objective.

  • when w=0.5=1m𝑤0.51𝑚w=0.5=\frac{1}{m}italic_w = 0.5 = divide start_ARG 1 end_ARG start_ARG italic_m end_ARG, the generated samples are supposed to obtain the comprise value between f1(x)subscript𝑓1𝑥f_{1}(x)italic_f start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( italic_x ) and f2(x)subscript𝑓2𝑥f_{2}(x)italic_f start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( italic_x ), i.e., (0.0625, 0.0625). We notice that the generated samples cover a small range around this point. This diversity could result from the diffusion noise in diffusion models.

Refer to caption

(a) w=0𝑤0w=0italic_w = 0

Refer to caption

(b) w=0.1𝑤0.1w=0.1italic_w = 0.1

Refer to caption

(c) w=0.2𝑤0.2w=0.2italic_w = 0.2

Refer to caption

(d) w=0.3𝑤0.3w=0.3italic_w = 0.3

Refer to caption

(e) w=0.4𝑤0.4w=0.4italic_w = 0.4

Refer to caption

(f) w=0.5𝑤0.5w=0.5italic_w = 0.5

Refer to caption

(g) w=0.6𝑤0.6w=0.6italic_w = 0.6

Refer to caption

(h) w=0.7𝑤0.7w=0.7italic_w = 0.7

Refer to caption

(i) w=0.8𝑤0.8w=0.8italic_w = 0.8

Refer to caption

(j) w=0.9𝑤0.9w=0.9italic_w = 0.9

Refer to caption

(k) w=1𝑤1w=1italic_w = 1

Figure 7: Sensitivity analysis on the weight coefficient for combining two objectives, i.e., (1w)f1+wf21𝑤subscript𝑓1𝑤subscript𝑓2(1-w)f_{1}+wf_{2}( 1 - italic_w ) italic_f start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT + italic_w italic_f start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT in DM+single. The depth of color represents sample density, the deeper the higher.

Refer to caption

(a) γ=0.8𝛾0.8\gamma=0.8italic_γ = 0.8

Refer to caption

(b) γ=1𝛾1\gamma=1italic_γ = 1

Refer to caption

(c) γ=1.5𝛾1.5\gamma=1.5italic_γ = 1.5

Figure 8: Different diversity coefficient γ𝛾\gammaitalic_γ for DM+m𝑚mitalic_m-MGD on CIFAR10 optimized with three objectives. 1,000 generated samples are randomly selected for visualization.

Refer to caption

(a) γ=0.01𝛾0.01\gamma=0.01italic_γ = 0.01

Refer to caption

(b) γ=0.05𝛾0.05\gamma=0.05italic_γ = 0.05

Refer to caption

(c) γ=0.1𝛾0.1\gamma=0.1italic_γ = 0.1

Figure 9: Different diversity coefficient γ𝛾\gammaitalic_γ for m𝑚mitalic_m-MGD on CIFAR10 optimized with three objectives. 1,000 generated samples are randomly selected for visualization.

Refer to caption

Refer to caption

Refer to caption

Refer to caption

(a) PROUD

Refer to caption

(b) DM+m𝑚mitalic_m-MGD

Refer to caption

(c) m𝑚mitalic_m-MGD

Figure 10: Approximation of Pareto front of various methods on CIFAR10 optimized with three objectives. The first row presents 50,000 generated samples while the second row presents non-dominated points out of 50,000 sample points, verifying the HV results obtained in Table 2.

Appendix B More Experimental Settings and Analyses

Image Generation

According to Ishibuchi et al [24], Li et al [30]666Our problem setting is slightly different as we take the distance square in order to obtain a non-linear shape of the Pareto front. We also refer reviewer to example-1 in Liu et al [33] that defines a same two-objective problem but with 1-D decision variable for easy understanding., we can obtain that: (1) the Pareto solutions of the two objective setting are the points on the line between 1Ωsubscript1Ω1_{\Omega}1 start_POSTSUBSCRIPT roman_Ω end_POSTSUBSCRIPT and 0.5Ωsubscript0.5Ω0.5_{\Omega}0.5 start_POSTSUBSCRIPT roman_Ω end_POSTSUBSCRIPT. Namely, the Pareto solutions are {x|xΩ=κΩ,κΩ[0.5Ω,1Ω]}conditional-set𝑥formulae-sequencesubscript𝑥Ωsubscript𝜅Ωsubscript𝜅Ωsubscript0.5Ωsubscript1Ω\{x|x_{\Omega}=\kappa_{\Omega},\kappa_{\Omega}\in[0.5_{\Omega},1_{\Omega}]\}{ italic_x | italic_x start_POSTSUBSCRIPT roman_Ω end_POSTSUBSCRIPT = italic_κ start_POSTSUBSCRIPT roman_Ω end_POSTSUBSCRIPT , italic_κ start_POSTSUBSCRIPT roman_Ω end_POSTSUBSCRIPT ∈ [ 0.5 start_POSTSUBSCRIPT roman_Ω end_POSTSUBSCRIPT , 1 start_POSTSUBSCRIPT roman_Ω end_POSTSUBSCRIPT ] }777We use [0.5Ω,1Ω]subscript0.5Ωsubscript1Ω[0.5_{\Omega},1_{\Omega}][ 0.5 start_POSTSUBSCRIPT roman_Ω end_POSTSUBSCRIPT , 1 start_POSTSUBSCRIPT roman_Ω end_POSTSUBSCRIPT ] to denote image patches in normalized RGB color values between [0.5, 0.5, 0.5] (grey) and [1, 1, 1] (white).. When taking images from CIFAR10 based on the Pareto set (Fig. 12), we follow Liu et al [34] to sample images in a small neighborhood around κΩsubscript𝜅Ω\kappa_{\Omega}italic_κ start_POSTSUBSCRIPT roman_Ω end_POSTSUBSCRIPT, namely, xΩκΩ22ϵsuperscriptsubscriptnormsubscript𝑥Ωsubscript𝜅Ω22italic-ϵ\|x_{\Omega}-\kappa_{\Omega}\|_{2}^{2}\leq\epsilon∥ italic_x start_POSTSUBSCRIPT roman_Ω end_POSTSUBSCRIPT - italic_κ start_POSTSUBSCRIPT roman_Ω end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ≤ italic_ϵ, where ϵ=8×104italic-ϵ8superscript104\epsilon=8\times 10^{-4}italic_ϵ = 8 × 10 start_POSTSUPERSCRIPT - 4 end_POSTSUPERSCRIPT. (2) The Pareto solutions of the three objective setting are the points on the convex polygonal formed by three points aΩ,bΩ,cΩsubscript𝑎Ωsubscript𝑏Ωsubscript𝑐Ωa_{\Omega},b_{\Omega},c_{\Omega}italic_a start_POSTSUBSCRIPT roman_Ω end_POSTSUBSCRIPT , italic_b start_POSTSUBSCRIPT roman_Ω end_POSTSUBSCRIPT , italic_c start_POSTSUBSCRIPT roman_Ω end_POSTSUBSCRIPT. For easy understanding, we assume Ω=3×1×1Ω311\Omega=3\times 1\times 1roman_Ω = 3 × 1 × 1, which is actually to constrain the middle point of CIFAR10 images to be certain colors.

We visualize the Pareto front of these two settings in Fig. 11. Specifically, for the two objective setting, the Pareto optimal points lie on the line between [1, 1, 1] and [0.5, 0.5, 0.5] (Fig. 11(a)), which physically denote RGB values (normalized, RGB values [0, 255] divided by 255). Then, we calculate the objectives values [f1(x),f2(x)]subscript𝑓1𝑥subscript𝑓2𝑥[f_{1}(x),f_{2}(x)][ italic_f start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( italic_x ) , italic_f start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( italic_x ) ] for these points accordingly, shown in Fig. 11(b). Fig. 11(c) and (d) are plotted for the three objective setting in a similar way. According to their Pareto fronts, we select [0.25, 0.25] and [0.2, 0.1, 0.2] as reference points to calculate the hypervolume (HV) for the two objective setting and the three objective setting in Table 2, respectively.

We sample CIFAR10 image using the constraint with different patch sizes to demonstrate its effect in Fig. 13. With a smaller size of the region ΩΩ\Omegaroman_Ω, more CIFAR10 images will meet the constraint.

Refer to caption

(a) Two objectives (data space)

Refer to caption

(B) Two objectives (functionality space)

Refer to caption

(c) Three objectives (data space)

Refer to caption

(d) Three objectives (functionality space)

Figure 11: Pareto front of two and three objectives in data space and functionality space optimized for CIFAR10 image generation.

Protein Sequence Generation

Our experiments in Section 5.2 adopted the same dataset and objectives as that in Section 5.2 of Gruver et al [19]. Note that we did not include their other experiments, because the experiment in their Section 5.1 is not a generation task equipped with property optimization and the dataset for the experiment in Section 5.3 and 5.4 has not been released due to private data. We select [1×104,0]1superscript1040[1\times 10^{4},0][ 1 × 10 start_POSTSUPERSCRIPT 4 end_POSTSUPERSCRIPT , 0 ] as a reference point to calculate the HV for this task.

Justification of Our Experiment Designs

Our experiment designs can appropriately justify the motivation of the MOG problem. Both CIFAR10 and protein datasets are real-world datasets whose data lie on low-dimensional manifolds in high-dimensional space [29, 19], thus applicable to our MOG problem setting. Meanwhile, the objectives considered for CIFAR10 are indeed benchmark multi-objective optimization problems with clear evaluations [24]; the objectives considered for the protein design task represent real-world scenarios [19]. Lastly, Fig. 2 and Table 2 demonstrate the necessity of considering generation quality, as the generation quality of all baseline methods suffers to some extent when optimizing multiple properties.

Significant Test

We apply the Friedman test under the null hypothesis positing that all methods perform similarly, alongside the Nemenyi post-hoc test for pairwise comparisons among the four methods [10]. The number of factors was set to four, given the failure of m𝑚mitalic_m-MGD to produce qualified samples, leading to its exclusion. The dataset comprised 30 instances, with each of the four methods independently evaluated five times across three datasets, employing two evaluation criteria. The Friedman test shows that τF=18.24subscript𝜏𝐹18.24\tau_{F}=18.24italic_τ start_POSTSUBSCRIPT italic_F end_POSTSUBSCRIPT = 18.24, greater than the critical value F3,87=2.709subscript𝐹3872.709F_{3,87}=2.709italic_F start_POSTSUBSCRIPT 3 , 87 end_POSTSUBSCRIPT = 2.709 when α=0.05𝛼0.05\alpha=0.05italic_α = 0.05. Therefore, the null hypothesis is rejected, which signifies a statistically significant difference among the four methods at the significance level of 0.05. Subsequent analysis via the Nemenyi post-hoc test in Fig. 14 unequivocally demonstrates that our PROUD exhibits marked superiority over the three baseline methods.

Refer to caption

[0, 0.25]

Refer to caption

[0.025, 0.1225]

Refer to caption

[0.0625, 0.0625]

Refer to caption

[0.140625, 0.015625]

Refer to caption

[0.180625, 0.005625]

Refer to caption

[0.25, 0]

Figure 12: Full resolution CIFAR10 images (3×32×32332323\times 32\times 323 × 32 × 32) in Fig. 1(b) of the manuscript. The red box denotes the region ΩΩ\Omegaroman_Ω (3×8×83883\times 8\times 83 × 8 × 8) in the two objectives in Section 5.1.

Refer to caption

ΩΩ\Omegaroman_Ω (3×6×63663\times 6\times 63 × 6 × 6, 19 images)

Refer to caption

ΩΩ\Omegaroman_Ω (3×8×83883\times 8\times 83 × 8 × 8, 7 images)

Refer to caption

ΩΩ\Omegaroman_Ω (3×10×10310103\times 10\times 103 × 10 × 10, 3 images)

Figure 13: Sampling CIFAR-10 images with regions of different patch sizes.

Refer to caption

Figure 14: Nemenyi post-hoc test over four methods.

Appendix C Discussions

The constrained MOO problem defines its decision space S𝑆Sitalic_S on a constrained space expressed using specified linear, nonlinear, or box constraints [1, 13] in dsuperscript𝑑\mathbb{R}^{d}blackboard_R start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT. Consequently, it is different from our MOG problems, whose manifold is delineated by a given dataset 𝒳𝒳\mathcal{X}caligraphic_X. Nevertheless, MOG problems could be understood as a type of constrained MOO problem in a broader context.

Table 5: Comparison of the MOG problem with the relevant MOO problems. The generation quality in MOG is usually modeled based on a given dataset X𝒳𝑋𝒳X\subset\mathcal{X}italic_X ⊂ caligraphic_X, where 𝒳𝒳\mathcal{X}caligraphic_X denotes a low-dimensional manifold embedded in a high dimensional space dsuperscript𝑑\mathbb{R}^{d}blackboard_R start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT. F(x)=[f1(x),f2(x),,fm(x)]𝐹𝑥subscript𝑓1𝑥subscript𝑓2𝑥subscript𝑓𝑚𝑥F(x)=[f_{1}(x),f_{2}(x),\ldots,f_{m}(x)]italic_F ( italic_x ) = [ italic_f start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( italic_x ) , italic_f start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( italic_x ) , … , italic_f start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT ( italic_x ) ].
objectives decision/data space generation quality
MOO F(x)𝐹𝑥F(x)italic_F ( italic_x ) xd𝑥superscript𝑑x\in\mathbb{R}^{d}italic_x ∈ blackboard_R start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT
Constrained
MOO
F(x)𝐹𝑥F(x)italic_F ( italic_x )
xS,Sdformulae-sequence𝑥𝑆𝑆superscript𝑑x\in S,S\subset\mathbb{R}^{d}italic_x ∈ italic_S , italic_S ⊂ blackboard_R start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT defined by
(non)linear or box constraints
MOG F(x)𝐹𝑥F(x)italic_F ( italic_x ) x𝒳,𝒳dformulae-sequence𝑥𝒳𝒳superscript𝑑x\in\mathcal{X},\mathcal{X}\subset\mathbb{R}^{d}italic_x ∈ caligraphic_X , caligraphic_X ⊂ blackboard_R start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT