Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                

Improved Noise Schedule for Diffusion Training

Tiankai Hang
Southeast University
tkhang@seu.edu.cn
&Shuyang Gu
Microsoft Research Asia
shuyanggu@microsoft.com
Long-term researcher intern at Microsoft Research Asia.
Abstract

Diffusion models have emerged as the de facto choice for generating visual signals. However, training a single model to predict noise across various levels poses significant challenges, necessitating numerous iterations and incurring significant computational costs. Various approaches, such as loss weighting strategy design and architectural refinements, have been introduced to expedite convergence. In this study, we propose a novel approach to design the noise schedule for enhancing the training of diffusion models. Our key insight is that the importance sampling of the logarithm of the Signal-to-Noise ratio (logSNRSNR\log\text{SNR}roman_log SNR), theoretically equivalent to a modified noise schedule, is particularly beneficial for training efficiency when increasing the sample frequency around logSNR=0SNR0\log\text{SNR}=0roman_log SNR = 0. We empirically demonstrate the superiority of our noise schedule over the standard cosine schedule. Furthermore, we highlight the advantages of our noise schedule design on the ImageNet benchmark, showing that the designed schedule consistently benefits different prediction targets.

1 Introduction

Diffusion models have emerged as a pivotal technique for generating visual signals across diverse domains, such as image synthesis Ramesh et al. (2022); Saharia et al. (2022); Rombach et al. (2022) and video generation Brooks et al. (2024). They are particularly adept at approximating complex distributions, where Generative Adversarial Networks (GANs) may encounter difficulties. Despite the substantial computational resources and numerous training iterations required for convergence, improving the training efficiency of diffusion models is essential for their application in large-scale scenarios, such as high-resolution image synthesis and video generation.

Architectural enhancements offer a promising path to improve both the training speed and performance of diffusion models. For instance, the use of Adaptive Layer Normalization Gu et al. (2022), when combined with zero initialization in the Transformer architecture as demonstrated by Peebles & Xie (2023), represents such an improvement. Similarly, the adoption of U-shaped skip connections within Transformers, as outlined in previous works Hoogeboom et al. (2023); Bao et al. (2022); Crowson et al. (2024), also boosts efficiency. In a parallel development, Karras et al. (2024) have contributed to this endeavor by reengineering the layers of ADM UNet Dhariwal & Nichol (2021) to preserve the magnitudes of activations, weights, and updates, ensuring a more efficient learning process.

Concurrently, various loss weighting designs have been implemented to accelerate the convergence of training. Previous works, such as eDiff-I Balaji et al. (2022) and Min-SNR Hang et al. (2023), found that the training of diffusion models may encounter conflicts among various noise intensities. Choi et al. (2022) prioritize specific noise levels during training to enhance learning of visual concepts. Min-SNR Hang et al. (2023) reduces weights of noisy tasks, pursuing the Pareto Optimality in different denoising tasks, validated its effectiveness on multiple datasets and architectures. A softer version of this approach, aiming to further enhance high-resolution image synthesis within hourglass diffusion models, was introduced by Crowson et al. (2024). SD3 Esser et al. (2024) empirically found that it’s crucial to increase the weight of the intermediate noise intensities, which has demonstrated the effectiveness during training the diffusion models.

In this study, we present a novel method to enhance the training of diffusion models by strategically redefining the noise schedule, which is equivalent to importance sampling of the noise across different intensities. However, empirical evidence suggests that allocating more computation costs (FLOPs) to mid-range noise levels (around logSNR=0SNR0\log\text{SNR}=0roman_log SNR = 0) yields superior performance compared to increasing loss weights during the same period, particularly under constrained computational budgets. We experimentally analyze the performance of several different noise schedules, including Laplace, Cauchy, and the Cosine Shifted/Scaled, which are visualized in Figure 1. Notably, the Laplace schedule exhibits favorable performance. We recommend to choose this noise schedule in the future.

We demonstrate the effectiveness of this approach using the ImageNet benchmark, with a consistent training budget of 500K iterations. Evaluated using the FID metric, our results reveal that noise schedules with a concentrated probability density around logSNR=0SNR0\log\text{SNR}=0roman_log SNR = 0 consistently surpass others, as evidenced at both 256×256256256256\times 256256 × 256 and 512×512512512512\times 512512 × 512 resolutions with different prediction target. This research contributes to the advancement of efficient training techniques for diffusion models.

Refer to caption
Figure 1: Illustration of the probability density functions of different noise schedules.

2 Method

2.1 Preliminaries

Diffusion models Ho et al. (2020); Yang et al. (2021) learn to generate data by iteratively reversing the diffusion process. We denote the distribution of data points as 𝐱pdata(𝐱)similar-to𝐱subscript𝑝data𝐱\mathbf{x}\sim p_{\text{data}}(\mathbf{x})bold_x ∼ italic_p start_POSTSUBSCRIPT data end_POSTSUBSCRIPT ( bold_x ). The diffusion process progressively adds noise to the data, which is defined as:

𝐱t=αt𝐱+σtϵ,whereϵ𝒩(0,𝐈),formulae-sequencesubscript𝐱𝑡subscript𝛼𝑡𝐱subscript𝜎𝑡bold-italic-ϵwheresimilar-tobold-italic-ϵ𝒩0𝐈\displaystyle\mathbf{x}_{t}=\alpha_{t}\mathbf{x}+\sigma_{t}\bm{\epsilon},\quad% \text{where}\quad\bm{\epsilon}\sim\mathcal{N}(0,\mathbf{I}),bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = italic_α start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT bold_x + italic_σ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT bold_italic_ϵ , where bold_italic_ϵ ∼ caligraphic_N ( 0 , bold_I ) , (1)

where αtsubscript𝛼𝑡\alpha_{t}italic_α start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT and σtsubscript𝜎𝑡\sigma_{t}italic_σ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT are the coefficients of the adding noise process, essentially representing the noise schedule. For the commonly used prediction target velocity: 𝐯t=αtϵσt𝐱subscript𝐯𝑡subscript𝛼𝑡bold-italic-ϵsubscript𝜎𝑡𝐱\mathbf{v}_{t}=\alpha_{t}\bm{\epsilon}-\sigma_{t}\mathbf{x}bold_v start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = italic_α start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT bold_italic_ϵ - italic_σ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT bold_x Salimans & Ho (2022), the diffusion model θ𝜃\thetaitalic_θ is trained through the Mean Squared Error (MSE) loss:

(θ)=𝔼𝐱pdata(𝐱)𝔼tp(t)[w(t)𝐯θ(αt𝐱+σtϵ,t,𝐜)𝐯t22],𝜃subscript𝔼similar-to𝐱subscript𝑝data𝐱subscript𝔼similar-to𝑡𝑝𝑡delimited-[]𝑤𝑡superscriptsubscriptdelimited-∥∥subscript𝐯𝜃subscript𝛼𝑡𝐱subscript𝜎𝑡bold-italic-ϵ𝑡𝐜subscript𝐯𝑡22\displaystyle\mathcal{L}(\theta)=\mathbb{E}_{\mathbf{x}\sim p_{\text{data}}(% \mathbf{x})}\mathbb{E}_{t\sim p(t)}\left[w(t)\lVert\mathbf{v}_{\theta}(\alpha_% {t}\mathbf{x}+\sigma_{t}\bm{\epsilon},t,\mathbf{c})-\mathbf{v}_{t}\rVert_{2}^{% 2}\right],caligraphic_L ( italic_θ ) = blackboard_E start_POSTSUBSCRIPT bold_x ∼ italic_p start_POSTSUBSCRIPT data end_POSTSUBSCRIPT ( bold_x ) end_POSTSUBSCRIPT blackboard_E start_POSTSUBSCRIPT italic_t ∼ italic_p ( italic_t ) end_POSTSUBSCRIPT [ italic_w ( italic_t ) ∥ bold_v start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_α start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT bold_x + italic_σ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT bold_italic_ϵ , italic_t , bold_c ) - bold_v start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ] , (2)

where w(t)𝑤𝑡w(t)italic_w ( italic_t ) is the loss weight, 𝐜𝐜\mathbf{c}bold_c denotes the condition information. Common practices sample t𝑡titalic_t from the uniform distribution 𝒰[0,1]𝒰01\mathcal{U}[0,1]caligraphic_U [ 0 , 1 ]. Kingma et al. (2021) introduced the Signal-to-Noise ratio as SNR(t)=αt2σt2SNR𝑡superscriptsubscript𝛼𝑡2superscriptsubscript𝜎𝑡2\text{SNR}(t)=\frac{\alpha_{t}^{2}}{\sigma_{t}^{2}}SNR ( italic_t ) = divide start_ARG italic_α start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG italic_σ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG to measure the noise level of different states. To simplify, we denote λ=logSNR𝜆SNR\lambda=\log\text{SNR}italic_λ = roman_log SNR to indicate the noise intensities. In the Variance Preserving (VP) setting, the coefficients in Equation 1 can be calculated by αt2=exp(λ)exp(λ)+1subscriptsuperscript𝛼2𝑡𝜆𝜆1\alpha^{2}_{t}=\frac{\exp(\lambda)}{\exp(\lambda)+1}italic_α start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = divide start_ARG roman_exp ( italic_λ ) end_ARG start_ARG roman_exp ( italic_λ ) + 1 end_ARG, σt2=1exp(λ)+1subscriptsuperscript𝜎2𝑡1𝜆1\sigma^{2}_{t}=\frac{1}{\exp(\lambda)+1}italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = divide start_ARG 1 end_ARG start_ARG roman_exp ( italic_λ ) + 1 end_ARG.

2.2 Improved Noise Schedule Design

Given that the timestep t𝑡titalic_t is a random variable sampled from uniform distribution, the noise schedule implicitly defines the distribution of importance sampling on various noise levels. The sampling probability of noise intensity λ𝜆\lambdaitalic_λ is:

p(λ)=p(t)|dtdλ|.𝑝𝜆𝑝𝑡d𝑡d𝜆\displaystyle p(\lambda)=p(t)\left|\frac{\mathrm{d}t}{\mathrm{d}\lambda}\right|.italic_p ( italic_λ ) = italic_p ( italic_t ) | divide start_ARG roman_d italic_t end_ARG start_ARG roman_d italic_λ end_ARG | . (3)

Considering that t𝑡titalic_t satisfies uniform distribution, and λ𝜆\lambdaitalic_λ is monotonically decreasing with t𝑡titalic_t, we have:

p(λ)=dtdλ.𝑝𝜆d𝑡d𝜆\displaystyle p(\lambda)=-\frac{\mathrm{d}t}{\mathrm{d}\lambda}.italic_p ( italic_λ ) = - divide start_ARG roman_d italic_t end_ARG start_ARG roman_d italic_λ end_ARG . (4)

We take cosine noise schedule Nichol & Dhariwal (2021) as an example, where αt=cos(πt2)subscript𝛼𝑡𝜋𝑡2\alpha_{t}=\cos\left(\frac{\pi t}{2}\right)italic_α start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = roman_cos ( divide start_ARG italic_π italic_t end_ARG start_ARG 2 end_ARG ), σt=sin(πt2)subscript𝜎𝑡𝜋𝑡2\sigma_{t}=\sin\left(\frac{\pi t}{2}\right)italic_σ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = roman_sin ( divide start_ARG italic_π italic_t end_ARG start_ARG 2 end_ARG ). Then we can deduce that λ=2logtan(πt/2)𝜆2𝜋𝑡2\lambda=-2\log\tan(\pi t/2)italic_λ = - 2 roman_log roman_tan ( italic_π italic_t / 2 ) and t=2/πarctaneλ/2𝑡2𝜋superscript𝑒𝜆2t=2/\pi\arctan e^{-\lambda/2}italic_t = 2 / italic_π roman_arctan italic_e start_POSTSUPERSCRIPT - italic_λ / 2 end_POSTSUPERSCRIPT. Thus the distribution of λ𝜆\lambdaitalic_λ is: p(λ)=dt/dλ=sech(λ/2)/2π𝑝𝜆d𝑡d𝜆sech𝜆22𝜋p(\lambda)=-\mathrm{d}t/\mathrm{d}\lambda=\text{sech}(\lambda/2)/2\piitalic_p ( italic_λ ) = - roman_d italic_t / roman_d italic_λ = sech ( italic_λ / 2 ) / 2 italic_π. This derivation illustrates the process of obtaining p(λ)𝑝𝜆p(\lambda)italic_p ( italic_λ ) from a noise schedule λ(t)𝜆𝑡\lambda(t)italic_λ ( italic_t ). On the other hand, we can derive the noise schedule from the sampling probability of different noise intensities p(λ)𝑝𝜆p(\lambda)italic_p ( italic_λ ). By integrating Equation 4, we have:

t𝑡\displaystyle titalic_t =1λp(λ)dλ=𝒫(λ),absent1superscriptsubscript𝜆𝑝𝜆differential-d𝜆𝒫𝜆\displaystyle=1-\int_{-\infty}^{\lambda}p(\lambda)\mathrm{d}\lambda=\mathcal{P% }(\lambda),= 1 - ∫ start_POSTSUBSCRIPT - ∞ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_λ end_POSTSUPERSCRIPT italic_p ( italic_λ ) roman_d italic_λ = caligraphic_P ( italic_λ ) , (5)
λ𝜆\displaystyle\lambdaitalic_λ =𝒫1(t),absentsuperscript𝒫1𝑡\displaystyle=\mathcal{P}^{-1}(t),= caligraphic_P start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ( italic_t ) , (6)

where 𝒫(λ)𝒫𝜆\mathcal{P}(\lambda)caligraphic_P ( italic_λ ) represents the cumulative distribution function of λ𝜆\lambdaitalic_λ. Thus we can obtain the noise schedule λ𝜆\lambdaitalic_λ by applying the inverse function 𝒫1superscript𝒫1\mathcal{P}^{-1}caligraphic_P start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT. In conclusion, during the training process, the importance sampling of varying noise intensities essentially equates to the modification of the noise schedules.

Noise Schedule p(λ)𝑝𝜆p(\lambda)italic_p ( italic_λ ) λ(t)𝜆𝑡\lambda(t)italic_λ ( italic_t )
Cosine sech(λ/2)/2πsech𝜆22𝜋\text{sech}\left(\lambda/2\right)/2\pisech ( italic_λ / 2 ) / 2 italic_π 2logcot(πt2)2𝜋𝑡22\log\cot\left(\frac{\pi t}{2}\right)2 roman_log roman_cot ( divide start_ARG italic_π italic_t end_ARG start_ARG 2 end_ARG )
Laplace e|λμ|b/2bsuperscript𝑒𝜆𝜇𝑏2𝑏e^{-\frac{|\lambda-\mu|}{b}}/2bitalic_e start_POSTSUPERSCRIPT - divide start_ARG | italic_λ - italic_μ | end_ARG start_ARG italic_b end_ARG end_POSTSUPERSCRIPT / 2 italic_b μbsgn(0.5t)log(12|t0.5|)𝜇𝑏sgn0.5𝑡12𝑡0.5\mu-b\text{sgn}(0.5-t)\log(1-2|t-0.5|)italic_μ - italic_b sgn ( 0.5 - italic_t ) roman_log ( 1 - 2 | italic_t - 0.5 | )
Cauchy 1πγ(λμ)2+γ21𝜋𝛾superscript𝜆𝜇2superscript𝛾2\frac{1}{\pi}\frac{\gamma}{(\lambda-\mu)^{2}+\gamma^{2}}divide start_ARG 1 end_ARG start_ARG italic_π end_ARG divide start_ARG italic_γ end_ARG start_ARG ( italic_λ - italic_μ ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + italic_γ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG μ+γtan(π2(12t))𝜇𝛾𝜋212𝑡\mu+\gamma\tan\left(\frac{\pi}{2}(1-2t)\right)italic_μ + italic_γ roman_tan ( divide start_ARG italic_π end_ARG start_ARG 2 end_ARG ( 1 - 2 italic_t ) )
Cosine Shifted 12πsech(λμ2)12𝜋sech𝜆𝜇2\frac{1}{2\pi}\text{sech}\left(\frac{\lambda-\mu}{2}\right)divide start_ARG 1 end_ARG start_ARG 2 italic_π end_ARG sech ( divide start_ARG italic_λ - italic_μ end_ARG start_ARG 2 end_ARG ) μ+2log(cot(πt2))𝜇2𝜋𝑡2\mu+2\log\left(\cot\left(\frac{\pi t}{2}\right)\right)italic_μ + 2 roman_log ( roman_cot ( divide start_ARG italic_π italic_t end_ARG start_ARG 2 end_ARG ) )
Cosine Scaled s2πsech(sλ2)𝑠2𝜋sech𝑠𝜆2\frac{s}{2\pi}\text{sech}\left(\frac{s\lambda}{2}\right)divide start_ARG italic_s end_ARG start_ARG 2 italic_π end_ARG sech ( divide start_ARG italic_s italic_λ end_ARG start_ARG 2 end_ARG ) 2slog(cot(πt2))2𝑠𝜋𝑡2\frac{2}{s}\log\left(\cot\left(\frac{\pi t}{2}\right)\right)divide start_ARG 2 end_ARG start_ARG italic_s end_ARG roman_log ( roman_cot ( divide start_ARG italic_π italic_t end_ARG start_ARG 2 end_ARG ) )
Table 1: Overview of various Noise Schedules. The table categorizes them into five distinct types: Cosine, Laplace, Cauchy, and two variations of Cosine schedules. The second column p(λ)𝑝𝜆p(\lambda)italic_p ( italic_λ ) denotes the sampling probability at different noise intensities λ𝜆\lambdaitalic_λ. The last column λ(t)𝜆𝑡\lambda(t)italic_λ ( italic_t ) indicates how to sample noise intensities for training. We derived their relationship in Equation 4 and Equation 6.

2.3 Unified Formulation for Diffusion Training

VDM++ Kingma & Gao (2023) proposes a unified formulation that encompasses recent prominent frameworks and loss weighting strategies for training diffusion models, as detailed below:

w(θ)=12𝔼𝐱𝒟,ϵ𝒩(0,𝐈),λp(λ)[w(λ)p(λ)ϵ^θ(𝐱λ;λ)ϵ22],subscript𝑤𝜃12subscript𝔼formulae-sequencesimilar-to𝐱𝒟formulae-sequencesimilar-tobold-italic-ϵ𝒩0𝐈similar-to𝜆𝑝𝜆delimited-[]𝑤𝜆𝑝𝜆superscriptsubscriptdelimited-∥∥subscript^bold-italic-ϵ𝜃subscript𝐱𝜆𝜆bold-italic-ϵ22\displaystyle\mathcal{L}_{w}(\theta)=\frac{1}{2}\mathbb{E}_{\mathbf{x}\sim% \mathcal{D},\bm{\epsilon}\sim\mathcal{N}(0,\mathbf{I}),\lambda\sim p(\lambda)}% \left[\frac{w(\lambda)}{p(\lambda)}\left\lVert\hat{\bm{\epsilon}}_{\theta}(% \mathbf{x}_{\lambda};\lambda)-\bm{\epsilon}\right\rVert_{2}^{2}\right],caligraphic_L start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT ( italic_θ ) = divide start_ARG 1 end_ARG start_ARG 2 end_ARG blackboard_E start_POSTSUBSCRIPT bold_x ∼ caligraphic_D , bold_italic_ϵ ∼ caligraphic_N ( 0 , bold_I ) , italic_λ ∼ italic_p ( italic_λ ) end_POSTSUBSCRIPT [ divide start_ARG italic_w ( italic_λ ) end_ARG start_ARG italic_p ( italic_λ ) end_ARG ∥ over^ start_ARG bold_italic_ϵ end_ARG start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( bold_x start_POSTSUBSCRIPT italic_λ end_POSTSUBSCRIPT ; italic_λ ) - bold_italic_ϵ ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ] , (7)

where 𝒟𝒟\mathcal{D}caligraphic_D signifies the training dataset, noise ϵbold-italic-ϵ\bm{\epsilon}bold_italic_ϵ is drawn from a standard Gaussian distribution, and p(λ)𝑝𝜆p(\lambda)italic_p ( italic_λ ) is the distribution of noise intensities. Different predicting targets, such as 𝐱0subscript𝐱0\mathbf{x}_{0}bold_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT and 𝐯𝐯\mathbf{v}bold_v, can also be re-parameterized to ϵbold-italic-ϵ\bm{\epsilon}bold_italic_ϵ-prediction. w(λ)𝑤𝜆w(\lambda)italic_w ( italic_λ ) denotes the loss weighting strategy. Although adjusting w(λ)𝑤𝜆w(\lambda)italic_w ( italic_λ ) is theoretically equivalent to altering p(λ)𝑝𝜆p(\lambda)italic_p ( italic_λ ). In practical training, directly modifying p(λ)𝑝𝜆p(\lambda)italic_p ( italic_λ ) to concentrate computational resources on training specific noise levels is more effective than enlarging the loss weight on specific noise levels. Therefore, we focus on how to design p(λ)𝑝𝜆p(\lambda)italic_p ( italic_λ ).

2.4 Practical Settings

Stable Diffusion 3 Esser et al. (2024), EDM Karras et al. (2022), and Min-SNR Hang et al. (2023); Crowson et al. (2024) find that the denoising tasks with medium noise intensity is most critical to the overall performance of diffusion models. Therefore, we increase the probability of p(λ)𝑝𝜆p(\lambda)italic_p ( italic_λ ) when λ𝜆\lambdaitalic_λ is of moderate size, and obtain a new noise schedule according to Section 2.2.

Specifically, we investigate four novel noise strategies, named Cosine Shifted, Cosine Scaled, Cauchy, and Laplace respectively. The detailed setting are listed in Table 1. Cosine Shifted use the hyperparameter μ𝜇\muitalic_μ to explore where the maximum probability should be used. Cosine Scaled explores how much the noise probability should be increased under the use of Cosine strategy to achieve better results. The Cauchy distribution, provides another form of function that can adjust both amplitude and offset simultaneously. The Laplace distribution is characterized by its mean μ𝜇\muitalic_μ and scale b𝑏bitalic_b, controls both the magnitude of the probability and the degree of concentration of the distribution. These strategies contain several hyperparameters, which we will explore in Section 3.5. Unless otherwise stated, we report the best hyperparameter results.

By re-allocating the computation resources at different noise intensities, we can train the complete denoising process. During sampling process, we standardize the sampled SNR to align with the cosine schedule, thereby focusing our exploration solely on the impact of different strategies during training. It is important to note that, from the perspective of the noise schedule, how to allocate the computation resource during inference is also worth reconsideration. We will not explore it in this paper and leave this as future work.

3 Experiments

3.1 implementation Details

Dataset. We conduct experiments on ImageNet Deng et al. (2009) with 256×256256256256\times 256256 × 256 and 512×512512512512\times 512512 × 512 resolution. For each image, we follow the preprocessing in Rombach et al. (2022) to center crop and encode images to latents. The shape of compressed latent feature is 32×32×43232432\times 32\times 432 × 32 × 4 for 2562superscript2562256^{2}256 start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT images and 64×64×46464464\times 64\times 464 × 64 × 4 for 5122superscript5122512^{2}512 start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT images.

Network Architecture. We adopt DiT-B from Peebles & Xie (2023) as our backbone. We replace the last AdaLN Linear layer with vanilla linear. Others are kept the same as the original implementation.

Training Settings. We adopt the Adam optimizer with learning rate 1×1041superscript1041\times 10^{-4}1 × 10 start_POSTSUPERSCRIPT - 4 end_POSTSUPERSCRIPT. We set the batch size to 256 as Peebles & Xie (2023); Hang et al. (2023). Each model is trained for 500K iterations if not specified. Our implementation is mainly built on OpenDiT Zhao et al. (2024) and experiments are mainly conducted on 8×\times×16G V100 GPUs.

Baselines and Metrics. We compare our proposed noise schedule with several baseline settings in Table 2. For each setting, we sample images using DDIM Song et al. (2021) with 50 steps. Despite the noise strategy for different settings may be different, we ensure they are the same at each sampling step. This approach is adopted to exclusively investigate the impact of the noise strategy during the training phase. Moreover, we report results with different classifier-free guidance scales, and the FID is calculated using 10K generated images.

Method w(λ)𝑤𝜆w(\lambda)italic_w ( italic_λ ) p(λ)𝑝𝜆p(\lambda)italic_p ( italic_λ )
Cosine eλ/2superscript𝑒𝜆2e^{-\lambda/2}italic_e start_POSTSUPERSCRIPT - italic_λ / 2 end_POSTSUPERSCRIPT sech(λ/2)sech𝜆2\text{sech}(\lambda/2)sech ( italic_λ / 2 )
Min-SNR  eλ/2min{1,γeλ}superscript𝑒𝜆21𝛾superscript𝑒𝜆e^{-\lambda/2}\cdot\min\{1,\gamma e^{-\lambda}\}italic_e start_POSTSUPERSCRIPT - italic_λ / 2 end_POSTSUPERSCRIPT ⋅ roman_min { 1 , italic_γ italic_e start_POSTSUPERSCRIPT - italic_λ end_POSTSUPERSCRIPT } sech(λ/2)sech𝜆2\text{sech}(\lambda/2)sech ( italic_λ / 2 )
Soft-Min-SNR  eλ/2γ/(eλ+γ)superscript𝑒𝜆2𝛾superscript𝑒𝜆𝛾e^{-\lambda/2}\cdot\gamma/(e^{\lambda}+\gamma)italic_e start_POSTSUPERSCRIPT - italic_λ / 2 end_POSTSUPERSCRIPT ⋅ italic_γ / ( italic_e start_POSTSUPERSCRIPT italic_λ end_POSTSUPERSCRIPT + italic_γ ) sech(λ/2)sech𝜆2\text{sech}(\lambda/2)sech ( italic_λ / 2 )
FM-OT  (1+eλ)sech2(λ/4)1superscript𝑒𝜆superscriptsech2𝜆4(1+e^{-\lambda})\text{sech}^{2}(\lambda/4)( 1 + italic_e start_POSTSUPERSCRIPT - italic_λ end_POSTSUPERSCRIPT ) sech start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ( italic_λ / 4 ) sech2(λ/4)/8superscriptsech2𝜆48\text{sech}^{2}(\lambda/4)/8sech start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ( italic_λ / 4 ) / 8
EDM  (1+eλ)(0.52+eλ)𝒩(λ;2.4,2.42)1superscript𝑒𝜆superscript0.52superscript𝑒𝜆𝒩𝜆2.4superscript2.42(1+e^{-\lambda})(0.5^{2}+e^{-\lambda})\mathcal{N}(\lambda;2.4,2.4^{2})( 1 + italic_e start_POSTSUPERSCRIPT - italic_λ end_POSTSUPERSCRIPT ) ( 0.5 start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + italic_e start_POSTSUPERSCRIPT - italic_λ end_POSTSUPERSCRIPT ) caligraphic_N ( italic_λ ; 2.4 , 2.4 start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) (0.52+eλ)𝒩(λ;2.4,2.42)superscript0.52superscript𝑒𝜆𝒩𝜆2.4superscript2.42(0.5^{2}+e^{-\lambda})\mathcal{N}(\lambda;2.4,2.4^{2})( 0.5 start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + italic_e start_POSTSUPERSCRIPT - italic_λ end_POSTSUPERSCRIPT ) caligraphic_N ( italic_λ ; 2.4 , 2.4 start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT )
Table 2: Comparison of different methods and related loss weighting strategies. The w(λ)𝑤𝜆w(\lambda)italic_w ( italic_λ ) is introduced in Equation 7.

3.2 Comparison with baselines and loss weight designs

This section details the principal findings from our experiments on the ImageNet-256 dataset, focusing on the comparative effectiveness of various noise schedules and loss weightings in the context of CFG values. Table 3 illustrates these comparisons, showcasing the performance of each method in terms of the FID-10K score.

The experiments reveal that our proposed noise schedules, particularly Laplace, achieve the most notable improvements over the traditional cosine schedule, as indicated by the bolded best scores and the blue numbers representing the reductions compared to baseline’s best score of 10.85.

We also provide a comparison with methods that adjust the loss weight, including Min-SNR and Soft-Min-SNR. We find that although these methods can achieve better results than the baseline, they are still not as effective as our method of modifying the noise schedule. This indicates that deciding where to allocate more computational resources is more efficient than adjusting the loss weight. Compared with other noise schedules like EDM Karras et al. (2022) and Flow Lipman et al. (2022), we found that no matter which CFG value, our results significantly surpass theirs under the same training iterations.

Method CFG=1.5 CFG=2.0 CFG=3.0
Cosine Nichol & Dhariwal (2021) 17.79 10.85 11.06
EDM Karras et al. (2022) 26.11 15.09 11.56
FM-OT Lipman et al. (2022) 24.49 14.66 11.98
Min-SNR Hang et al. (2023) 16.06 9.70 10.43
Soft-Min-SNR Crowson et al. (2024) 14.89 9.07 10.66
Cosine Shifted Hoogeboom et al. (2023) 19.34 11.67 11.13
Cosine Scaled 12.74 8.04 11.02
Cauchy 12.91 8.14 11.02
Laplace 16.69 9.04 7.96 (-2.89)
Table 3: Comparison of various noise schedules and loss weightings on ImageNet-256, showing the performance (in terms of FID-10K) of different methods under different CFG values. The best results highlighted in bold and the blue numbers represent the improvement when compared with the baseline FID 10.85. The line in gray is our suggested noise schedule.

Furthermore, we investigate the convergence speed of these method, and the results are shown in Figure 2. It can be seen that adjusting the noise schedule converges faster than adjusting the loss weight. Additionally, we also notice that the optimal training method may vary when using different CFG values for inference, but adjusting the noise schedule generally yields better results.

Refer to caption
Figure 2: Comparison between adjusting the noise schedule, adjusting the loss weights and baseline setting. The Laplace noise schedule yields the best results and the fastest convergence speed.

3.3 Robustness on different predicting targets

We evaluate the effectiveness of our designed noise schedule across three commonly adopted prediction targets: ϵbold-italic-ϵ\bm{\epsilon}bold_italic_ϵ, 𝐱0subscript𝐱0\mathbf{x}_{0}bold_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT and 𝐯𝐯\mathbf{v}bold_v. The results are shown in Table 4.

We observed that regardless of the prediction target, our proposed Laplace strategy significantly outperforms the Cosine strategy. It’s noteworthy that as the Laplace strategy focuses the computation on medium noise levels during training, the extensive noise levels are less trained, which could potentially affect the overall performance. Therefore, we have slightly modified the inference strategy of DDIM to start sampling from tmax=0.99subscript𝑡0.99t_{\max}=0.99italic_t start_POSTSUBSCRIPT roman_max end_POSTSUBSCRIPT = 0.99.

Predict Target Noise Schedule 100K 200k 300k 400k 500k
𝐱0subscript𝐱0\mathbf{x}_{0}bold_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT Cosine 35.20 17.60 13.37 11.84 11.16
Laplace (Ours) 21.78 10.86 9.44 8.73 8.48
𝐯𝐯\mathbf{v}bold_v Cosine 25.70 14.01 11.78 11.26 11.06
Laplace (Ours) 18.03 9.37 8.31 8.07 7.96
ϵbold-italic-ϵ\bm{\epsilon}bold_italic_ϵ Cosine 28.63 15.80 12.49 11.14 10.46
Laplace (Ours) 27.98 13.92 11.01 10.00 9.53
Table 4: Effectiveness evaluated using FID-10K score on different predicting targets. The proposed Laplace schedule performs better than the baseline Cosine schedule along with training iterations.

3.4 Robustness on high resolution images

To explore the robustness of the adjusted noise schedule to different resolutions, we also designed experiments on Imagenet-512. As pointed out by Chen (2023), the adding noise strategy will cause more severe signal leakage as the resolution increases. Therefore, we need to adjust the hyperparameters of the noise schedule according to the resolution.

Specifically, the baseline Cosine schedule achieves the best performance when the CFG value equals to 3. So we choose this CFG value for inference. Through systematic experimentation, we explored the appropriate values for the Laplace schedule’s parameter b𝑏bitalic_b, testing within the range {0.5, 0.75, 1.0}, and determined that b=0.75𝑏0.75b=0.75italic_b = 0.75 was the most effective, resulting in an FID score of 9.09. This indicates that despite the need for hyperparameter tuning, adjusting the noise schedule can still stably bring performance improvements.

Noise Schedule Cosine Laplace
FID-10K 11.91 9.09 (-2.82)
Table 5: FID-10K results on ImageNet-512. All models are trained for 500K iterations.

3.5 Ablation Study

We conduct an ablation study to analyze the impact of hyperparameters on various distributions of p(λ)𝑝𝜆p(\lambda)italic_p ( italic_λ ), which are enumerated below.

Laplace distribution is easy to implement and we adjust the scale to make the peak at the middle timestep. We conduct experiments with different Laplace distribution scales b{0.25,0.5,1.0,2.0,3.0}𝑏0.250.51.02.03.0b\in\{0.25,0.5,1.0,2.0,3.0\}italic_b ∈ { 0.25 , 0.5 , 1.0 , 2.0 , 3.0 }. The results are shown in Figure 3. The baseline with standard cosine schedule achieves FID score of 17.79 with CFG=1.5, 10.85 with CFG=2.0, and 11.06 with CFG=3.0 after 500K iterations. We can see that the model with Laplace distribution scale b=0.5𝑏0.5b=0.5italic_b = 0.5 achieves the best performance 7.96 with CFG=3.0, which is relatively 26.6% better than the baseline.

Refer to caption
Figure 3: FID-10K results on ImageNet-256 with different Laplace distribution scales b𝑏bitalic_b in {0.25,0.5,1.0,2.0,3.0}0.250.51.02.03.0\{0.25,0.5,1.0,2.0,3.0\}{ 0.25 , 0.5 , 1.0 , 2.0 , 3.0 }. The location parameter μ𝜇\muitalic_μ is fixed to 0. Baseline denotes standard cosine schedule.

Cauchy distribution is another heavy-tailed distribution that can be used for noise schedule design. The distribution is not symmetric when the location parameter is not 0. We conduct experiments with different Cauchy distribution parameters and the results are shown in Table 6. Cauchy(0, 0.5) means 1πγ(λμ)2+γ21𝜋𝛾superscript𝜆𝜇2superscript𝛾2\frac{1}{\pi}\frac{\gamma}{(\lambda-\mu)^{2}+\gamma^{2}}divide start_ARG 1 end_ARG start_ARG italic_π end_ARG divide start_ARG italic_γ end_ARG start_ARG ( italic_λ - italic_μ ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + italic_γ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG with μ=0,γ=0.5formulae-sequence𝜇0𝛾0.5\mu=0,\gamma=0.5italic_μ = 0 , italic_γ = 0.5. We can see that the model with μ=0𝜇0\mu=0italic_μ = 0 achieve better performance than the other two settings when fixing γ𝛾\gammaitalic_γ to 1. It means that the model with more probability mass around λ=0𝜆0\lambda=0italic_λ = 0 performs better than others biased to negative or positive directions.

Cauchy(0, 0.5) Cauchy(0, 1) Cauchy(-1, 1) Cauchy(1, 1)
CFG=1.5 12.91 14.32 18.12 16.60
CFG=2.0 8.14 8.93 10.38 10.19
CFG=3.0 11.02 11.26 10.81 10.94
Table 6: FID-10k results on ImageNet-256 with different Cauchy distribution parameters.

Cosine Shifted Hoogeboom et al. (2023) is the shifted version of the standard cosine schedule. We evaluate the schedules with both positive and negative μ𝜇\muitalic_μ values. Shifted with μ=1𝜇1\mu=1italic_μ = 1 achieves FID-10k score {19.34,11.67,11.13}19.3411.6711.13\{19.34,11.67,11.13\}{ 19.34 , 11.67 , 11.13 } with CFG {1.5,2.0,3.0}1.52.03.0\{1.5,2.0,3.0\}{ 1.5 , 2.0 , 3.0 }. Results with shifted value μ=1𝜇1\mu=-1italic_μ = - 1 are {19.30,11.48,11.28}19.3011.4811.28\{19.30,11.48,11.28\}{ 19.30 , 11.48 , 11.28 }. Comparatively, both scenarios demonstrate inferior performance relative to the baseline cosine schedule (μ=0𝜇0\mu=0italic_μ = 0). Additionally, by examining the data presented in Table 6, we find concentrated on λ=0𝜆0\lambda=0italic_λ = 0 can best improve the results.

Cosine Scaled is also a modification of Cosine schedule. When s𝑠sitalic_s is equal to 1, it becomes the standard Cosine version. s>1𝑠1s>1italic_s > 1 means sampling more heavily around λ=0𝜆0\lambda=0italic_λ = 0 while s<1𝑠1s<1italic_s < 1 means sampling more uniformly of all λ𝜆\lambdaitalic_λ. We report related results in Table 7. Larger values of s(s>1)𝑠𝑠1s(s>1)italic_s ( italic_s > 1 ) outperform the baseline; however, s𝑠sitalic_s should not be excessively large and must remain within a valid range. A model trained with s=2𝑠2s=2italic_s = 2 attains a score of 8.04, representing a 25.9%percent25.9\mathbf{25.9\%}bold_25.9 % improvement over the baseline.

1/s1𝑠1/s1 / italic_s 1.3 1.1 0.5 0.25
CFG=1.5 39.74 22.60 12.74 15.83
CFG=2.0 23.38 12.98 8.04 8.64
CFG=3.0 13.94 11.16 11.02 8.26
Table 7: FID-10k results on ImageNet-256 with different scales of Cosine Scaled distribution.

4 Related Work

Efficient Diffusion Training

Generally speaking, the diffusion model uses a network with shared parameters to denoise different noise intensities. However, the different noise levels may introduce conflicts during training, which makes the convergence slow. Min-SNR Hang et al. (2023) seeks the Pareto optimal direction for different tasks, achieves better convergence on different predicting targets. HDiT Crowson et al. (2024) propose a soft version of Min-SNR to further improve the efficiency on high resolution image synthesis. Stable Diffusion 3 Esser et al. (2024) puts more weight on the middle timesteps by multiplying the distribution of logit normal distribution. On the other hand, architecture modification is also explored to improve diffusion training. DiT Peebles & Xie (2023) proposes adaptive Layer Normalization with zero initialization to improve the training of Transformer architectures. A more robust ADM UNet with better training dynamics is proposed in EDM2 Karras et al. (2024) by preserving activation, weight, and update magnitudes.

Noise Schedule Design for Diffusion Models

The design of the noise schedule plays a critical role in training diffusion models. In DDPM, Ho et al. (2020) propose linear schedule for the noise level, which is later used in Stable Diffusion Rombach et al. (2022) version 1.5 and 2.0. iDDPM Nichol & Dhariwal (2021) introduces a cosine schedule aimed at bringing the sample with the highest noise level closer to pure Gaussian noise. EDM Karras et al. (2022) proposes a new continuous framework and make the logarithm of noise intensity sampled from a Gaussian distribution. Flow matching with optimal transport Lipman et al. (2022); Liu et al. (2022) linearly interpolates the noise and data point as the input of flow-based models.  Chen (2023) underscored the need for adapting the noise schedule according to the token length, and several other works Lin et al. (2024); Tang et al. (2023) emphasize that it’s important to prevent signal leakage in the final step.

5 Conclusion

In this technical report, we present a novel method for enhancing diffusion model training by redefining the noise schedule. We theoretically analyzed that this approach equates to performing importance sampling on the noise. Empirical results show that our proposed Laplace noise schedule, focusing computational resources on mid-range steps, yields superior performance compared to the adjustment of loss weights under constrained budgets. This study not only contributes significantly to developing efficient training techniques for diffusion models but also offers potential for future large-scale applications.

References

  • Balaji et al. (2022) Yogesh Balaji, Seungjun Nah, Xun Huang, Arash Vahdat, Jiaming Song, Karsten Kreis, Miika Aittala, Timo Aila, Samuli Laine, Bryan Catanzaro, Tero Karras, and Ming-Yu Liu. ediff-i: Text-to-image diffusion models with ensemble of expert denoisers. arXiv preprint arXiv:2211.01324, 2022.
  • Bao et al. (2022) Fan Bao, Shen Nie, Kaiwen Xue, Yue Cao, Chongxuan Li, Hang Su, and Jun Zhu. All are worth words: A vit backbone for diffusion models. arXiv preprint arXiv:2209.12152, 2022.
  • Brooks et al. (2024) Tim Brooks, Bill Peebles, Connor Holmes, Will DePue, Yufei Guo, Li Jing, David Schnurr, Joe Taylor, Troy Luhman, Eric Luhman, Clarence Ng, Ricky Wang, and Aditya Ramesh. Video generation models as world simulators. 2024. URL https://openai.com/research/video-generation-models-as-world-simulators.
  • Chen (2023) Ting Chen. On the importance of noise scheduling for diffusion models. arXiv preprint arXiv:2301.10972, 2023.
  • Choi et al. (2022) Jooyoung Choi, Jungbeom Lee, Chaehun Shin, Sungwon Kim, Hyunwoo Kim, and Sungroh Yoon. Perception prioritized training of diffusion models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp.  11472–11481, 2022.
  • Crowson et al. (2024) Katherine Crowson, Stefan Andreas Baumann, Alex Birch, Tanishq Mathew Abraham, Daniel Z Kaplan, and Enrico Shippole. Scalable high-resolution pixel-space image synthesis with hourglass diffusion transformers. In Forty-first International Conference on Machine Learning, 2024.
  • Deng et al. (2009) Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. Imagenet: A large-scale hierarchical image database. In 2009 IEEE conference on computer vision and pattern recognition, pp.  248–255. Ieee, 2009.
  • Dhariwal & Nichol (2021) Prafulla Dhariwal and Alexander Nichol. Diffusion models beat gans on image synthesis. Advances in Neural Information Processing Systems, 34:8780–8794, 2021.
  • Esser et al. (2024) Patrick Esser, Sumith Kulal, Andreas Blattmann, Rahim Entezari, Jonas Müller, Harry Saini, Yam Levi, Dominik Lorenz, Axel Sauer, Frederic Boesel, et al. Scaling rectified flow transformers for high-resolution image synthesis. arXiv preprint arXiv:2403.03206, 2024.
  • Gu et al. (2022) Shuyang Gu, Dong Chen, Jianmin Bao, Fang Wen, Bo Zhang, Dongdong Chen, Lu Yuan, and Baining Guo. Vector quantized diffusion model for text-to-image synthesis. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp.  10696–10706, 2022.
  • Hang et al. (2023) Tiankai Hang, Shuyang Gu, Chen Li, Jianmin Bao, Dong Chen, Han Hu, Xin Geng, and Baining Guo. Efficient diffusion training via min-snr weighting strategy. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pp.  7441–7451, October 2023.
  • Ho et al. (2020) Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising diffusion probabilistic models. Advances in Neural Information Processing Systems, 33:6840–6851, 2020.
  • Hoogeboom et al. (2023) Emiel Hoogeboom, Jonathan Heek, and Tim Salimans. simple diffusion: End-to-end diffusion for high resolution images. In International Conference on Machine Learning, pp. 13213–13232. PMLR, 2023.
  • Karras et al. (2022) Tero Karras, Miika Aittala, Timo Aila, and Samuli Laine. Elucidating the design space of diffusion-based generative models. In Alice H. Oh, Alekh Agarwal, Danielle Belgrave, and Kyunghyun Cho (eds.), Advances in Neural Information Processing Systems, 2022. URL https://openreview.net/forum?id=k7FuTOWMOc7.
  • Karras et al. (2024) Tero Karras, Miika Aittala, Jaakko Lehtinen, Janne Hellsten, Timo Aila, and Samuli Laine. Analyzing and improving the training dynamics of diffusion models. In Proc. CVPR, 2024.
  • Kingma et al. (2021) Diederik Kingma, Tim Salimans, Ben Poole, and Jonathan Ho. Variational diffusion models. Advances in neural information processing systems, 34:21696–21707, 2021.
  • Kingma & Gao (2023) Diederik P Kingma and Ruiqi Gao. Understanding diffusion objectives as the ELBO with simple data augmentation. In Thirty-seventh Conference on Neural Information Processing Systems, 2023. URL https://openreview.net/forum?id=NnMEadcdyD.
  • Lin et al. (2024) Shanchuan Lin, Bingchen Liu, Jiashi Li, and Xiao Yang. Common diffusion noise schedules and sample steps are flawed. In Proceedings of the IEEE/CVF winter conference on applications of computer vision, pp.  5404–5411, 2024.
  • Lipman et al. (2022) Yaron Lipman, Ricky TQ Chen, Heli Ben-Hamu, Maximilian Nickel, and Matthew Le. Flow matching for generative modeling. In The Eleventh International Conference on Learning Representations, 2022.
  • Liu et al. (2022) Xingchao Liu, Chengyue Gong, et al. Flow straight and fast: Learning to generate and transfer data with rectified flow. In The Eleventh International Conference on Learning Representations, 2022.
  • Nichol & Dhariwal (2021) Alexander Quinn Nichol and Prafulla Dhariwal. Improved denoising diffusion probabilistic models. In International Conference on Machine Learning, pp. 8162–8171. PMLR, 2021.
  • Peebles & Xie (2023) William Peebles and Saining Xie. Scalable diffusion models with transformers. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp.  4195–4205, 2023.
  • Ramesh et al. (2022) Aditya Ramesh, Prafulla Dhariwal, Alex Nichol, Casey Chu, and Mark Chen. Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:2204.06125, 2022.
  • Rombach et al. (2022) Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Björn Ommer. High-resolution image synthesis with latent diffusion models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp.  10684–10695, 2022.
  • Saharia et al. (2022) Chitwan Saharia, William Chan, Saurabh Saxena, Lala Li, Jay Whang, Emily Denton, Seyed Kamyar Seyed Ghasemipour, Raphael Gontijo-Lopes, Burcu Karagol Ayan, Tim Salimans, Jonathan Ho, David J. Fleet, and Mohammad Norouzi. Photorealistic text-to-image diffusion models with deep language understanding. In Alice H. Oh, Alekh Agarwal, Danielle Belgrave, and Kyunghyun Cho (eds.), Advances in Neural Information Processing Systems, 2022. URL https://openreview.net/forum?id=08Yk-n5l2Al.
  • Salimans & Ho (2022) Tim Salimans and Jonathan Ho. Progressive distillation for fast sampling of diffusion models. In International Conference on Learning Representations, 2022. URL https://openreview.net/forum?id=TIdIXIpzhoI.
  • Song et al. (2021) Jiaming Song, Chenlin Meng, and Stefano Ermon. Denoising diffusion implicit models. In International Conference on Learning Representations, 2021.
  • Tang et al. (2023) Zhicong Tang, Shuyang Gu, Chunyu Wang, Ting Zhang, Jianmin Bao, Dong Chen, and Baining Guo. Volumediffusion: Flexible text-to-3d generation with efficient volumetric encoder. arXiv preprint arXiv:2312.11459, 2023.
  • Yang et al. (2021) S. Yang, J. Sohl-Dickstein, D. P. Kingma, A. Kumar, S. Ermon, and B. Poole. Score-based generative modeling through stochastic differential equations. In International Conference on Learning Representations, 2021.
  • Zhao et al. (2024) Xuanlei Zhao, Zhongkai Zhao, Ziming Liu, Haotian Zhou, Qianli Ma, and Yang You. Opendit: An easy, fast and memory-efficient system for dit training and inference. https://github.com/NUS-HPC-AI-Lab/OpenDiT, 2024.

Appendix A: Detailed Implementation for Noise Schedule

We provide a simple PyTorch implementation for the Laplace noise schedule and its application in training. This example can be adapted to other noise schedules, such as the Cauchy distribution, by replacing the laplace_noise_schedule function. The model accepts noisy samples 𝐱tsubscript𝐱𝑡\mathbf{x}_{t}bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT, timestep t𝑡titalic_t, and an optional condition tensor 𝐜𝐜\mathbf{c}bold_c as inputs. This implementation supports prediction of {𝐱0,𝐯,ϵ}subscript𝐱0𝐯bold-italic-ϵ\{\mathbf{x}_{0},\mathbf{v},\bm{\epsilon}\}{ bold_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , bold_v , bold_italic_ϵ }.

1import torch
2
3
4def laplace_noise_schedule(mu=0.0, b=0.5):
5 # refer to Table 1
6 lmb = lambda t: mu - b * torch.sign(0.5 - t) * \
7 torch.log(1 - 2 * torch.abs(0.5 - t))
8 snr_func = lambda t: torch.exp(lmb(t))
9 alpha_func = lambda t: torch.sqrt(snr_func(t) / (1 + snr_func(t)))
10 sigma_func = lambda t: torch.sqrt(1 / (1 + snr_func(t)))
11
12 return alpha_func, sigma_func
13
14
15def training_losses(model, x, timestep, condition, noise=None,
16 predict_target="v", mu=0.0, b=0.5):
17
18 if noise is None:
19 noise = torch.randn_like(x)
20
21 alpha_func, sigma_func = laplace_noise_schedule(mu, b)
22 alphas = alpha_func(timestep)
23 sigmas = sigma_func(timestep)
24
25 # add noise to sample
26 x_t = alphas.view(-1, 1, 1, 1) * x + sigmas.view(-1, 1, 1, 1) * noise
27 # velocity
28 v_t = alphas.view(-1, 1, 1, 1) * noise - sigmas.view(-1, 1, 1, 1) * x
29
30 model_output = model(x_t, timestep, condition)
31 if predict_target == "v":
32 loss = (v_t - model_output) ** 2
33 elif predict_target == "x0":
34 loss = (x - model_output) ** 2
35 else: # predict_target == "noise":
36 loss = (noise - model_output) ** 2
37
38 return loss.mean()

Appendix B: Details for Sampling Process

As we mentioned before, choosing which noise schedule for sampling worth exploration. In this paper, we focus on exploring what kind of noise schedule is needed for training. Therefore, we adopted the same inference strategy as the cosine schedule to ensure a fair comparison. Specifically, first we sample {t0,t1,,ts}subscript𝑡0subscript𝑡1subscript𝑡𝑠\{t_{0},t_{1},\ldots,t_{s}\}{ italic_t start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , italic_t start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_t start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT } from uniform distribution 𝒰[0,1]𝒰01\mathcal{U}[0,1]caligraphic_U [ 0 , 1 ], then get the corresponding SNRs from Cosine schedule: {αt02σt02,αt12σt12,,αts2σts2}subscriptsuperscript𝛼2subscript𝑡0subscriptsuperscript𝜎2subscript𝑡0subscriptsuperscript𝛼2subscript𝑡1subscriptsuperscript𝜎2subscript𝑡1subscriptsuperscript𝛼2subscript𝑡𝑠subscriptsuperscript𝜎2subscript𝑡𝑠\{\frac{\alpha^{2}_{t_{0}}}{\sigma^{2}_{t_{0}}},\frac{\alpha^{2}_{t_{1}}}{% \sigma^{2}_{t_{1}}},\ldots,\frac{\alpha^{2}_{t_{s}}}{\sigma^{2}_{t_{s}}}\}{ divide start_ARG italic_α start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT end_ARG start_ARG italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT end_ARG , divide start_ARG italic_α start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT end_ARG start_ARG italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT end_ARG , … , divide start_ARG italic_α start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT end_POSTSUBSCRIPT end_ARG start_ARG italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT end_POSTSUBSCRIPT end_ARG }. According to Equation 6, we get the corresponding {t0,t1,,ts}subscriptsuperscript𝑡0subscriptsuperscript𝑡1subscriptsuperscript𝑡𝑠\{t^{\prime}_{0},t^{\prime}_{1},\ldots,t^{\prime}_{s}\}{ italic_t start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , italic_t start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_t start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT } by inverting these SNR values through the respective noise schedules. Finally, we use DDIM Song et al. (2021) to sample with these new calculated {t}superscript𝑡\{t^{\prime}\}{ italic_t start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT }.