Improved Noise Schedule for Diffusion Training

Tiankai Hang
Southeast University
tkhang@seu.edu.cn
&Shuyang Gu
Microsoft Research Asia
shuyanggu@microsoft.com
Long-term researcher intern at Microsoft Research Asia.

Abstract

Diffusion models have emerged as the de facto choice for generating visual signals. However, training a single model to predict noise across various levels poses significant challenges, necessitating numerous iterations and incurring significant computational costs. Various approaches, such as loss weighting strategy design and architectural refinements, have been introduced to expedite convergence. In this study, we propose a novel approach to design the noise schedule for enhancing the training of diffusion models. Our key insight is that the importance sampling of the logarithm of the Signal-to-Noise ratio ( $\log\text{SNR}$ ), theoretically equivalent to a modified noise schedule, is particularly beneficial for training efficiency when increasing the sample frequency around $\log\text{SNR}=0$ . We empirically demonstrate the superiority of our noise schedule over the standard cosine schedule. Furthermore, we highlight the advantages of our noise schedule design on the ImageNet benchmark, showing that the designed schedule consistently benefits different prediction targets.

1 Introduction

Diffusion models have emerged as a pivotal technique for generating visual signals across diverse domains, such as image synthesis Ramesh et al. (2022); Saharia et al. (2022); Rombach et al. (2022) and video generation Brooks et al. (2024). They are particularly adept at approximating complex distributions, where Generative Adversarial Networks (GANs) may encounter difficulties. Despite the substantial computational resources and numerous training iterations required for convergence, improving the training efficiency of diffusion models is essential for their application in large-scale scenarios, such as high-resolution image synthesis and video generation.

Architectural enhancements offer a promising path to improve both the training speed and performance of diffusion models. For instance, the use of Adaptive Layer Normalization Gu et al. (2022), when combined with zero initialization in the Transformer architecture as demonstrated by Peebles & Xie (2023), represents such an improvement. Similarly, the adoption of U-shaped skip connections within Transformers, as outlined in previous works Hoogeboom et al. (2023); Bao et al. (2022); Crowson et al. (2024), also boosts efficiency. In a parallel development, Karras et al. (2024) have contributed to this endeavor by reengineering the layers of ADM UNet Dhariwal & Nichol (2021) to preserve the magnitudes of activations, weights, and updates, ensuring a more efficient learning process.

Concurrently, various loss weighting designs have been implemented to accelerate the convergence of training. Previous works, such as eDiff-I Balaji et al. (2022) and Min-SNR Hang et al. (2023), found that the training of diffusion models may encounter conflicts among various noise intensities. Choi et al. (2022) prioritize specific noise levels during training to enhance learning of visual concepts. Min-SNR Hang et al. (2023) reduces weights of noisy tasks, pursuing the Pareto Optimality in different denoising tasks, validated its effectiveness on multiple datasets and architectures. A softer version of this approach, aiming to further enhance high-resolution image synthesis within hourglass diffusion models, was introduced by Crowson et al. (2024). SD3 Esser et al. (2024) empirically found that it’s crucial to increase the weight of the intermediate noise intensities, which has demonstrated the effectiveness during training the diffusion models.

In this study, we present a novel method to enhance the training of diffusion models by strategically redefining the noise schedule, which is equivalent to importance sampling of the noise across different intensities. However, empirical evidence suggests that allocating more computation costs (FLOPs) to mid-range noise levels (around $\log\text{SNR}=0$ ) yields superior performance compared to increasing loss weights during the same period, particularly under constrained computational budgets. We experimentally analyze the performance of several different noise schedules, including Laplace, Cauchy, and the Cosine Shifted/Scaled, which are visualized in Figure 1. Notably, the Laplace schedule exhibits favorable performance. We recommend to choose this noise schedule in the future.

We demonstrate the effectiveness of this approach using the ImageNet benchmark, with a consistent training budget of 500K iterations. Evaluated using the FID metric, our results reveal that noise schedules with a concentrated probability density around $\log\text{SNR}=0$ consistently surpass others, as evidenced at both $256\times 256$ and $512\times 512$ resolutions with different prediction target. This research contributes to the advancement of efficient training techniques for diffusion models.

Refer to caption — Figure 1: Illustration of the probability density functions of different noise schedules.

2 Method

2.1 Preliminaries

Diffusion models Ho et al. (2020); Yang et al. (2021) learn to generate data by iteratively reversing the diffusion process. We denote the distribution of data points as $\mathbf{x}\sim p_{\text{data}}(\mathbf{x})$ . The diffusion process progressively adds noise to the data, which is defined as:

\displaystyle\mathbf{x}_{t}=\alpha_{t}\mathbf{x}+\sigma_{t}\bm{\epsilon},\quad% \text{where}\quad\bm{\epsilon}\sim\mathcal{N}(0,\mathbf{I}),

(1)

where $\alpha_{t}$ and $\sigma_{t}$ are the coefficients of the adding noise process, essentially representing the noise schedule. For the commonly used prediction target velocity: $\mathbf{v}_{t}=\alpha_{t}\bm{\epsilon}-\sigma_{t}\mathbf{x}$ Salimans & Ho (2022), the diffusion model $\theta$ is trained through the Mean Squared Error (MSE) loss:

\displaystyle\mathcal{L}(\theta)=\mathbb{E}_{\mathbf{x}\sim p_{\text{data}}(% \mathbf{x})}\mathbb{E}_{t\sim p(t)}\left[w(t)\lVert\mathbf{v}_{\theta}(\alpha_% {t}\mathbf{x}+\sigma_{t}\bm{\epsilon},t,\mathbf{c})-\mathbf{v}_{t}\rVert_{2}^{% 2}\right],

(2)

where $w(t)$ is the loss weight, $\mathbf{c}$ denotes the condition information. Common practices sample $t$ from the uniform distribution $\mathcal{U}[0,1]$ . Kingma et al. (2021) introduced the Signal-to-Noise ratio as $\text{SNR}(t)=\frac{\alpha_{t}^{2}}{\sigma_{t}^{2}}$ to measure the noise level of different states. To simplify, we denote $\lambda=\log\text{SNR}$ to indicate the noise intensities. In the Variance Preserving (VP) setting, the coefficients in Equation 1 can be calculated by $\alpha^{2}_{t}=\frac{\exp(\lambda)}{\exp(\lambda)+1}$ , $\sigma^{2}_{t}=\frac{1}{\exp(\lambda)+1}$ .

2.2 Improved Noise Schedule Design

Given that the timestep $t$ is a random variable sampled from uniform distribution, the noise schedule implicitly defines the distribution of importance sampling on various noise levels. The sampling probability of noise intensity $\lambda$ is:

\displaystyle p(\lambda)=p(t)\left|\frac{\mathrm{d}t}{\mathrm{d}\lambda}\right|.

(3)

Considering that $t$ satisfies uniform distribution, and $\lambda$ is monotonically decreasing with $t$ , we have:

\displaystyle p(\lambda)=-\frac{\mathrm{d}t}{\mathrm{d}\lambda}.

(4)

We take cosine noise schedule Nichol & Dhariwal (2021) as an example, where $\alpha_{t}=\cos\left(\frac{\pi t}{2}\right)$ , $\sigma_{t}=\sin\left(\frac{\pi t}{2}\right)$ . Then we can deduce that $\lambda=-2\log\tan(\pi t/2)$ and $t=2/\pi\arctan e^{-\lambda/2}$ . Thus the distribution of $\lambda$ is: $p(\lambda)=-\mathrm{d}t/\mathrm{d}\lambda=\text{sech}(\lambda/2)/2\pi$ . This derivation illustrates the process of obtaining $p(\lambda)$ from a noise schedule $\lambda(t)$ . On the other hand, we can derive the noise schedule from the sampling probability of different noise intensities $p(\lambda)$ . By integrating Equation 4, we have:

	$\displaystyle t$	$\displaystyle=1-\int_{-\infty}^{\lambda}p(\lambda)\mathrm{d}\lambda=\mathcal{P% }(\lambda),$		(5)
	$\displaystyle\lambda$	$\displaystyle=\mathcal{P}^{-1}(t),$		(6)

where $\mathcal{P}(\lambda)$ represents the cumulative distribution function of $\lambda$ . Thus we can obtain the noise schedule $\lambda$ by applying the inverse function $\mathcal{P}^{-1}$ . In conclusion, during the training process, the importance sampling of varying noise intensities essentially equates to the modification of the noise schedules.

Noise Schedule	$p(\lambda)$	$\lambda(t)$
Cosine	$\text{sech}\left(\lambda/2\right)/2\pi$	$2\log\cot\left(\frac{\pi t}{2}\right)$
Laplace	$e^{-\frac{\|\lambda-\mu\|}{b}}/2b$	$\mu-b\text{sgn}(0.5-t)\log(1-2\|t-0.5\|)$
Cauchy	$\frac{1}{\pi}\frac{\gamma}{(\lambda-\mu)^{2}+\gamma^{2}}$	$\mu+\gamma\tan\left(\frac{\pi}{2}(1-2t)\right)$
Cosine Shifted	$\frac{1}{2\pi}\text{sech}\left(\frac{\lambda-\mu}{2}\right)$	$\mu+2\log\left(\cot\left(\frac{\pi t}{2}\right)\right)$
Cosine Scaled	$\frac{s}{2\pi}\text{sech}\left(\frac{s\lambda}{2}\right)$	$\frac{2}{s}\log\left(\cot\left(\frac{\pi t}{2}\right)\right)$

Table 1: Overview of various Noise Schedules. The table categorizes them into five distinct types: Cosine, Laplace, Cauchy, and two variations of Cosine schedules. The second column

p(\lambda)

denotes the sampling probability at different noise intensities

\lambda

. The last column

\lambda(t)

indicates how to sample noise intensities for training. We derived their relationship in Equation 4 and Equation 6.

2.3 Unified Formulation for Diffusion Training

VDM++ Kingma & Gao (2023) proposes a unified formulation that encompasses recent prominent frameworks and loss weighting strategies for training diffusion models, as detailed below:

\displaystyle\mathcal{L}_{w}(\theta)=\frac{1}{2}\mathbb{E}_{\mathbf{x}\sim% \mathcal{D},\bm{\epsilon}\sim\mathcal{N}(0,\mathbf{I}),\lambda\sim p(\lambda)}% \left[\frac{w(\lambda)}{p(\lambda)}\left\lVert\hat{\bm{\epsilon}}_{\theta}(% \mathbf{x}_{\lambda};\lambda)-\bm{\epsilon}\right\rVert_{2}^{2}\right],

(7)

where $\mathcal{D}$ signifies the training dataset, noise $\bm{\epsilon}$ is drawn from a standard Gaussian distribution, and $p(\lambda)$ is the distribution of noise intensities. Different predicting targets, such as $\mathbf{x}_{0}$ and $\mathbf{v}$ , can also be re-parameterized to $\bm{\epsilon}$ -prediction. $w(\lambda)$ denotes the loss weighting strategy. Although adjusting $w(\lambda)$ is theoretically equivalent to altering $p(\lambda)$ . In practical training, directly modifying $p(\lambda)$ to concentrate computational resources on training specific noise levels is more effective than enlarging the loss weight on specific noise levels. Therefore, we focus on how to design $p(\lambda)$ .

2.4 Practical Settings

Stable Diffusion 3 Esser et al. (2024), EDM Karras et al. (2022), and Min-SNR Hang et al. (2023); Crowson et al. (2024) find that the denoising tasks with medium noise intensity is most critical to the overall performance of diffusion models. Therefore, we increase the probability of $p(\lambda)$ when $\lambda$ is of moderate size, and obtain a new noise schedule according to Section 2.2.

Specifically, we investigate four novel noise strategies, named Cosine Shifted, Cosine Scaled, Cauchy, and Laplace respectively. The detailed setting are listed in Table 1. Cosine Shifted use the hyperparameter $\mu$ to explore where the maximum probability should be used. Cosine Scaled explores how much the noise probability should be increased under the use of Cosine strategy to achieve better results. The Cauchy distribution, provides another form of function that can adjust both amplitude and offset simultaneously. The Laplace distribution is characterized by its mean $\mu$ and scale $b$ , controls both the magnitude of the probability and the degree of concentration of the distribution. These strategies contain several hyperparameters, which we will explore in Section 3.5. Unless otherwise stated, we report the best hyperparameter results.

By re-allocating the computation resources at different noise intensities, we can train the complete denoising process. During sampling process, we standardize the sampled SNR to align with the cosine schedule, thereby focusing our exploration solely on the impact of different strategies during training. It is important to note that, from the perspective of the noise schedule, how to allocate the computation resource during inference is also worth reconsideration. We will not explore it in this paper and leave this as future work.

3 Experiments

3.1 implementation Details

Dataset. We conduct experiments on ImageNet Deng et al. (2009) with $256\times 256$ and $512\times 512$ resolution. For each image, we follow the preprocessing in Rombach et al. (2022) to center crop and encode images to latents. The shape of compressed latent feature is $32\times 32\times 4$ for $256^{2}$ images and $64\times 64\times 4$ for $512^{2}$ images.

Network Architecture. We adopt DiT-B from Peebles & Xie (2023) as our backbone. We replace the last AdaLN Linear layer with vanilla linear. Others are kept the same as the original implementation.

Training Settings. We adopt the Adam optimizer with learning rate $1\times 10^{-4}$ . We set the batch size to 256 as Peebles & Xie (2023); Hang et al. (2023). Each model is trained for 500K iterations if not specified. Our implementation is mainly built on OpenDiT Zhao et al. (2024) and experiments are mainly conducted on 8 $\times$ 16G V100 GPUs.

Baselines and Metrics. We compare our proposed noise schedule with several baseline settings in Table 2. For each setting, we sample images using DDIM Song et al. (2021) with 50 steps. Despite the noise strategy for different settings may be different, we ensure they are the same at each sampling step. This approach is adopted to exclusively investigate the impact of the noise strategy during the training phase. Moreover, we report results with different classifier-free guidance scales, and the FID is calculated using 10K generated images.

Method	$w(\lambda)$	$p(\lambda)$
Cosine	$e^{-\lambda/2}$	$\text{sech}(\lambda/2)$
Min-SNR	$e^{-\lambda/2}\cdot\min\{1,\gamma e^{-\lambda}\}$	$\text{sech}(\lambda/2)$
Soft-Min-SNR	$e^{-\lambda/2}\cdot\gamma/(e^{\lambda}+\gamma)$	$\text{sech}(\lambda/2)$
FM-OT	$(1+e^{-\lambda})\text{sech}^{2}(\lambda/4)$	$\text{sech}^{2}(\lambda/4)/8$
EDM	$(1+e^{-\lambda})(0.5^{2}+e^{-\lambda})\mathcal{N}(\lambda;2.4,2.4^{2})$	$(0.5^{2}+e^{-\lambda})\mathcal{N}(\lambda;2.4,2.4^{2})$

Table 2: Comparison of different methods and related loss weighting strategies. The

w(\lambda)

is introduced in Equation 7.

3.2 Comparison with baselines and loss weight designs

This section details the principal findings from our experiments on the ImageNet-256 dataset, focusing on the comparative effectiveness of various noise schedules and loss weightings in the context of CFG values. Table 3 illustrates these comparisons, showcasing the performance of each method in terms of the FID-10K score.

The experiments reveal that our proposed noise schedules, particularly Laplace, achieve the most notable improvements over the traditional cosine schedule, as indicated by the bolded best scores and the blue numbers representing the reductions compared to baseline’s best score of 10.85.

We also provide a comparison with methods that adjust the loss weight, including Min-SNR and Soft-Min-SNR. We find that although these methods can achieve better results than the baseline, they are still not as effective as our method of modifying the noise schedule. This indicates that deciding where to allocate more computational resources is more efficient than adjusting the loss weight. Compared with other noise schedules like EDM Karras et al. (2022) and Flow Lipman et al. (2022), we found that no matter which CFG value, our results significantly surpass theirs under the same training iterations.

Method	CFG=1.5	CFG=2.0	CFG=3.0
Cosine Nichol & Dhariwal (2021)	17.79	10.85	11.06
EDM Karras et al. (2022)	26.11	15.09	11.56
FM-OT Lipman et al. (2022)	24.49	14.66	11.98
Min-SNR Hang et al. (2023)	16.06	9.70	10.43
Soft-Min-SNR Crowson et al. (2024)	14.89	9.07	10.66
Cosine Shifted Hoogeboom et al. (2023)	19.34	11.67	11.13
Cosine Scaled	12.74	8.04	11.02
Cauchy	12.91	8.14	11.02
Laplace	16.69	9.04	7.96 (-2.89)

Table 3: Comparison of various noise schedules and loss weightings on ImageNet-256, showing the performance (in terms of FID-10K) of different methods under different CFG values. The best results highlighted in bold and the blue numbers represent the improvement when compared with the baseline FID 10.85. The line in gray is our suggested noise schedule.

Furthermore, we investigate the convergence speed of these method, and the results are shown in Figure 2. It can be seen that adjusting the noise schedule converges faster than adjusting the loss weight. Additionally, we also notice that the optimal training method may vary when using different CFG values for inference, but adjusting the noise schedule generally yields better results.

3.3 Robustness on different predicting targets

We evaluate the effectiveness of our designed noise schedule across three commonly adopted prediction targets: $\bm{\epsilon}$ , $\mathbf{x}_{0}$ and $\mathbf{v}$ . The results are shown in Table 4.

We observed that regardless of the prediction target, our proposed Laplace strategy significantly outperforms the Cosine strategy. It’s noteworthy that as the Laplace strategy focuses the computation on medium noise levels during training, the extensive noise levels are less trained, which could potentially affect the overall performance. Therefore, we have slightly modified the inference strategy of DDIM to start sampling from $t_{\max}=0.99$ .

Predict Target	Noise Schedule	100K	200k	300k	400k	500k
$\mathbf{x}_{0}$	Cosine	35.20	17.60	13.37	11.84	11.16
	Laplace (Ours)	21.78	10.86	9.44	8.73	8.48
$\mathbf{v}$	Cosine	25.70	14.01	11.78	11.26	11.06
	Laplace (Ours)	18.03	9.37	8.31	8.07	7.96
$\bm{\epsilon}$	Cosine	28.63	15.80	12.49	11.14	10.46
	Laplace (Ours)	27.98	13.92	11.01	10.00	9.53

Table 4: Effectiveness evaluated using FID-10K score on different predicting targets. The proposed Laplace schedule performs better than the baseline Cosine schedule along with training iterations.

3.4 Robustness on high resolution images

To explore the robustness of the adjusted noise schedule to different resolutions, we also designed experiments on Imagenet-512. As pointed out by Chen (2023), the adding noise strategy will cause more severe signal leakage as the resolution increases. Therefore, we need to adjust the hyperparameters of the noise schedule according to the resolution.

Specifically, the baseline Cosine schedule achieves the best performance when the CFG value equals to 3. So we choose this CFG value for inference. Through systematic experimentation, we explored the appropriate values for the Laplace schedule’s parameter $b$ , testing within the range {0.5, 0.75, 1.0}, and determined that $b=0.75$ was the most effective, resulting in an FID score of 9.09. This indicates that despite the need for hyperparameter tuning, adjusting the noise schedule can still stably bring performance improvements.

Noise Schedule	Cosine	Laplace
FID-10K	11.91	9.09 (-2.82)

Table 5: FID-10K results on ImageNet-512. All models are trained for 500K iterations.

3.5 Ablation Study

We conduct an ablation study to analyze the impact of hyperparameters on various distributions of $p(\lambda)$ , which are enumerated below.

Laplace distribution is easy to implement and we adjust the scale to make the peak at the middle timestep. We conduct experiments with different Laplace distribution scales $b\in\{0.25,0.5,1.0,2.0,3.0\}$ . The results are shown in Figure 3. The baseline with standard cosine schedule achieves FID score of 17.79 with CFG=1.5, 10.85 with CFG=2.0, and 11.06 with CFG=3.0 after 500K iterations. We can see that the model with Laplace distribution scale $b=0.5$ achieves the best performance 7.96 with CFG=3.0, which is relatively 26.6% better than the baseline.

Cauchy distribution is another heavy-tailed distribution that can be used for noise schedule design. The distribution is not symmetric when the location parameter is not 0. We conduct experiments with different Cauchy distribution parameters and the results are shown in Table 6. Cauchy(0, 0.5) means $\frac{1}{\pi}\frac{\gamma}{(\lambda-\mu)^{2}+\gamma^{2}}$ with $\mu=0,\gamma=0.5$ . We can see that the model with $\mu=0$ achieve better performance than the other two settings when fixing $\gamma$ to 1. It means that the model with more probability mass around $\lambda=0$ performs better than others biased to negative or positive directions.

	Cauchy(0, 0.5)	Cauchy(0, 1)	Cauchy(-1, 1)	Cauchy(1, 1)
CFG=1.5	12.91	14.32	18.12	16.60
CFG=2.0	8.14	8.93	10.38	10.19
CFG=3.0	11.02	11.26	10.81	10.94

Table 6: FID-10k results on ImageNet-256 with different Cauchy distribution parameters.

Cosine Shifted Hoogeboom et al. (2023) is the shifted version of the standard cosine schedule. We evaluate the schedules with both positive and negative $\mu$ values. Shifted with $\mu=1$ achieves FID-10k score $\{19.34,11.67,11.13\}$ with CFG $\{1.5,2.0,3.0\}$ . Results with shifted value $\mu=-1$ are $\{19.30,11.48,11.28\}$ . Comparatively, both scenarios demonstrate inferior performance relative to the baseline cosine schedule ( $\mu=0$ ). Additionally, by examining the data presented in Table 6, we find concentrated on $\lambda=0$ can best improve the results.

Cosine Scaled is also a modification of Cosine schedule. When $s$ is equal to 1, it becomes the standard Cosine version. $s>1$ means sampling more heavily around $\lambda=0$ while $s<1$ means sampling more uniformly of all $\lambda$ . We report related results in Table 7. Larger values of $s(s>1)$ outperform the baseline; however, $s$ should not be excessively large and must remain within a valid range. A model trained with $s=2$ attains a score of 8.04, representing a $\mathbf{25.9\%}$ improvement over the baseline.

$1/s$	1.3	1.1	0.5	0.25
CFG=1.5	39.74	22.60	12.74	15.83
CFG=2.0	23.38	12.98	8.04	8.64
CFG=3.0	13.94	11.16	11.02	8.26

Table 7: FID-10k results on ImageNet-256 with different scales of Cosine Scaled distribution.

4 Related Work

Efficient Diffusion Training

Generally speaking, the diffusion model uses a network with shared parameters to denoise different noise intensities. However, the different noise levels may introduce conflicts during training, which makes the convergence slow. Min-SNR Hang et al. (2023) seeks the Pareto optimal direction for different tasks, achieves better convergence on different predicting targets. HDiT Crowson et al. (2024) propose a soft version of Min-SNR to further improve the efficiency on high resolution image synthesis. Stable Diffusion 3 Esser et al. (2024) puts more weight on the middle timesteps by multiplying the distribution of logit normal distribution. On the other hand, architecture modification is also explored to improve diffusion training. DiT Peebles & Xie (2023) proposes adaptive Layer Normalization with zero initialization to improve the training of Transformer architectures. A more robust ADM UNet with better training dynamics is proposed in EDM2 Karras et al. (2024) by preserving activation, weight, and update magnitudes.

Noise Schedule Design for Diffusion Models

The design of the noise schedule plays a critical role in training diffusion models. In DDPM, Ho et al. (2020) propose linear schedule for the noise level, which is later used in Stable Diffusion Rombach et al. (2022) version 1.5 and 2.0. iDDPM Nichol & Dhariwal (2021) introduces a cosine schedule aimed at bringing the sample with the highest noise level closer to pure Gaussian noise. EDM Karras et al. (2022) proposes a new continuous framework and make the logarithm of noise intensity sampled from a Gaussian distribution. Flow matching with optimal transport Lipman et al. (2022); Liu et al. (2022) linearly interpolates the noise and data point as the input of flow-based models. Chen (2023) underscored the need for adapting the noise schedule according to the token length, and several other works Lin et al. (2024); Tang et al. (2023) emphasize that it’s important to prevent signal leakage in the final step.

5 Conclusion

In this technical report, we present a novel method for enhancing diffusion model training by redefining the noise schedule. We theoretically analyzed that this approach equates to performing importance sampling on the noise. Empirical results show that our proposed Laplace noise schedule, focusing computational resources on mid-range steps, yields superior performance compared to the adjustment of loss weights under constrained budgets. This study not only contributes significantly to developing efficient training techniques for diffusion models but also offers potential for future large-scale applications.

References

Balaji et al. (2022) Yogesh Balaji, Seungjun Nah, Xun Huang, Arash Vahdat, Jiaming Song, Karsten Kreis, Miika Aittala, Timo Aila, Samuli Laine, Bryan Catanzaro, Tero Karras, and Ming-Yu Liu. ediff-i: Text-to-image diffusion models with ensemble of expert denoisers. arXiv preprint arXiv:2211.01324, 2022.
Bao et al. (2022) Fan Bao, Shen Nie, Kaiwen Xue, Yue Cao, Chongxuan Li, Hang Su, and Jun Zhu. All are worth words: A vit backbone for diffusion models. arXiv preprint arXiv:2209.12152, 2022.
Brooks et al. (2024) Tim Brooks, Bill Peebles, Connor Holmes, Will DePue, Yufei Guo, Li Jing, David Schnurr, Joe Taylor, Troy Luhman, Eric Luhman, Clarence Ng, Ricky Wang, and Aditya Ramesh. Video generation models as world simulators. 2024. URL https://openai.com/research/video-generation-models-as-world-simulators.
Chen (2023) Ting Chen. On the importance of noise scheduling for diffusion models. arXiv preprint arXiv:2301.10972, 2023.
Choi et al. (2022) Jooyoung Choi, Jungbeom Lee, Chaehun Shin, Sungwon Kim, Hyunwoo Kim, and Sungroh Yoon. Perception prioritized training of diffusion models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11472–11481, 2022.
Crowson et al. (2024) Katherine Crowson, Stefan Andreas Baumann, Alex Birch, Tanishq Mathew Abraham, Daniel Z Kaplan, and Enrico Shippole. Scalable high-resolution pixel-space image synthesis with hourglass diffusion transformers. In Forty-first International Conference on Machine Learning, 2024.
Deng et al. (2009) Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. Imagenet: A large-scale hierarchical image database. In 2009 IEEE conference on computer vision and pattern recognition, pp. 248–255. Ieee, 2009.
Dhariwal & Nichol (2021) Prafulla Dhariwal and Alexander Nichol. Diffusion models beat gans on image synthesis. Advances in Neural Information Processing Systems, 34:8780–8794, 2021.
Esser et al. (2024) Patrick Esser, Sumith Kulal, Andreas Blattmann, Rahim Entezari, Jonas Müller, Harry Saini, Yam Levi, Dominik Lorenz, Axel Sauer, Frederic Boesel, et al. Scaling rectified flow transformers for high-resolution image synthesis. arXiv preprint arXiv:2403.03206, 2024.
Gu et al. (2022) Shuyang Gu, Dong Chen, Jianmin Bao, Fang Wen, Bo Zhang, Dongdong Chen, Lu Yuan, and Baining Guo. Vector quantized diffusion model for text-to-image synthesis. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 10696–10706, 2022.
Hang et al. (2023) Tiankai Hang, Shuyang Gu, Chen Li, Jianmin Bao, Dong Chen, Han Hu, Xin Geng, and Baining Guo. Efficient diffusion training via min-snr weighting strategy. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pp. 7441–7451, October 2023.
Ho et al. (2020) Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising diffusion probabilistic models. Advances in Neural Information Processing Systems, 33:6840–6851, 2020.
Hoogeboom et al. (2023) Emiel Hoogeboom, Jonathan Heek, and Tim Salimans. simple diffusion: End-to-end diffusion for high resolution images. In International Conference on Machine Learning, pp. 13213–13232. PMLR, 2023.
Karras et al. (2022) Tero Karras, Miika Aittala, Timo Aila, and Samuli Laine. Elucidating the design space of diffusion-based generative models. In Alice H. Oh, Alekh Agarwal, Danielle Belgrave, and Kyunghyun Cho (eds.), Advances in Neural Information Processing Systems, 2022. URL https://openreview.net/forum?id=k7FuTOWMOc7.
Karras et al. (2024) Tero Karras, Miika Aittala, Jaakko Lehtinen, Janne Hellsten, Timo Aila, and Samuli Laine. Analyzing and improving the training dynamics of diffusion models. In Proc. CVPR, 2024.
Kingma et al. (2021) Diederik Kingma, Tim Salimans, Ben Poole, and Jonathan Ho. Variational diffusion models. Advances in neural information processing systems, 34:21696–21707, 2021.
Kingma & Gao (2023) Diederik P Kingma and Ruiqi Gao. Understanding diffusion objectives as the ELBO with simple data augmentation. In Thirty-seventh Conference on Neural Information Processing Systems, 2023. URL https://openreview.net/forum?id=NnMEadcdyD.
Lin et al. (2024) Shanchuan Lin, Bingchen Liu, Jiashi Li, and Xiao Yang. Common diffusion noise schedules and sample steps are flawed. In Proceedings of the IEEE/CVF winter conference on applications of computer vision, pp. 5404–5411, 2024.
Lipman et al. (2022) Yaron Lipman, Ricky TQ Chen, Heli Ben-Hamu, Maximilian Nickel, and Matthew Le. Flow matching for generative modeling. In The Eleventh International Conference on Learning Representations, 2022.
Liu et al. (2022) Xingchao Liu, Chengyue Gong, et al. Flow straight and fast: Learning to generate and transfer data with rectified flow. In The Eleventh International Conference on Learning Representations, 2022.
Nichol & Dhariwal (2021) Alexander Quinn Nichol and Prafulla Dhariwal. Improved denoising diffusion probabilistic models. In International Conference on Machine Learning, pp. 8162–8171. PMLR, 2021.
Peebles & Xie (2023) William Peebles and Saining Xie. Scalable diffusion models with transformers. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4195–4205, 2023.
Ramesh et al. (2022) Aditya Ramesh, Prafulla Dhariwal, Alex Nichol, Casey Chu, and Mark Chen. Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:2204.06125, 2022.
Rombach et al. (2022) Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Björn Ommer. High-resolution image synthesis with latent diffusion models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10684–10695, 2022.
Saharia et al. (2022) Chitwan Saharia, William Chan, Saurabh Saxena, Lala Li, Jay Whang, Emily Denton, Seyed Kamyar Seyed Ghasemipour, Raphael Gontijo-Lopes, Burcu Karagol Ayan, Tim Salimans, Jonathan Ho, David J. Fleet, and Mohammad Norouzi. Photorealistic text-to-image diffusion models with deep language understanding. In Alice H. Oh, Alekh Agarwal, Danielle Belgrave, and Kyunghyun Cho (eds.), Advances in Neural Information Processing Systems, 2022. URL https://openreview.net/forum?id=08Yk-n5l2Al.
Salimans & Ho (2022) Tim Salimans and Jonathan Ho. Progressive distillation for fast sampling of diffusion models. In International Conference on Learning Representations, 2022. URL https://openreview.net/forum?id=TIdIXIpzhoI.
Song et al. (2021) Jiaming Song, Chenlin Meng, and Stefano Ermon. Denoising diffusion implicit models. In International Conference on Learning Representations, 2021.
Tang et al. (2023) Zhicong Tang, Shuyang Gu, Chunyu Wang, Ting Zhang, Jianmin Bao, Dong Chen, and Baining Guo. Volumediffusion: Flexible text-to-3d generation with efficient volumetric encoder. arXiv preprint arXiv:2312.11459, 2023.
Yang et al. (2021) S. Yang, J. Sohl-Dickstein, D. P. Kingma, A. Kumar, S. Ermon, and B. Poole. Score-based generative modeling through stochastic differential equations. In International Conference on Learning Representations, 2021.
Zhao et al. (2024) Xuanlei Zhao, Zhongkai Zhao, Ziming Liu, Haotian Zhou, Qianli Ma, and Yang You. Opendit: An easy, fast and memory-efficient system for dit training and inference. https://github.com/NUS-HPC-AI-Lab/OpenDiT, 2024.

Appendix A: Detailed Implementation for Noise Schedule

We provide a simple PyTorch implementation for the Laplace noise schedule and its application in training. This example can be adapted to other noise schedules, such as the Cauchy distribution, by replacing the laplace_noise_schedule function. The model accepts noisy samples $\mathbf{x}_{t}$ , timestep $t$ , and an optional condition tensor $\mathbf{c}$ as inputs. This implementation supports prediction of $\{\mathbf{x}_{0},\mathbf{v},\bm{\epsilon}\}$ .

⬇

1import torch

4def laplace_noise_schedule(mu=0.0, b=0.5):

5 # refer to Table 1

6 lmb = lambda t: mu - b * torch.sign(0.5 - t) * \

7 torch.log(1 - 2 * torch.abs(0.5 - t))

8 snr_func = lambda t: torch.exp(lmb(t))

9 alpha_func = lambda t: torch.sqrt(snr_func(t) / (1 + snr_func(t)))

10 sigma_func = lambda t: torch.sqrt(1 / (1 + snr_func(t)))

12 return alpha_func, sigma_func

15def training_losses(model, x, timestep, condition, noise=None,

16 predict_target="v", mu=0.0, b=0.5):

18 if noise is None:

19 noise = torch.randn_like(x)

21 alpha_func, sigma_func = laplace_noise_schedule(mu, b)

22 alphas = alpha_func(timestep)

23 sigmas = sigma_func(timestep)

25 # add noise to sample

26 x_t = alphas.view(-1, 1, 1, 1) * x + sigmas.view(-1, 1, 1, 1) * noise

27 # velocity

28 v_t = alphas.view(-1, 1, 1, 1) * noise - sigmas.view(-1, 1, 1, 1) * x

30 model_output = model(x_t, timestep, condition)

31 if predict_target == "v":

32 loss = (v_t - model_output) ** 2

33 elif predict_target == "x0":

34 loss = (x - model_output) ** 2

35 else: # predict_target == "noise":

36 loss = (noise - model_output) ** 2

38 return loss.mean()

Appendix B: Details for Sampling Process

As we mentioned before, choosing which noise schedule for sampling worth exploration. In this paper, we focus on exploring what kind of noise schedule is needed for training. Therefore, we adopted the same inference strategy as the cosine schedule to ensure a fair comparison. Specifically, first we sample $\{t_{0},t_{1},\ldots,t_{s}\}$ from uniform distribution $\mathcal{U}[0,1]$ , then get the corresponding SNRs from Cosine schedule: $\{\frac{\alpha^{2}_{t_{0}}}{\sigma^{2}_{t_{0}}},\frac{\alpha^{2}_{t_{1}}}{% \sigma^{2}_{t_{1}}},\ldots,\frac{\alpha^{2}_{t_{s}}}{\sigma^{2}_{t_{s}}}\}$ . According to Equation 6, we get the corresponding $\{t^{\prime}_{0},t^{\prime}_{1},\ldots,t^{\prime}_{s}\}$ by inverting these SNR values through the respective noise schedules. Finally, we use DDIM Song et al. (2021) to sample with these new calculated $\{t^{\prime}\}$ .