TASR: Timestep-Aware Diffusion Model for Image Super-Resolution

Qinwei Lin^1,†,, Xiaopeng Sun^2,†, Yu Gao^2,†, Yujie Zhong^2,³³3Corresponding Author., Dengjie Li²,
Zheng Zhao², Haoqian Wang^1,³³3Corresponding Author.
¹Tsinghua University, ²Meituan Inc.
Work done during an internship at Meituan.

Abstract

Diffusion models have recently achieved outstanding results in the field of image super-resolution. These methods typically inject low-resolution (LR) images via ControlNet. In this paper, we first explore the temporal dynamics of information infusion through ControlNet, revealing that the input from LR images predominantly influences the initial stages of the denoising process. Leveraging this insight, we introduce a novel timestep-aware diffusion model that adaptively integrates features from both ControlNet and the pre-trained Stable Diffusion (SD). Our method enhances the transmission of LR information in the early stages of diffusion to guarantee image fidelity and stimulates the generation ability of the SD model itself more in the later stages to enhance the detail of generated images. To train this method, we propose a timestep-aware training strategy that adopts distinct losses at varying timesteps and acts on disparate modules. Experiments on benchmark datasets demonstrate the effectiveness of our method. Code: https://github.com/SleepyLin/TASR

²²footnotetext: Equal contribution.

1 Introduction

Image super-resolution (ISR) aims to reconstruct high-resolution (HR) images from their low-resolution (LR) counterparts. The effectiveness of methods based on generative adversarial networks (GANs) has been demonstrated in previous works [40, 49, 15, 39, 52]. However, when dealing with severely degraded LR images, the HR images generated by these methods contain numerous visual artifacts and lack realistic details, resulting in low visual quality.

Recently, denoising diffusion probabilistic models (DDPMs) [11, 7, 31, 22, 5, 32, 29, 20, 27, 8] have achieved remarkable performance in the field of image generation, gradually replacing GANs [9] in a series of downstream image generation tasks. Therefore, some works [43, 4, 16, 47, 46, 25, 33, 37, 48, 17, 44] have leveraged large-scale pre-trained diffusion models to solve the ISR tasks. These models are mainly based on the ControlNet [50], which is used to inject the LR image as a condition into the latent feature space of a pre-trained diffusion model. This category of diffusion-based methods typically requires sampling over multiple timesteps during the generation process. Previous works [6, 12, 1, 19, 34] have pointed out that diffusion models primarily generate low-frequency semantic content in the initial stages of the denoising process, while gradually generating high-frequency details in the later stages. However, it remains unclear whether ControlNet exhibits similar patterns of conditional information integration when applied to diffusion-based ISR.

Refer to caption — Figure 1: Effectiveness of ControlNet at different timesteps in the denoising process. (a) Generated HR images at different conditional inference steps. “LR” denotes the given LR images, “20” indicates that the ControlNet features are introduced only in the first 20 steps of the inference process. (b) Analysis of the generated HR images under different conditional inference steps based on PSNR and CLIPIQA [36] metrics.

To this end, we conduct a simple experiment based on DiffBIR [16] to explore the effectiveness of ControlNet at different timesteps in the denoising process. As shown in Fig. 1, as the conditional steps of ControlNet increase, the fidelity of the generated HR images correspondingly improves, as indicated by a gradual increase in PSNR. On the other hand, the visual quality may deteriorate, as indicated by a decrease in the CLIPIQA [36, 26] scores. As shown in the visual examples, introducing ControlNet during the initial inference timesteps significantly improves the structural consistency between the generated HR image and the given LR image. Interestingly, disabling ControlNet during the late stages of inference (e.g., the last 10 timesteps) has minimal impact on the visual fidelity of the generated outputs. In some cases, the image details (e.g., windows on buildings) are even better. The above experiments indicate that ControlNet primarily affects the structural information of the generated HR images in the early stages of the denoising process. However, in the later stages of the denoising process, this constraining effect diminishes and may potentially impede the generation of intricate details.

The above observations inspire us to design a timestep-aware adapter that adaptively integrates ControlNet features with diffusion features at different timesteps. Based on the role of different timesteps, we anticipate that the adapter will emphasize the structural and color information of the image by increasing the weights of ControlNet features in the early stages of denoising. Meanwhile, in the later stages, the adapter is expected to enhance the generation of fine-grained image details by focusing more on the diffusion model features. To achieve this goal, we propose a timestep-aware training strategy to optimize our method, applying L1 loss functions from early timesteps to ensure image fidelity and introducing the CLIPIQA score in the later stages to enhance visual quality. Furthermore, by recognizing the characteristics corresponding to each loss function, different loss functions are used to separately optimize the corresponding modules within the model. Overall, the main contributions are summarized as follows:

$\bullet$

We propose a novel diffusion-based method for image super-resolution, which designs an adapter to dynamically control the feature fusion process between the ControlNet features and the diffusion features. This control is guided by the timesteps in the denoising process, allowing for a more nuanced integration of information.
$\bullet$

We introduce a timestep-aware training strategy that employs distinct loss functions to separately optimize the ControlNet and Adapter modules at different timesteps.
$\bullet$

Experiments on benchmark datasets demonstrate the effectiveness and superiority of our method.

2 Related Work

2.1 Diffusion-based Image Super-Resolution

Diffusion models have shown excellent performance in the field of image generation. Therefore, recent research works [16, 43, 46, 37, 47, 48, 33, 25, 44] utilized powerful pre-trained text-to-image diffusion models [28, 7, 24] as generative priors to tackle image super-resolution tasks. DiffBIR [16] initially proposed a restoration module to remove degradation noise from the LR images, then utilized a pre-trained SD [28] as a generative module. The denoised LR images are used as control signals for ControlNet [50] to generate the final HR image. SeeSR [43] and PASD [46] both proposed a degradation-aware prompt extractor to extract semantic prompts from LR images and use these prompts as auxiliary conditional information to guide the denoising process together with the LR images. SUPIR [47] collected a large-scale dataset as training data and used a more powerful pre-trained text-to-image diffusion model, SDXL [24], as generative prior. It also leveraged a multimodal large language model (MLLM) [18] to extract textual descriptions to guide the generation of HR images. From the perspective of model architecture, most of these works are based on ControlNet [50], which receives LR images as conditions and passes ControlNet features into the pre-trained diffusion model to guide the generation of corresponding HR images. Whereas, these methods do not take into account the role of ControlNet in the generation of HR images at different timesteps. Therefore, how to enhance the visual quality of generated images from a temporal perspective based on the ControlNet architecture is the main objective of this paper.

2.2 Temporal Analysis of Diffusion Model

During inference, diffusion models start from random noise and generate corresponding images through multiple steps of sampling and denoising. Recently, some studies [3, 6, 12, 1, 19, 34] have found that at different timesteps, diffusion models focus on different aspects during the denoising process: in the early stages of denoising, model primarily generates low-frequency information, such as the semantics and structure of the image, while in the later stages, model tends to generate high-frequency information, such as the edges and details of the image. ELLA [12] proposed a timestep-aware semantic connector that dynamically extracted control information of different frequencies from LLM text features at various denoising stages. T-GATE [19] investigated the role of attention in the denoising process from a temporal perspective and improved computational efficiency by caching and reusing attention operations at different timesteps during inference. MATTE [1] decomposed multiple attributes (e.g., color, object, layout, style) of the reference image from both the temporal dimension and the network layer dimension, injecting layout and color attributes at different timesteps during the denoising process to achieve attribute-guided image synthesis. Inspired by these works, we analyze the role of ControlNet from the temporal dimension and propose a timestep-aware adapter between ControlNet and the pre-trained diffusion model to adaptively fuse ControlNet features with diffusion features.

3 Method

3.1 Overview

In this work, we propose a timestep-aware SR (TASR) method aimed at improving the quality of generated HR images by employing adaptively feature fusion between ControlNet and the diffusion model. As shown in Fig. 2, our proposed TASR mainly consists of a pre-trained SD model, corresponding ControlNet, and a timestep-aware adapter. The parameters of the pre-trained SD model are frozen during the entire training stage. The VAE [28] encoded LR image features are used as the condition input for ControlNet. Meanwhile, we use the RAM [53] to extract text prompts from the given LR images and obtain the text features encoded by the frozen CLIP text encoder [28]. These CLIP text features are fed into the model as additional semantic prompts. In addition, the designed timestep-aware Adapter is inserted between ControlNet and the pre-trained SD UNet decoder. The specific design of each module is detailed in Sec. 3.2. The entire training process is divided into two stages to ensure the effectiveness and stability of ControlNet and the adapter during training. In the first stage, we optimize only the ControlNet parameters using the SR training dataset, thereby ensuring the effectiveness of ControlNet features. In the second stage, based on the patterns observed in the denoising process, we design a timestep-aware training strategy to optimize ControlNet and the adapter separately with different loss functions, as described in Sec. 3.3.

3.2 TASR

ControlNet. Similar to previous work [16, 46, 43], we utilize the powerful pre-trained large-scale text-to-image SD model as the image generation module in our model and employ the ControlNet to inject the VAE-encoded LR image feature as an additional condition into the decoder of SD to generate the corresponding HR image. Following [50], ControlNet is constructed by creating trainable copies of the pre-trained U-Net encoder and middle blocks, and injecting the conditional ControlNet features into the pre-trained U-Net decoder blocks via zero convolution layer. To fully leverage the guidance of text prompts on image generation in SD, we utilize a pre-trained image tagging model [53] to extract semantic prompt information from LR images. These prompts are encoded into text features $\boldsymbol{c}$ with the frozen CLIP text encoder and injected into the SD model to guide the denoising process.

Timestep-Aware Adapter Based on the empirical observations in Fig. 1, injecting ControlNet conditional features during the early stages of the denoising process, i.e., semantic-planning phase [19], can improve the structural consistency of the generated HR images. However, in the later stages of denoising, i.e., the fidelity-improving phase [19], the role of ControlNet features weakens and may even negatively impact the generation of HR image details. These observations inspire us to evaluate the role of ControlNet at different timesteps and how to inject ControlNet features into SD in a timestep-aware manner. To this end, we design a timestep-aware adapter that predicts a control weight map based on the current timestep to achieve the dynamic feature fusion of ControlNet and SD. The specific structure of the timestep-aware adapter is illustrated in Fig. 2 (right). Each adapter consists of two stacked convolutional layers with ReLU layers and normalization layers. The timesteps are injected through the AdaLN [12, 22, 23] layer, and finally, the adapter outputs the control weight map via a sigmoid function. The adapter takes the U-Net feature $\boldsymbol{f_{d}}$ from the previous layer, the skip connection feature $\boldsymbol{f}_{cond}$ from ControlNet and the timestep $t$ as inputs to predict the corresponding control weight $\boldsymbol{\alpha}$ for dynamically injecting the ControlNet feature by $\boldsymbol{f_{d}}+\boldsymbol{f}_{cond}*\boldsymbol{\alpha}$ .

3.3 Optimization Objective

Stage I. To ensure the control effectiveness of the ControlNet and improve training stability, we train only the ControlNet in the first stage. During training, the HR image $\mathbf{I}_{HR}$ and the LR image $\mathbf{I}_{LR}$ are encoded by pre-trained VAE encoder into latent representations $\boldsymbol{z}_{0}$ and $\boldsymbol{z}_{lr}$ , respectively. In addition, we leverage a tag model [43] to extract text prompts from the LR images and the pre-trained CLIP text encoder to obtain the text features $\boldsymbol{c}$ . The diffusion process [11] generates the noisy latent $\boldsymbol{z}_{t}$ by adding Gaussian noise with variance $\beta_{t}\in(0,1)$ at timestep $t$ to $\boldsymbol{z}_{0}$ :

\boldsymbol{z}_{t}=\sqrt{\bar{\alpha}_{t}}\boldsymbol{z}_{0}+\sqrt{1-\bar{% \alpha}_{t}}\epsilon,\quad\epsilon\sim\mathcal{N}(\boldsymbol{0},\mathbf{I}),

(1)

where $\epsilon$ represents a noise map sampled from a normal Gaussian distribution, $\alpha_{t}=1-\beta_{t}$ , and $\bar{\alpha}_{t}=\prod^{t}_{s=1}\alpha_{s}$ .

The ControlNet is initialized as a trainable copy of the pre-trained UNet encoder and middle block, and $\boldsymbol{z}_{t}$ and $\boldsymbol{z}_{lr}$ are concatenated together as the input to the ControlNet. Given the diffusion timestep $\boldsymbol{t}$ , LR latent $\boldsymbol{z}_{l}$ , noisy latent $\boldsymbol{z}_{t}$ , text features $\boldsymbol{c}$ , we trained our model $\epsilon_{\theta}$ using denoising loss $\mathcal{L}_{d}$ to predict the noise added to $\boldsymbol{z}_{t}$ , as follows:

\mathcal{L}_{d}=\mathbb{E}_{\boldsymbol{z}_{l},\boldsymbol{z}_{t},\boldsymbol{% c},\boldsymbol{t},\epsilon\sim\mathcal{N}(0,1)}\left[\left\|\epsilon-\epsilon_% {\theta}(\boldsymbol{z}_{t},\boldsymbol{t},\boldsymbol{c},\boldsymbol{z}_{l})% \right\|\right].

(2)

Stage II. After the first training stage, ControlNet can learn the correct conditional information from the LR image. However, in the second stage, an obvious question arises: how to properly train the adapter to weigh the information from ControlNet in a timestep-aware manner based on the pattern of the diffusion model in the denoising process. A naive approach is to optimize the adapter using the same training data and the denoising loss as in the previous stage. However, since the weights of ControlNet have already been trained in the first stage, the adapter will infinitely approach an identity function under the same training data and denoising loss function.

As observed in Fig. 1, during the early stages of denoising, the model tends to learn image structures and other information from the control information, i.e., $\boldsymbol{z}_{lr}$ , while in the later stages of denoising, it focuses on generating high-frequency image details. Therefore, we propose a timestep-aware training strategy that introduces different loss functions based on the contribution of different stages of the denoising process to guide the image generation process. Specifically, in addition to the denoising loss $\mathcal{L}_{d}$ , we introduce L1 Loss $\mathcal{L}_{1}$ to supervise ControlNet from the early stages of denoising (i.e., $0\leq\boldsymbol{t}\leq 800$ ):

	$\displaystyle\mathcal{L}_{1}$	$\displaystyle=\left\\|\mathbf{I}_{HR}-\mathbf{\hat{I}}\right\\|,$		(3)
	$\displaystyle\text{where}\quad\mathbf{\hat{I}}$	$\displaystyle=\mathbb{D}\left(\frac{x_{t}-\sqrt{1-\alpha_{t}}\epsilon_{\theta}% (\boldsymbol{z}_{t},\boldsymbol{t},\boldsymbol{c},\boldsymbol{z}_{l})}{\sqrt{% \alpha}_{t}}\right),$		(4)

$\mathbb{D}$ denotes the pre-trained VAE decoder.

In the later stages of denoising (i.e., $0\leq\boldsymbol{t}\leq 200$ ), we use the non-reference metric CLIP-IQA [36] to evaluate the visual quality of the generated HR images $\mathbf{\hat{I}}$ and use its results as a perception reward to encourage the model to improve the image quality of the generated results. The specific detailed reward loss is as follows:

\mathcal{L}_{clipiqa}=1-\mathbb{R}(\mathbf{\hat{I}}),

(5)

where $\mathbb{R}$ denotes the CLIP-IQA model.

Without introducing the adapter, directly using the $\mathcal{L}{1}$ and $\mathcal{L}_{clipiqa}$ to optimize the ControlNet is another option. We conduct experiments on this scheme, and the experimental results show that the generated images tend to exhibit specific styles, as shown in Fig. 3 (a). We found that this is due to the reward hacking [30] caused by $\mathcal{L}_{clipiqa}$ , resulting in a high CLIPIQA score but poor visual quality. As shown in Fig. 3, ControlNet has a larger optimization space compared to the proposed adapter, whose optimization space is constrained by the sigmoid function. When directly using CLIPIQA as the perceptual reward to train ControlNet, the model can easily fall into the local optimum that aligns with the preference of CLIPIQA. The Timestep-Aware Adapter guides image generation by predicting the control weight map $\boldsymbol{\alpha}$ of ControlNet features, with the $\boldsymbol{\alpha}$ values constrained between 0 and 1. If only the adapter is optimized, its optimization space is smaller, and the reward hacking point mentioned above falls outside this space. Therefore, we only utilize $\mathcal{L}_{clipiqa}$ to optimize the parameters of the adapter, thus avoiding perception reward traps. However, compared to perception reward loss $\mathcal{L}_{clipiqa}$ , the L1 Loss $\mathcal{L}_{1}$ is applied to measure the absolute error between images in a pixel-wise manner, representing the structural differences between images (e.g. color distribution). Thus, using L1 Loss $\mathcal{L}_{1}$ to optimize the ControlNet with large parameter space can achieve better fitting results. To this end, we apply the L1 Loss $\mathcal{L}_{1}$ to optimize ControlNet parameters, while utilizing the perception reward $\mathcal{L}_{clipiqa}$ to optimize the timestep-aware adapter parameters. Additionally, to enhance the stability of the training process, we adopt an alternating training approach inspired by GANs [49, 40], optimizing the parameters of ControlNet and the adapter in alternate iteration steps to prevent interference between the two modules and allows each to learn more effectively. Specifically, we fix the parameters of one module while updating the other, alternately training ControlNet and adapter. The final loss function is as follows:

\mathcal{L}=\begin{cases}\mathcal{L}_{d},&\text{if }\begin{aligned} t&\in[t_{1% },1000]\end{aligned}\\[5.0pt] \mathcal{L}_{d}+\lambda_{1}\mathcal{L}_{1},&\text{if }\begin{aligned} t&\in[t_% {clipiqa},t_{1}]\end{aligned}\\[5.0pt] \mathcal{L}_{d}+\lambda_{1}\mathcal{L}_{1}+\lambda_{clipiqa}\mathcal{L}_{% clipiqa},&\text{if }\begin{aligned} t&\in[0,t_{clipiqa}]\end{aligned}\end{cases}

(6)

where $\lambda_{1}$ and $\lambda_{clipiqa}$ are hyperparameters, both set to 0.01. The values of $t_{1}$ and $t_{clipiqa}$ are set to 800 and 200, respectively. Further discussions regarding the selection of these two timesteps are provided in the Appendix.

4 Experiments

Datasets	Metrics	GAN-based methods			Diffusion-based methods
Datasets	Metrics	RealESRGAN	BSRGAN	SwinIR	SeeSR	PASD	ResShift	DiffBIR	SUPIR	Ours
DIV2K-val	$\text{PSNR}\uparrow$	22.34	22.84	23.29	21.53	21.46	22.02	21.24	20.61	20.92
	$\text{SSIM}\uparrow$	0.5768	0.5969	0.6083	0.5356	0.5321	0.5485	0.5168	0.4717	0.5174
	$\text{LPIPS}\downarrow$	0.3370	0.4851	0.4951	0.3311	0.4408	0.3644	0.3691	0.4203	0.3762
	$\text{MANIQA}\uparrow$	0.3892	0.2500	0.2306	0.5094	0.3558	0.3540	0.4573	0.4861	0.6007
	$\text{MUSIQ}\uparrow$	57.50	37.71	31.32	67.62	52.63	55.56	67.43	54.97	68.14
	$\text{CLIPIQA}\uparrow$	0.5365	0.2869	0.3110	0.6987	0.4917	0.5479	0.7110	0.6374	0.7681
RealSR	$\text{PSNR}\uparrow$	25.69	27.27	27.42	25.18	26.56	26.42	25.22	23.74	23.79
	$\text{SSIM}\uparrow$	0.7618	0.7983	0.7999	0.7200	0.7614	0.7569	0.7028	0.6631	0.6650
	$\text{LPIPS}\downarrow$	0.2172	0.2312	0.2440	0.2354	0.2217	0.2385	0.2577	0.2871	0.2986
	$\text{MANIQA}\uparrow$	0.3743	0.3269	0.2816	0.5427	0.3875	0.3969	0.4583	0.5025	0.6113
	$\text{MUSIQ}\uparrow$	60.17	53.12	45.22	69.78	59.10	60.18	66.62	61.56	69.94
	$\text{CLIPIQA}\uparrow$	0.4444	0.2952	0.3166	0.6611	0.4829	0.5563	0.6700	0.6565	0.7076
DRealSR	$\text{PSNR}\uparrow$	28.64	29.91	30.36	28.17	29.05	28.78	26.87	25.00	27.25
	$\text{SSIM}\uparrow$	0.8052	0.8394	0.8496	0.7674	0.793	0.7878	0.7116	0.6416	0.7381
	$\text{LPIPS}\downarrow$	0.2121	0.2569	0.2571	0.2346	0.2371	0.2508	0.3045	0.3340	0.2869
	$\text{MANIQA}\uparrow$	0.3448	0.2771	0.2621	0.5146	0.3753	0.3517	0.4548	0.4980	0.5551
	$\text{MUSIQ}\uparrow$	54.17	41.76	36.48	64.93	52.60	52.49	63.34	59.66	65.23
	$\text{CLIPIQA}\uparrow$	0.4418	0.3145	0.3460	0.6810	0.5035	0.5429	0.6680	0.6791	0.7155

Table 1: Quantitative comparison results on both synthetic and real-world benchmark datasets. Red and blue represent the best and the second best performance, respectively.

\downarrow

represents the smaller values are better, while

\uparrow

represents the larger values are better.

4.1 Dataset and Evaluation Metric

Datasets. Following [43, 46, 16], we train our method on DIV2K [2], DIV8K [10], Flickr2K [35], OST [38], and the first 10,000 face images from FFHQ [13], and use the degradation pipeline of RealESR-GAN [40] to generate HR-LR image pairs for training. We evaluate the performance of our method on both synthetic and real-world datasets. The synthetic dataset is generated from the DIV2K validation set, where we use the same degradation pipeline in the training process to randomly crop 3K image patches to $128\times 128$ as LR images. For the real-world test datasets, we evaluate the DrealSR [42] and RealSR [42] datasets, where each image is center-cropped to obtain LR images of $128\times 128$ resolution. The resolution of each HR image in both the training and test sets is $512\times 512$ .

Metrics. We perform a comprehensive and effective quantitative evaluation of ISR methods using a series of widely used reference and non-reference metrics. Among the reference-based metrics, PSNR and SSIM [41] (calculated on the Y channel in the YCbCr space) are fidelity metrics, while LPIPS [51] are quality assessment metrics. MANIQA [45], MUSIQ [45], and CLIPIQA [36] are non-reference image quality assessment (IQA) metrics.

4.2 Implementation Details

We use the pre-trained SD-v2.1 [28] model as the base SD model. In the first stage of training, we fine-tune the ControlNet module for 20K iterations, and in the second stage, we fine-tune both the IRControlNet and Adapter for 100K iterations. During training, we use the AdamW [14] optimizer with a weight decay of 1e-2, a batch size of 32, and a learning rate of 1e-5. All experiments are conducted at a resolution of 512×512 on 8 NVIDIA A100 GPUs. During inference, we employ a classifier-free guidance strategy, generating higher-quality images through negative prompts without additional training. The guidance scale for classifier-free guidance is set to 4.5, and we use a spaced DDPM sampling schedule [21] with 20 timesteps.

4.3 Comparison with State-of-the-art Methods

To validate the effectiveness of our method, we compared it with several state-of-the-art GAN-based and diffusion-based ISR methods, namely RealESRGAN [40], BSRGAN [49], SwinIR [15], SeeSR [43], PASD [46], ResShift [48], DiffBIR [16], SUPIR [47].

Quantitative Comparisons. Tab. 1 presents the quantitative comparison on both synthetic and real-world test datasets. As observed, our method significantly outperforms other state-of-the-art (SOTA) methods on non-reference metrics such as MANIQA, MUSIQ, and CLIPIQA across all datasets, generating higher-quality HR images. For example, on the DRealSR dataset, TASR outperforms the second-best method, SeeSR by 7.8%, 0.4%, and 5.0% on MANIQA, MUSIQ, and CLIPIQA metrics, respectively. Furthermore, GAN-based methods surpass diffusion-based methods on reference metrics such as PSNR and SSIM. We attribute this phenomenon to the fact that diffusion-based methods leverage powerful generative priors to generate details that are more perceptually realistic to humans, but this comes at the expense of the fidelity to the LR images, as noted in previous works [43, 46].

Qualitative Comparisons. In Fig. 4, we present some visual comparison results on both synthetic and real-world test datasets. As observed, GAN-based approaches such as Real-ESRGAN and SwinIR fail to generate fine image details compared to diffusion-based methods, and the resulting HR images still exhibit a certain degree of degradation. Meanwhile, our method outperforms other diffusion-based methods in terms of image structural fidelity and detail richness. As shown in Fig. 4 (a,d,f), the HR images generated by other diffusion-based methods still contain a certain degree of blurring and artifacts in the complex region. In contrast, our method effectively removes these degradations and generates more refined image details, such as the realistic ear of wheat, the edge details of the building, and clearer urban landscapes. Furthermore, compared to other methods, our approach generates image details with more accurate semantics. As shown in Fig. 4 (b), RealESR-GAN, SeeSR, and SUPIR all fail to generate accurate animal hair textures. In Fig. 4 (e), SUPIR mistakenly generates the black spots on the feathers as eyes. In contrast, employing the proposed timestep-aware training strategy, our method can generate HR images with richer details while maintaining structural consistency with the LR images.

4.4 Ablation Study

To validate the effectiveness of our proposed method, we conduct experiments on the DIV2K-val test dataset.

Model Architecture. We employ various architectures to validate the effectiveness of our proposed model structure for the adapter. Firstly, we removed all time-adaptive normalization layers from the adapter and directly utilized the U-Net features $\boldsymbol{f}_{d}$ and the skip connection features $\boldsymbol{f}_{cond}$ to predict the control weight map $\boldsymbol{\alpha}$ . This modification is denoted as ‘w/o timestep’. As shown in Tab. 2, by incorporating timesteps into the adapter, our method achieves better performance in all the metrics compared to the variant without timesteps. In addition, we replace our adapter structure with a transformer-based architecture similar to ELLA [12], denoted as ‘Transformer’. The variant with the Transformer block has declined in the non-reference metrics.

Training for Different Module. Firstly, we remove the timestep-aware adapter and optimize only the parameters of ControlNet during the training process, denoted as ‘ControlNet’. As shown in Tab. 3 and Fig. 5, when the limitation of the adapter on the optimization space is removed, the method is prone to reward hacking during training. Although there is a significant improvement in perceptual metrics such as CLIPIQA and MUSIQ, the generated images tend to align with the preferences of these metrics rather than human perception. This can result in images that are not sufficiently realistic and lack fidelity, such as the strange feathers and the human eye in the sky background in Fig. 5 (a). Similarly, we add the proposed adapter but optimize only the parameters of the adapter during the training process, denoted as ‘Adapter’. Our method shows a better trade-off between the reference and non-reference metrics. When only the Adapter is optimized, the parameter space available for optimization is constrained, resulting in a 6.3% and 2.0% decrease in the MANIQA, and MUSIQ metrics, respectively. As can be seen from Fig. 5, compared to our method, the model obtained by optimizing only the Adapter struggles to generate more refined feathers.

Loss Function. We begin by evaluating the impact of the chosen loss functions on the results. As shown in Tab. 4, when only $\mathcal{L}_{clipiqa}$ is added as the loss function (row 2), the no-reference image quality assessment metrics MANIQA improves by 8.3% compared to using only the denoising loss (row 1). However, reference metrics such as PSNR and SSIM decrease by 4.3% and 10.6% respectively, indicating lower fidelity in the generated HR images. Similarly, when only $\mathcal{L}_{1}$ is added as the loss function (row 3), the reference metrics improve while the non-reference metrics correspondingly decline. In contrast, our proposed training strategy, applying $\mathcal{L}_{1}$ and $\mathcal{L}_{clipiqa}$ , allows us to enhance image perceptual quality while ensuring structural consistency and fidelity. The visualization in Fig. 6 confirms this as well. In the initial stages of denoising, the adapter encourages the integration of ControlNet features by increasing the control weight map. Subsequently, it progressively reduces these control weights to suppress ControlNet constraints, thereby guiding the generation of high-frequency details to enhance the visual quality of results.

It is noteworthy that when applying $\mathcal{L}_{clipiqa}$ , not only does the CLIPIQA metric improve, but all no-reference image quality assessment metrics show an increase. We believe that adopting a better perceptual quality as a reward signal could further enhance the results, and this lies beyond the scope of our current research.

Timestep-Aware training Strategy. We conducted further experiments to explore the impact of using different reward functions at various time steps. Specifically, we first added both $\mathcal{L}_{1}$ and $\mathcal{L}_{clipiqa}$ across all time steps (0-1000 steps) in the training process, denoted as ”all steps”. As shown in Tab. 5 and Fig. 7 (a), when image rewards are introduced at all time steps, the model becomes unstable during training and fails to generate the correct HR image. Consequently, all reference and non-reference metrics significantly decrease. Similarly, when we applied the inverse training strategy that introduces $\mathcal{L}_{clipiqa}$ in the early denoising stages (0-800 steps) and $\mathcal{L}_{1}$ in the later stages (0-200 steps), the model also suffered from instability during training, resulting in lower quality high-resolution images. These experimental results demonstrate that our training strategy is crucial for achieving optimal outcomes.

Exp	$\text{PSNR}\uparrow$	$\text{SSIM}\uparrow$	$\text{LPIPS}\downarrow$	$\text{MANIQA}\uparrow$	$\text{MUSIQ}\uparrow$	$\text{CLIPIQA}\uparrow$
w/o timestep	20.66	0.4969	0.3965	0.5856	67.38	0.7668
Transformer	21.09	0.5235	0.3619	0.5790	67.64	0.7583
Ours	20.92	0.5174	0.3762	0.6007	68.14	0.7681

Table 2: Ablations of Model Architecture.

exps	$\text{PSNR}\uparrow$	$\text{SSIM}\uparrow$	$\text{LPIPS}\downarrow$	$\text{MANIQA}\uparrow$	$\text{MUSIQ}\uparrow$	$\text{CLIPIQA}\uparrow$
ControlNet	19.12	0.4344	0.4632	0.6876	74.90	0.9345
Adapter	21.42	0.5342	0.3561	0.5628	66.75	0.7501
Ours	20.92	0.5174	0.3762	0.6007	68.14	0.7681

Table 3: Ablations of Training for Different Module.

$\mathcal{L}_{1}$	$\mathcal{L}_{clipiqa}$	$\text{PSNR}\uparrow$	$\text{SSIM}\uparrow$	$\text{LPIPS}\downarrow$	$\text{MANIQA}\uparrow$	$\text{MUSIQ}\uparrow$	$\text{CLIPIQA}\uparrow$
✗	✗	20.88	0.5117	0.3639	0.5781	67.63	0.7566
✗	✓	19.98	0.4570	0.4110	0.6265	69.73	0.7687
✓	✗	21.02	0.5224	0.3650	0.5586	67.12	0.7366
✓	✓	20.92	0.5174	0.3762	0.6007	68.14	0.7681

Table 4: Ablations of Loss Function.

exps	$\text{PSNR}\uparrow$	$\text{SSIM}\uparrow$	$\text{LPIPS}\downarrow$	$\text{MANIQA}\uparrow$	$\text{MUSIQ}\uparrow$	$\text{CLIPIQA}\uparrow$
all steps	17.97	0.4079	0.6299	0.5302	65.12	0.6790
inverse steps	18.80	0.4397	0.5830	0.5863	67.48	0.7508
Ours	20.92	0.5174	0.3762	0.6007	68.14	0.7681

Table 5: Ablations of Timestep-Aware training Strategy.

5 Conclusion

In this paper, we proposed a timestep-aware image super-resolution method that introduces a timestep-aware adapter to dynamically integrate ControlNet and diffusion features. In addition, we designed a timestep-aware training strategy to train each module separately based on the generative pattern of the denoising process. Extensive experiments on benchmark datasets demonstrate the effectiveness of our method compared with the state-of-the-art methods.

References

Agarwal et al. [2023] Aishwarya Agarwal, Srikrishna Karanam, Tripti Shukla, and Balaji Vasan Srinivasan. An image is worth multiple words: Multi-attribute inversion for constrained text-to-image synthesis. arXiv preprint arXiv:2311.11919, 2023.
Agustsson and Timofte [2017] Eirikur Agustsson and Radu Timofte. Ntire 2017 challenge on single image super-resolution: Dataset and study. In Proceedings of the IEEE conference on computer vision and pattern recognition workshops, pages 126–135, 2017.
Balaji et al. [2022] Yogesh Balaji, Seungjun Nah, Xun Huang, Arash Vahdat, Jiaming Song, Qinsheng Zhang, Karsten Kreis, Miika Aittala, Timo Aila, Samuli Laine, et al. ediff-i: Text-to-image diffusion models with an ensemble of expert denoisers. arXiv preprint arXiv:2211.01324, 2022.
Chen et al. [2024] Haolan Chen, Jinhua Hao, Kai Zhao, Kun Yuan, Ming Sun, Chao Zhou, and Wei Hu. Cassr: Activating image power for real-world image super-resolution. arXiv preprint arXiv:2403.11451, 2024.
Chen et al. [2023] Junsong Chen, Jincheng Yu, Chongjian Ge, Lewei Yao, Enze Xie, Yue Wu, Zhongdao Wang, James Kwok, Ping Luo, Huchuan Lu, et al. Pixart-alpha: Fast training of diffusion transformer for photorealistic text-to-image synthesis. arXiv preprint arXiv:2310.00426, 2023.
Choi et al. [2022] Jooyoung Choi, Jungbeom Lee, Chaehun Shin, Sungwon Kim, Hyunwoo Kim, and Sungroh Yoon. Perception prioritized training of diffusion models. In IEEE Conf. Comput. Vis. Pattern Recog., pages 11472–11481, 2022.
Dhariwal and Nichol [2021] Prafulla Dhariwal and Alexander Nichol. Diffusion models beat gans on image synthesis. Advances in neural information processing systems, 34:8780–8794, 2021.
Esser et al. [2024] Patrick Esser, Sumith Kulal, Andreas Blattmann, Rahim Entezari, Jonas Müller, Harry Saini, Yam Levi, Dominik Lorenz, Axel Sauer, Frederic Boesel, et al. Scaling rectified flow transformers for high-resolution image synthesis. In Forty-first International Conference on Machine Learning, 2024.
Goodfellow et al. [2020] Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. Generative adversarial networks. Communications of the ACM, 63(11):139–144, 2020.
Gu et al. [2019] Shuhang Gu, Andreas Lugmayr, Martin Danelljan, Manuel Fritsche, Julien Lamour, and Radu Timofte. Div8k: Diverse 8k resolution image dataset. In 2019 IEEE/CVF International Conference on Computer Vision Workshop (ICCVW), pages 3512–3516. IEEE, 2019.
Ho et al. [2020] Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising diffusion probabilistic models. Adv. Neural Inform. Process. Syst., 33:6840–6851, 2020.
Hu et al. [2024] Xiwei Hu, Rui Wang, Yixiao Fang, Bin Fu, Pei Cheng, and Gang Yu. Ella: Equip diffusion models with llm for enhanced semantic alignment. arXiv preprint arXiv:2403.05135, 2024.
Karras et al. [2019] Tero Karras, Samuli Laine, and Timo Aila. A style-based generator architecture for generative adversarial networks. In IEEE Conf. Comput. Vis. Pattern Recog., pages 4401–4410, 2019.
Kinga et al. [2015] D Kinga, Jimmy Ba Adam, et al. A method for stochastic optimization. In Int. Conf. Learn. Represent., page 6. San Diego, California;, 2015.
Liang et al. [2021] Jingyun Liang, Jiezhang Cao, Guolei Sun, Kai Zhang, Luc Van Gool, and Radu Timofte. Swinir: Image restoration using swin transformer. In Int. Conf. Comput. Vis., pages 1833–1844, 2021.
Lin et al. [2024] Xinqi Lin, Jingwen He, Ziyan Chen, Zhaoyang Lyu, Bo Dai, Fanghua Yu, Wanli Ouyang, Yu Qiao, and Chao Dong. Diffbir: Towards blind image restoration with generative diffusion prior, 2024.
Liu et al. [2022] Anran Liu, Yihao Liu, Jinjin Gu, Yu Qiao, and Chao Dong. Blind image super-resolution: A survey and beyond. IEEE Trans. Pattern Anal. Mach. Intell., 45(5):5461–5480, 2022.
Liu et al. [2024a] Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual instruction tuning. Advances in neural information processing systems, 36, 2024a.
Liu et al. [2024b] Haozhe Liu, Wentian Zhang, Jinheng Xie, Francesco Faccio, Mengmeng Xu, Tao Xiang, Mike Zheng Shou, Juan-Manuel Perez-Rua, and Jürgen Schmidhuber. Faster diffusion via temporal attention decomposition. arXiv e-prints, pages arXiv–2404, 2024b.
Nichol et al. [2021] Alex Nichol, Prafulla Dhariwal, Aditya Ramesh, Pranav Shyam, Pamela Mishkin, Bob McGrew, Ilya Sutskever, and Mark Chen. Glide: Towards photorealistic image generation and editing with text-guided diffusion models. arXiv preprint arXiv:2112.10741, 2021.
Nichol and Dhariwal [2021] Alexander Quinn Nichol and Prafulla Dhariwal. Improved denoising diffusion probabilistic models. pages 8162–8171. PMLR, 2021.
Peebles and Xie [2023] William Peebles and Saining Xie. Scalable diffusion models with transformers. In Int. Conf. Comput. Vis., pages 4195–4205, 2023.
Perez et al. [2018] Ethan Perez, Florian Strub, Harm De Vries, Vincent Dumoulin, and Aaron Courville. Film: Visual reasoning with a general conditioning layer. 2018.
Podell et al. [2023] Dustin Podell, Zion English, Kyle Lacey, Andreas Blattmann, Tim Dockhorn, Jonas Müller, Joe Penna, and Robin Rombach. Sdxl: Improving latent diffusion models for high-resolution image synthesis. arXiv preprint arXiv:2307.01952, 2023.
Qi et al. [2023] Chenyang Qi, Zhengzhong Tu, Keren Ye, Mauricio Delbracio, Peyman Milanfar, Qifeng Chen, and Hossein Talebi. Tip: Text-driven image processing with semantic and restoration instructions. arXiv:2312.11595, 2023.
Radford et al. [2021] Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervision. In International conference on machine learning, pages 8748–8763. PMLR, 2021.
Rombach et al. [2022a] Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Björn Ommer. High-resolution image synthesis with latent diffusion models. In IEEE Conf. Comput. Vis. Pattern Recog., pages 10684–10695, 2022a.
Rombach et al. [2022b] Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Björn Ommer. High-resolution image synthesis with latent diffusion models. In IEEE Conf. Comput. Vis. Pattern Recog., pages 10684–10695, 2022b.
Saharia et al. [2022] Chitwan Saharia, William Chan, Saurabh Saxena, Lala Li, Jay Whang, Emily L Denton, Kamyar Ghasemipour, Raphael Gontijo Lopes, Burcu Karagol Ayan, Tim Salimans, et al. Photorealistic text-to-image diffusion models with deep language understanding. Adv. Neural Inform. Process. Syst., 35:36479–36494, 2022.
Skalse et al. [2022] Joar Skalse, Nikolaus Howe, Dmitrii Krasheninnikov, and David Krueger. Defining and characterizing reward gaming. Adv. Neural Inform. Process. Syst., 35:9460–9471, 2022.
Song et al. [2020a] Jiaming Song, Chenlin Meng, and Stefano Ermon. Denoising diffusion implicit models. arXiv:2010.02502, 2020a.
Song et al. [2020b] Yang Song, Jascha Sohl-Dickstein, Diederik P Kingma, Abhishek Kumar, Stefano Ermon, and Ben Poole. Score-based generative modeling through stochastic differential equations. arXiv preprint arXiv:2011.13456, 2020b.
Sun et al. [2024] Haoze Sun, Wenbo Li, Jianzhuang Liu, Haoyu Chen, Renjing Pei, Xueyi Zou, Youliang Yan, and Yujiu Yang. Coser: Bridging image and language for cognitive super-resolution. In IEEE Conf. Comput. Vis. Pattern Recog., pages 25868–25878, 2024.
Sun et al. [2023] Lingchen Sun, Rongyuan Wu, Zhengqiang Zhang, Hongwei Yong, and Lei Zhang. Improving the stability of diffusion models for content consistent super-resolution. arXiv e-prints, pages arXiv–2401, 2023.
Timofte et al. [2017] Radu Timofte, Eirikur Agustsson, Luc Van Gool, Ming-Hsuan Yang, and Lei Zhang. Ntire 2017 challenge on single image super-resolution: Methods and results. In Proceedings of the IEEE conference on computer vision and pattern recognition workshops, pages 114–125, 2017.
Wang et al. [2023] Jianyi Wang, Kelvin CK Chan, and Chen Change Loy. Exploring clip for assessing the look and feel of images. In AAAI, 2023.
Wang et al. [2024] Jianyi Wang, Zongsheng Yue, Shangchen Zhou, Kelvin C.K. Chan, and Chen Change Loy. Exploiting diffusion prior for real-world image super-resolution. 2024.
Wang et al. [2018a] Xintao Wang, Ke Yu, Chao Dong, and Chen Change Loy. Recovering realistic texture in image super-resolution by deep spatial feature transform. In IEEE Conf. Comput. Vis. Pattern Recog., pages 606–615, 2018a.
Wang et al. [2018b] Xintao Wang, Ke Yu, Shixiang Wu, Jinjin Gu, Yihao Liu, Chao Dong, Yu Qiao, and Chen Change Loy. Esrgan: Enhanced super-resolution generative adversarial networks. In Proceedings of the European conference on computer vision (ECCV) workshops, pages 0–0, 2018b.
Wang et al. [2021] Xintao Wang, Liangbin Xie, Chao Dong, and Ying Shan. Real-esrgan: Training real-world blind super-resolution with pure synthetic data. In Int. Conf. Comput. Vis., pages 1905–1914, 2021.
Wang et al. [2004] Zhou Wang, Alan C Bovik, Hamid R Sheikh, and Eero P Simoncelli. Image quality assessment: from error visibility to structural similarity. IEEE Trans. Image Process., 13(4):600–612, 2004.
Wei et al. [2020] Pengxu Wei, Ziwei Xie, Hannan Lu, Zongyuan Zhan, Qixiang Ye, Wangmeng Zuo, and Liang Lin. Component divide-and-conquer for real-world image super-resolution. In Eur. Conf. Comput. Vis., pages 101–117. Springer, 2020.
Wu et al. [2024] Rongyuan Wu, Tao Yang, Lingchen Sun, Zhengqiang Zhang, Shuai Li, and Lei Zhang. Seesr: Towards semantics-aware real-world image super-resolution. In IEEE Conf. Comput. Vis. Pattern Recog., pages 25456–25467, 2024.
Xie et al. [2024] Rui Xie, Ying Tai, Chen Zhao, Kai Zhang, Zhenyu Zhang, Jun Zhou, Xiaoqian Ye, Qian Wang, and Jian Yang. Addsr: Accelerating diffusion-based blind super-resolution with adversarial diffusion distillation. arXiv preprint arXiv:2404.01717, 2024.
Yang et al. [2022] Sidi Yang, Tianhe Wu, Shuwei Shi, Shanshan Lao, Yuan Gong, Mingdeng Cao, Jiahao Wang, and Yujiu Yang. Maniqa: Multi-dimension attention network for no-reference image quality assessment. In IEEE Conf. Comput. Vis. Pattern Recog., pages 1191–1200, 2022.
Yang et al. [2025] Tao Yang, Rongyuan Wu, Peiran Ren, Xuansong Xie, and Lei Zhang. Pixel-aware stable diffusion for realistic image super-resolution and personalized stylization. In Eur. Conf. Comput. Vis., pages 74–91. Springer, 2025.
Yu et al. [2024] Fanghua Yu, Jinjin Gu, Zheyuan Li, Jinfan Hu, Xiangtao Kong, Xintao Wang, Jingwen He, Yu Qiao, and Chao Dong. Scaling up to excellence: Practicing model scaling for photo-realistic image restoration in the wild. In IEEE Conf. Comput. Vis. Pattern Recog., pages 25669–25680, 2024.
Yue et al. [2024] Zongsheng Yue, Jianyi Wang, and Chen Change Loy. Resshift: Efficient diffusion model for image super-resolution by residual shifting. Adv. Neural Inform. Process. Syst., 36, 2024.
Zhang et al. [2021] Kai Zhang, Jingyun Liang, Luc Van Gool, and Radu Timofte. Designing a practical degradation model for deep blind image super-resolution. In Int. Conf. Comput. Vis., pages 4791–4800, 2021.
Zhang et al. [2023] Lvmin Zhang, Anyi Rao, and Maneesh Agrawala. Adding conditional control to text-to-image diffusion models. In Int. Conf. Comput. Vis., pages 3836–3847, 2023.
Zhang et al. [2018] Richard Zhang, Phillip Isola, Alexei A Efros, Eli Shechtman, and Oliver Wang. The unreasonable effectiveness of deep features as a perceptual metric. In IEEE Conf. Comput. Vis. Pattern Recog., pages 586–595, 2018.
Zhang et al. [2019] Wenlong Zhang, Yihao Liu, Chao Dong, and Yu Qiao. Ranksrgan: Generative adversarial networks with ranker for image super-resolution. In Int. Conf. Comput. Vis., pages 3096–3105, 2019.
Zhang et al. [2024] Youcai Zhang, Xinyu Huang, Jinyu Ma, Zhaoyang Li, Zhaochuan Luo, Yanchun Xie, Yuzhuo Qin, Tong Luo, Yaqian Li, Shilong Liu, et al. Recognize anything: A strong image tagging model. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 1724–1732, 2024.