Feature Denoising Diffusion Model For Blind Image Quality Assessment
Xudong Li1 , Jingyuan Zheng2 , Runze Hu3 , Yan Zhang1,∗ , Ke Li4 , Yunhang Shen4 ,
Xiawu Zheng1 , Yutao Liu5 , ShengChuan Zhang1 , Pingyang Dai1 , Rongrong Ji1
Key Laboratory of Multimedia Trusted Perception and Efficient Computing,
Ministry of Education of China, Xiamen University 2 School of Medicine, Xiamen University
School of Information and Electronics, Beijing Institute of Technology
Tencent Youtu Lab 5 School of Computer Science and Technology, Ocean University of China
{lxd761050753, jyzheng0606, bzhy986, hrzlpk2015, shenyunhang01, tristanli.sh}@gmail.com,
liuyutao@ouc.edu.cn,{zhengxiawu, zsc 2016, pydai, rrji}@xmu.edu.cn
arXiv:2401.11949v1 [cs.CV] 22 Jan 2024
Sum Operation Fd
Perception Similarity Denoising Network
Multiplication Generation & Softmax ...
Operation Random mask pˆ d ( x) Feature Fˆt Feature F̂0
(a) Perceptual Prior Discovery and Aggregation (b) Perceptual Prior-based Diffusion Refinement
Figure 2: The overview of PFD-IQA, which consists of a teacher model used for creating pseudo-labels and a student model equipped with
PDA and PDR modules. Specifically, we begin by developing a learning perceptual prior (Sec. 3.2) through the random mask reconstruction
process. Subsequently, we use the prior knowledge to aggregate text information as the condition to guide the feature-denoising process of
the diffusion model and refine the features (Sec. 3.3).
images. To further extend diffusion models (dm) into main- F̂ d and Perceptual priors F̂ q , which then adaptively aggre-
stream computer vision tasks, latent representation learning gate perceptual text embeddings as conditions for the diffu-
methods based on dm have been proposed, including Diffu- sion process (Sec. 3.2). Next, in the PDR module, these prior
sionDet for object detection [Chen et al., 2023] and SegDiff features are used to modulate F s for feature enhancement to
for segmentation [Amit et al., 2021]. However, diffusion obtain F̂h . This is followed by matching it to a predefined
models are seldom used for specific feature denoising. In this noise level F̂ t through an adaptive noise matching module ϵ,
study, we treat the feature optimization process in IQA as an and finally employing a lightweight feature denoising mod-
inverse denoising approximation and iteratively use diffusion ule to progressively denoise under the guidance of the per-
models to enhance representations for accurate quality aware- ceptual text embeddings (Sec. 3.3). After the PDR module,
ness. To the best of our knowledge, ours is the first work to a layer of transformer decoder is used to further interpret the
introduce diffusion models into IQA for feature denoising. denoised features for predicting the final quality score [Qin et
al., 2023]. It is important to emphasize that pseudo-labels are
3 Methodology only used for training.
In the context of BIQA, we introduce common notations.
Bold formatting is used to denote vectors (e.g., x, y), ma- 3.2 Perceptual Prior Discovery and Aggregation
trices (e.g., X, Y ), and tensors. The training data consists Considering the intricate nature of image distortions in
of D = {x, yg , yd , yq }, where x is the labeled image with the real world, the evaluation of image quality necessitates
ground-truth scores yg . yd , and yq represent the distortion discriminative representations that can distinguish different
type and quality level pseudo-labels associated with the input types of distortions [Zhang et al., 2022], as well as the degrees
image, respectively. Additionally, image embeddings F and of degradation. To achieve this, an auxiliary task involving
textual embeddings G are denoted. The probability distribu- the classification of distortion types is introduced Which is
tion of logits for the network is represented as p. designed to refine the differentiation among diverse distor-
tion types, thereby providing nuanced information. Addition-
3.1 Overview ally, the quality levels classification task is further employed
The paper introduces a model called the Perceptual Feature to offer a generalized classification that compensates for the
Diffusion model for Image Quality Assessment (PFD-IQA), uncertainty and error inherent in predicting absolute image
which progressively refines quality-aware features. As de- quality scores.
picted in Fig. 2, PFD-IQA seamlessly integrates two main Perceptual Prior Discovery. In this context, two feature re-
components: A Perceptual Prior Discovery and Aggregation constructors denoted as R(·) are trained to reconstruct the
(PDA) and A Perceptual Prior-based Diffusion Refinement mentioned two prior features, respectively. These reconstruc-
Module (PDR). Initially, PFD-IQA inputs the given image x tors consist of two components: (1) a stochastic channel
into a Vision Transformer (ViT) encoder [Dosovitskiy et al., mask module and (2) a module for the feature reconstruc-
2021] to obtain a feature representation F s . Under the super- tions. Specifically, given an image x and its feature F s that
vision of pseudo-labels for distortion types and quality levels, has been generated by a VIT encoder. The first step involves
we use the PDA module to discover potential distortion priors applying a channel-wise random mask Mc to the channel di-
mension of this feature to obtain F m . Diffusion Process
F tea Ft tea
0, if Rc < β Predefined Denoising Trajectories
Mc = , F m = falign (F s ) · Mc , (1)
1, Otherwise F0tea × (T 1)
Denoising Denoising Network
where Rc is a random number in (0, 1) and c are channel
number of the feature. β is a hyper-parameter that denotes the
masked ratio and falign is a adaptation layer with 1×1 convo- eˆada
lution. The random mask helps to train a more robust feature Perceptual Text Key , Value ò
reconstructor [Yang et al., 2022]. Subsequently, we utilize
(1 ) Ft tea
the two feature reconstruction modules R(·) to generate prior
Qu ery
features. Each R(·) consists of a sequence of operations in- Noise
cluding a 1×1 convolution Wl1 , a Batch Normalization (BN)
layer, and another 1×1 convolutional layer. Wl2 .
t Bottleneck
Time Embedding
Matrix Addition CA Cross Attention
F̂ j = R(F m ) = Wl2,j · (ReLU (Wl1,j (F m ))) , (2)
Figure 3: The predefined denoising trajectory starts with a teacher
where, j ∈ {d, q}, F̂ d and F̂ q stands for distortion and qual- pseudo-feature label for forward diffusion. During each reverse de-
ity level classification auxiliary tasks. These tasks are linked noising phase, image and text information are fused to accurately
to the original image feature F s and involve capturing differ- predict the noise in the features. For student denoising, the noise
ent aspects of information. level matched by the noise alignment mechanism is used as the in-
put for noise prediction.
To effectively supervise the auxiliary tasks related to qual-
ity level classification Q and distortion classification D for
the discovery of potential prior features, we divide the tasks for the distortion and quality levels, for each prompt, we can
into five quality levels and eleven types of distortions fol- further get its logit(ji |I) by:
lowing previous study [Zhang et al., 2023]. As illustrated
exp(logit(ji |x))
in Fig. 2, let D denote the set of image distortions, i.e., p̂(ci |x) = PK , (5)
D = {d1 , d2 , . . . , dK }, where di is an image distortion type, k=1 exp(logit(jk |x))
e.g., “noise”. Let Q denote the set of image quality lev- where ji is the i-th element of Tj . Next, we obtain the
els, i.e., Q = {q1 , q2 , . . . , qK }, where qi is quality level, adaptive perceptual text embedding êada via the following
e.g., “bad”, and K is the number of distortions or quality weighted aggregation:
levels we consider. The textural description prompt set is
Td = {Td | Td = “a photo of with {d} artifact.”, d ∈ D} X
and Tq = {Tq | Tq = “a photo of with {q} quality.”, q ∈ Q}. êada = p̂(ji |x)GiJ , j ∈ {d, q} (6)
Given an image x, we compute the cosine similarity be- i=1
tween image prior embedding F̂j and each prompt embed- It is worth noting that êada is capable of effectively repre-
ding Gj = E(Tj ) ∈ RK×C from text encoder E resulting in senting the multi-distortion mixture information in real dis-
the logits output for the auxiliary tasks, namely p̂d and p̂q , torted images as soft label weightings. This approach is more
which the parameters of the text encoder is freeze: informative compared to the hard label method, which relies
solely on a single image-text pair.
F̂ · GTj
p̂j (x) = logit(j|x) = . (3) 3.3 Perceptual Prior-based Diffusion Refinement
∥F̂ j ∥2 ∥Gj ∥2
In this section, we introduce our Perceptual Prior Fu-
To supervise the feature reconstruction module, we utilize sion (PPF) Module and Perceptual Prior-based Diffusion Re-
the soft pseudo-labels pd for distortion and pq for quality, finement Module (PDR), as well as discuss how to automat-
which are generated by the pre-trained teacher model. This ically aggregate perceptual text embeddings that can be used
guidance is accomplished by applying the KL divergence as as conditions to guide feature denoising.
follows: Perceptual Prior Enhancement. Due to the primary empha-
X X pj (x) sis on global semantic features in pre-trained models, there
LjKL pj , pˆj =
LKL = px (j) log exists a gap in capturing quality-aware information across dif-
pˆj (x)
j∈{d,q} j∈{d,q} ferent granularities. To address this, we propose the integra-
(4) tion of perceptual prior information to enhance feature rep-
Perceptual Prompt Aggregation (PPA). Psychological re- resentations. Specifically, we introduce the Perceptual Prior
search suggests that humans prefer using natural language Fusion module (PPF), which is designed to merge both dis-
for qualitative rather than quantitative evaluations [Hou et al., tortion perception and quality degradation perception into the
2014]. In practice, this means qualitative descriptors like ’ex- framework. The proposed PPF Module operates sequentially
cellent’ or ’bad’ are often used to assess image quality. Build- on normalized features, incorporating additional convolutions
ing on this, we’ve developed an approach to automatically ag- and SiLU layers [Elfwing et al., 2018] to facilitate the fusion
gregate natural language prompts that qualitatively represent of features across different granularities. In the implementa-
image quality perception. Specifically, we compute the logit tion, we first apply a two-dimensional scaling modulation to
the normalized feature norm F s and then employ two convo- 1×1 convolution. In addition, A cross-attention layer [Rom-
lutional transformations modulate the normalized feature F s bach et al., 2022] is added after each bottleneck block to ag-
with scaling and shifting parameters from additive features gregate the text and image features. We empirically find that
F̂ dq , resulting in the feature representation F̂ h : this lightweight network is capable of effective noise removal
with less than 5 sampling iterations which is more than 200×
F̂ h = (conv(F̂ dq ) × norm(F s ) + conv(F̂ dq )) + F s . (7) faster sampling speed compared to the DDPM.
During the sampling process, with the initial noise F t ob-
Predefined Conditional Denoising Trajectories. The pro- tained in Equ. 10, the trained network is employed for itera-
posed PFD-IQA iteratively optimizes the feature F̂h to at- tive denoising to reconstruct the feature F̂ 0 :
tain accurate and quality-aware representations. This process
can be conceptualized as an approximation of the inverse fea-
pθ F̂ t−1 | F̂ t := N F̂ t−1 ; ϵθ F̂ t , êada , t , σt2 I
ture denoising procedure. However, the features represent-
ing the ground truth are often unknown. Therefore, we in- (11)
troduce features F tea generated by a pre-trained teacher as Subsequently, we employ the features F tea derived from the
pseudo-ground truth to pre-construct a denoising trajectory pseudo-labels generated by the pre-trained teacher to super-
of quality-aware features. As depicted in Fig. 3, for the for- vise the denoising procedure using MSE loss which ensures
ward diffusion process, F tea t is a linear combination of the the stability of the feature denoising process.
initial data F tea and the noise variable ϵt .
√ √ Lf ea = ∥F̂ 0 − F tea )∥22 (12)
F tea
t = ᾱt F tea + 1 − ᾱt ϵt . (8)
Qt To sum up, the overall loss at the training stage is described
The parameter ᾱt is defined as ᾱt := s=0 αs =
as follows:
s=0 (1 − β s ), offering a method to directly sample F tea
at any time step using a noise variance schedule denoted L = λ1 LKL + λ2 Lldm + λ3 Lf ea + ∥ŷ − y g )∥1 (13)
by β [Ho et al., 2020]. During training, a neural net-
tea tea Here, ŷ represents the predicted score of image x based on
work ϵθ (F tea
t , êada , t) conditioned on perceptual text êada the denoised feature obtained from the transformer decoder.
is trained to predict the noise ϵt ∈ N (0, I) by minimizing
y g stands for the ground truth corresponding to the image
the ℓ2 loss, i.e.,
x. The notation ∥ · ∥1 denotes the ℓ1 regression loss. In this
Lldm = ∥ϵt − ϵθ (F tea 2
t , êada , t)∥2 , (9) paper, We simply set λ1 = 0.5, λ2 = 1, and λ3 = 0.01 in all
Adaptive Noise-Level Alignment (ANA). We treat the fea-
ture representations extracted by students according to the
fine-tuning paradigm as noisy versions of the teacher’s 4 Experiments
quality-aware features. However, the extent of noise that sig-
nifies the dissimilarity between the teacher and student fea- 4.1 Benchmark Datasets and Evaluation Protocols
tures remains elusive and may exhibit variability across dis-
tinct training instances. As a result, identifying the optimal We evaluate the performance of the proposed PFD-IQA on
initial time step to initiate the diffusion process presents a eight typical BIQA datasets, including four synthetic datasets
challenging task. To overcome this, we introduce an Adap- of LIVE [Sheikh et al., 2006], CSIQ [Larson and Chandler,
tive Noise Matching Module to match the noise level of stu- 2010], TID2013 [Ponomarenko et al., 2015], KADID [Lin et
dent features with a predefined noise level. al., 2019], and four authentic datasets of LIVEC [Ghadiyaram
As depicted in Fig. 2, we develop a Noise-level predictor and Bovik, 2015] KONIQ [Hosu et al., 2020], LIVEFB [Ying
using a straightforward convolutional module aimed at learn- et al., 2020], SPAQ [Fang et al., 2020]. Specifically, for
ing a weight γ to combine the fusion feature F̂ h of the student the authentic dataset, LIVEC contains 1162 images from
different mobile devices and photographers. SPAQ com-
with Gaussian noise, resulting in F̂ t that aligns with F t . This
prises 11,125 photos from 66 smartphones. KonIQ-10k in-
weight ensures that the student’s outputs are harmonized with
cludes 10,073 images from public sources, while LIVEFB
the noise level corresponding to the initial time step t. Con-
is the largest real-world dataset to date, with 39,810 images.
sequently, the initial noisy feature involved in the denoising
The synthetic datasets involve original images distorted arti-
process is altered subsequently:
ficially using methods like JPEG compression and Gaussian
F̂ t = γ ⊙ F̂ h + (1 − γ) ⊙ N (0, 1) (10) blur. LIVE and CSIQ have 779 and 866 synthetically dis-
torted images, respectively, with five and six distortion types
Lightweight Architecture. Considering the huge dimension each. TID2013 and KADID include 3000 and 10,125 syn-
of transformers, performing the denoising process on features thetically distorted images, respectively, spanning 24 and 25
during training requires considerable iterations, which may distortion types.
result in a huge computational load. To address this issue, this In our experiments, we employ two widely used metrics:
paper proposes a lightweight diffusion model ϵθ (·) as an alter- Spearman’s Rank Correlation Coefficient (SRCC) and Pear-
native to the U-net architecture, as shown in Fig. 3. It consists son’s Linear Correlation Coefficient (PLCC). These metrics
of two bottleneck blocks from ResNet [He et al., 2016] and a evaluate prediction monotonicity and accuracy, respectively.
DIIVINE [Saad et al., 2012] 0.908 0.892 0.776 0.804 0.567 0.643 0.435 0.413 0.591 0.588 0.558 0.546 0.187 0.092 0.600 0.599
BRISQUE [Mittal et al., 2012] 0.944 0.929 0.748 0.812 0.571 0.626 0.567 0.528 0.629 0.629 0.685 0.681 0.341 0.303 0.817 0.809
ILNIQE [Zhang et al., 2015] 0.906 0.902 0.865 0.822 0.648 0.521 0.558 0.534 0.508 0.508 0.537 0.523 0.332 0.294 0.712 0.713
BIECON [Kim and Lee, 2016] 0.961 0.958 0.823 0.815 0.762 0.717 0.648 0.623 0.613 0.613 0.654 0.651 0.428 0.407 - -
MEON [Ma et al., 2017] 0.955 0.951 0.864 0.852 0.824 0.808 0.691 0.604 0.710 0.697 0.628 0.611 0.394 0.365 - -
WaDIQaM [Bosse et al., 2017] 0.955 0.960 0.844 0.852 0.855 0.835 0.752 0.739 0.671 0.682 0.807 0.804 0.467 0.455 - -
DBCNN [Zhang et al., 2018] 0.971 0.968 0.959 0.946 0.865 0.816 0.856 0.851 0.869 0.851 0.884 0.875 0.551 0.545 0.915 0.911
TIQA [You and Korhonen, 2021] 0.965 0.949 0.838 0.825 0.858 0.846 0.855 0.85 0.861 0.845 0.903 0.892 0.581 0.541 - -
MetaIQA [Zhu et al., 2020] 0.959 0.960 0.908 0.899 0.868 0.856 0.775 0.762 0.802 0.835 0.856 0.887 0.507 0.54 - -
P2P-BM [Ying et al., 2020] 0.958 0.959 0.902 0.899 0.856 0.862 0.849 0.84 0.842 0.844 0.885 0.872 0.598 0.526 - -
HyperIQA [Su et al., 2020] 0.966 0.962 0.942 0.923 0.858 0.840 0.845 0.852 0.882 0.859 0.917 0.906 0.602 0.544 0.915 0.911
TReS [Golestaneh et al., 2022] 0.968 0.969 0.942 0.922 0.883 0.863 0.858 0.859 0.877 0.846 0.928 0.915 0.625 0.554 - -
MUSIQ [Ke et al., 2021] 0.911 0.940 0.893 0.871 0.815 0.773 0.872 0.875 0.746 0.702 0.928 0.916 0.661 0.566 0.921 0.918
DACNN [Pan et al., 2022] 0.980 0.978 0.957 0.943 0.889 0.871 0.905 0.905 0.884 0.866 0.912 0.901 - - 0.921 0.915
DEIQT [Qin et al., 2023] 0.982 0.980 0.963 0.946 0.908 0.892 0.887 0.889 0.894 0.875 0.934 0.921 0.663 0.571 0.923 0.919
PFD-IQA (ours) 0.985 0.985 0.972 0.962 0.937 0.924 0.935 0.931 0.922 0.906 0.945 0.930 0.667 0.572 0.925 0.922
Table 1: Performance comparison measured by averages of SRCC and PLCC, where bold entries indicate the best results, underlines indicate
the second-best.
Training LIVEFB LIVEC KonIQ LIVE CSIQ 4.3 Overall Prediction Performance Comparison
Testing KonIQ LIVEC KonIQ LIVEC CSIQ LIVE For competing models, we either directly adopt the pub-
licly available implementations, or re-train them on our
DBCNN 0.716 0.724 0.754 0.755 0.758 0.877
P2P-BM 0.755 0.738 0.740 0.770 0.712 - datasets with the training codes provided by the respective
HyperIQA 0.758 0.735 0.772 0.785 0.744 0.926 authors. Tab. 1 reports the comparison results between the
TReS 0.713 0.740 0.733 0.786 0.761 - proposed PFD-IQA and 14 state-of-the-art BIQA methods,
DEIQT 0.733 0.781 0.744 0.794 0.781 0.932 including hand-crafted feature-based BIQA methods, such
as ILNIQE (Zhang, Zhang, and Bovik 2015) and BRISQUE
PFD-IQA 0.775 0.783 0.796 0.818 0.817 0.942
(Mittal, Moorthy, and Bovik 2012), and deep learning-based
methods, i.e., MUSIQ [Ke et al., 2021] and MetaIQA [Zhu et
Table 2: SRCC on the cross datasets validation. The best results are
highlighted in bold, second-best is underlined.
al., 2020]. It is observed from these eight datasets that PFD-
IQA achieves superior performance over all other methods
across the 8 datasets. Since the images on these 8 datasets
cover various image content and distortion types, it is very
4.2 Implementation Details challenging to consistently achieve leading performance on
all these datasets. Accordingly, these observations confirm
For the student network, we follow the typical training the effectiveness and superiority of PFD-IQA in characteriz-
strategy of randomly cropping the input image into 10 im- ing image quality.
age patches with a resolution of 224 × 224. Each image patch
is then reshaped as a sequence of patches with patch size p 4.4 Generalization Capability Validation
= 16 and the dimension of input tokens as in D = 384. We We further evaluate the generalization ability of PFD-
create the Transformer encoder based on the ViT-B proposed IQA by a cross-dataset validation approach, where the BIQA
in DeiT III [Touvron et al., 2022]. The encoder depth is set model is trained on one dataset and then tested on the oth-
to 12 and the number of heads h = 12. For Decoder, the depth ers without any fine-tuning or parameter adaptation. Tab.
is set to one. Our model is trained for 9 epochs. The learn- 2 reports the experimental results of SRCC averages on the
ing rate is set to 2 × 10−4 with a decay factor of 10 every five datasets. As observed, PFD-IQA achieves the best per-
3 epochs. The batch size depends on the size of the dataset, formance on six cross-datasets, achieving clear performance
which is 16 and 128 for LIVEC and KonIQ, respectively. For gains on the LIVEC dataset and competitive performance on
each dataset, 80% of the images are used for training and the the KonIQ dataset. These results strongly verify the general-
remaining 20% of the images are used for testing. We re- ization ability of PFD-IQA.
peat this process 10 times to mitigate performance bias and
report the average of SRCC and PLCC. For the pre-trained 4.5 Qualitative Analysis
teacher network, We adopt ViT-B/16 [Radford et al., 2021]
Visualization of activation map We employ GradCAM [Sel-
as the visual encoder and text encoder. The re-training hyper-
varaju et al., 2017] to visualize the feature attention map,
parameter Settings are consistent with [Zhang et al., 2023].
as shown in Fig. 4. Our findings indicate that PFD-IQA ef-
fectively focuses on the quality degradation areas, while the
DEIQT[Qin et al., 2023] (the second best in Tab. 1) overly
Index P DA P DR Avg.
a) 0.966 0.964 0.952 0.935 0.899 0.888 0.878 0.884 0.881 0.863 0.911
b) " 0.984 0.981 0.963 0.954 0.927 0.915 0.925 0.925 0.911 0.895 0.938
c) " 0.983 0.982 0.968 0.959 0.910 0.890 0.918 0.919 0.916 0.897 0.934
d) " " 0.985+1.9% 0.985+2.1% 0.972+2.0% 0.962+2.7% 0.937+3.8% 0.924+3.6% 0.935+6.0% 0.931+4.7% 0.922+4.1% 0.906+4.3% 0.946+3.5%
Table 3: Ablation experiments on LIVE, CSIQ, TID2013, KADAD and LIVEC datasets. Here, P DA and P DR refer to the Perceptual
Prior Discovery and Aggregation module and Diffusion Refinement module, where bold entries indicate the best result.