Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                

Convolutional Neural Networks For No-Reference Image Quality Assessment

Download as pdf or txt
Download as pdf or txt
You are on page 1of 8

Convolutional Neural Networks for No-Reference Image Quality Assessment

Le Kang1 , Peng Ye1 , Yi Li2 , and David Doermann 1


1
University of Maryland, College Park, MD, USA
2
NICTA and ANU, Canberra, Australia
1 2
{lekang,pengye,doermann}@umiacs.umd.edu yi.li@cecs.anu.edu.au

Abstract plied to directly quantify the differences between distorted


images and their corresponding ideal versions. State of the
In this work we describe a Convolutional Neural Net- art FR measures, such as VIF [14] and FSIM [22], achieve
work (CNN) to accurately predict image quality without a a very high correlation with human perception.
reference image. Taking image patches as input, the CNN However, in many practical computer vision applications
works in the spatial domain without using hand-crafted fea- there do not exist perfect versions of the distorted images,
tures that are employed by most previous methods. The net- so NR-IQA is required. NR-IQA measures can directly
work consists of one convolutional layer with max and min quantify image degradations by exploiting features that are
pooling, two fully connected layers and an output node. discriminant for image degradations. Most successful ap-
Within the network structure, feature learning and regres- proaches use Natural Scene Statistics (NSS) based features.
sion are integrated into one optimization process, which Typically, NSS based features characterize the distributions
leads to a more effective model for estimating image quality. of certain filter responses. Traditional NSS based features
This approach achieves state of the art performance on the are extracted in image transformation domains using, for
LIVE dataset and shows excellent generalization ability in example the wavelet transform [10] or the DCT transform
cross dataset experiments. Further experiments on images [13]. These methods are usually very slow due to the use of
with local distortions demonstrate the local quality estima- computationally expensive image transformations. Recent
tion ability of our CNN, which is rarely reported in previous development in NR-IQA methods – CORNIA [20, 21] and
literature. BRISQUE [9] promote extracting features from the spatial
domain, which leads to a significant reduction in compu-
tation time. CORNIA demonstrates that it is possible to
learn discriminant image features directly from the raw im-
1. Introduction age pixels, instead of using handcrafted features.
This paper presents a Convolutional Neural Network Based on these observations, we explore using a Convo-
(CNN) that can accurately predict the quality of distorted lutional Neural Network (CNN) to learn discriminant fea-
images with respect to human perception. The work focuses tures for the NR-IQA task. Recently, deep neural networks
on the most challenging category of objective image qual- have gained researchers’ attention and achieved great suc-
ity assessment (IQA) tasks: general-purpose No-Reference cess on various computer vision tasks. Specifically, CNN
IQA (NR-IQA), which evaluates the visual quality of digi- has shown superior performance on many standard object
tal images without access to reference images and without recognition benchmarks [6, 7, 4]. One of CNN’s advan-
prior knowledge of the types of distortions present. tages is that it can take raw images as input and incorporate
Visual quality is a very complex yet inherent character- feature learning into the training process. With a deep struc-
istic of an image. In principle, it is the measure of the dis- ture, the CNN can effectively learn complicated mappings
tortion compared with an ideal imaging model or perfect while requiring minimal domain knowledge.
reference image. When reference images are available, Full To the best of our knowledge, CNN has not been ap-
Reference (FR) IQA methods [14, 22, 16, 17, 19] can be ap- plied to general-purpose NR-IQA. The primary reason is
that the original CNN is not designed for capturing image
The partial support of this research by DARPA through BBN/DARPA
Award HR0011-08-C-0004 under subcontract 9500009235, the US Gov-
quality features. In the object recognition domain good fea-
ernment through NSF Awards IIS-0812111 and IIS-1262122 is gratefully tures generally encode local invariant parts, however, for
acknowledged. the NR-IQA task, good features should be able to capture

1
NSS properties. The difference between NR-IQA and ob- ing capacity significantly [1]. In the following sections we
ject recognition makes the application of CNN nonintuitive. will see that with fewer filters/features than CORNIA, we
One of our contributions is that we modified the network are able to achieve the state of the art results. Second, in
structure, such that it can learn image quality features more the CNN framework, training the network as a whole us-
effectively and estimate the image quality more accurately. ing a simple method like backpropagation enables the pos-
Another contribution of our paper is that we propose a sibility of conveniently incorporating recent techniques de-
novel framework that allows learning and prediction of im- signed to improve learning such as dropout [5] and rectified
age quality on local regions. Previous approaches typically linear unit [7]. Furthermore, after we make the bridge be-
accumulate features over the entire image to obtain statis- tween NR-IQA and CNN, the rapid developing deep learn-
tics for estimating overall quality, and have rarely shown ing community will be a significant source of novel tech-
the ability to estimate local quality, except for a simple ex- niques for advancing the NR-IQA performance.
ample in [18]. By contrast, our method can estimate quality
3. CNN for NR-IQA
on small patchs (such as 32 × 32). Local quality estima-
tion is important for the image denoising or reconstruction The proposed framework of using CNN for image qual-
problems, applying enhancement only where required. ity estimation is as follows. Given a gray scale image, we
We show experimentally that the proposed method ad- first perform a contrast normalization, then sample non-
vances the state of the art. On the LIVE dataset our CNN overlapping patches from it. We use a CNN to estimate the
outperforms CORNIA and BRISQUE, and achieves com- quality score for each patch and average the patch scores to
parable results with state of the art FR measures such as obtain a quality estimation for the image.
FSIM [22]. In addition to the superior overall performance,
we also show qualitative results that demonstrate the local 3.1. Network Architecture
quality estimation of our method. The proposed network consists of five layers. Figure 1
shows the architecture of our network, which is a 32 × 32 −
2. Related Work 26 × 26 × 50 − 2 × 50 − 800 − 800 − 1 structure. The
Previously researchers have attempted to use neural net- input is locally normalized 32 × 32 image patches. The first
works for NR-IQA. Li et al. [8] applied a general regression layer is a convolutional layer which filters the input with
neural network that takes as input perceptual features in- 50 kernels each of size 7 × 7 with a stride of 1 pixel. The
cluding phase congruency, entropy and the image gradients. convolutional layer produces 50 feature maps each of size
Chetouani et al. [3] used a neural network to combine mul- 26 × 26, followed by a pooling operation that reduces each
tiple distortion-specific NR-IQA measures. These methods feature map to one max and one min. Two fully connected
require pre-extracted handcrafted features and only use neu- layers of 800 nodes each come after the pooling. The last
ral networks for learning the regression function. Thus they layer is a simple linear regression with a one dimensional
do not have the advantage of learning features and regres- output that gives the score.
sion models in a holistic way, and these approaches are in- 3.2. Local Normalization
ferior to the state of the art approaches. In contrast, our
method does not require any handcrafted features and di- Previous NR-IQA methods, such as BRISQUE and
rectly learns discriminant features from normalized raw im- CORNIA, typically apply a contrast normalization. In this
age pixels to achieve much better performance. work, we employ a simple local contrast normalization
The use of convolutional neural networks is partly mo- method similar to [9]. Suppose the intensity value of a pixel
tivated by the feature learning framework introduced in at location (i, j) is I(i, j), we compute its normalized value
CORNIA [20, 21]. First, the CORNIA features are learned ˆ j) as follows:
I(i,
directly from the normalized raw image patches. This im-
plies that it is possible to extract discriminative features I(i, j) − µ(i, j)
from spatial domain without complicated image transfor- ˆ j)
I(i, =
σ(i, j) + C
mations. Second, supervised CORNIA [21] employs a two- p=P
X q=Q X
layer structure which learns the filters and weights in the µ(i, j) = I(i + p, j + q)
regression model simultaneously based on an EM like ap- p=−P q=−Q
v
proach. This structure can be viewed as an empirical imple- u p=P q=Q
u X X 2
mentation of a two layer neural network. However, it has σ(i, j) =t I(i + p, j + q) − µ(i, j)
not utilized the full power of neural networks. p=−P q=−Q
Our approach integrates feature learning and regression (1)
into the general CNN framework. The advantages are two where C is a positive constant that prevents dividing by
fold. First, making the network deeper will raise the learn- zero. P and Q are the normalization window sizes. In [9],
Figure 1: The architecture of our CNN

it was shown that a smaller normalization window size im- connected layer takes an input of size 2 × K. It is worth
proves the performance. In practice we pick P = Q = 3 so noting that although max pooling already works well, intro-
the window size is much smaller than the input image patch. ducing min pooling boosts the performance by about 2%.
Note that with this local normalization each pixel may have In object recognition scenario, pooling is typically per-
a different local mean and variance. formed on every 2 × 2 cell. In that case, selecting a repre-
Local normalization is important. We observe that us- sentative filter response from each small cell may keep some
ing larger normalization windows leads to worse perfor- location information while achieving robustness to transla-
mance. Specifically, a uniform normalization, which ap- tion. This operation is particularly helpful for object recog-
plies the mean and variance of the entire image patch to nition since objects can typically be modeled as multiple
each pixel, will cause about a 3% drop on the performance. parts organized in a certain spatial order. However, for the
It is worth noting that when using a CNN for object NR-IQA task we observe that image distortions are often
recognition, a global contrast normalization is usually ap- locally (if not globally) homogeneous, i.e. the same level of
plied to the entire image. The normalization not only al- distortion occurs at all the locations of a 32 × 32 patch, for
leviates the saturation problem common in early work that example. The lack of obvious global spatial structure in im-
used sigmoid neurons, but also makes the network robust to age distortions enables pooling without keeping locations to
illumination and contrast variation. For the NR-IQA prob- reduce the cost of computation.
lem, contrast normalization should be applied locally. Ad-
ditionally, although luminance and contrast change can be 3.4. ReLU Nonlinearity
considered distortions in some applications, we mainly fo- Instead of traditional sigmoid or tanh neurons, we use
cus on distortions arising from image degradations, such as Rectified Linear Units (ReLUs) [11] in the two fully con-
blur, compression and additive noise. nected layers. [7] demonstrated in a deep CNN that ReLUs
3.3. Pooling enable the network to train several times faster compared
to using tanh units. Here we give a brief description of Re-
In the convolution layer, the locally normalized image LUs. ReLUs take a simple form of nonlinearity by applying
patches are convolved with 50 filters and each filter gen- a thresholding function to the input, in place of the sigmoid
erates a feature map. We then apply pooling on each fea- or tanh transform. Let g, wi and ai denote the output of
ture map to reduce the filter responses to a lower dimension. the ReLU, the weights of the ReLU and the output of the
Specifically, each feature map is pooled into one max value previous layer, respectively, then the
k PReLU can be mathe-
and one min value, which is similar to CORNIA. Let Ri,j matically described as g = max(0, i wi ai ).
denote the response at location (i, j) of the feature map ob- Note that ReLUs only allow nonnegative signals to pass
tained by the k-th filter, then the max and min values of uk through. Due to this property, we do not use ReLUs but use
and vk are given by linear neurons (identity transform) on the convolutional and
k pooling layer. The reason is that the min pooling typically
uk = max Ri,j
i,j produce negative values and we do not want to block the
(2)
k
vk = min Ri,j information in these negative pooling outputs.
i,j
3.5. Learning
where k = 1, 2, ..., K and K is the number of kernels. The
pooling procedure reduces each feature map to a 2 dimen- We train our network on non-overlapping 32×32 patches
sional feature vector. Therefore, each node of the next fully taken from large images. For training we assign each patch
a quality score as its source image’s ground truth score. We where wt is weight at epoch t, 0 = 0.1 is learning rate,
can do this because the training images in our experiments d = 0.9 is decay for the learning rate, rs = 0.9 and re = 0.5
have homogeneous distortions. During the test stage, we are starting and ending momentums respectively, T = 10 is
average the predicted patch scores for each image to ob- a threshold to control how the momentum changes with the
tain the image level quality score. By taking small patches number of epochs. Note that unlike [5] where momentum
as input, we have a much larger number of training sam- starts off at a value of 0.5 and stays at 0.99, we use a large
ples compared to using the whole image on a given dataset, momentum at the beginning and reduce it as the training
which particularly meets the needs of CNNs. progresses. We found through experiments that this setting
Let xn and yn denote the input patch and its ground truth can achieve better performance.
score respectively and f (xn ; w) be the predicted score of xn
4. Experiment
with network weights w. Support Vector Regression (SVR)
with -insensitive loss has been successfully applied to learn 4.1. Experimental Protocol
the regression function for NR-IQA in previous work [21,
Datasets: The following two datasets are used in our exper-
9]. We adopt a similar objective function as follows:
iments.
(1) LIVE [15]: A total of 779 distorted images with
N five different distortions – JP2k compression (JP2K), JPEG
1 X
L= kf (xn ; w) − yn kl1 compression (JPEG), White Gaussian (WN), Gaussian blur
N n=1 (3) (BLUR) and Fast Fading (FF) at 7-8 degradation levels de-
0
w = min L rived from 29 reference images. Differential Mean Opinion
w
Scores (DMOS) are provided for each image, roughly in the
Note that the above loss function is equivalent to the loss range [0, 100]. Higher DMOS indicates lower quality.
function used in -SVR with  = 0. Stochastic gradient (2) TID2008 [12]: 1700 distorted images with 17 differ-
decent (SGD) and backpropagation are used to solve this ent distortions derived from 25 reference images at 4 degra-
problem. A validation set is used to select parameters of the dation levels. In our experiments, we consider only the four
trained model and prevent overfitting. In experiments we common distortions that are shared by the LIVE dataset, i.e.
perform SGD for 40 epochs in training and keep the model JP2k, JPEG, WN and BLUR. Each image is associated with
parameters that generate the highest Linear Correlation Co- a Mean Opinion Score (MOS) in the range [0, 9]. Contrary
efficient (LCC) on the validation set. to DMOS, higher MOS indicates higher quality.
Recently successful neural network methods [7, 5] re- Evaluation: Two measures are used to evaluate the perfor-
port that dropout and momentum improve learning. In our mance of IQA algorithms: 1) Linear Correlation Coefficient
experiment we also find these two techniques boost the per- (LCC) and 2) Spearman Rank Order Correlation Coefficient
formance. (SROCC). LCC measures the linear dependence between
Dropout is a technique that prevents overfitting in train- two quantities and SROCC measures how well one quan-
ing neural networks. Typically the outputs of neurons are tity can be described as a monotonic function of another
set to zero with a probability of 0.5 in the training stage quantity. We report results obtained from 100 train-test it-
and divided by 2 in the test stage. By randomly masking erations where in each iteration we randomly select 60% of
out the neurons, dropout is an efficient approximation of reference images and their distorted versions as the training
training many different networks with shared weights. In set, 20% as the validation set, and the remaining 20% as the
our experiments, since applying dropout to all layers sig- test set.
nificantly increases the time to reach convergence, we only
apply dropout at the second fully connected layer. 4.2. Evaluation on LIVE
Updating the network weights with momentum is a On the LIVE dataset, for distortion-specific experiment
widely adopted strategy. We update the weights in the fol- we train and test on each of the five distortions: JP2K,
lowing form: JPEG, WN, BLUR and FF. For non-distortion-specific ex-
periments, images of all five distortions are trained and
∆wt = rt ∆wt−1 − (1 − rt )t hOw Li tested together without providing a distortion type.
Table 1 shows the results of the two experiments com-
wt = wt−1 + ∆wt pared with previous state of the art NR-IQA methods as
(4) well as FR-IQA methods. Results of the best performing
t = 0 (d)t NR-IQA systems are in bold. The FR-IQA measures are
( evaluated by using 80% of the data for fitting a non-linear
t t
t T re + (1 − T )rs , t<T logistic function, then testing on 20% of the data. We can
r =
re , t>T see from Table 1 that our approach works well on each of
SROCC JP2K JPEG WN BLUR FF ALL
PSNR 0.870 0.885 0.942 0.763 0.874 0.866
SSIM 0.939 0.946 0.964 0.907 0.941 0.913
FSIM 0.970 0.981 0.967 0.972 0.949 0.964
DIIVINE 0.913 0.910 0.984 0.921 0.863 0.916
BLIINDS-II 0.929 0.942 0.969 0.923 0.889 0.931
BRISQUE 0.914 0.965 0.979 0.951 0.877 0.940
CORNIA 0.943 0.955 0.976 0.969 0.906 0.942
CNN 0.952 0.977 0.978 0.962 0.908 0.956
LCC JP2K JPEG WN BLUR FF ALL
PSNR 0.873 0.876 0.926 0.779 0.870 0.856
SSIM 0.921 0.955 0.982 0.893 0.939 0.906
FSIM 0.910 0.985 0.976 0.978 0.912 0.960
DIIVINE 0.922 0.921 0.988 0.923 0.888 0.917 Figure 3: SROCC and LCC with respect to number of con-
BLIINDS-II 0.935 0.968 0.980 0.938 0.896 0.930 volution kernels
BRISQUE 0.922 0.973 0.985 0.951 0.903 0.942
CORNIA 0.951 0.965 0.987 0.968 0.917 0.935
CNN 0.953 0.981 0.984 0.953 0.933 0.953 size 5×5 7×7 9×9
SROCC 0.953 0.956 0.955
LCC 0.951 0.953 0.955
Table 1: SROCC and LCC on LIVE. Italicized are FR-IQA
methods for reference. Table 2: SROCC and LCC under different kernel sizes

4.3. Effects of Parameters


Several parameters are involved in the CNN design. In
this section, we examine how these parameters affect the
performance of the network on the LIVE dataset.
Number of kernels Figure 3 shows how the perfor-
mance varies with the number of convolution kernels. It is
not surprising to find that the number of filters significantly
affects the performance. In general, the use of more kernels
(a) (b) leads to better performance. But little performance increase
is gained when the number of kernels exceeds 40.
Figure 2: Learned convolution kernels on (a) JPEG (b) ALL Kernel size We train and test the network with differ-
on LIVE dataset ent kernel sizes while fixing the rest of structure. Table 2
shows how the performance changes with the kernel size.
We can see from Figure 2 that all tested kernel sizes show
similar performance. The proposed network is not sensitive
the five distortions, especially on JPEG, JP2K and FF. For to kernel size.
the overall evaluation, our CNN outperformed all previous
Patch size Since in our experiment the whole image
state of the art NR-IQA methods and approached the state
score is simply the average score of all patches sampled,
of the art FR-IQA method FSIM.
we examine how the patch sampling strategy affects perfor-
We visually examine the learned convolution kernels, mance. This includes two aspects, patch size and number
and find only a few kernels present obvious structures re- of patches per image. It is worth noting that if we keep
lated to the type of distortion. Figure 2 shows the kernels sampling patches in a non-overlapping way, larger patch
learned on JPEG and all distortions combined respectively. size leads to fewer patches. For example, if we double the
We can see that blockiness patterns are learned from JPEG, patch size, the number of patches per image will drop to one
and a few blur-like patterns exist for kernels learned from fourth of the original number. To avoid this situation, we al-
all distortions. It is not surprising that the kernels learned low overlap sampling and use a fixed sampling stride (32)
by CNN tend to be noisy patterns instead of presenting for different patch sizes. In this way the number of patches
strong structure related to certain distortions as shown in per image remains roughly the same (ignoring the border
CORNIA[20]. This is because CORNIA’s feature learning effect) when patch size varies. Table 3 shows the change
is unsupervised and belongs to generative model while our of performance with respect to patch size. From Table 3
CNN is supervisedly trained and learns discriminative fea- we see that larger patch results in better performance. The
tures. performance increases slightly as the patch size increases
size 48 40 32 24 16 CORNIA BRISQUE CNN
SROCC 0.959 0.958 0.956 0.950 0.946 SROCC 0.890 0.882 0.920
LCC 0.957 0.955 0.953 0.947 0.946 LCC 0.880 0.892 0.903

Table 3: SROCC and LCC on different patch size Table 4: SROCC and LCC obtained by training on LIVE
and testing on TID2008

same method as [20] to perform a nonlinear mapping on the


predicted scores produced by the model trained on LIVE.
A nonlinear mapping based on a logistic function is usually
applied to FR measures for transforming the quality mea-
sure into a certain range. We randomly split the TID2008
into two parts of 80% and 20% 100 times. Each time 80% of
data is used for estimating parameters of the logistic func-
tion and 20% is used for testing, i.e. evaluating the trans-
formed prediction scores. Results of the cross dataset test
Figure 4: SROCC and LCC with respect to the sampling
are shown in Table 4. We can see that our CNN outperforms
stride
previous state of the art methods.

from 8 to 48. However larger patches not only lead to more 4.5. Local Quality Estimation
processing time but also reduce spatial quality resolution. Our CNN measures the quality on small image patches,
Therefore we prefer the smallest patch that yields the state so it can be used to detect low/high quality local regions as
of the art performance. well as giving a global score for the entire image.
Sampling stride To observe how the number of patches We select an undistorted reference image from TID 2008
affects the overall performance, we fix the patch size and (which is not included in LIVE) and divide it into four verti-
vary the stride. Changing the stride does not change the cal parts. We then replace the second to the fourth parts with
structure of the network. For simplicity at each iteration of distorted versions at three different degradation levels. Four
the 100 iteration experiment, we use the same model trained synthetic images are generated in this way, one for each
at stride 32, and test with different different stride values. types of distortions including WN, BLUR, JPEG and JP2K.
Figure 4 shows the change of performance with respect to We then perform local quality estimation on these synthetic
the stride. A larger stride generally leads to lower perfor- images using our model trained on LIVE. We scan 16 × 16
mance since less image information is used for overall esti- patches with a stride of 8 and normalize the predicted scores
mation. However, it is worth noting that state of the art per- into the range [0, 255] for visualization. Figure 5 shows esti-
formance is still maintained even when the stride increases mated quality map on the synthetic images. We can see that
up to 128, which roughly corresponds to 1/16 of the origi- our model properly distinguishes the clean and the distorted
nal number of patches. This result is consistent with the fact parts of each synthetic image.
that the distortions on the LIVE data are roughly homoge-
To better examine the local quality estimation power
neous across entire image, and also indicates that our CNN
of our model, we consider several types of distortions in
can accurately predict quality score on small image patches.
TID2008 that are not used in previous experiments, and find
4.4. Cross Dataset Test three types that can only affect local regions: JPEG trans-
mission, JPEG2000 transmission and blockwise distortion.
Tests on TID2008 This set of experiment is designed to Again from TID2008 we pick several images that are not
test the generalization ability of our method. We follow the shared by LIVE, and test on their distorted versions with
protocol of previous work [9, 20] to investigate cross dataset the above three distortions. Figure 6 shows the local quality
performance between the two datasets by training our CNN estimation results. We find our model locates the distorted
on LIVE and testing on TID20081 . Only the four types of regions with reasonable accuracy and the results generally
distortions that are shared by LIVE and TID2008 are exam- fit human judgement. It is worth noting that our model lo-
ined in this experiment. The DMOS scores in LIVE range cates the blockwise distortion very well although this type
from 0 to 100, while the MOS scores in TID2008 fall in the of distortion is not contained in the training data from LIVE.
range 0 and 9. To make a fair comparison, we adopt the In the images of the third row in Figure 6, the stripes on the
1 We have observed that some images in TID2008 share the same con- window are mistaken as a low quality region. We speculate
tent as images in LIVE. However, their resolutions are different. that it is because the local patterns on the stripes resem-
(a) (b) (c) (d)

Figure 5: Synthetic examples and local quality estimation results. The first row contains images distorted in (a) WN, (b)
BLUR, (c) JPEG (d) JP2K. Each image is divided into four parts and three of them are distorted in different degradation
level. The second row shows the local quality estimation results, where brighter pixels indicate lower quality.

Figure 6: Local quality estimation results on examples of non-global distortion from TID2008. Column 1,3,5 show (a) jpeg
transmission errors (b) jpeg2000 transmission errors (c) local blockwise distortion. Column 2,4,6 show the local quality
estimation results, where brighter pixels indicate lower quality.

ble blockness distortion. Contextual information may be gorithm on a GPU to speed up the process without much
needed to overcome such problems. optimization. Our experiments are performed on a PC with
1.8GHz CPU and GTX660 GPU. We measure the process-
4.6. Computational Cost ing time on images of size 512 × 768 using our model of 50
Our CNN is implemented using the Python library kernels with 32×32 input size, and test the model using part
Theano [2]. With Theano we are able to easily run our al- of those strides that give the state of the art performance in
the experiments on LIVE. Table 5 shows the average pro- [7] A. Krizhevsky, I. Sutskever, and G. E. Hinton. Imagenet
cessing time per image under different strides. Note that classification with deep convolutional neural networks. In
our implementation is not fully optimized. For example, NIPS, volume 1, page 4, 2012.
the normalization process for each image is performed on [8] C. Li, A. Bovik, and X. Wu. Blind image quality assessment
the CPU in about 0.017 sec, which takes a significant por- using a general regression neural network. IEEE Transac-
tion of the total time. From Table 5 we can see that with a tions on Neural Networks, 22(5):793–799, 2011.
[9] A. Mittal, A. Moorthy, and A. Bovik. No-reference image
sparser sampling pattern (stride greater than 64), real time
quality assessment in the spatial domain. IEEE Transactions
processing can be achieved while maintaining state of the
on Image Processing, 21(12):4695–4708, 2012.
art performance. [10] A. K. Moorthy and A. C. Bovik. Blind image quality as-
sessment: From natural scene statistics to perceptual quality.
stride 32 64 96 128
IEEE Transactions on Image Processing, 20(12):3350–3364,
time(sec) 0.114 0.041 0.029 0.023
Dec. 2011.
Table 5: Time cost under different strides. [11] V. Nair and G. E. Hinton. Rectified linear units improve
restricted boltzmann machines. In ICML, pages 807–814,
2010.
[12] N. Ponomarenko, V. Lukin, A. Zelensky, K. Egiazarian,
5. Conclusion M. Carli, and F. Battisti. TID2008 - a database for evaluation
of full-reference visual quality assessment metrics. Advances
We have developed a CNN for no-reference image qual- of Modern Radio Electronics, 10:30–45, 2009.
ity assessment. Our algorithm combines feature learn- [13] M. Saad, A. Bovik, and C. Charrier. Blind image quality as-
ing and regression as a complete optimization process, sessment: A natural scene statistics approach in the DCT do-
which enables us to employ modern training techniques to main. IEEE Transactions on Image Processing, 21(8):3339–
boost performance. Our algorithm generates image qual- 3352, Aug. 2012.
[14] H. R. Sheikh, A. C. Bovik, and G. de Veciana. An infor-
ity predictions well correlated with human perception, and
mation fidelity criterion for image quality assessment using
achieves state of the art performance on standard IQA
natural scene statistics. IEEE Transactions on Image Pro-
datasets. Furthermore we demonstrated that our algorithm cessing, 14(12):2117–2128, Dec. 2005.
can estimate quality in local regions, which is rarely re- [15] H. R. Sheikh, Z. Wang, L. Cormack, and A. C. Bovik.
ported in previous literature and has many potential appli- LIVE image quality assessment database release 2. Online,
cations in image reconstruction or enhancement. http://live.ece.utexas.edu/research/quality.
[16] Z. Wang, A. Bovik, H. Sheikh, and E. Simoncelli. Image
References quality assessment: from error visibility to structural similar-
[1] Y. Bengio, A. Courville, and P. Vincent. Representa- ity. IEEE Transactions on Image Processing, 13(4):600–612,
tion learning: A review and new perspectives. Pattern 2004.
Analysis and Machine Intelligence, IEEE Transactions on, [17] Z. Wang and Q. Li. Information content weighting for per-
35(8):1798–1828, 2013. ceptual image quality assessment. IEEE Transactions on Im-
[2] J. Bergstra, O. Breuleux, F. Bastien, P. Lamblin, R. Pascanu, age Processing, 20(5):1185–1198, 2011.
G. Desjardins, J. Turian, D. Warde-Farley, and Y. Bengio. [18] W. Xue, L. Zhang, and X. Mou. Learning without human
Theano: a CPU and GPU math expression compiler. In Pro- scores for blind image quality assessment. In IEEE Confer-
ceedings of the Python for Scientific Computing Conference ence on Computer Vision and Pattern Recognition (CVPR),
(SciPy), June 2010. pages 995–1002, 2013.
[3] A. Chetouani, A. Beghdadi, S. Chen, and G. Mostafaoui. A [19] W. Xue, L. Zhang, X. Mou, and A. Bovik. Gradient mag-
novel free reference image quality metric using neural net- nitude similarity deviation: A highly efficient perceptual im-
work approach. In Int. Workshop Video Process. Qual. Met- age quality index. Image Processing, IEEE Transactions on,
rics Cons. Electron., pages 1–4, Jan. 2010. 23(2):684–695, Feb 2014.
[20] P. Ye, J. Kumar, L. Kang, and D. Doermann. Unsupervised
[4] D. C. Ciresan, U. Meier, and J. Schmidhuber. Multi-column
feature learning framework for no-reference image quality
deep neural networks for image classification. In Computer
assessment. In IEEE Conference on Computer Vision and
Vision and Pattern Recognition, pages 3642–3649, 2012.
Pattern Recognition (CVPR), pages 1098–1105, 2012.
[5] G. E. Hinton, N. Srivastava, A. Krizhevsky, I. Sutskever, and
[21] P. Ye, J. Kumar, L. Kang, and D. Doermann. Real-time no-
R. R. Salakhutdinov. Improving neural networks by pre-
reference image quality assessment based on filter learning.
venting co-adaptation of feature detectors. arxiv:1207.0580,
In IEEE Conference on Computer Vision and Pattern Recog-
2012.
nition (CVPR), pages 987–994, 2013.
[6] K. Kavukcuoglu, P. Sermanet, Y.-L. Boureau, K. Gregor, [22] L. Zhang, D. Zhang, X. Mou, and D. Zhang. FSIM: A feature
M. Mathieu, and Y. LeCun. Learning convolutional feature similarity index for image quality assessment. IEEE Trans-
hierachies for visual recognition. In Advances in Neural In- actions on Image Processing, 20(8):2378–2386, 2011.
formation Processing Systems (NIPS), 2010.

You might also like