1. Introduction
As a fundamental research direction in the field of computer vision, image quality assessment is widespread applicated in both scientific research and daily life. With the advent of the mobile internet era, various online platforms such as TikTok, Instagram, and YouTube witness the daily upload of billions of user-generated images and videos. For these platforms, assessing the image quality in a manner closely related to human visual perception and using it as a reference for content recommendation is crucial for enhancing user experience. Simultaneously, as a core technology for these platforms, image quality assessment is extensively employed in information recommendation, data filtering, compression, and storage, etc. In the realm of scientific research, image quality assessment algorithms can also assist in evaluating other tasks’ approaches within the field of computer vision, contributing to the optimization and improvement of various image processing techniques. Consequently, accurately predicting the perceptual quality of diverse images using automated methods remains a pressing challenge—one that significantly impacts various aspects of both daily life and scientific inquiry.
Image quality assessment can be divided into three categories: full-reference image quality assessment (FR-IQA), reduced-reference image quality assessment (RR-IQA), and no-reference image quality assessment (NR-IQA). Full-reference image quality assessment algorithms such as SSIM [
1], FSIM [
2], and LPIPS [
3] require both the original and the distorted versions of an image for evaluation. This requirement prevents them from predicting the quality of real-world images where no reference image is available, greatly limiting their applications. On the other hand, no-reference image quality assessment algorithms such as BRISQUE [
4], PaQ-2-PiQ [
5], and CONTRIQUE [
6] do not require the original reference image or any information related to image distortions when assessing the visual quality of an image, directly quantifying the target image’s perceived quality. This makes them the only solution for field image quality evaluation problems holding a broad prospect of application.
Over the past decade, NR-IQA has always been a popular research topic in the field of computer vision, leading to the development of numerous excellent evaluation models and image quality assessment datasets. Traditional image quality assessment datasets, such as TID-2013 [
7], LIVE [
8], and CSIQ [
9], consist of synthetically distorted images, which are created by artificially adding common distortions (Gaussian blur, JPEG compression, white noise, etc.) to originally high-quality images. The emergence of these datasets has played a crucial role in advancing the field of image quality assessment, and scholars still use the model performance on these datasets as an important metric for evaluating model capabilities. However, in real life, images can be affected by various factors during different stages, including generation, transmission, and compression storage, which cause all kinds of distortions. Moreover, these distortions might concentrate or even overlap in certain parts of the image, which is difficult to simulate with traditional synthetic distortion images. To address this issue, some new datasets composed of real images, such as KONIQ-10K [
10], CLIVE [
11], and SPAQ [
12], have been introduced recently. Model performance on these datasets will largely demonstrate the predictive capabilities of the models in practical applications.
As semiconductor technology, parallel computing techniques, and deep neural network models continue to evolve, deep models with a large number of tunable parameters have achieved widespread success in the field of computer vision. This has led to the development of numerous no-reference image quality assessment algorithms based on deep models. Compared to traditional algorithms that rely on natural scene statistics, these deep model-based approaches have significantly improved performance. However, deep model algorithms still fall short of simulating human visual perception due to various factors. Currently, the application of deep models in image quality assessment faces two main limitations: (1) Training deep models requires a substantial amount of labeled data. However, annotating datasets specifically for image quality assessment is expensive and challenging to ensure availability. Existing labeled datasets are insufficient to support comprehensive deep model training. (2) Most deep models built for computer vision tasks focus solely on image content. However, in image quality assessment, perceived image quality depends on various factors, including image content, introduced distortions, and their intricate relationships. Designing an end-to-end quality assessment solution with a deep model that considers all these factors simultaneously remains challenging. To address these issues, we propose a framework that leverages contrastive learning to train Multiple Collaborative Modules for Image Quality Assessment (McmIQA), corresponding modules are trained on multiple large-scale datasets for subtasks. The proposed approach consists of three parts, as follows:
- (a)
We further divided the image quality assessment task and used three modules: the content recognition, the distortion perception, and the correlation mapping modules. These modules are responsible for extracting content features, extracting distortion features, and ultimately evaluating quality.
- (b)
Based on contrastive learning, we have designed two distinct self-supervised training frameworks for the content recognition module and the distortion perception module. Training these modules on large datasets for their respective customized tasks ensures accurate feature recognition in both cases.
- (c)
When training the correlation mapping module on image quality assessment datasets, we froze the parameters of the content recognition module and the distortion perception module. We used a larger batch size to train the correlation mapping module independently. Additionally, during this stage, we employed a composite loss based on ranking and mean squared error (MSE). This approach not only helps the module fit real quality scores but also enables the model to recognize relative quality differences between different images.
2. Related Work
In the past decade, significant efforts have been devoted to the development of no-reference image quality assessment algorithms. Traditionally, before the rise of deep learning, constructing image quality assessment models based on Natural Scene Statistics (NSS) theory was the mainstream approach. NSS assumes that the composition of original natural images follows certain statistical distributions, and the presence of various distortions disrupts these statistical regularities [
13]. Based on this assumption, researchers have built distortion feature extractors for different spatial domains within images, including the spatial domain [
14,
15], frequency domain [
16,
17], and gradient-based methods [
18]. Methods like CORNIA [
19] and HOSA [
20], which utilize local patches to construct dictionaries for obtaining quality-aware features, also rely on NSS. In practice, NSS-based methods generally yield acceptable results when evaluating synthetic distorted images. However, their evaluation performance significantly deteriorates when faced with real-world images. These methods primarily focus on modeling various distortions in images as statistical deviations from natural distributions, but they overlook the combined effects of multiple distortions and the influence of image content on perceived quality.
In recent years, the emergence of deep learning has provided a new solution for computer vision tasks. Various computer vision algorithms built upon deep learning have achieved unprecedented breakthroughs and successes. Among these, several deep neural networks have been proposed for image quality assessments. Most no-reference image quality assessment approaches based on deep learning follow a pre-training–finetuning paradigm to mitigate the issue of small dataset sizes. In the RAPIQUE [
21] authors proposed a two-step approach: pre-training on the ImageNet dataset followed by fine-tuning for image quality assessment tasks, resulting in improved performance for quality evaluation using a ResNet-50 [
22] model. In the PQR [
23] approach, an innovative strategy involves using statistical distributions of subjective opinion scores as auxiliary labels during model training, leading to additional gains. The BIECON [
24] method introduces a pretext task that involves fitting FR-IQA prediction scores during pre-training. Additionally, there are image quality assessment algorithms based on multi-module perception. For instance, in DBCNN [
25], the authors employed dual-path technology to separate the perception of distortion features and content features, which are then combined during prediction. Another approach [
26] introduced an adaptive hypernetwork architecture that considers content understanding during perceptual quality prediction.
In recent years, there have been many image quality assessment algorithms based on self-supervised learning. As an essential approach to address the issue of small datasets, self-supervised learning is often used to construct upstream pretext tasks, which, in turn, provide better data representations for downstream tasks [
27]. In simple terms, self-supervised learning directly trains the model on tasks that do not require manual annotations, such as reconstructing input pixels [
28] or predicting predefined image categories [
29,
30], etc. Inspired by the success of masked language modeling in natural language processing, masked image modeling has become a hot trend in the field of computer vision [
31,
32]. Another form of self-supervised learning is contrastive learning, which aims to train models to create a mapping. Through this mapping, similar data points are pulled together, while dissimilar samples are pushed apart [
33]. CONTRIQUE [
6] proposed a contrastive learning scheme to pre-train image quality assessment models by predicting distortion types and severity. In [
34], researchers introduced a method that generates synthetic distorted images by randomly overlaying various distortions. They used contrastive learning to train models to perceive approximate quality features on this dataset. Re-IQA [
35] trained a hybrid perceptual model for image quality assessment with contrastive learning. This work is quite similar to our approach, with two different pre-training methods, contrastive learning and ImageNet image classification, they separately trained extraction modules for distortion and content features. Finally, they fine-tuned the regression layers and predicted image quality on image quality assessment datasets. In contrast, focusing on distortion perception and content recognition, we designed two distinct schemes for image cropping and contrastive learning to train our distortion perception module and content recognition module. Additionally, by incorporating a ranking-based approach during prediction training for the correlation mapping module, we trained the model to judge relative quality differences between different images, achieving more stable and accurate quality assessment results.
4. Experiments
4.1. Datasets and Evaluation Criteria
In this work, we primarily utilized three types of datasets: (1) The dataset constructed with the method proposed in [
34]; this dataset was used for training the distortion perception module. (2) The ImageNet dataset for the content perception module training. (3) Image quality assessment datasets to train the correlation mapping module and evaluate the model’s performance.
Using the distortion image generation method proposed in [
34], we applied random image enhancement methods to images from the Waterloo-4744 [
39] dataset and the COCO-330K [
40] dataset. Eventually, we generated approximately 1 million images for training the distortion perception module. To train the associated mapping module and evaluate the model’s performance under various conditions, we selected four synthetic distortion datasets, LIVE [
8], CSIQ [
9], KADID-10K [
43], and TID-2013 [
7], and three real-world distortion datasets, KonIQ-10K [
10], CLIVE [
11], and SPAQ [
12], in our experiments. The KonIQ-10K dataset comprises 10 k images selected from the publicly available YFCC100M database. CLIVE contains 1162 real distorted images captured using various mobile devices, while SPAQ consists of 11 k images captured using 66 different mobile devices. The summarized information about the image quality evaluation datasets used in our experiments is presented in
Table 1.
Evaluation criteria: We choose the Pearson correlation coefficient (PLCC) and the Spearman rank correlation coefficient (SRCC) to monitor the model’s evaluation performance. A higher PLCC indicates that the model’s scores better fit the image’s Mean Opinion Score (MOS) annotations. Similarly, a larger SRCC signifies that the model more accurately ranks the quality of images within the dataset.
4.2. Experiment Details
Our experiments are based on the Pytorch [
44] deep learning framework on a Geforce RTX 3080 LapTop GPU-16GB. The detailed configuration for each step is as follows:
Training for the distortion perception module: This training process inherits most of the settings from MoCo-V2 while modifying the pretext task and the decay process to achieve quality perception. Specifically, we used ResNet-50 [
22] as the encoder and trained it on the generated dataset. Using the Adam [
45] optimizer, an initial learning rate of 3 × 10
−2 is employed. Following the approach in [
46], we performed a two-epoch warm-up for the learning rate and applied cosine annealing. The momentum for updating is set to 0.99, and the batch size is 64. The hyperparameter τ in InfoNCE is empirically set to 0.2. Due to time constraints, we trained ResNet-50 for approximately 10 days, corresponding to 100 epochs, to obtain the distortion perception module.
Training for the content recognition module: In the process of training ResNet-50 for content recognition using the MOCO-V3 framework on the ImageNet dataset, most of the configurations remained consistent with the distortion perception training. However, due to GPU memory limitations, the batch size was set to 32. We trained ResNet-50 for approximately 12 days, corresponding to 75 epochs, to obtain the content recognition module.
Training for the correlation mapping module: We trained the correlation mapping module, composed of three layers of linear regression (2024→512→64→1), on various image quality assessment datasets using the Adam optimizer with an initial learning rate of 5 × 10−3 and weight decay of 3 × 10−4. The batch size during training was set to 128. And the image quality assessment datasets were randomly split into an 80% support set and a 20% test set. We conducted the training for 100 epochs on each dataset and selected the version as the correlation mapping module that achieved the highest sum of SRCC and PLCC on the test set.
4.3. Comparative Experiments
4.3.1. Models Selected for Comparative Experiments
To evaluate the specific performance of the McmIQA model in image quality assessment tasks, we conducted experiments comparing it with 16 state-of-the-art (SOTA) models. These models can be roughly categorized into five groups: (1) Traditional Handcrafted Features: BRISQUE [
4], NIQE [
15]. (2) Codebook-based Features: CORNIA [
19], HOSA [
20]. (3) CNN-Based Models with Supervised Pre-training: PQR [
24], DBCNN [
23], BIECON [
25], PaQ-2-PiQ [
5], HyperIQA [
26]. (4) Attention Mechanism-Based Models: TIQA [
47], TRES [
48], MUSIQ [
49]. (5) Contrastive Learning Pre-trained Models: CONTRIQUE [
6], Re-IQA [
35], QPT-ResNet50 [
34]. Additionally, to validate the effectiveness of the proposed contrastive learning approach, we also included the Resnet-50 [
22] model pre-trained on ImageNet for comparison.
4.3.2. Performance Comparison Experiments
In
Table 2 and
Table 3, we present the validation results of various models, including the proposed McmIQA method, on both synthetic distortion datasets and natural distortion datasets. The performance of the proposed model surpasses that of four quality assessment algorithms based on traditional feature extraction followed by SVR (support vector regressor): BRISQUE [
4], NIQE [
15], CORNIA [
19], and HOSA [
20]. Compared to supervised pre-trained CNN schemes, such as PQR [
24], DBCNN [
23], BIECON [
25], PaQ-2-PiQ [
5], and HyperIQA [
26], McmIQA also achieved superior predictive performance, benefiting from the feature extraction capabilities obtained during the pre-training phase on large-scale datasets. Additionally, when compared to models that perform feature extraction based on attention mechanisms, such as TIQA [
47], TRES [
48], and MUSIQ [
49], our model similarly exhibits superior performance due to the large-scale contrastive learning pre-training and collaborative mechanism. Finally, in comparison with recently proposed contrastive learning models, such as CONTRIQUE [
6], Re-IQA [
35], and QPT-ResNet50 [
34], the McmIQA model trained on 16GB-3080 still achieves highly competitive results. This indicates that the novel contrastive learning framework proposed in this paper and the rank training process for the correlation mapping module further enhanced the predictive performance of the resulting model.
Figure 5 illustrates the model’s predictive results on various datasets.
4.3.3. SRCC Evaluations on Cross Datasets
To validate the generalization performance of the proposed image quality assessment algorithm, we conducted cross-dataset evaluation experiments on two synthetic distortion datasets and two real distortion datasets, including the proposed model and three other quality assessment models. As shown in
Table 4, McmIQA exhibits a superior performance on the real distortion datasets and achieves highly competitive results on the synthetic distortion datasets. During the experiments, we kept the parameters of the distortion perception module and content recognition module frozen while optimizing the correlation mapping module independently.
4.3.4. Efficiency Comparison
In this section, to validate the practical application efficiency of the multi-module collaborative model, we selected three existing image quality assessment models, MetaIQA, MANIQA, and TRES, and compared their inference speeds with our model. All models evaluated the quality of test images on an RTX-3080 Laptop GPU. In
Table 5, we recorded the processing time for 10,000 images (including cropping, scaling, and inference). As shown in
Table 5, the MetaIQA model, which utilizes ResNet-50 as its backbone network, has fewer parameters and completes the relevant image processing first. Our model’s processing speed is comparable to that of the TRES model, and its image processing efficiency is significantly higher than that of MANIQA.
4.4. Ablation Study
In this section, we present relevant ablation experiment results by comparing the performance of various model versions on three datasets: TID-2013, Koniq-10k, and SPAQ. This comparison aims to validate the effectiveness of the individual modules included in the proposed model.
4.4.1. Ablation Experiments on the Distortion Perception Module Training
Table 6 presents the ablation experiment results on the distortion perception training contrastive learning frameworks. Compared with three scenarios: (1) Removing the distortion perception module. (2) Supervised training of the ResNet-50 on ImageNet to obtain the distortion perception module. (3) Training ResNet-50 as a distortion perception model with full-reference method scores. Remarkably, training the distortion perception module using the proposed approach leads to the best SRCC performance across all datasets.
4.4.2. Ablation Experiments for the Content Recognition Module Training
Table 7 presents the results of ablation experiments on the training framework for the content recognition module. We compared our framework with three other versions: (1) Removing the content recognition module. (2) Supervised training of the content recognition module. (3) Training the content recognition module using a colorization strategy proposed in [
50].
4.4.3. Ablation Experiments for the Correlation Mapping Module Training
Table 8 presents the results of ablation experiments on the training framework for the correlation mapping module. In our experiments, we started with the original method, which neither employed ranking loss nor froze the parameters of other modules. Instead, we directly fine-tuned the entire model. Subsequently, we incrementally introduced various mechanisms for comparison.
As shown in
Table 8, the introduction of both the ranking and the weight freezing mechanism during the training of the correlation mapping module positively impacted the model’s final performance to varying degrees. The ranking mechanism allows the model to assess the relative quality between different images, enhancing its overall perception of image quality. And the weight freezing enables us to significantly increase the batch size, reducing the impact of certain noise during model training.
5. Conclusions
In this paper, we further divide the image quality assessment task into three components, distortion perception, content recognition, and associated mapping, to address the challenge of real image quality assessment. By enhancing and utilizing the MOCO-V2 and MOCO-V3 contrastive learning frameworks, improving the image patch generation process for different modules, and introducing training mechanisms such as ranking and parameter freezing, our McmIQA method achieves state-of-the-art predictive performance across seven image quality evaluation datasets, including both synthetic and real distortions. This indicates that in this study, the distortion perception module and content recognition module trained with two contrastive learning schemes effectively extracted image distortion features and image content features related to image quality. With the extracted two image quality-related features, the correlation mapping module accurately predicted the perceptual quality of the corresponding image. Moreover, our approach is not restricted to a specific model; in scenarios where memory constraints are not an issue, it can easily switch to other networks for feature extraction, including those based on transformations, potentially yielding even more advanced performance.