\ArticleType

REVIEW \Year2024 \Month \Vol \No \DOI \ArtNo \ReceiveDate \ReviseDate \AcceptDate \OnlineDate

Perceptual Video Quality Assessment: A Survey

zhaiguangtao@sjtu.edu.cn

\AuthorMark

Min X K, et al.

\AuthorCitation

Min X K, Duan H Y, Sun W, Zhu Y C, Zhai G T, et al

Perceptual Video Quality Assessment: A Survey

Xiongkuo Min Huiyu Duan Wei Sun Yucheng Zhu Guangtao Zhai Department of Electronic Engineering, Shanghai Jiao Tong University, Shanghai 200240, China

Abstract

Perceptual video quality assessment plays a vital role in the field of video processing due to the existence of quality degradations introduced in various stages of video signal acquisition, compression, transmission and display. With the advancement of internet communication and cloud service technology, video content and traffic are growing exponentially, which further emphasizes the requirement for accurate and rapid assessment of video quality. Therefore, numerous subjective and objective video quality assessment studies have been conducted over the past two decades for both generic videos and specific videos such as streaming, user-generated content (UGC), 3D, virtual and augmented reality (VR and AR), high frame rate (HFR), audio-visual, etc. This survey provides an up-to-date and comprehensive review of these video quality assessment studies. Specifically, we first review the subjective video quality assessment methodologies and databases, which are necessary for validating the performance of video quality metrics. Second, the objective video quality assessment algorithms for general purposes are surveyed and concluded according to the methodologies utilized in the quality measures. Third, we overview the objective video quality assessment measures for specific applications and emerging topics. Finally, the performances of the state-of-the-art video quality assessment measures are compared and analyzed. This survey provides a systematic overview of both classical works and recent progresses in the realm of video quality assessment, which can help other researchers quickly access the field and conduct relevant research.

keywords:

Video quality assessment, human visual system, subjective quality assessment, objective quality assessment, survey

1 Introduction

Video is an electronic medium that involves the recording, copying, playback, broadcasting, and display of moving visual information, which is one of the most important forms of media. It is estimated that video traffic contributed 65% of the internet traffic [1, 2] due to evolution of internet communication, cloud service, and the popularization of video-sharing platforms. Recommending and delivering high-quality video content is important for retaining the interest of users. However, considering the uneven video shooting quality, huge user upload volume and congested network status, the video quality on the user side may not be satisfactory, which may cause negative quality of experience (QoE) and reduce the engagement. Therefore, video quality assessment (VQA) is crucial in video communication systems to ensure and improve the quality of video contents delivered to the end-users.

Video quality assessment can be performed subjectively or objectively [3]. Subjective video quality assessment is usually considered as the most reliable and accurate evaluation method. However, performing subjective evaluation is time-consuming and expensive, which makes it hard to be used in visual communication systems. Thus, subjective video quality assessment is generally used as the evaluation method for objective video quality assessment, which aims to objectively assess the perceived quality of videos of the human visual system (HVS). Video quality can be affected by many factors such as spatial and temporal resolution, frame rate, blur, noise, compression artifacts, etc., which brings challenges for designing objective video quality evaluation algorithms. Moreover, video categories are diverse, which include user-generated content (UGC), virtual and augmented reality (VR and AR), high frame rate (HFR), audio-visual, gaming, etc.. HVS has different perceptual characteristics for different types of videos, thus the influence of degradation on the perceived quality of these videos is also different, which further increases the difficulties of devising objective VQA measures. Therefore, numerous works have conducted studies on subjective and objective video quality assessment considering different conditions and factors [4, 5, 6, 7, 8, 9, 10].

Refer to caption — Figure 1: Scope of this survey.

1.1 Related Surveys

Since a large number of IQA and VQA studies have been carried out, some papers have also surveyed these works. Some papers have reviewed related research on image quality assessment. Wang and Bovik [11] provided an initial analysis of full-reference (FR) image fidelity assessment from the aspect of signal fidelity. They further gave a comprehensive introduction to reduced-reference (RR) and no-reference (NR) image quality assessment (IQA) [12] in 2011. Lin and Kuo [13] discussed several influence factors of perceptual visual quality measure including signal decomposition [14, 15, 16], just-noticeable distortions [17, 18], visual attention [19, 20], feature and artifact extraction [21], viewing conditions [22, 23, 20, 24], etc. Moorthy and Bovik [25] presented their perspectives of the future trends of the visual quality assessment field. Chandler [26] summarized seven challenges in image quality assessment. Zhai et al. [3] gave a comprehensive survey of classical algorithms and recent progress in the realm of perceptual image quality assessment.

Some researchers have also surveyed the studies on perceptual video quality assessment. Chikkerur et al. [27] classified, reviewed the objective video quality assessment methods, and compared the performance of these metrics. Shahid et al. [28] discusses classical and well-know NR video quality assessment algorithms. Chen et al. [29] wrote a tutorial for video quality assessment discussing the relationship between quality of service (QoS) and quality of experience (QoE). Fan et al. [30] briefly reviewed the metrics and methods of video quality assessment. Li et al. [31] summarized recent advances and challenges in video quality assessment. Zhou et al. [32] introduced a brief survey on adaptive video streaming quality assessment. Saha et al. [2] surveyed the recent progress of video quality assessment and discussed future research trends.

Most of these existing surveys for VQA research discuss classical VQA techniques, or only provide an overview of specific VQA topics. With the advancement of deep learning, many state-of-the-art VQA models have adopted deep neural networks (DNN) to predict perceptual quality, which is rarely addressed in most previous reviews. Moreover, with the recent progress in multimedia, many works have also conducted VQA research for specific applications, such as VR/AR, HFR, UGC, audio-visual, etc., which failed to be reviewed in most surveys. Therefore, a systematic, comprehensive, and up-to-date survey is still needed.

1.2 Scope and Organization of This Survey

This survey provides an up-to-date and comprehensive review of VQA studies. Since VQA has varied application-specific use cases, we only thoroughly overview these studies, but refrain from benchmarking them. The scope of this survey is shown in Figure 1. The organization of this survey is introduced as follows. Section 2 summarizes subjective VQA methodologies, and reviews subjective VQA databases for general purpose and specific applications. In Section 3, we review objective VQA measures for traditional topics (i.e., general purposes), including FR, RR, and NR metrics. Section 4 surveys objective VQA measures for emerging topics (i.e., specific applications), including compression VQA, streaming VQA, stereoscopic VQA, virtual reality VQA, high frame rate VQA, audio-visual VQA, high dynamic range (HDR) VQA, screen and game VQA. In Section 5, evaluation process of VQA measures is discussed, with a comparison of their performances. Section 6 outlooks future trends in VQA and Section 7 summarizes the whole paper.

2 Subjective Video Quality Assessment

Subjective quality assessment is the most reliable way for evaluating the perceptual quality of images or videos, since human eyes are usually the ultimate receiver of these contents [3, 33, 34]. Different application systems may require different subjective assessment methods [35, 36, 37, 38, 39, 40, 41]. In this section, we first review the general methodology of subjective quality assessment suggested by ITU-R BT.500 [42] including subjective viewing environment setup, subject recruitment, subject grading, and subjective result processing, etc. Then, 20 subjective VQA databases for general contents are reviewed.

2.1 Subjective VQA Methodology

Subjective quality assessment usually needs a large number of subjects to rate the quality of the target objects according to certain standards, and take the mean opinion score (MOS) or difference mean opinion score (DMOS) as the result of the subjective quality. The MOS means the average score from all subjects for specific stimuli, while the DMOS refers to the average value of the difference between the scores of the reference stimuli and the scores of the corresponding distorted stimuli. For DMOS, the influence of the video content can be effectively reduced by subtracting the scores of the reference material. The measurement of the perceived quality of videos requires the use of subjective scaling methods, and several test methods are usually adopted, including absolute category rating (ACR), comparison category rating (CCR, also known as double stimulus comparison scale (DSCS)), degradation category rating (DCR, also known as double stimulus impairment scale (DSIS)), etc., whereas ACR is better suited to obtain a general, unbiased judgment of the overall quality, DCR and CCR might be better suited for smaller, subtle differences. The subjective quality assessment process generally includes five steps according to the recommendations given by the ITU [42]: (1) Build evaluation environment. Experimental instructors need to set up and calibrate the test environment and test equipment to achieve the corresponding viewing requirements. (2) Prepare test stimuli. Generally, experimental instructors prepare test stimuli according to the problem to be evaluated, such as raw videos and distorted videos. (3) Invite or recruit subjects to give opinion scores. The subjects can be experts or non-experts according to the requirements of the experiment, but in any case the subjects should not know the purpose of the experiment. All subjects should have normal or corrected to normal vision. Generally, the number of subjects should not be less than 15. (4) Conduct subjective experiments. The subjects should give their subjective quality ratings according to predetermined test methods and evaluation scales. There are many types of test methods, such as single-stimulus method and double-stimulus method, which are generally selected according to the needs of the experiment. The evaluation scale generally adopts a five grade quality and impairment scale, and the detailed scale can be continuous or discrete according to the requirements of the experiment. (5) Process subjective data. First, the subjective data needs to be screened to remove abnormal subjects and abnormal scores, and the screening criteria can refer to the recommendations provided by the ITU. Then the MOS values or DMOS values can be calculated as the final subjective quality results.

Since it is hard to recruit numerous subjects to join a lab experiment, many recent studies have adopted to use crowdsourcing methods to conduct subjective video quality assessment experiments. Crowdsourcing offers fast, low cost, and scalable approaches by outsourcing tasks to a large number of participants. In addition, crowdsourcing also provides a large diverse source of participants, and a practical environment for quality assessment of multimedia services and applications. Nevertheless, crowdsourcing subjective quality assessment methods may not obtain data with equal quality compared to laboratory testing methods, due to factors they inherit from the nature of crowdsourcing. Therefore, crowdsourcing experiments should be designed differently and carefully. ITU has given some recommendations for crowdsourcing subjective quality assessment [43]. The subjective crowdsourcing video quality assessment method generally includes the following steps: (1) Choose crowdsourcing platforms and crowdworkers. There are many crowdsourcing platforms that can be adopted to conduct subjective VQA experiments, such as Amazon Mechanical Turk (AMT) [44], Microworkers [45], CrowdFlower [46], Crowdee [47], etc. Moreover, most platforms provide the function of choosing target subjects, including the conditions of household size, educational level, yearly income, etc. (2) Prepare subjective experiment. Considering the crowdsourcing condition, some aspects should be considered when preparing the experimental stimuli, including overall experiment enjoyability, task duration and complexity, user interface (UI) logistics, compensation relative to the test duration and complexity, test clarity and the user’s ability to understand the task. (3) Conduct subjective experiment. The experiment procedure includes a qualification step, a training step, and a rating step. It should be noted that experimenters should incorporate different validity check methods in these steps for ensuring the worker’s full attentiveness to the task, and the adequateness of the environment and system used by the worker. (4) Screen subjective data. It is important to screen the data obtained from the crowdsourcing methods, including examining entire votes for each video stimulus to remove score outliers, and calculate the correlation coefficient between the MOS from one work and the global MOS to remove subject outliers.

Table 1: An overview of popular public video quality assessment databases for general contents, including lab-introduced video datasets with synthetic distortions and large-scale crowdsourced user-generated content (UGC) video datasets with authentic distortions.

Database	Type	Year	#Cont.	#Total	Resolution	FR	Dur.	Format	Distortions	#Subj.	#Ratings	Data	Env.
LIVE-VQA [48]	Syn.	2008	10	160	768 $\times$ 432	25/50	10	YUV $+$ 264	Compression, transmission	38	29	DMOS $+$ $\sigma$	In-lab
EPFL-PoliMI [49]	Syn.	2009	12	156	CIF/4CIF	25/30	10	YUV $+$ 264	Compression, transmission	40	34	MOS	In-lab
VQEG-HDTV [50]	Syn.	2010	49	740	1080i/p	25/30	10	AVI	Compression, transmission	120	24	RAW	In-lab
IVP [51]	Syn.	2011	10	138	1080p	25	10	YUV	Compression, transmission	42	35	DMOS $+$ $\sigma$	In-lab
TUM 1080p50 [52]	Syn.	2012	5	25	1080p	50	10	YUV	Compression	21	21	MOS	In-lab
LIVE Mobile [53]	Syn.	2012	10	200	720p	30	15	YUV	Compression, transmission	30 $+$ 17	30 $+$ 17	DMOS $+$ $\sigma$	In-lab
CSIQ [54]	Syn.	2014	12	228	832 $\times$ 480	24-60	10	YUV	Compression, transmission	35	N/A	DMOS $+$ $\sigma$	In-lab
MCL-V [55]	Syn.	2015	12	108	1080p	24-30	6	YUV	Compression, scaling	45	32	MOS	In-lab
MCL-JCV [56]	Syn.	2016	30	1560	1080p	24-30	5	MP4	Compression	150	50	RAW-JND	In-lab
CVD2014 [57]	Aut.	2014	5	234	720p, 480p	9-30	10-25	AVI	In-capture	210	30	MOS	In-lab
LIVE-Qualcomm [58]	Aut.	2016	54	208	1080p	30	15	YUV	In-capture	39	39	MOS	In-lab
KoNViD-1k [59]	Aut.	2017	1200	1200	540p	24-30	8	MP4	In-the-wild	642	114	MOS $+$ $\sigma$	Crowd
LIVE-VQC [60]	Aut.	2018	585	585	1080p-240p	19-30	10	MP4	In-the-wild	4776	240	MOS	Crowd
YouTube-UGC [61]	Aut.	2019	1380	1380	4k-360p	15-60	20	MKV	In-the-wild	$>$ 8k	123	MOS $+$ $\sigma$	Crowd
LSVQ [62]	Aut.	2021	39075	39075	Diverse	Diverse	5-12	MP4	In-the-wild	6284	35	MOS	Crowd
UGC-VIDEO [63]	Syn. $+$ Aut.	2019	50	550	720p	30	10	N/A	UGC $+$ compression	30	30	DMOS	In-lab
LIVE-WC [64]	Syn. $+$ Aut.	2020	55	275	1080p	30	10	MP4	UGC $+$ compression	40	40	MOS	In-lab
YT-UGC $+$ (Subset) [65]	Syn. $+$ Aut.	2021	189	567	1080p, 720p	Diverse	20	RAW $+$ 264	UGC $+$ compression	N/A	30	DMOS	In-lab
ICME2021 [66]	Syn. $+$ Aut.	2021	1000	8000	N/A	N/A	N/A	N/A	UGC $+$ compression	N/A	N/A	MOS	In-lab
TaoLive [67]	Syn. $+$ Aut.	2022	418	3762	1080p, 720p	20	8	MP4	UGC $+$ compression	44	44	MOS	In-lab

Note: #Cont.: The number of unique video contents. #Total: Total number of test video sequences. FR: Framerate (in fps). Dur.: Video duration/length (in seconds).

#Subj.: Total number of subjects in the study. #Ratings: Average number of subjective ratings per video. Env.: Subjective experiment environment.

In-lab: Experiment was conducted in a laboratory. Crowd: Experiment was conducted by crowdsourcing. Syn.: Synthetic. Aut.: Authentic.

2.2 Subjective VQA Databases for General Purpose

Table 1 [48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67] gives an overview of 20 databases that are widely used in the research of visual quality assessment for general video contents. Information including the type of databases, years, the number of unique video contents, total numbers of test video sequences, video resolutions, video frame rates, video durations, video formats, distortion types, subject numbers, average numbers of subjective ratings per video, subjective score types and subjective experiment environments is summarized.

2.2.1 General VQA Databases with Synthetic Distortions

Many early VQA studies have only considered synthetic distortions in their databases. We first review 9 popular VQA databases with synthetic distortions as follows [48, 49, 50, 51, 52, 53, 54, 55, 56].

•

LIVE video quality assessment database (LIVE-VQA) [48]. LIVE-VQA is a synthetic database, which includes 10 pristine videos and 160 distorted videos corrupted by compression and transmission distortions. All videos have a resolution of 768 $\times$ 432, a frame rate of 25 or 50 fps, and a duration of 10 seconds. The video formats are YUV and h.264. The subjective experiment was conducted in a lab environment, and the subjective data contains DMOS and $\sigma$ .
•

EPFL-PoliMI database [49]. EPFL-PoliMI is a synthetic database, which includes 12 pristine videos and 156 distorted videos corrupted by compression and transmission distortions. All videos have a resolution of 360 $\times$ 240 or 704 $\times$ 480, a frame rate of 25 or 30, and a duration of 10 seconds. The video formats are YUV and h.264. The subjective experiment was conducted in a lab environment, and the subjective data is MOS.
•

VQEG-HDTV database [50]. VQEG-HDTV is a synthetic database, which includes 49 pristine videos and 740 distorted videos degraded by compression and transmission distortions. All videos have a resolution of 1080i or 1080p, a frame rate of 25 or 30, and a duration of 10 seconds. The video format is AVI. The subjective experiment was conducted in a lab environment, and the raw subjective data is available.
•

IVP subjective quality video database [51]. IVP is a synthetic database, which includes 10 pristine videos and 138 distorted videos generated by compression and transmission distortions. All videos have a resolution of 1080p, a frame rate of 25, and a duration of 10 seconds. The video format is YUV. The subjective experiment was conducted in a lab environment, and the subjective data contains DMOS and $\sigma$ .
•

TUM high definition video datasets (TUM 1080p50) [52]. It is a synthetic database, which includes 5 pristine videos and 25 distorted videos degraded by compression. All videos have a resolution of 1080p, a frame rate of 50, and a duration of 10 seconds. The video format is YUV. The subjective experiment was conducted in a lab environment, and the subjective data available is MOS.
•

LIVE mobile video quality assessment database (LIVE Mobile) [53]. LIVE mobile is a synthetic database, which includes 10 pristine videos and 200 distorted videos corrupted by compression and transmission distortions. All videos have a resolution of 720p, a frame rate of 30, and a duration of 15 seconds. The video format is YUV. The subjective experiment was conducted in a lab environment, and the subjective data contains DMOS and $\sigma$ .
•

CSIQ video database [54]. CSIQ is a synthetic database, which includes 12 pristine videos and 228 distorted videos corrupted by compression and transmission distortions. All videos have a resolution of 832 $\times$ 480, and a duration of 10 seconds. The frame rate ranges from 24 to 60. The video format is YUV. The subjective experiment was conducted in a lab environment, and the subjective data available is MOS.
•

MCL-V: a streaming video quality assessment database [55]. MCL-V is a synthetic database, which includes 12 pristine videos and 108 distorted videos corrupted by compression and scaling distortions. All videos have a resolution of 1080p and a duration of 6 seconds. The frame rate ranges from 24 to 30. The video format is YUV. The subjective experiment was conducted in a lab environment, and the subjective data available is MOS.
•

MCL-JCV: a jnd-based h.264/avc video quality assessment datase [56]. MCL-JCV is a synthetic database, which includes 30 pristine videos and 1560 distorted videos degraded by compression. All videos have a resolution of 1080p and a duration of 5 seconds. The frame rate ranges from 24 to 30. The video format is YUV. The subjective experiment was conducted in a lab environment, and the subjective data format is RAW-JND.

2.2.2 General VQA Databases with Authentic Distortions

With the popularity of UGC, many recent VQA studies have focused on authentic distortions [57, 58, 59, 60, 61, 62], i.e., in-capture or in-the-wild distortions, and some studies have also considered both synthetic and authentic distortions [63, 64, 65, 66, 67]. These databases are summarized as follows.

•

CVD2014 [57]. It is an authentic database, which includes 5 scenes and 234 test video sequences with camera in-capture distortions. The resolution of the videos in the CVD2014 is 720p or 480p. The frame rate ranges from 9-30 fps. The video length ranges from 10 seconds to 25 seconds. The video format is AVI. The subjective experiment was conducted in a lab environment, and the subjective data format is MOS.
•

LIVE-Qualcomm [58]. LIVE-Qualcomm is an authentic database, which includes 54 scenes and 208 test video sequences with in-capture distortions. All videos have a resolution of 1080p, a frame rate of 30 fps, and a duration of 15 seconds. The video format is YUV. The subjective experiment was conducted in a lab environment, and the subjective data format is MOS.
•

The konstanz natural video database (KoNViD-1k) [59]. KoNViD-1k is an authentic database, which includes 1200 unique test video sequences with diverse authentic distortions. All videos were sampled from YFCC100m (Flickr) via a feature space of blur, colorfulness, contrast, SI, TI, and NIQE, and content was clipped from the original and resized to 540p with the landscape layout. The frame rates of these videos are 24, 25, and 30 fps, and the duration is 8 seconds. All videos are in MP4 format. The subjective experiment was conducted by crowdsourcing using CrowdFlower. A total of 642 subjects were included in the experiment, and 136800 subjective quality ratings were collected with about 114 votes per video. The MOS and $\sigma$ values are available in the database.
•

LIVE-VQC [60]. LIVE-VQC is an authentic database, which includes 585 unique test video sequences with diverse authentic distortions. All videos were manually captured by certain people, which includes many camera motion distortions and some night scene distortions. The resolution is not uniformly distributed and ranges from 240p to 1080p with the landscape or portrait layouts. The frame rates of these videos are 20, 24, 25, and 30 fps, and the duration is 10 seconds. All videos are in MP4 format. The subjective experiment was conducted by crowdsourcing using AMT. A total of 4776 subjects were included in the experiment, and 205000 subjective quality ratings were collected with about 240 votes per video. The MOS values are available in the database.
•

Youtube UGC dataset for video compression research (YouTube-UGC) [61]. YouTube-UGC is an authentic database, which includes 1380 unique test video sequences with diverse authentic distortions. All videos were sampled from YouTube via a feature space of spatial, color, temporal, and chunk variation with diverse video contents including HDR, screen content, animation, gaming videos, etc. All videos are in resolutions of 4k, 1080p, 720p, 480p, and 360p with the landscape and portrait layouts. The frame rates of these videos are 15, 20, 24, 25, 30, 50, and 60 fps, and the duration is 20 seconds. All videos are in MKV format. The subjective experiment was conducted by crowdsourcing using AMT. More than 8000 subjects participated in the experiment, and 170159 subjective quality ratings were collected with about 123 votes per video. The MOS and $\sigma$ values are available in the database.
•

LSVQ [62]. LSVQ is a large-scale authentic video quality assessment database, which includes 39075 unique video sequences with diverse authentic distortions. The videos in LSVQ have diverse resolutions and frame rates. The duration ranges from 5 seconds to 12 seconds. All videos are in MP4 format. The subjective experiment was conducted by crowdsourcing using AMT. A total of 6284 subjects were included in the experiment, and each video was evaluated by 35 subjects. The subjective data format is MOS.
•

UGC-VIDEO [63]. UGC-VIDEO is a video quality assessment database with both synthetic and authentic distortions. It contains 50 UGC videos and 550 distorted videos corrupted by compression. All videos have a resolution of 720p, a frame rate of 30 fps, and a duration of 10 seconds. The subjective experiment was conducted in a lab environment, and each video was assessed by 30 subjects. The subjective data format is DMOS.
•

LIVE-WC [64]. LIVE-WC is a video quality assessment database with both synthetic and authentic distortions. It contains 55 UGC videos and 275 distorted videos corrupted by compression. All videos have a resolution of 1080p, a frame rate of 30 fps, and a duration of 10 seconds. All videos are in MP4 format. The subjective experiment was conducted in a lab environment, and each video was assessed by 40 subjects. The subjective data format is MOS.
•

YT-UGC+(Subset) [65]. YT-UGC+(Subset) is a video quality assessment database with both synthetic and authentic distortions. It contains 189 UGC videos and 567 distorted videos corrupted by compression. All videos are in the resolutions of 1080p or 720p. The videos have diverse frame rates and a fixed duration of 20 seconds. The subjective experiment was conducted in a lab environment, and each video was assessed by 30 subjects. The subjective data format is DMOS.
•

ICME2021 [66]. It is a video quality assessment database with both synthetic and authentic distortions. It contains 1000 UGC videos and 8000 distorted videos corrupted by compression. The subjective experiment was conducted in a lab environment, and the data format is MOS.
•

TaoLive [67]. TaoLive is a video quality assessment database with both synthetic and authentic distortions, which contains 418 UGC videos and 3762 distorted videos corrupted by compression. All videos are in the resolutions of 1080p or 720p. The frame rate is 20 fps and the video length is 8 seconds. The subjective experiment was conducted in a lab environment, and each video was assessed by 44 subjects. The subjective data format is MOS.

Table 2: An overview of popular public video quality assessment databases for specific applications.

Category	Database	Year	#Ref.	#Total	Resolution	Dur.	#Dist. Type	#Subj.	Data
Streaming	LIVE Mobile [53]	2012	10	200	720p	10	H.264 compression, switching, stalling	47	DMOS
	LIVE-TVSQ [68]	2014	3	15	720p	300	H.264 compression, switching	25	RDMOS
	LIVE-AMV [69]	2014	24	180	720p, 360p	29-134	stalling	27	DMOS
	LIVE-NFLX-I [70]	2017	14	112	1080p	$>$ 60	H.264, initial buffering, stalling, switching	55	MOS
	LIVE-NFLX-II [71]	2018	15	420	1080p	diverse	Video encoding, network simulation, etc.	65	MOS
	WaterlooSQoE-I [72]	2016	20	180	1080p	10	H.264, initial buffering, stalling	25	MOS
	WaterlooSQoE-II [73]	2017	12	588	1080p	diverse	H.264, switching	35	MOS
	WaterlooSQoE-III [74]	2018	20	450	diverse	diverse	H.264, initial buffering, stalling, switching	34	MOS
	WaterlooSQoE-IV [75]	2019	5	1450	N/A	N/A	video encoders, network traces, ABR algorithms, etc.	N/A	MOS
3D	LIVE 3D [5]	2012	6	54	480p	10, 15	H.264	27	DMOS
	StSD 3D [76]	2013	14	116	1080p	8	H.264, HEVC	16	DMOS
	Tampere 3D [77]	2011	4	60	1080p	10	H.264, Depth level	30	N/A
	MMSPG 3D [78]	2010	6	30	1080p	10	Camer distance	17	MOS
	NAMA3DS1-COSPAD1 [79]	2012	10	110	1080p	13, 16	H.264, JPEG2000	29	MOS
	UBC DML 3D [80]	2014	5	64	1080p	10	HEVC, frame rates	16	MOS
	3DVCL $@$ FER [81, 82]	2015	6	184	1080p	10	H.264, JPEG2000, geometric distortion, etc.	35	MOS
	WATERLOO-IVC 3D [83]	2017	10	704	1080p, 1024 $\times$ 768	6, 10	HEVC, Gaussian low-pass filtering	54	MOS
VR	IVQAD2017 [4]	2017	10	160	4096 $\times$ 2048	15	MPEG-4, difference resolutions, different frame rates	13	MOS
	Zhang et al. [84]	2018	10	60	3600 $\times$ 1800	10	H.265 with different QPs	30	DMOS, HM
	Zhang et al. [85]	2017	16	400	4096 $\times$ 2048	N/A	VP9, H.264, H.265, Gaussian noise, box blur	23	MOS, DMOS
	Lopes et al. [86]	2018	6	85	8192 $\times$ 4096	10	H.265, different resolutions, different frame rates	37	MOS, DMOS
	Singla et al. [87]	2017	6	66	1080p, 4k	10	H.265 with different bitrates	30	MOS, HM
	VR-VQA48 [88]	2018	12	48	4096 $\times$ 2048	12	H.265 with different QPs	48	MOS, DMOS
	Tran et al. [89]	2018	6	126	N/A	30	H.265 with different QPs, different resolutions	37	MOS
	VQA-ODV [90]	2018	60	600	7680 $\times$ 3840-3840 $\times$ 1920	10-23	H.265 with different QPs; different projections	221	MOS, DMOS
	LIVE-FBT-FCVR 2D & 3D [91]	2021	20	360	7680 $\times$ 3840, 5376 $\times$ 5376	10	3 foveated samples, 4 radii, 5 VP9 QPs	N/A	MOS
Frame rate, frame inter- polation	Waterloo-IVC-HFR [6]	2015	7	336	1080p, 480p	10	Frame rate, QP, resolution	25	MOS
	BVI-HFR [92]	2018	22	88	1080p	10s	Frame rate	51	MOS
	LIVE-YouTube-HFR [93]	2021	16	480	UHD-1, HD	6-8, 10	Frame rate, VP9	85	MOS
	ETRI-LIVE STSVQ [94]	2021	15	437	3840 $\times$ 2160	4.5-7	Spatial subsampling, temporal subsampling, HEVC	34	DMOS
	KosMo-1k [95]	2020	30	1350	480 $\times$ 540	8	Slow motion	N/A	MOS
	BVI-VFI [96]	2022	108	540	UHD-1, HD, 960 $\times$ 540	N/A	Frame repeating, averaging, interpolation	189	DMOS
Audio-Visual	Winkler et al. [97]	2006	6	48	QCIF (176 $\times$ 144)	$\sim$ 8	H.264, MPEG-4	24	MOS
	VQEGMM2 [98, 99]	2012	10	60	640 $\times$ 480	10	H.264 (AVC), AAC	10	MOS
	Demirbilek et al. [100]	2016	6	144	1080p, 720p	N/A	Resolution, bit rate, bandwidth, etc.	24	MOS
	Martinez et al. [101, 102]	2014	6	132	720p	N/A	H.264, MPEG-1 layer-3	17	MOS
	LIVE-SJTU A/V-QA [7]	2020	14	336	1080p	8	HEVC, scaling, AAC	35	MOS
	Fela et al. [103]	2022	12	576	6144 $\times$ 3072	20	H.265/HEVC, AAC-LC, resolution	20	MOS
	OAVQAD [8]	2023	15	375	7680 $\times$ 3840	6	HEVC, AAC, resolution, noise, blur, stalling	22	MOS
HDR/ WCG/ iTMO/ TMO	DML-HDR [104]	2014	4	32	1080p	10, 17	HEVC, H.264 with different QPs	17	MOS
	Narwaria et al. [105]	2015	9	153	1080p	N/A	TMO, iTMO, compression and decompression	25	MOS
	Mukherjee et al. [106]	2016	39	429	1080p	5	H.264	64	Rank
	Yeganeh et al. [107]	2016	10	40	N/A	N/A	TMO	30	MOS
	DML-HDR 2 [108]	2018	5	N/A	2048 $\times$ 1080	10	AWGN, mean intensity shift, low Pass filter, etc.	18	MOS
	Waterloo UHD-HDR-WCG [109]	2019	14	140	3840 $\times$ 2160	10	H.264, HEVC	51	MOS
	LIVE-HDR [9]	2022	31	310	4k, 1080p, 720p, 540p	1-10	HEVC with different bitrates, resolution	66	DMOS
Screen/game	SCVD [110]	2020	16	800	1080p	10	GN, GB, MB, CC, CSC, CQD, H.264, etc.	32	MOS
	CSCVQ [111]	2020	11	165	720p	10	H.264, HEVC, HEVC-SCC	20	MOS
	GamingVideoSET [112]	2018	24	600	480p, 720p, 1080p	30	H.264 compression	25	MOS
	KUGVD [113]	2019	6	150	480p, 720p, 1080p	30	H.264 compression	17	MOS
	CGVDS [114]	2020	15	255	480p, 720p, 1080p	30	H.264 compression	$>$ 100	MOS
	TGV [115]	2022	150	1293	480p, 720p, 1080p	5	H264, H265, Tencent codec	19	N/A
	LIVE-YOUTUBE GVQA [10]	2023	600	600	360p, 480p, 720p, 1080p	8-9	PGC, UGC	61	MOS

2.3 Subjective VQA Databases for Specific Applications

With the advancement of multimedia video services, video categories have gradually enriched, therefore, many studies have studied VQA for specific applications. In this section, we mainly discuss VQA databases for specific applications as demonstrated in Table 2.

2.3.1 Streaming VQA Databases

Many databases have considered the temporal degradations of the videos during streaming services.

•

LIVE mobile video quality assessment database (LIVE Mobile) [53]. LIVE mobile consists of 10 reference videos and 200 distorted videos with 5 distortion types including H.264 compression, stalling, frame drop, rate adaptation, and wireless channel packet-loss. The resolution of the videos is 720p, and the duration is 15 seconds. A total of 47 subjects were included in the subjective experiment.
•

LIVE time-varying subjective quality database (LIVE-TVSQ) [68]. LIVE-TVSQ consists of 3 reference videos constructed by concatenating 8 high-quality high-definition video clips of different content, and 15 distorted videos corrupted by adjusting the encoding bitrate of H.264 video encoder with 5 bitrate-varying levels. The resolution is 720p. Each video is 5 minutes long and is viewed by 25 subjects. The subjective data format is Reversed DMOS (RDMOS).
•

LIVE-Avvasi Mobile Video database (LIVE-AMV) [69]. LIVE-AMV consists of 24 reference videos and 180 distorted videos generated with 26 hand-crafted stalling events. 17 videos have a resolution of 720p, and 7 videos have a resolution of 360p. The video lengths range between 29-134 seconds. The single stimulus continuous quality evaluation procedure was adopted, where the reference videos were also evaluated to obtain a DMOS for each distorted video sequence.
•

LIVE-Netflix Video QoE Database I (LIVE-NFLX-I) [70]. LIVE-NFLX-I consists of 112 distorted videos derived from 14 source content with 8 handcrafted playout patterns including dynamically changing H.264 compression rates, rebuffering events and mixtures of both. The resolution of the videos is 1080p. The video sequences were displayed on a small mobile screen at low bitrates, and were viewed for at least one minute by 55 subjects. MOS values were obtained for the videos.
•

LIVE-Netflix Video QoE Database II (LIVE-NFLX-II) [71]. LIVE-NFLX-II consists of 420 streaming videos derived from 15 source content with various streaming degradations including content-adaptive encoding profiles, bitrate adaptation algorithms, and various network conditions. The videos have a resolution of 1080p and diverse video lengths. A total of 65 subjects were included in the experiment and MOS values were collected.
•

Waterloo Streaming QoE Database I (WaterlooSQoE-I) [72]. WaterlooSQoE-I contains 20 pristine videos and 180 distorted videos including 60 compressed videos, 60 initial buffering videos, and 60 mid-stalling videos. The resolution of the videos is 1080p, and the duration is 10 seconds. Each video was evaluated by 25 subjects, and MOS values were obtained.
•

Waterloo Streaming QoE Database II (WaterlooSQoE-II) [73]. WaterlooSQoE-II contains 12 pristine videos and 588 distorted videos corrupted by various compression levels, spatial resolutions, and frame rates. The resolution is 1080p. The videos have diverse lengths. The subjective data format is MOS.
•

Waterloo Streaming QoE Database III (WaterlooSQoE-III) [74]. WaterlooSQoE-III contains 20 source videos and 450 streaming videos corrupted by various encoding configurations, bandwidth shaping, and ABR algorithms. The videos have diverse resolutions and lengths. The subjective data format is MOS.
•

Waterloo Streaming QoE Database IV (WaterlooSQoE-IV) [75]. WaterlooSQoE-IV dataset contains 1350 highly realistic streaming videos generated from 5 pristine videos with the combinations of 2 video encoders, 9 real-world network traces, 5 ABR algorithms, and 3 viewing devices. The 5 ABR algorithms include RB, BB, FastMPC, Pensieve, and RDOS.

2.3.2 3D VQA Databases

Traditional videos are plane videos without stereoscopic depth cues. With the advancement of display techniques, many 3D videos have emerged, and many 3D VQA databases have been established.

•

LIVE 3D Video Database (LIVE 3D) [5]. LIVE 3D contains 6 pristine videos and 54 distorted videos corrupted by 9 different quantization parameter (QP) levels. The resolution of the videos is 480p. The length of two source videos is 15 seconds, while the length of the remaining four source videos is 10 seconds. A total of 27 subjects were recruited and divided into two groups. In group A, 13 subjects were asked to evaluate the spatial video quality (SVQ), depth quality (DQ), and visual comfort (VC) thirteen subjects, while in group B, 14 subjects were asked to give their ratings of the overall 3D video quality (3DVQ).
•

StSD 3D Video Database (StSD 3D) [76]. StSD 3D contains 14 pristine videos and 116 distorted videos corrupted by H.264 and HEVC compressions. The resolution of the videos is 1080p. The duration is 8 seconds. A total of 16 subjects were included in the experiment and DMOS values were collected.
•

Tampere 3D Video Database (Tampere 3D) [77]. Tampere 3D contains 4 pristine videos and 60 distorted videos corrupted by the H.264 compression and various depth levels. All videos have a resolution of 1080p and a duration of 10 seconds. A total of 30 subjects were included in the experiment.
•

MMSPG 3D Video Quality Assessment Database (MMSPG 3D) [78]. MMSPG 3D contains 6 pristine scenes, and 5 different stimuli were generated for each scene with different camera distances including 10, 20, 30, 40, 50 cm. All videos have a resolution of 1080p and a duration of 10 seconds. MOS values were calculated by subjective quality ratings collected from 17 qualified subjects.
•

NAMA3DS1-COSPAD1 [79]. NAMA3DS1-COSPAD1 contains 10 pristine scenes and 110 distorted videos corrupted by H.264 and JPEG 2000 compressions. The resolution of the videos is 1080p. The video lengths are 13 seconds or 16 seconds. A total of 29 subjects were included in the experiment and MOS values were collected.
•

UBC Digital Multimedia Lab 3D Video Database (UBC DML 3D) [80]. UBC DML 3D contains 5 pristine videos and 64 distorted videos corrupted by HEVC compression and different frame rates. All videos have a resolution of 1080p and a duration of 10 seconds. A total of 16 subjects were included in the experiment and MOS values were collected.
•

3DVCL $@$ FER Video Database (3DVCL $@$ FER) [81, 82]. 3DVCL $@$ FER contains 6 pristine videos and 184 distorted videos corrupted by H.264 compression, JPEG2000 compression, Geometric distortion, packet losses, different frame rates and frame freeze. All videos have a resolution of 1080p and a duration of 10 seconds. A total of 35 subjects were included in the experiment and MOS values were collected.
•

WATERLOO-IVC 3D Video Quality Database (WATERLOO-IVC 3D) [83]. WATERLOO-IVC 3D contains two sub-databases. Waterloo-IVC 3D Video Database Phase I contains 4 pristine multi-view 3D videos and 176 distorted videos corrupted by symmetric and asymmetric transform-domain quantization coding followed by different levels of low-pass filtering. Waterloo-IVC 3D Video Database Phase II includes 6 pristine 3D videos and various distorted stereoscopic 3D videos obtained from mixed-resolution coding, asymmetric transform-domain quantization coding, their combinations, and different levels of low-pass filtering. The videos in database Phase I have a resolution of 1024 $\times$ 768, and the videos in database Phase II have a resolution of 1080p. 22 subjects were recruited in the Phase I experiment, while 32 subjects were recruited in the Phase II experiment. MOS values were obtained in the experiments.

2.3.3 VR VQA Databases

Virtual Reality (VR) allows users to perceive 360 ${}^{\circ}$ digital content immersively via head-mounted displays (HMDs), which is a gradually popular display media. Omnidirectional videos are important digital contents in VR, thus many omnidirectional VQA databases have also been established [116].

•

Immersive Video Quality Assessment Database 2017 (IVQAD2017) [4]. IVQAD2017 is a large-scale immersive video quality assessment database, which contains 10 pristine videos and 160 distorted videos corrupted by MPEG-4 compression, different resolutions, and different frame rates. All videos in IVQAD2017 have a resolution of 4096 $\times$ 2048 and a duration of 15 seconds. The VR device used in the subjective experiment was HTC VIVE. A total of 13 subjects participated in the experiment and MOS values were obtained.
•

Zhang et al. [84]. Zhang et al. [84] established a VR VQA database, which contains 10 pristine videos and 60 distorted videos corrupted by H.265 compression with different QPs. All videos in the database have a resolution of 3600 $\times$ 1800 and a duration of 10 seconds. The VR device used in the subjective experiment was HTC VIVE. A total of 30 subjects participated in the experiment, and DMOS values and head movement data were obtained.
•

Zhang et al. [85]. Zhang et al. [85] established a VR VQA database, which contains 16 pristine videos and 400 distorted videos corrupted by VP9, H.264, H.265 compressions with different bitrates, and different levels of Gaussian noise and box blur. All videos in the database have a resolution of 4096 $\times$ 2048. The VR device used in the subjective experiment was HTC VIVE. A total of 23 subjects participated in the experiment, and MOS as well as DMOS values were obtained.
•

Lopes et al. [86]. Lopes et al. [86] established a VR VQA database, which contains 6 pristine videos and 85 distorted videos corrupted by H.265 compression with different QPs, different resolutions, and different frame rates. All videos in the database have a resolution of 8192 $\times$ 4096 and a duration of 10 seconds. The VR device used in the subjective experiment was Oculus Rift. A total of 37 subjects participated in the experiment, and MOS as well as DMOS values were obtained.
•

Singla et al. [87]. Singla et al. [87] established a VR VQA database, which contains 6 pristine videos and 66 distorted videos corrupted by H.265 compression with different bitrates. All videos in the database have the resolutions of 4096 $\times$ 2048 and 2048 $\times$ 1024, and a duration of 10 seconds. The VR device used in the subjective experiment was Oculus Rift CV1. A total of 30 subjects participated in the experiment, and MOS values and head movement data were obtained.
•

Virtual Reality Video Quality Assessment Database 48 (VR-VQA48) [88]. VR-VQA48 contains 12 pristine videos and 48 distorted videos corrupted by H.265 compression with different QPs. All videos in the database have a resolution of 4096 $\times$ 2048, and a duration of 12 seconds. The VR device used in the subjective experiment was HTC Vive. A total of 48 subjects participated in the experiment, and MOS values, DMOS values and head movement data were obtained.
•

Tran et al. [89]. Tran et al. [89] established a VR VQA database, which contains 6 pristine videos and 126 distorted videos corrupted by H.265 with different QPs and different resolutions. All videos in the database have a duration of 30 seconds. The VR devices used in the subjective experiment were Samsung Gear VR and Samsung Galaxy S6. A total of 37 subjects participated in the experiment, and MOS values were obtained.
•

Video Quality Assessment - Omnidirectional Videos (VQA-ODV) [90]. VQA-ODV contains 60 pristine videos and 600 distorted videos corrupted by H.265 compression with different QPs and different projections. All videos in the database have the resolutions of 7680 $\times$ 3840 and 3840 $\times$ 1920. The video lengths range from 10 seconds to 23 seconds. The VR device used in the subjective experiment was HTC Vive. A total of 221 subjects participated in the experiment, and MOS values, DMOS values, head movement data and eye movement data were obtained.
•

LIVE-FBT-FCVR 2D & 3D [91]. It contains 10 pristine 2D omnidirectional videos, 10 pristine 3D omnidirectional videos, and generated 180 distorted 2D omnidirectional videos as well as 180 distorted 3D omnidirectional videos. The resolution of 2D omnidirectional videos is 7680 $\times$ 3840, and the resolution of 3D omnidirectional videos is 5376 $\times$ 5376. The duration is 10 seconds. The used display device is HTC VIVE.

2.3.4 High Frame Rate $\&$ Frame interpolation VQA Databases

Users are pursuing higher frame-rate videos. With the improvement of video communication technologies, high frame rate (HFR) videos can be displayed at 50 fps or more, rather than traditional videos which are typically displayed at 30 fps or 24 fps. Therefore, some HFR VQA databases have also been constructed.

•

Waterloo-IVC High Frame Rate Video Quality Database (Waterloo-IVC-HFR) [6]. Waterloo-IVC-HFR contains 7 pristine 60fps source videos and their generated 336 test video sequences corrupted by the combination of 6 frame rate levels, 4 QP levels, and 2 resolution levels. The videos in the database have two different resolutions including 1080p and 480p. The video length is 10 seconds. A total of 25 subjects participated in the experiment, and MOS values were obtained.
•

Bristol Vision Institute High Frame Rate Video Database (BVI-HFR) [92]. BVI-HFR contains 22 120 fps source sequences and 88 distorted videos with 4 different frame rates varying from 15 fps to 120 fps obtained by subsampling the source videos via frame averaging. All videos in the database have a resolution of 1080p, and a duration of 10 seconds. A total of 51 subjects participated in the experiment, and MOS values were obtained.
•

LIVE-YouTube-HFR Database (LIVE-YouTube-HFR) [93]. LIVE-YouTube-HFR contains 16 source videos and 480 test sequences with 6 levels of frame rate and 5 levels (lossless+4 CRF) of VP9 compression. 11 sequences were borrowed from the BVI-HFR video database [92], which have a resolution of 1920 $\times$ 1080 (HD), and a duration of 10 seconds. 5 other sequences were high-motion sports content captured by the Fox Media Group, which have a resolution of 3840 $\times$ 2160 (UHD-1), and video lengths of 6-8 seconds. A total of 85 subjects participated in the experiment, and MOS values were obtained.
•

ETRI-LIVE STSVQ [94]. ETRI-LIVE STSVQ contains 15 high-quality 4K 10-bit source contents, which have a resolution of 3840 $\times$ 2160, and a chroma format of YUV420p. The video lengths range from 4.5 seconds to 7 seconds. 437 distorted videos were generated by spatial subsampling, temporal subsampling, and HEVC compression with different QPs. A total of 34 subjects participated in the experiment, and DMOS values were obtained.
•

KosMo-1k [95]. KosMo-1k contains 30 source videos and 1350 distorted videos corrupted by slow motion. All videos in KosMo-1k have a resolution of 480 $\times$ 540, and a duration of 8 seconds. MOS values were calculated as the subjective data.
•

BVI-VFI [96]. BVI-VFI contains 108 source videos and 540 distorted videos corrupted by dropping every second frame, then reconstructing the dropped frames using five VFI algorithms: frame repeating, frame averaging (where the middle frame is generated by averaging every two frames), DVF [117], QVI [118] and ST-MFNet [119]. The resolutions of the videos include UHD-1, HD, and 960 $\times$ 540. A total of 189 subjects participated in the experiment, and DMOS values were obtained.

2.3.5 Audio-Visual VQA Databases

Videos are generally accompanied by audios, and the degradation of audio can also affect the overall QoE. Thus, some works have also explored the audio-visual VQA.

•

Winkler et al. [97]. Winkler et al. [97] established an audio-visual VQA database, which contains 6 pristine videos and 8 distorted videos. The video track was corrupted by H.264 compression with different bitrates, and the audio track was corrupted by MPEG-4 AAC-LC with different sampling rates and bitrates. All videos in the database have a resolution of 176 $\times$ 144 (QCIF), and a duration of about 8 seconds. A total of 24 subjects participated in the experiment, and MOS values were obtained.
•

Video Quality Experts Group Multimedia Phase II (VQEGMM2) [98, 99]. Pinson et al. [98] have established an audio-visual VQA database, which contains 10 pristine videos and 60 test videos (10 pristine + 50 distorted). The video track was corrupted by H.264 compression (advanced video coding (AVC)), and the audio track was corrupted by advanced audio coding (AAC). All videos in the database have a resolution of 640 $\times$ 480, and a duration of 10 seconds. A total of 10 subjects participated in the experiment, and MOS values were obtained.
•

Demirbilek et al. [100]. Demirbilek et al. [98] established an audio-visual VQA database, which contains 6 pristine videos and 144 test videos corrupted by different resolutions, bitrates, bandwidths, packet loss rates, jitter cases. The resolutions of the videos include 1080p and 720p. A total of 24 subjects participated in the experiment, and MOS values were obtained.
•

Martinez et al. [101, 102]. Martinez et al. [101] established an audio-visual VQA database, which contains 6 pristine videos and 132 test videos. All videos in the database have a resolution of 720p. The experiment includes three sessions. For session I, each of the original video test sequences (no audio) was compressed using the H.264 codec with four different bitrate values including 30, 2, 1, and 0.8 Mbps, resulting in 30 test conditions (6 pristine $\times$ 5 levels (4 bitrate levels + 1 pristine)), and a total of 16 subjects participated in this session. For session II, only the audio components of the videos were compressed using the MPEG-1 layer-3 coding standard with three different bitrate values including 128, 96, and 48 kbps, resulting in 24 test conditions (6 pristine $\times$ 4 levels (3 bitrate levels + 1 pristine)), and a total of 16 subjects participated in this session. For session III, both audio and video components of the test sequences were compressed, where the video components were compressed using H.264 and the audio components were compressed using the MPEG-1 layer-3 coding standard, resulting in 78 test conditions (6 pristine $\times$ 13 levels (3 audio bitrates $\times$ 4 video bitrates + 1 pristine)), and a total of 16 subjects participated in this session. The MOS values were obtained.
•

LIVE-SJTU A/V-QA [7]. LIVE-SJTU A/V-QA contains 14 pristine videos and 336 distorted videos corrupted by HEVC compression with 4 different constant rate factor (CRF) levels, video compression plus scaling, and AAC compression with 3 different constant bit rate (CBR) levels. The videos have a resolution of 1080p, a duration of 8 seconds, and are provided in the raw YUV 4:2:0 format. A total of 35 subjects participated in the experiment, and MOS values were obtained.
•

Fela et al. [103]. Fela et al. [103] established an audio-visual VQA database for 360 videos, which contains 12 pristine videos and 576 test videos corrupted by 3 different resolutions, 4 QP levels, and four AAC-LC levels. The resolution of the pristine videos is 6144 $\times$ 3072. The duration is 20 seconds. A total of 20 subjects participated in the experiment, and MOS values were obtained.
•

Omnidirectional Audio-visual Quality Assessment Database (OAVQAD) [8]. OAVQAD contains 15 pristine videos and 375 distorted videos corrupted by HEVC compression, AAC compression, different resolutions, noise, blur, and stalling. The resolution of the pristine videos is 7680 $\times$ 3840. The duration is 6 seconds. A total of 22 subjects participated in the experiment, and MOS values were obtained.

2.3.6 HDR, WCG, iTMO, and TMO VQA Databases

With the increasing requirements for video experience, high dynamic range (HDR) and wide color gamut (WCG) video technologies have been gradually developed. Many studies have investigated the VQA problem for HDR and WCG videos, as well as tone mapping operation and inverse tone mapping operation.

•

Digital Multimedia Lab HDR (DML-HDR) [104]. DML-HDR contains 4 pristine videos and 32 distorted videos corrupted by HEVC, H.264 compressions with different QPs. The resolution of the pristine videos is 1080p. The video lengths include 10 and 17 seconds. A total of 17 subjects participated in the experiment, and MOS values were obtained.
•

Narwaria et al. [105]. Narwaria et al. [105] established an HDR VQA database, which contains 9 pristine videos and 153 test videos corrupted by different TMO, iTMO, compression and decompression methods. All videos have a resolution of 1080p. A total of 25 subjects participated in the experiment, and MOS values were obtained.
•

Mukherjee et al. [106]. Mukherjee et al. [106] established an HDR VQA database, which contains 39 pristine videos and 429 test videos corrupted by different H.264 compressions. All videos have a resolution of 1080p. A total of 64 subjects participated in the experiment, and rank values were obtained.
•

Yeganeh et al. [107]. Yeganeh et al. [107] established an HDR VQA database, which contains 10 pristine videos and 40 test videos corrupted by different TMO methods. A total of 30 subjects participated in the experiment, and MOS values were obtained.
•

Digital Multimedia Lab HDR 2 (DML-HDR 2) [108]. DML-HDR 2 contains 5 pristine videos and various distorted videos corrupted by AWGN, mean intensity shift, salt and pepper noise, low Pass filter, and compression. All videos have a resolution of 2048 $\times$ 1080 and a duration of 10 seconds. A total of 18 subjects participated in the experiment, and MOS values were obtained.
•

Waterloo UHD-HDR-WCG [109]. Waterloo UHD-HDR-WCG contains 14 pristine videos and 140 distorted videos corrupted by H.264 compression and HEVC compression. All videos have a resolution of 3840 $\times$ 2160 and a duration of 10 seconds. A total of 51 subjects participated in the experiment, and MOS values were obtained.
•

LIVE-HDR [9]. LIVE-HDR contains 310 test video sequences including 31 pristine videos and 279 distorted videos corrupted by different resolutions and the HEVC compression with different bitrates. The resolutions include 4k, 1080p, 720p, 540p. The video lengths range from 1 to 10 seconds. A total of 66 subjects participated in the experiment, and DMOS values were obtained.

2.3.7 Screen and Game VQA Databases

Screen graphics and cloud gaming are other popular video applications, and pursue high-quality video experience. Therefore, many screen content or game content VQA databases have been constructed.

•

Screen Content Video Database (SCVD) [110]. SCVD contains 16 pristine videos and 800 distorted videos corrupted by 10 different distortions including Gaussian noise (GN), Gaussian blur (GB), motion blur (MB), contrast change (CC), color saturation change (CSC), color quantization with dithering (CQD), H.264, high efficiency video coding (HEVC), screen content coding (SCC), and packet loss (PL). All videos have a resolution of 1080p and a duration of 10 seconds. A total of 32 subjects participated in the experiment, and MOS values were obtained.
•

Compressed Screen Content Video Quality (CSCVQ) Database [111]. CSCVQ contains 11 pristine videos and 165 distorted videos corrupted by H.264 compression, HEVC compression, and HEVC Screen Content Coding (HEVC-SCC). All videos have a resolution of 720p and a duration of 10 seconds. A total of 20 subjects participated in the experiment, and MOS values were obtained.
•

GamingVideoSET [112]. GamingVideoSET contains 24 pristine videos and 600 distorted videos corrupted by H.264 compression. The resolutions of the videos include 480p, 720p, and 1080p. All videos have a duration of 30 seconds. A total of 25 subjects participated in the experiment, and MOS values were obtained.
•

Kingston University Gaming Video Dataset (KUGVD) [113]. KUGVD contains 6 pristine videos and 150 distorted videos corrupted by H.264 compression. The resolutions of the videos include 480p, 720p, and 1080p. All videos have a duration of 30 seconds. A total of 17 subjects participated in the experiment, and MOS values were obtained.
•

CGVDS [114]. CGVDS contains 15 pristine videos and 255 distorted videos corrupted by H.264 compression. The resolutions of the videos include 480p, 720p, and 1080p. All videos have a duration of 30 seconds. Over 100 subjects participated in the experiment, and MOS values were obtained.
•

Tencent Gaming Video (TGV) [115]. TGV contains 150 pristine videos and 1293 distorted videos corrupted by H.264 compression, H.265 compression, and Tencent codec. The resolutions of the videos include 480p, 720p, and 1080p. All videos have a duration of 5 seconds. A total of 19 subjects participated in the experiment.
•

LIVE-YOUTUBE Gaming Video Quality Database (LIVE-YOUTUBE GVQA) [10]. LIVE-YOUTUBE GVQA contains 600 Professionally-Generated-Content (PGC) or User-Generated-Content (UGC) gaming videos. The resolutions of the videos include 360p, 480p, 720p, and 1080p. The videos were clipped into 8-9 seconds. A total of 61 subjects participated in the experiment, and MOS values were obtained.

3 Objective Video Quality Assessment: General-purpose Models

In this section, we review the general-purpose objective VQA models that are designed to handle various video distortions. As illustrated in Figure 2, depending on whether reference video information is required or not, we category the objective VQA models into three types, full-reference (FR) VQA, reduced-reference (RR) VQA, and no-reference (NR) VQA.

3.1 Full-reference Video Quality Assessment

Full reference video quality assessment models aim to evaluate the quality of a video signal by comparing it to its reference (original or pristine) video, which are commonly employed in various domains such as video broadcasting, video streaming, video compression, video enhancement, and quality control in video production. Generally, FR VQA models measure the fidelity between distorted and reference videos. As shown in Figure 3, one prevalent approach involves applying FR image quality assessment methods to individual or sampled video frames and subsequently aggregating the frame-level quality scores into the video-level quality score. Well-known FR IQA methods include PSNR, SSIM [120], MS-SSIM [121], VIF [122], LPIPS [123], etc., and a comprehensive survey on FR IQA methods can refer to [3]. However, video quality is intricately related to the temporal distortions like jitter, flicker, etc., which are not effectively captured by these IQA-based methods. Therefore, to address temporal distortions in videos and achieve a better evaluation ability, lots of FR VQA models have been proposed in literature. These models can be roughly classified them into knowledge-driven and data-driven methods based on their types of feature extraction. For knowledge-driven FR VQA methods, quality-aware features are extracted based on the characteristics of the human visual system, whereas data-driven FR VQA methods employ the machine learning techniques to directly acquire quality-aware features from video data.

3.1.1 Knowledge-driven FR VQA

(1) SSIM-based FR VQA: Structure Similarity (SSIM) [120] has been the most popular FR IQA methods over the last two decades. It calculates luminance, contrast and structure similarities between distorted and reference images. Due to its simplicity and effectiveness, numerous efforts have been made to extend SSIM to the video domain. Wang et al. [124] investigated two pooling strategies to apply SSIM to video quality assessment. Specifically, they calculate frame-level SSIM values and subsequently aggregate them into the video-level score based on the luminance intensity and motion degree of the frames. Wang and Li [125] introduced a motion-based pooling strategy for well-known FR IQA methods (e.g. PSNR and SSIM). They incorporate a human visual speed perception model into an information framework and estimate motion information and perceptual uncertainty as the weighting factors. Moorthy and Bovik proposed a motion-compensated SSIM (MC-SSIM) [126, 127] to assess both spatial and temporal quality scores, where the spatial quality scores are calculated using SSIM and the temporal quality scores are evaluated by evaluating structural retention between motion-compensated regions. Seshadrinathan et al. [128] observed a hysteresis effect in human user study of subjective VQA and propose a hysteresis temporal pooling strategy to aggregate frame-level quality scores into the video-level quality score. Park et al. [129] introduced a content-adaptive spatial and temporal pooling strategy named Video Quality Pooling (VQPooling), which emphasizes the influence of the “worst” quality scores along both the spatial and temporal dimensions of a video sequence on the overall video quality. Manasa and Channappayya [130] employed MS-SSIM [121] to characterize spatial quality estimation and utilize local flow statistics defined by the mean, the standard deviation, the coefficient of variation, and the minimum eigenvalue of the local flow patches to represent temporal distortions. Instead of calculating the quality scores frame-by-frame and then merging them into the video-level quality scores, Zeng et al. [131] treated the video as the 3D volume data and directly calculated SSIM values of 3D volume data. Different with 2D SSIM, they utilized local information content and local distortion based weighting methods to pool the quality map into the quality score.

(2) Low-level feature-based FR VQA: Some FR VQA models attempt to leverage abundant low-level features like optical flow, gradient, etc. to represent video quality. For example, Seshadrinathan and Bovik introduce a motion-based video integrity evaluation (MOVIE) index [132, 133], which uses Gabor filters to decompose the video and calculate corresponding spatial, temporal, and motion features. In particular, motion estimation is computed in the optical flow field. The framework of MOVIE is illustrated in Figure 4. To handle local flicker distortions, Choi et al. [134, 135] developed flicker sensitive MOVIE (FS-MOVIE) by integrating a perceptual flicker masking index into MOVIE Index, where the flicker masking mechanism is derived from the responses of neurons in primary visual cortex to video flicker. Wang et al. [136] proposed a VQA model by leveraging structural features in localized spacetime regions to jointly represent spatial edge features and temporal motion characteristics, thus having a relatively low computational complexity. Vu et al. [137] developed spatial-temporal most apparent distortion (ST-MAD) by applying MAD [138] to each frame to obtain spatial MAD and utilizing an optical-flow-derived weighting scheme to emphasize the appearance component of spatial MAD in fast-moving regions to derive temporal MAD. Yan and Mou [139] partitioned the spatiotemporal slice images into regions with simple motion and complex motion. They then utilized gradient magnitude standard deviation (GMSD) [140] index to evaluate distortions within these distinct segments.

(3) HVS-based FR VQA: The human visual system plays a crucial role in guiding the design of FR VQA models. Zhang and Bull [141] introduce a perception-based FR VQA model, which utilizes an enhanced nonlinear model to combine noticeable distortion and blurring artifacts, simulating simulate the HVS perception process. The visual attention mechanism [142, 143] reflects how human allocate their attention to regions of the video, and several works utilize visual attention or saliency mechanism to develop FR VQA models. Since the human visual system is sensitive to motion objects, Wu et al. [144] propose a full-reference assessor along salient trajectories (FAST) model, which computes the motion object trajectories in the optical flow domain, employ the motion velocity to represent temporal quality, and apply the 3D filters to motion content to represent sptio-temporal quality. Additionally, spatial quality is represented by calculating GMSD [140] for each frame. Finally, they combine three three quality metrics to obtain an overall video quality score. For example, You et al. [145] proposed an attention-driven foveated FR VQA models by integrating the attention-driven contrast sensitivity function into a wavelet-based distortion visibility measure. Peng et al. [146] developed an attention-guided and motion-tuned temporal distortion metric based on spacetime texture, which serves as a uniform and distributive descriptor of a wide set of spacetime structures. Zhang and Liu [147] conducted a video saliency experiment to gather reliable eye-tracking data for distorted videos and integrate the eye-tracking data into FR VQA models to improve their performance.

(4) Features fusion based FR VQA: Recently, some studies attempt to extract various types of features and subsequently employ a learning-based regressor to map these features to video quality scores, thereby capitalizing on the strengths of different extracted feature types. Freitas et al. [148] extracted a set of features including multiscale salient local binary patterns, MS-SSIM, GMSD, Riesz pyramids similarity deviation, spatial activity and temporal distortion measures and then employ a random forest regression algorithm to derive the video quality score. Video Multi-method Assessment Fusion (VMAF) [149] extracts two kinds of FR IQA features including Visual Information Fidelity (VIF) [122] and Detail Loss Metric (DLM) [150] along with motion features quantified by temporal difference between consecutive frames. Then it learns a Support Vector Regressor (SVR) to map these features into the video quality score. Bampis et al. [151, 152] further made enhancements to VMAF from two aspects, known as spatiotemporal VMAF (ST-VMAF) and ensemble VMAF (E-VMAF), by incorporating space–time features at multiple scales. In order to reduce the computational complexity of VMAF, Venkataramanan et al. [153] proposed a VQA model named fusion of unified quality evaluators (FUNQUE), which calculates the features in VMAF including VIF, DLM, motion features, and SSIM on a common transform domain that accounts for the human visual system. Liu et al. [154] introduced a serial dependence modeling framework for FR VQA, which first extracts static appearance features and two kinds of motion information features (represented by an explicit content-based 3D structure and an implicit feature-based 2D structure) and subsequently utilizes the LSTM and attention-based quality pooling strategy to obtain the video quality score.

3.1.2 Data-driven FR VQA

Data-driven FR VQA models rely on large-scale video datasets to automatically learn quality-aware features for video quality evaluation. In recent years, with popularity of deep neural network, convolutional neural network (CNN) and Vision Transformer (ViT) have become the two dominant approaches for data-driven FR VQA models. For instance, Kim et al. [155] proposed a deep video quality assessor named DeepVQA, which employs CNNs to generate spatio-temporal sensitivity maps for tackling temporal motion artifacts and introduces a convolutional neural aggregation network to capture temporal memory effects for quality judgment. Xu et al. [156] introduced C3DVQA, which leverages CNNs with the 3D kernels for FR VQA. In particular, C3DVQA utilizes 2D CNNs to extract spatial features from both distorted frames and residual frames (i.e. the difference between distorted and reference frames), and it utilizes 3D CNN to learn spatio-temporal features from extracted spatial features for video quality evaluation. Zhang et al. [157] presented a transfer learning framework for FR VQA to address challenges posed by imbalanced and limited samples of VQA datasets. They utilized distorted images as the related domain to enrich the distorted samples and train a six-layer CNN to extract high-level spatiotemporal features from distorted image blocks and video blocks annotated by classic FR IQA metrics. Zhang et al. [158] proposed a FR VQA model that integrates DenseNet with the spatial pyramid pooling strategy and RankNet, where the former is used to extract high-level distortion representations and the latter acts as a temporal pooling method to characterize the high-level relevance among frames. Wu et al. [159] developed a quality aggregation network for FR VQA. It employs a 3D CNN to extract spatiotemporal features and utilizes a LSTM-based temporal quality pooling network to capture the nonlinearities and temporal dependencies inherent in the video quality evaluation process.

Table 3: Overview of the FR and RR Video Quality Assessment Models.

Type	Algorithm	Methodology	Extracted quality features	Quality fusion
FR	Wang et al. [124]	Structural similarity	Structure similarity, motion vector	Weighted sum
	Wang and Li [125]	Structural similarity	Structure similarity, motion vector	Weighted sum
	MC-SSIM [126, 127]	Structural similarity	Structure similarity, motion vector	Weighted sum
	Seshadrinathan et al. [128]	Structural similarity	Structure similarity or MOVIE	Hysteresis temporal pooling
	Park et al. [129]	Structural similarity	Structure similarity or MOVIE	VQPooling
	Manasa and Channappayya [130]	Structural similarity	Multi-scale Structure similarity, optical flow	Weighted sum
	Zeng et al. [131]	Structural similarity	3D structural similarity	Weighted sum
	MOVIE [132, 133]	Low feature extraction	Gabor filter, optical flow	Weighted sum
	FS-MOVIE [134, 135]	Low feature extraction	Gabor filter, optical flow	Weighted sum
	Wang et al. [136]	Low feature extraction	Sobel gradient features, eigenvalue of 3D structure tensor	Averaged sum
	ST-MAD [137]	Low feature extraction	MAD, optical flow	Weighted sum
	Yan and Mou [139]	Low feature extraction	GMAD, motion features	Weighted sum
	Zhang and Bull [141]	HVS perception behaviour	DT-CWT features, motion vector	Weighted sum
	Wu et al. [144]	HVS perception behaviour	GMSD, saliency trajectory features, optical flow	Weighted sum
	You et al. [145]	HVS perception behaviour	Visual saliency, CSF, DWT	Weighted sum
	Peng et al. [146]	HVS perception behaviour	G3 filer, spacetime texture, visual attention	Weighted sum
	Zhang and Liu [147]	HVS perception behaviour	Visual saliency	Weighted sum
	Freitas et al. [148]	Features fusion	MSLBP, MSSIM, GMSD, RPSD, SA, TDM	Weighted sum
	VMAF [149]	Features fusion	VIF, DLM, TI	SVR
	ST-VMAF [151, 152]	Features fusion	VIF, DLM, T-SpEED	SVR
	FUNQUE [153]	Features fusion	DLM, SSIM,VIF, TI	SVR
	Liu et al. [154]	Features fusion	3D Prewitt operators, optical flow,	2D-CNN LSTM
	DeepVQA [155]	Deep learning	2D-CNN	CNAN
	C3DVQA [156]	Deep learning	2D-CNN, 3D-CNN	MLP
	Zhang et al. [157]	Deep learning	2D-CNN, 3D-CNN	MLP
	Zhang et al. [158]	Deep learning	DenseNet with SPP	RankNet
	Wu et al. [159]	Deep learning	3D-CNN	LSTM
	Li et al. [160]	Deep learning	2D-CNN	Transformer encoder
	Sun et al. [161]	Deep learning	2D-CNN	MLP
	Li et al. [162]	Deep learning	2D-CNN	MLP
RR	VQM [163]	Low feature extraction	SI, TI, edge features, chroma features	Weighted sum
	Masry et al. [164]	Low feature extraction	Color, DWT, contrast, visual masking	Weighted sum
	Callet et al. [165]	Low feature extraction	GHV, GHVP, TI, blockness,	CNN, MLP
	Gunawan and Ghanbari [166]	Low feature extraction	Edge, blockness, motion vector	Weighted sum
	Zeng and Wang [167]	Low feature extraction	Complex wavelet transform, circular variance	Weighted sum
	Ma et al. [168]	NSS	Energy variation descriptor, GGD, City-block distance	Averaged sum
	Zhu et al. [169]	Low feature extraction	Energy, entropy, kurtosis, Jensen–Shannon divergence, SSIM, smoothness	Weighted sum
	STRRED [170]	NSS	GSM, entropies, wavelet coefficients	Weighted sum
	SpEED-QA [171]	NSS	GSM, entropies	Weighted sum
	Wang et al. [172]	Structural similarity	Structure similarity, CSF	Weighted sum

With the popularity of user generated content videos in recent years, the focus of FR VQA research has shifted to UGC videos. For example, Li et al. [160] utilized a Siamese CNN to extract features of distorted and reference videos and subsequently employ a Transformer encoder to map these features into the video quality score. Sun et al. [161] extracted the structure and texture similarities of feature maps extracted from all intermediate layers of a CNN model for the quality-aware feature representation and then used a fully connected layer to map the quality-aware features into the video quality score. Li et al. [162] first used a learned neural network to estimate the quality maps of the pristine and distorted UGC videos. They then assessed the quality of distorted UGC videos based on the estimated quality maps, considering the influence of the pristine and the distorted video quality on the overall quality assessment.

3.2 Reduced-reference Video Quality Assessment

Reduced-reference VQA is a special type of FR VQA, which necessitates only partial reference video information for evaluating the quality of distorted videos. So, it provides the potential to significantly save transmission bandwidths in the situations of assessing the quality of transmitted videos compared with FR VQA models. The National Telecommunications and Information Administration (NTIA) General Model [163], also known as Video Quality Model (VQM), is a RR VQA method that initially calibrates the reference and distorted video and subsequently extracts low-bandwidth spatial and temporal features for video quality evaluation. The General Model necessitates an ancillary data channel bandwidth that accounts for 9.3% of the uncompressed video sequence, while the associated calibration techniques demand an extra 4.7%. Masry et al. [164] utilized wavelet transforms and separable filters to decompose the video into multiple channels, adjusting the bit rate of the reference video decomposition using a coefficient selection strategy. By setting a reference bit rate of 10 kbit/s, the proposed RR VQA model shows impressive performance while maintaining real-time processing capability. Zeng and Wang [167] first utilized the complex wavelet transform to decompose the reference and distorted videos and subsequently calculate the conditional histogram and circular variance (CV) curve. They employed a fourth-order polynomial to model the CV curve of the reference video and set the 5 parameters of the fitted polynomial as the RR features. This method measures temporal motion smoothness and achieves good performance on video distortions like frame jittering, dropping, etc. Gunawan and Ghanbari [166] developed a RR VQA method to measure the quality of encoded videos using harmonic analysis of spatial gradients. It involves extracting local harmonic strength features from images as reduced-reference data, and through discriminative analysis, generating harmonics gain and loss to represent blocking/tiling and blurring/smearing distortions respectively.

Some works employ learning-based methods, such as linear regression, neural network, etc., for the reduced-reference feature fusion. Le Callet et al. [165] first extracted three types of features of both reference and distorted videos, including frequency content measures [173], temporal content measures [174, 175], and blocking measures [176], and subsequently employ a time-delay neural network which consisting several CNN and multi-layer perception networks to regress the features into video quality scores. Zhu et al. [169] initially employed a NR VQA model [177] to extract three intra-subband features, including energy, entropy, and kurtosis, as well as three inter-subband features, encompassing the Jensen-Shannon divergence, the structural similarity index between two subbands, and the smoothness, for both distorted and reference videos. Subsequently, they introduced a feature pooling approach consisting of three components: a global linear model for aggregating the extracted features, a simple linear model for achieving local alignment wherein the local factors are influenced by the source videos, and a non-linear model for quality calibration.

With the popularity of natural scene statistics (NSS) in the IQA filed, many works try to incorporate NSS as part of the features for RR VQA. For example, Ma et al. [168] extracted the spatial information loss and the temporal statistics characteristics from the interframe histogram as the reduced features. Specifically, they introduced an energy variation descriptor to assess the energy difference of each individual encoded frame for spatial information loss and employed the generalized Gaussian density (GGD) function to capture the natural statistics of the interframe histogram distribution. Soundararajan and Bovik [170] proposed the spatiotemporal RR entropic differences (STRRED) metric that calculates the wavelet coefficients of frame differences modeled by the Gaussian scale mixture (GSM) distribution to capture temporal information and leverages their previous developed RR IQA method (SRRED) [178] to model spatial information. To mitigate the computational complexity of STRRED, Bampis et al. [171] introduced the spatial efficient entropic differencing for quality assessment (SpEED-QA) model, which computes local entropic differences between reference and distorted videos in the spatial domain, resulting in efficient feature calculation. Wang et al. [172] proposed a RR VQA metric that leverages the contrast and motion sensitivity characteristics of the human visual system to select the reference data. Specifically, the proposed metric first maps the reference video into different frequency using the Discrete Wavelet Transform (DWT) and then selects the image blocks of each frame according to the energy of the wavelet coefficients in the subbands of interest and spatio-temporal information of frame difference. Finally, it calculates the SSIM values between the selected reference image blocks and their corresponding distorted ones as the quality score.

Table 4: Overview of the NR Video Quality Assessment Models.

Type	Algorithm	Methodology	Extracted quality features	Quality fusion
NR	Wang et al. [124]	Hand-crafted feature	Structure similarity, motion vector	Weighted sum
	V-CORNIA [180]	Hand-crafted feature	CORNIA, temporal pooling	Weighted sum
	Video BLIINDS [179]	Hand-crafted feature	Spatial-temporal natural video statistics (NVS), motion vector	SVR
	VIIDEO [181]	Hand-crafted feature	Natural video statistics, inter sub-band statistics	Weighted sum
	TLVQM [182]	Hand-crafted feature	45 low complexity & 30 high complexity features	SVR
	VIDEVAL [183]	Hand-crafted feature	BRISQUE, GM-LOG, HIGRADE-GRAD, FRIQUEE, TLVQM	SVR
	Chip-QA [184]	Hand-crafted feature	space-time chip, luma, color, gradient	SVR
	VSFA [185]	Pre-trained DNN model	ResNet-50	GRU
	MDVSFA [186]	Pre-trained DNN model	ResNet-50	GRU
	Tang et al. [187]	Pre-trained DNN model	VGG-16	MLP, temporal memory-based pooling
	RIRNet [188]	Pre-trained DNN model	ResNet-50 with SPP	GRU
	Chen et al. [189]	Pre-trained DNN model	VGG-16 with attention module	GRU
	You [190]	Pre-trained DNN model	ResNet-50 with FPN and attention module	Transformer encoder
	You and Lin [191]	Pre-trained DNN model	ResNet-50 with FPN	Transformer encoder
	Wu et al. [192]	Pre-trained DNN model	Swin-T	STDE, TCT
	STDAM [193]	Pre-trained DNN model	ResNet-18 with graph convolution module and attention module	Bi-directional LSTM
	PatchVQ [62]	Pre-trained DNN model	PaQ-2-PiQ, 3D ResNet-18	InceptionTime
	Ying et al. [194]	Pre-trained DNN model	MobileNetV3, 3D ResNet-18, MobileNetV1	GRU-FCN
	Li et al. [195, 196]	Pre-trained DNN model	UNIQUE, SlowFast	GRU
	Liu et al. [197]	Pre-trained DNN model	KonCept512, SlowFast	Progressively Residual Aggregation
	UVQ [65]	Pre-trained DNN model	EfficientNet-B0, D3D	MLP
	UVQ-lite [198]	Pre-trained DNN model	MobileNet, MoViNet	MLP
	Telili et al. [199]	Pre-trained DNN model	ResNet-50	Bi-LSTM
	Lu et al. [200]	Pre-trained DNN model	ResNet-50	GRU
	MD-VQA [67]	Pre-trained DNN model	EfficientNetV2, ResNet3D-18, blur, noise, block effect, exposure, colorfulness	MLP
	Zhu et al. [201]	Pre-trained DNN model	ResNeXt-101	Transformer encoder
	Zhang et al. [202]	Pre-trained DNN model	ConvNeXt, SAMNet, SlowFast	GRU
	Chen et al. [203]	Pre-trained DNN model	ResNet-50, C3D, RIRNet, PVQ, LSCT-PHIQNet	MLP
	Kwong et al. [204]	Pre-trained DNN model	Multi-channel CNN	GRU
	Wu et al. [205, 206]	Pre-trained DNN model	NIQE, TPQI, CLIP	Weighted sum
	Wu et al. [205]	Pre-trained DNN model	FAST-VQA, CLIP-visual, CLIP	MLP, cosine similarity
	Liu et al. [207]	Pre-trained DNN model	EfficientNet-b7, ir-CSN-152, CLIP, Swin-B, TimeSformer, Video Swin-B, SlowFast	MLP
	Liu et al. [208]	End-to-end training	3D-CNN	MLP
	You and Korhonen [209]	End-to-end training	3D-CNN	LSTM
	Yi et al. [210]	End-to-end training	VGG-16 with non-local module	MLP
	Wen and Wang [211]	End-to-end training	ResNet-18	MLP
	SimpleVQA [212]	End-to-end training	ResNet-50	MLP
	Minimalistic VQA [213]	End-to-end training	ResNet-50 or Swin-B	MLP
	StarVQA [214]	End-to-end training	Transformer	MLP
	Lin et al. [215]	End-to-end training	HED, I3D	Transformer encoder, MLP
	Shen et al. [216]	End-to-end training	2D-CNN, 3D-CNN	MLP
	Xian et al. [217]	End-to-end training	DeblurGAN-v2, 3D-CNN	MLP
	Guan et al. [218]	End-to-end training	ResNet-50	ConvLSTM, MLP
	Lu et al. [219]	End-to-end training	ResNet-18	MLP
	FAST-VQA [220]	End-to-end training	Video Swin-T	MLP
	DOVER [221]	End-to-end training	Video Swin-T, ConvNeXt-T	MLP
	Kou et al. [222]	End-to-end training	Swin-T, 3D ResNet, blur encoder	MLP
	Yuan et al. [223]	End-to-end training	Visual quality transformer	MLP
	Ke et al. [224]	End-to-end training	Spatial and temporal Transformer encoder	MLP
	Liu et al. [225, 226]	Self-supervised learning	R(2+1)D	MLP
	Chen et al. [227]	Self-supervised learning	C3D	MLP
	Chen et al. [228]	Self-supervised learning	VSFA or RIRNet	GRU, MLP
	Madhusudana et al. [229]	Self-supervised learning	ResNet-50	GRU, regularized linear regressor
	Mitra and Soundararajan [230]	Self-supervised learning	ResNet-50	Weighted sum
	Jiang et al. [231]	Self-supervised learning	2D-CNN, 3D-CNN	Transformer encoder, MLP

3.3 No-reference Video Quality Assessment

In practical video-enabled applications, reference videos are often unavailable, thereby only NR VQA models are qualified to assess the video quality. Similar to FR VQA, we categorize NR VQA models into two groups: knowledge-driven NR VQA models and data-driven NR VQA models based on their feature extraction modules.

3.3.1 Knowledge-driven NR VQA

Classical VQA methods generally adopt the knowledge-driven approach and manually extract hand-crafted features to perform evaluation. Some early works extended the no-reference image quality assessment (NR-IQA) methods, such as NIQE [232] and BRISQUE [233] to perform video quality assessment [234]. To better predict video quality, some classical VQA methods have been proposed by leveraging temporal information in videos. Xu et al. [180] proposed a V-CORNIA metric, which extracts CORNIA features as spatial features and utilizes the hysteresis temporal pooling method to predict video quality. As illustrated in Figure 6, Saad et al. [179] presented a Video BLIINDS model, which combines spatial-temporal natural video statistic (NVS) features and motion-related features to perform VQA. Mittal et al. [181] introduced a VIIDEO algorithm for VQA, which incorporates natural video statistics and inter sub-band statistics via weighted sum. Korhonen et al. [182] developed a TLVQM VQA measure, which extracts 45 low-complexity features and 30 high-complexity features, and utilizes SVR to integrate them. Tu et al. [183] devised a VIDEVAL VQA method, which extracts BRISQUE, GM-LOG, HIGRADE-GRAD, FRIQUEE, TLVQM features and uses SVR to combine these features. Ebenezer et al. [184] proposed a Chip-QA metric, which extracts luma, color features as spatial features, and exploits space-time chip to capture temporal motion features, to conduct VQA. This method achieves better performance compared to existing knowledge-driven NR-VQA models while still keeping low computational complexity.

3.3.2 Data-driven NR VQA

In comparison to knowledge-driven NR VQA methods, data-driven NR VQA models can automatically extract quality-aware features of distorted videos by designed neural networks, which is simper but more powerful. Based on the training methods of the feature extraction network, we can divide data-driven BVQA methods into three categories: pre-trained model based methods, end-to-end training based methods, and unsupervised learning based methods.

(1) Pre-trained model based methods: These methods employ pre-trained quality-aware or semantic-related models as the feature extraction module to extract features, with only the quality regressor requiring training to map the extracted features into video quality scores. VSFA [185] extracts semantic features using a pre-trained ResNet-50 model on ImageNet, subsequently utilizing a gated recurrent unit (GRU) network as the regressor to capture the temporal relationship. It also introduce a differentiable subjectively-inspired temporal pooling strategy to address the temporal hysteresis effect of the human vision system. We present the framework of VSFA in Figure 8. The authors of VSFA further propose MDVSFA [186], which enhances the performance and generalization of VSFA by training it on four VQA databases including CVD2014 [57], KoNViD-1k [59], LIVE-Qualcomm [58], and LIVE-VQC [60]. Tang et al. [187] utilized VGG-16 [235] to extract the content features from frame patches and then employed a patch quality regression network and a patch weight estimation network to derive frame-level quality scores. Finally, they introduced a temporal memory-based pooling method to aggregate the frame-level quality scores into the video quality scores. Chen et al. [188] presented a NR VQA framework called Recurrent-In-Recurrent Network (RIRNet), which employs a ResNet-50 followed by a spatial pyramid pooling (SPP) layer to capture the content features of each video frame. Then, these extracted features are divided into multiple groups based on different temporal resolutions and RIRNet utilizes multiple GRU to aggregate the extracted features with different temporal resolutions into video quality score by the deep supervision manner. Chen et al. [189] proposed a generalized spatial-temporal deep feature representation through imposing the Gaussian distribution constraints and a pyramid temporal aggregation module on the spatial-temporal features extracted by the multi-stage layers of VGG-16 and enhanced by a GRU network.

Besides GRU, Transformer has been exploited as a superior feature aggregation model for NR VQA. You [190] performed a basic Transformer encoder for NR VQA. He first utilized a perceptual hierarchical network with an integrated attention module to extract quality-aware features of each frame and then employed a time-distributed 1D CNN consisting of Conv1D, MaxPool, and Dropout layers to reduce the dimensions of extracted features. Finally, he used a standard Transformer encoder with a mask strategy to drive the video quality scores. You and Lin [191] further replaced the time distributed 1D CNN module with a shared multi-head attention module. Wu et al. [192] first utilized video Swin-T [236] to extract the spatial-temporal features of the video and then introduced a temporal distorted-content Transformer for aggregating the content features and obtaining the video quality score. To be more specific, the temporal distorted-content Transformer consists of a transformer-based spatial-temporal distortion extraction (STDE) module for discerning various kinds of temporal variations and extracting the temporal distortion features, and encoder-decoder-like temporal content transformer (TCT) for addressing temporal quality attention issues.

Since the image classification model cannot capture the quality-aware and motion-aware features, some studies attempt to leverage NR IQA models to represent quality-aware spatial features and the action recognition model or the optical flow model to extract motion-aware features. Xu et al. [193] developed a spatiotemporal distortion-aware model (STDAM) for NR VQA. Specifically, they employed ResNet-18 to extract content features from video frames at two kinds of spatial resolutions and then utilized a graph convolution module and an attention module to aggregate these content features into the frame-level features. Note that the ResNet-18 along with the graph convolution module and the attention module has been pre-trained on KonIQ-10k [237], a large-scale IQA dataset. Besides, they computed the optical flow maps of videos and used ResNet-18 to extract the motion features from these optical flow maps. Finally, they utilized a bi-directional LSTM network to map the frame-level features and motion features into the video quality score. Ying et al. [62] introduced PatchVQ, which extracts spatial features using the PaQ-2-PiQ [238] backbone pre-trained on the LIVE-FB dataset [238] and extracts spatio-temporal features using a 3D ResNet-18 backbone [239] pre-trained on the Kinetics dataset [240]. Furthermore, PatchVQ utilizes a region-of-interest pooling (RoIPool) layer and segment-of-interest pooling (SoIPoll) to capture the local interested region of spatial and temporal. Finally, InceptionTime is employed to regress the pooled features into the video quality score. Ying et al. [194] further proposed a multi-modal NR VQA model designed for live streaming telepresence content. It consists of three branches, each corresponding to the feature extraction network of the audio, image, and video modalities, respectively. Specifically, the frame-level and patch-level features are extracted by MobileNetV3 [241] pre-trained on LIVE-FB dataset [238], the video-level features are extracted by the R(2+1)D model [242], and the audio-level features are extracted by the MobileNetV1 pre-trained on the Google AudioSet dataset [243].

Li et al. [195, 196] employed UNIQUE [244], an IQA model pre-trained on four IQA datasets including BID [245], LIVE Challenge [246], SPAQ [247], and KonIQ-10k [237], to extract quality-aware spatial features and utilize SlowFast [248] to extract temporal feature. Subsequently, a GRU network is used to model spatial and temporal features and regress them into the video quality scores. Similar to the methods of Li et al. [195, 196], Liu et al. [197] also utilized an IQA model named KonCept512 [237] to capture the static appearance degradation and an action recognition model SlowFast [248] to represent dynamic motion degradation. They then introduced a progressively residual aggregation module to hierarchical merge these two kinds of features to derive the video quality scores. Zhu et al. [201] proposed a spatiotemporal interaction strategy for assessing the quality of user-generated videos. Specifically, they extracted feature maps of video frames using a ResNeXt-101 [249] pre-trained on KonIQ-10k [237] and then calculated the mean and standard deviation values of these extracted feature maps as the spatial features and computed the difference between the spatial features of two consecutive frames as to derive motion features. Finally, a Transformer model was utilized to aggregate the spatial and motion features into the video quality scores. Kwong et al. [204] first used the self-supervised learning method to train a multi-channel CNN network on the IQA task without using the quality-rated labels and subsequently fine-tuned the multi-channel CNN model with motion-aware features followed by a GRU network for the NR VQA task. Lu et al. [200] calculated deep structural similarities between the feature maps of continuous frames extracted by a quality-aware pre-trained DNN model to capture temporal distortions arising from frame rate variations, object movement, and camera motion.

It is known that a video may exhibit various kinds of distortions. Therefore, employing a broader range of feature descriptors can effectively address complex video distortions. For example, Wang et al. [65] proposed a feature-rich NR VQA model named UVQ, which incorporates features extracted from three pre-trained models, including compression level classification, action recognition, and distortion type classification. In [198], Wang et al. further replaced the backbones from EfficientNet-b0 [250] and D3D [251] to MobileNet [252] and MoViNet [253] to achieve a light-weight UVQ. Telili et al. [199] introduced a double Bi-LSTM network for video quality assessment, where the first Bi-LSTM is employed to spatially pool the features extracted by a ResNet-50 pre-trained on KonIQ-10k [237] and the second Bi-LSTM is used to temporally pool the spatial features into the video quality scores. Zhang et al. [67] developed a multi-dimensional VQA (MD-VQA) model, which leverages the EfficientNetV2 [254] to extract the semantic features, utilizes five distortion descriptors including blur [255], noise [256], block effect [257], exposure [182], and colorfulness [258] to measure the distortion level, and employs ResNet3D-18 [259] to capture the motion information. Zhang et al. [202] considered five characteristics of HVS for video quality assessment: visual saliency, edge masking, content dependency, motion perception, and temporal hysteresis. For content dependency and edge masking, they used ConvNeXt [260] to extract the content features and edge feature maps from original RGB frames and the corresponding Canny edge maps, respectively. For visual saliency, they performed a saliency detection model SAMNet [261] to extract saliency maps and then used the saliency maps to weight the content and edge feature maps. For motion perception, they utilized SlowFast to extract the motion features. Finally, for temporal hysteresis, they combined the saliency-weighted content and edge features and motion features and employed the GRU module with the subjectively-inspired temporal pooling model in [185] to regress combined features into video quality scores. Chen et al. [203] developed a dynamic expert-knowledge ensemble strategy for generalizable video quality assessment, which relies on one image classification model, ResNet-50 [262], one action recognition model, C3D [263], and three trained NR VQA model, RIRNet [188], PVQ [62], and LSCT-PHIQNet [190] as the experts. Then, they trained an ensemble model to make full use of complementary information from these experts using the contrast learning method.

Recently, the visual-language pre-training methods are exploited for NR VQA. For example, Wu et al. [205, 206] proposed to combine the spatial naturalness index NIQE [233], the temporal naturalness index TPQI [264], and the contrastive language-image pre-training (CLIP) model [265] with a quality-guided text prompt to achieve zero-shot NR VQA. Wu et al. [205] further introduced a multi-dimensional language-prompted NR VQA model that employs FAST-VQA [220] to capture low-level-aware features, CLIP-visual to extract local CLIP features, and CLIP-textual to extract dimensional-oriented quality-guided text features. They calculated the cosine similarity of the visual features fused by low-level-aware features and local CLIP features and text features as the video quality scores. Liu et al. [207] extracted a range of quality-aware features from the image modality, video modality, and text-to-image modality. Seven pre-trained models were employed to diverse features from these three modalities and a quality-aware acquisition module was designed to adaptively capture the diversity and complementary information among them. They further utilized a knowledge distillation method to transfer the knowledge from these modalities to a lightweight VQA model.

(2) End-to-end training based methods: The end-to-end training approach enables the BVQA model to directly learn the quality-aware feature representation from the raw pixels of a video. Liu et al. [208] proposed a multi-task BVQA model V-MEON by jointly optimizing the 3D-CNN model for quality assessment and compression distortion classification. You and Korhonen [209] also employed a 3D-CNN model to extract features from a video clip and subsequently employed a LSTM network to regress the 3D-CNN features into the video quality scores. Note that the two network are trained independently. Yi et al. [210] introduced an attention-based NR VQA model that tackles the problem of uneven spatial distortion by training the VGG network with a non-local operator in an end-to-end manner. Wen and Wang [211] developed an IQA-based VQA method, which uses a ResNet-18 to compute the frame-level quality scores and then averagely pools the frame-level quality scores into the video quality scores. They performed the L1 loss and the Rank loss to optimize the proposed VQA models.

To better handle temporal-related distortions in videos, some NR VQA studies leverage the motion features or spatial-temporal modeling methods to improve the performance. Sun et al. [212] proposed SimpleVQA, a simple NR VQA framework consisting an end-to-end trained multi-scale spatial feature extraction module and a pre-trained motion extraction module. Sun et al. [213] further proposed a minimalistic VQA model, which includes four basic blocks: a video preprocessor (for aggressive spatiotemporal downsampling), a spatial quality analyzer, an optional temporal quality analyzer, and a quality regressor, all with the simplest possible instantiations. Shen et al. [216] presented an end-to-end NR VQA model that incorporates spatiotemporal feature fusion and hierarchical information integration. It includes a feature extraction model using 2D and 3D convolutional layers for gradual extraction of spatiotemporal features from raw video clips and a hierarchical branching network for fusing multiframe features. Xian et al. [217] proposed to generate a simulated video using a generative adversarial network (GAN)-based image restoration model as a pesudo reference video and then developed a pyramidal spatiotemporal feature hierarchy (PSFH) network to extract the multi-stage spatiotemporal features of the distorted videos and the differences between the distorted videos and the pesudo reference videos. Guan et al. [218] developed a visual and memory attention-based NR VQA model. They proposed a visual attention module to derive spatial-temporal attention-guided representation for frame-level quality-aware features and a memory attention module to map the frame-level quality-aware features into the video-level quality scores. Lu et al. [219] proposed a grey-level co-occurrence matrix based text measure to select represent patches from high-resolution video content and subsequently employed a 2D-CNN backbone (i.e. ResNet-18) to extract quality-aware features from these selected patches.

Recently, Vision Transformer have demonstrated outstanding performance in various computer vision tasks. Hence, an increasing number of NR VQA methods are adopting Transformer-based architectures. Xing et al. [214] introduced StarVQA, which constructs a Transformer model for NR VQA by combining divided space-time attention and then devises a vectorized regression loss that encodes the mean opinion score into a probability vector. They further developed StarVQA $+$ [266] by co-training StarVQA on both images and videos across different kinds of datasets. Lin et al. [215] took into account the visual saliency mechanism and employ holistically-nested edge detection [267] to choose the saliency regions within the video. The selected video saliency clips are subsequently inputted into Inflated 3D ConvNet (I3D) [240] to extract the features and a Transformer encoder was employed to regress the features into the video quality scores. Wu et al. [220] introduced FAST-VQA, a fragment attention network consisting of a video Swin Transformer and the gated relative position bias module, which is specifically designed to take mini-patches sampled from the video as the input. Wu et al. [221] further considered the video quality from two aspects: the technical and aesthetic perspectives, and proposed the disentangled objective video quality evaluator (DOVER) to independently learn two quality assessment models, each focusing on one of these perspectives. Kou et al. [222] introduced StableVQA, a NR VQA model designed to evaluate video stability. This model leverages Swin Transformer to capture video content, utilizes 3D ResNet to extract the motion information from the optical flow modality, and incorporates a blur encoder [268] to measure the blur distortion. Yuan et al. [223] developed the Visual Quality Transformer, which utilizes a multi-pathway temporal network consisting of multiple sparse temporal attention modules to sample keyframes and measure the degree of coexisting distortions of a video. Ke et al. [224] presented a multi-resolution transformer for NR VQA, which first samples spatially aligned patches from the multi-resolution frames input to preserve high-resolution details and global content and then performs a factorized spatial-temporal transformer to derive the video quality scores.

(3) Self-supervised learning based methods: Both pre-trained model based methods or end-to-end learning based NR VQA models require a large-scale of VQA dataset to train a robust NR VQA model. However, obtaining high-quality labels for VQA datasets, typically acquired through subjective VQA experiments, is a time-consuming and expensive process. Therefore, some studies attempt to use self-supervised or unsupervised learning methods for NR VQA, which aim to learn quality-aware feature representation from large-scale unlabelled video data. Liu et al. [225, 226] proposed a weakly supervised learning method for NR VQA. They first constructed a large-scale VQA dataset via degrading the high-quality video clips by the video compression and transmission algorithms and calculating the quality scores of the degraded videos by multiple FR VQA methods. Subsequently, they introduced a NR VQA model with a heterogeneous knowledge ensemble to learn representation from the weakly labeled data. Chen et al. [227] proposed a curriculum-style unsupervised domain adaptation method to tackle the cross-domain VQA challenge. The approach consists of two main stages. First, they performed domain adaptation between the source and target domains to predict the rating distribution for target samples, which provides a more accurate understanding of the subjective aspects of VQA. Second, they treated the samples in the confident subset as the easier tasks in the curriculum, and conducted a fine-grained adaptation between these two subsets to refine the prediction model. Chen et al. [228] presented a self-supervised pre-training method for video quality assessment using the contrastive learning approach. Specifically, they first generated a range of distorted video samples with diverse distortions and visual content through a carefully designed distortion augmentation strategy. Then, they applied contrastive learning to enhance feature representations by maximizing agreement between future frames and their corresponding predictions in the embedding space. Moreover, they introduced a distortion prediction task as an extra learning objective, encouraging the model to differentiate between various distortion categories in the input video.

Madhusudana et al. [229] proposed to utilize distortion type identification and degradation level determination as the auxiliary tasks to train a NR VQA model consisting of a CNN for extracting spatial features and a GRU for extracting temporal information through the contrastive learning method. Mitra and Soundararajan [230] developed a self-supervised multi-view contrastive learning framework to learn quality-aware spatio-temporal representation by comparing features between frame differences and frames by treating them as a pair of views. The learned features were subsequently compared with a dataset of unaltered, high-quality natural video patches to derive the quality of the distorted video. Jiang et al. [231] introduced a multi-task self-supervised representation learning framework for NR VQA. Three tasks including the distortion type classification, frame rate classification, and bitrate evaluation were used to train a Siamese network to capture spatiotemporal differences between the original video and the corresponding distorted ones. This model contains 3D-CNN and 2D-CNN to model short-term spatio-temporal dependencies and a Transformer to model the long-term spatio-temporal dependencies.

4 Objective Video Quality Assessment: Specific-purpose Models

The following section provides an overview of emerging topics in the field of video quality assessment that have gained attention in recent years. These topics include compressed VQA, streaming VQA, stereoscopic VQA, VR VQA, framerate and frame interpolation VQA, audio-visual VQA, HDR or WCG VQA, screen or game VQA, and various other emerging topics. To ensure a clear organization, we have classified these surveyed algorithms based on their respective topics or applications.

4.1 Compressed VQA

In addition to the previously mentioned video quality assessment approaches, there exists a set of specialized methods tailored for evaluating compressed videos, which is the primary focus of this section review. Compressed video assessment involves unique challenges and considerations due to the data reduction techniques applied during compression. To address these aspects effectively, researchers and experts have developed various methodologies dedicated to this domain.

4.1.1 FR and RR Methods

Full reference and reduced reference methods compare the compressed video with its original, uncompressed version, allowing for a thorough and accurate analysis of the compression quality. In [269], Xu et al. proposed the FR free-energy principle inspired video quality metric (FePVQ), which is applied to optimize perceptual video coding. FePVQ separates videos into orderly and disorderly regions based on the free-energy principle, where fixation or visual attention is associated with objects exhibiting significant motion according to human visual speed perception, extending the principle into the spatio-temporal domain for VQA. VMAF, developed by Netflix [270], is a full-reference, perceptual video quality metric designed to closely align with subjective Mean Opinion Score ratings. It employs machine learning techniques and a support vector machine to combine scores from multiple quality assessment algorithms, aiming to estimate the perceived quality of video content by considering degradation caused by compression and rescaling. In [161, 212], Sun proposed a FR deep learning-based VQA framework for evaluating the quality of compressed User-Generated Content videos. The proposed framework consists of three modules: a feature extraction module that fuses features from intermediate layers of a CNN to create quality-aware feature representation, a quality regression module that uses FC layers to regress the features into frame-level scores, and a subjectively-inspired temporal pooling strategy to aggregate frame-level scores into video-level scores. In [168], Ma et al. proposed a RR VQA method for compressed videos. In the model, the spatial aspect is measured using an energy variation descriptor that captures the energy change and texture masking property of the human visual system, while the temporal aspect is captured using the generalized Gaussian density function to model the interframe histogram distribution. The city-block distance is then used to calculate the histogram distance between the original video sequence and the encoded one.

4.1.2 NR Methods

No-reference video compression quality assessment methods are more generalized, as they do not require any reference information, such as the original uncompressed video, for evaluation. In [271], Lee et al. proposed a NR video quality assessment method for scalable video coding that quantifies video quality using decoding parameters from compressed bitstreams, including both the base layer and enhancement layer. The proposed approach assesses the quality of the enhancement layers based on statistics of coding parameters and their relationship with the quality of the base layer, providing the assessment of the overall video quality. In [272], Lin et al. introduced a NR VQA algorithm, which operates in the compressed domain and considers three key factors: quantization parameter, motion, and bit allocation factor, extracted from the compressed bitstream. The algorithm also takes into account the characteristics of the human visual system for improved quality estimation. In [21], Zhu et al. proposed a NR compressed video quality prediction model based on discrete cosine transform (DCT). The model has two stages: distortion measurement, where efficient frame-level features are extracted from DCT coefficients of decoded frames to quantify distortion, and nonlinear mapping, where a trained multilayer neural network takes video-level features obtained through temporal pooling as inputs and predicts the quality score of the video sequence. In [273], Huang et al. presented a NR VQA method for videos compressed using HEVC, without access to the bitstream. The proposed method estimates quantization levels based on transform coefficients extracted from the decoded video pixels, and models HEVC transform coefficients using a joint-Cauchy probability density function. These features are then used to predict Mean Opinion Scores for subjective video quality assessment using Elastic Net regression. Liu et al. [208] proposed a NR VQA model named V-MEON. The proposed model uses a multi-task deep neural network framework to jointly estimate perceptual quality and codec type, leveraging complementary sets of labels obtained at low cost. The training process involves pre-training early convolutional layers with codec classification subtask, and jointly optimizing the entire network with the two subtasks together, while incorporating 3D convolutional layers for improved spatiotemporal feature extraction and performance enhancement. In [65], Wang et al. introduced a NR VQA framework based on deep neural networks that comprehensively analyzes the significance of content, technical quality, and compression level in perceptual quality assessment. In [274], Lin et al. addressed the issue of Perceivable Encoding Artifacts (PEAs) in compressed videos, which significantly reduce video quality. The study investigates four spatial PEAs (blurring, blocking, bleeding, and ringing) and two temporal PEAs (flickering and floating) and proposes a compressed video quality index based on saliency-aware spatio-temporal artifact detection.

4.2 Streaming VQA

The variability of streaming environments and the intricacies of human QoE responses have presented significant challenges for delivering optimal content distribution services. In the past decade, there has been significant effort invested in the development of objective QoE models.

4.2.1 QoS-driven User QoE Assessment

The QoS driven user QoE assessment exploits the causal relationship between QoS and QoE problems. Liu et al. [275] conducted an analysis of the effects of client-side, video coding, and CDN factors on QoE and proposed a video control plane capable of dynamically optimizing video delivery by considering a comprehensive perspective of the above mentioned client and network conditions. Rodríguez et al. [276] addressed the impact of frequent video quality level (VQL) switching on QoE through subjective testing, objective modeling, and computer/network configurations. Contributions include identifying the strong impact of frequent VQL switching on users’ attention, different impacts of spatial and temporal resolution switchings on QoE, identification of key factors in characterizing VQL switching impact, development of a switching degradation factor model to account for changes in QoE. In [277], Nightingale et al. evaluated HEVC video streaming under network impairments, quantifying their impact on perceptual quality and offering insights into influencing factors, thus informing QoE-oriented HEVC streaming development.

4.2.2 QoS and QA-driven User QoE Assessment

Despite the diversity in the implementations of QoE models, recent studies have increasingly converged towards utilizing the QoS driven user QoE assessment and the visual quality measurement at the same time. In [278], Bentaleb et al. conducted an analysis of the effects of chunk quality, startup delay, number of stalls, average video quality, and video quality switches on QoE and introduced an architecture for dynamic resource allocation and management in DASH systems. In [72], Duanmu et al. developed a unified QoE prediction model called Streaming QoE Index (SQI), which considers video presentation quality, initial buffering, and stalling events as combined factors in determining QoE. The SQI model takes into account the overall experience of video quality, stalling events, and their interaction for a more comprehensive QoE assessment. In [279], Bampis et al. introduced a machine learning framework called Video Assessment of Temporal Artifacts and Stalls (Video ATLAS) for accurately predicting user QoE. The framework combines multiple QoE-related features, including objective quality features, rebuffering-aware features, and memory-driven features, to make reliable QoE predictions. In [280], Bampis et al. proposed a machine learning-based Nonlinear Autoregressive Network with Exogenous Inputs (NARX) model, which utilizes objective metrics, rebuffering-related information, and memory-related features for predicting QoE in video streaming. In [281], Ghadiyaram et al. developed a QoE evaluation tool, called the time-varying QoE Indexer, which considers interactions between stalling events, analyzes the spatial and temporal content of a video, predicts perceptual video quality, models the state of the client-side data buffer, and provides continuous-time quality scores that are in good agreement with human opinion scores. In [74], Duanmu et al. proposed the ECT-QoE. The proposed framework is based on the expectation confirmation theory (ECT) to construct an ECT-based QoE measure (ECT-QoE) that considers spatial and temporal expectation confirmations separately. The effects of adaptation intensity, adaptation type, intrinsic quality and content type on the end user QoE are considered in the method. Eswara et al. [282] introduced LSTM-QoE, a new dynamic model that utilizes LSTM networks for predicting continuous QoE. The model incorporates a network of LSTMs optimized for QoE prediction performance using advanced QoE features, and has the potential for real-time QoE computation. Rao et al. [283] proposed a bitstream-based video quality model that utilizes both metadata such as codec type, framerate, resoution and bitrate as well as the video pixel information. Duanmu et al. [284] proposed a Bayesian streaming quality index (BSQI) model that integrates prior knowledge on the human visual system and human annotated data in a principled manner to predict objective QoE. Through analysis of subjective characteristics in streaming videos from subjective studies, authors demonstrated that a family of QoE functions follows a convex set, and they optimized the BSQI model using a variant of projected gradient descent over a training video database.

4.2.3 Data-driven Approaches

Another type of model utilizes data-driven approaches, employing machine learning models like random forest and neural network, which impose noninformative priors on the model parameters to achieve effective results. In [286], Singh et al. developed a no-reference QoE monitoring module for HTTP/TCP video streaming using H.264/AVC video codec in the context of IPTV. The proposed approach utilizes pseudo-subjective quality assessment (PSQA) methodology based on random neural network (RNN), considering the quantization parameter (QP) used in video compression and playout interruptions as metrics impacting QoE, as these factors are directly related to perceived quality in adaptive HTTP streaming. In [287], Li et al. proposed a novel weakly-supervised domain adaptation approach for continuous-time QoE evaluation, utilizing a small amount of labeled data in the source domain and weakly-labeled data (retrospective QoE labels only) in the target domain. The approach involves learning effective spatiotemporal segment-level feature representations using a combination of 2D and 3D convolutional networks, and developing a multi-task prediction framework that simultaneously predicts continuous-time and retrospective QoE.

4.3 Stereoscopic VQA

The advancement of 3D movies and TV programs has popularized stereoscopic or 3D Video Quality Assessment. Research in this area holds both theoretical and practical significance, as the current state of 3D content, capture, and display devices still have considerable room for improvement in terms of delivering optimal visual experiences.

4.3.1 2D Extension Methods

The quality of stereoscopic 3D videos can be assessed by utilizing established algorithms for image quality assessment and video quality assessment that are traditionally used for 2D content. These models employ IQA and VQA algorithms on the distinct views, including the disparity view, of stereoscopic 3D videos. Typically, these IQA and VQA models are applied at the level of individual frames or views to estimate the perceptual quality of a stereoscopic 3D video.

In [288], Yasakethu et al. explored the correlation between subjective quality measures and various objective quality measures, such as PSNR, SSIM, in the context of 3D video content. In [289], Nur et al. utilized classical 2D algorithms by directly applying them to each frame of the stereoscopic video and obtaining an average predicted quality as the global quality score for the stereoscopic video. In order to address the systematic deviation in quality prediction for asymmetric distortion in stereo videos using weighted average 2D evaluation methods, Wang et al. [83] proposed a dynamic weight method based on binocular rivalry theory. The weighting strategy combines the local energy information of image patches and integrates the prediction quality of left and right videos, resulting in improved performance for existing FR quality evaluation methods. In [290], Hong et al. proposed the 3-D-PQI metric to quantify video compression distortion in stereoscopic videos. The proposed model incorporates the measurement of local video compression distortions in both the spatial and temporal domains for both the left and right views, taking into consideration the contrast and motion masking effects. To accumulate these local spatial and temporal distortions, a stereo saliency-based pooling strategy is employed. Finally, the 3-D-PQI is derived through a texture energy-based fusion of the distortion measurements obtained from the left and right views.

4.3.2 Stereo Vision Perception Methods

In addition to the 2D extension methods, researchers have explored stereo vision perception methods. Several full-reference models for 3D VQA have also been developed. In [291], Galkandage et al. proposed a FR metric for stereoscopic video quality assessment. In the proposed model, binocular suppression and recurrent excitation are considered. A novel image quality metric based on the HVS is proposed. The metric is extended to the video domain by introducing an optimized temporal pooling strategy. Appina et al. [292] proposed the DeMo3D model, where the Bivariate Generalized Gaussian Distribution is employed to measure the correlation between motion and disparity maps at three different scales and six directions. Subsequently, 2D evaluation methods are applied to extract spatial features to predict the quality of stereo videos. In [293], Zhang et al. proposed a FR VQA method for synthesized 3D videos. The method involves decomposing the synthesized video into spatially neighboring temporal layers, using gradient features and strong edges of depth maps to detect flicker distortions, and applying dictionary learning and sparse representation to effectively represent temporal flicker distortion. A rank pooling method is then used to combine the temporal flicker distortion measurement with conventional spatial distortion measurement for overall quality assessment of synthesized 3D videos. In [294], Galkandage et al. proposed a FR VQA model. The model consider the motion sensitivity of HVS and extract both non-motion sensitive and motion sensitive energy terms to mimic the response of the HVS.

Researchers have also made progress in the development of reduced-reference models for 3D VQA. These models aim to efficiently assess the visual quality of 3D content while using only a limited set of reference information. In [295], Hewage et al. proposed a RR quality metric for color plus depth 3D video transmission. The metric utilizes edge information from depth maps and corresponding color images in the areas near edges to assess video quality. Yu et al. [296] proposed a RR VQA model. The proposed method uses motion intensity to extract RR frames for temporal characteristics in stereo video. Binocular fusion and rivalry portions are modeled based on the internal generative mechanism of human visual perception. RR frame quality indicators are computed for these portions, and then compared between the original and distorted frames. A temporal pooling strategy, with motion intensity influencing pooling parameters, is applied to obtain the final stereo video quality score.

No-reference models for 3D VQA enable the assessment of quality without relying on any reference information, making them particularly valuable for real-world scenarios. In [297], Chen et al. proposed a NR VQA model. In the proposed method, auto-regressive prediction-based disparity entropy (ARDE) and energy weighted video content measurement features are introduced, inspired by the free-energy principle and binocular vision mechanism. Binocular summation and difference operations are combined with natural scene statistic measurement and ARDE measurement to assess the impact of texture and disparity in video quality evaluation. Yang et al. [298] proposed a NR VQA model. In the model, the sum map is calculated that remains basic information of the 3D video. Then saliency map and sparse coefficients are calculated on the sum map to predict the video quality. In [299], a NR VQA model was proposed. The model is built on 3D CNN to extract local spatiotemporal information and global temporal information. The global temporal clues are considered in the quality fusion. In [300], Yang et al. proposed a NR VQA model. In the model, key frame sequences are extracted. The binocular summation and difference are calculated on extracted sequences, and then texture statistic measurement are conducted to predict the 3D video quality.

Statistical dependencies between motion and disparity information are employed in some methods. In [301], Appina et al. proposed a NR VQA model called MoDi3D. The BGGD parameters of the joint statistical dependencies between motion and disparity subband coefficients are estimated as the features, which are pooled to predict the 3D video quality. In [302], Biswas et al. proposed a NR VQA model of stereoscopic 3D videos. In the model, the correlation between the motion and depth components is computed and represented as a correlation map. The correlation maps are then subjected to steerable pyramid decomposition at various scales and orientations. The resulting subband decompositions of the correlation map are modeled using the UGGD models. The parameters of the UGGD model are estimated to predict the quality of the video content.

4.4 VR VQA

The metrics for omnidirectional videos (ODV) need to consider the unique aspects of ODV, such as the spherical nature and viewing characteristics, and often address projection distortions through distortion weights or resampling techniques. Distortion weights in ODV metrics are determined based on the level of projection distortion at a specific location, while resampling techniques may involve extracting viewports with low projection distortions, converting ODV into a projection format with low distortions, or extracting uniformly distributed points on the sphere. To this end, many traditional technique-based and deep learning-based methods have been proposed for the VR VQA problem.

4.4.1 Traditional Visual Computing Techniques

Traditional visual computing techniques have been developed and applied for the purpose of assessing the quality of VR videos. In [303], Sun et al. proposed a FR VQA method. The method involves multiplying the error of each pixel on projection planes by a weight to ensure equivalent spherical area in observation space, thereby avoiding error propagation caused by conversion from resampling representation space to observation space and improving the accuracy and reliability of quality evaluation results. In [304], Zakharchenko et al. introduced a new location invariant quality assurance metrics for spherical panoramic images/videos. Two methods are proposed: using PSNR in the Craster parabolic projection format or weighted PSNR calculation for ERP contents. In [305], Yu et al. proposed a FR method named S-PSNR. The spherical PSNR metric (S-PSNR) estimates the PSNR for uniformly sampled points on the sphere, but the number of sampled points in the official implementation is too small compared to the resolution of ODVs, resulting in massive information loss. S-PSNR has two variants, S-PSNR-NN and S-PSNR-I, which use nearest neighbor or bicubic interpolation for pixel sampling, respectively. In [306], Zhou et al. proposed a FR method called Weighted-to-Spherically-Uniform SSIM for evaluating the objective quality of panoramic video and images, where the structural similarity index is multiplied by different weights in different regions to ensure that spherical distortion corresponds linearly to plane distortion as observed by the user. In [307], Chen et al. presented a FR quality assessment method for omnidirectional video based on structural similarity in the spherical domain, taking into account the relationship between the structural similarity in the 2D plane and the sphere, which helps to handle the interference caused by projection in the assessment process. In [308], Ozcinar et al. proposed a FR quality metric based on PSNR that takes into consideration visual attention and projection distortions, with the objective of optimizing streaming of omnidirectional video. In [309], Meng et al. proposed a RR analytical model to connect the perceptual quality of compressed viewport videos with their spatial, temporal, and amplitude resolutions variables, using linearly weighted content features. Additionally, the model is extended to infer the overall video quality by weighing the saliency-aggregated qualities of salient viewports and the quality of non-salient areas.

In certain methods, human visual perceptual regularities and natural scene statistics are effectively utilized to enhance video quality assessment techniques. In [88], Xu et al. proposed the FR VQA methods for encoded omnidirectional video, taking into consideration human perception characteristics. One method weighs pixel distortion based on their distances to the center of front regions, accounting for human preference in panoramic viewing, while the other method predicts viewing directions from the video content and allocates weights to pixel distortion accordingly in our VQA method. In [310], Gao et al. proposed a FR spatiotemporal modeling approach for evaluating the quality of omnidirectional videos. The approach involves constructing a spatiotemporal quality assessment unit that evaluates distortion at the eye fixation level, incorporating temporal variations to obtain smoothed distortion values. The paper also presents a solution for integrating existing spatial video quality metrics, as well as investigating cross-format omnidirectional video distortion measurement. In [311], Azevedo et al. proposed a FR approach for assessing the quality of omnidirectional videos. This approach uses viewports regularly sampled from ODV frames with low projection distortions to better capture the user experience, supports different ODV projection formats, and applies different spatio-temporal metrics combined with a model of human visual system’s temporal quality perception for computing the final quality score using a random forest regression trained on the VQA-ODV dataset. In [312], Zhou et al. proposed a NR algorithm called MultiFrequency Information and Local-Global Naturalness (MFILGN). The approach decomposes the projected equirectangular projection maps into wavelet subbands using discrete Haar wavelet transform, and measures multifrequency information using entropy intensities of low-frequency and high-frequency subbands. The natural scene statistics features are extracted from each viewport image to measure local naturalness. The support vector regression is used to train the quality evaluation model.

4.4.2 Deep Learning-based Computing Techniques

In recent years, in addition to traditional visual computing techniques, metrics based on deep learning have emerged and demonstrated state-of-the-art performance in video quality assessment. These deep learning-based methods encompass both FR and RR approaches. In [313], Li et al. proposed a FR VQA approach that incorporates viewport proposal and saliency prediction as auxiliary tasks. The proposed approach consists of two stages - the first stage involves a viewport proposal network to generate potential viewports, and the second stage includes a Viewport Quality Network that rates the VQA score for each proposed viewport using predicted saliency maps. In [314], Xu et al. presented a FR approach using a viewport-based convolutional neural network (V-CNN) for VQA on $360^{\circ}$ videos. The V-CNN includes a multi-task architecture with a viewport proposal network for handling camera motion detection and viewport proposal, and a viewport quality network for handling viewport saliency prediction and the main VQA task. In [315], Duan et al. developed the FR metric that can predict the distortion caused by stitching in VR contents. The proposed method incorporates a subnetwork for spatial attention and introduces a spatial regularization component. In the field of VR VQA, it is noticeable that utilizing saliency information is a widespread practice. Many emerging VR saliency methods have been introduced [316, 317, 318, 319, 320, 321, 322, 323] to aid in predicting visual attention information. These methods play a crucial role in identifying the most relevant and visually significant regions within virtual environments, enhancing the overall VR experience and improving the accuracy of VQA tasks.

Besides, there are also video quality assessment metrics developed in the NR manner. In [324], Yang et al. proposed a NR approach for predicting VR video quality, using an end-to-end 3D convolutional neural network that extracts spatiotemporal features. The score fusion strategy is designed based on the characteristics of VR video projection, where local spatiotemporal features are captured from pre-processed VR video patches and combined to obtain the final quality score. In [325], Fei et al. proposed a NR two-step neural network model, leveraging features from physiological psychology and cognitive neurology, to capture the relationship between network parameters and perception in VR transmission for objective evaluation. In [326], Yang et al. proposed a NR approach for predicting VR video quality. The proposed method combines spherical convolutional neural networks and non-local neural networks to extract spatiotemporal information from panoramic videos. In [327], Xu et al. proposed a NR model. The proposed method includes a viewpoint detector to select viewports based on human visual system sensitivity, a viewport descriptor for feature extraction, and a spatial viewport graph to model mutual dependency among viewports. Graph convolutional networks are used for reasoning on the graph to obtain the global quality of the omnidirectional image, omitting the pseudo reconstruction step for simplicity and performance enhancement. In [328], Guo et al. proposed a NR omnidirectional video quality assessment approach based on generative adversarial networks, consisting of a reference video generator and a quality score predictor. To address the issue of varying reference image/video quality levels in existing GAN-based methods, a level loss is introduced, and the viewing direction of the omnidirectional video is incorporated in the quality and weight regression process. In [329], Zhu et al. proposed a NR approach. The proposed EyeQoE method is inspired by advanced techniques in deep neural networks and uses a graph-based approach to model eye-based cues for video quality assessment. The method organizes fixations and saccades into a graph, where edges represent temporal relations and additional edges are added for content-dependent features. A graph convolution network is used to learn useful feature representations from the graph, which are then used to compute the quality of the video clip. In [330], Yang et al. proposed a NR approach called ProVQA for quality assessment of $360^{\circ}$ videos, taking into account the progressive paradigm of human perception. Three sub-nets are designed in ProVQA: the spherical perception aware quality prediction sub-net models spatial quality degradation based on human spherical perception mechanism, the motion perception aware quality prediction sub-net incorporates motion contextual information for quality assessment, and the multi-frame temporal non-local sub-net aggregates multi-frame quality degradation to yield the final quality score. In [331], An et al. proposed a method that uses both 2D-CNN and 3D-CNN to extract video features in both temporal and spatial domains. The input video is divided into patches and processed through convolutional, excitation, pooling, and fully connected layers to obtain a score for the video.

4.5 Framerate & Frame Interpolation VQA

Altering the frame rate of a video can significantly influence its visual quality. Lower frame rates might introduce choppiness and reduced motion smoothness, while higher frame rates can enhance the viewing experience with improved clarity and realism. As a result, specific framerate VQA methods become essential to evaluate and ensure the perceptual quality of videos across different frame rates, helping content creators, streaming platforms, and viewers make informed decisions about frame rate selection to achieve the best visual experience. In [332, 333], Ma et al. proposed FR rate and quality model, which is analytically tractable and relys on content-dependent parameters, combines a spatial quality factor assessing decoded frames’ quality and a temporal correction factor adjusting for the frame rate. In [334], Ou et al. explored the impact of spatial, temporal, and amplitude resolution on video’s perceptual quality and related reductions in frame rate to perceptual quality through subjective and objective analyses. Zhang et al. proposed the FR method FRQM in [335]. The method evaluates the relationship between frame rate variations and perceptual video quality. FRQM utilizes temporal wavelet decomposition, subband combination, and spatiotemporal pooling to estimate the relative quality of low frame rate videos compared to higher frame rate versions. In [336], Madhusudana et al. proposed the objective VQA model, called Space-Time GeneRalized Entropic Difference (GREED), analyzes spatial and temporal band-pass video coefficient statistics using a generalized Gaussian distribution. GREED captures quality variations due to frame rate changes by calculating entropic differences across multiple temporal and spatial subbands In [337], Madhusudana et al. focused on VQA for High Frame Rate videos with different frame rates and compression factors. They proposed a FR model that combines features from VMAF and GREED, offering improved efficiency in predicting frame rate dependent video quality. Lee et al. [338] developed a FR video quality predictor sensitive to spatial, temporal, or space-time subsampling combined with compression. The predictor utilizes space-time natural video statistics models to capture regularities in motion trajectories and disturbances caused by space-time distortions. In [339], Zheng et al. introduced FAVER, a NR VQA model tailored for high frame rate videos, utilizing the temporal natural video statistics of bandpass filtered videos to capture and represent aspects of temporal video quality.

Video frame interpolation results often show unique artifacts, which can lead to inconsistencies between existing quality metrics and human perception when assessing the interpolation outcomes. In [340], Yang et al. proposed a FR metric that quantifies interpolation artifacts, incorporates human visual factors, and provides a global quality measurement. The proposed metric takes into account blocking artifacts and potential areas of quality degradation to overcome the limitations of other commonly used metrics. In [341], Danier et al. proposed FloLPIPS, a FR video quality metric for VFI, based on LPIPS, which incorporates temporal distortion through optical flow comparison to enhance performance. In [342], Hou et al. proposed a dedicated and FR perceptual quality metric that learns features directly from videos and considers spatio-temporal information using Swin Transformer blocks.

4.6 Audio-Visual VQA

With the increasing prevalence of mobile Internet, audio and video (A/V) are essential for everyday entertainment and social interactions. However, compression of A/V signals by service providers to reduce storage and transmission costs can result in distortions, negatively impacting end-users’ QoE. Therefore, AVQA is a significant and attention-worthy area of research.

Most previous research has primarily focused on single-mode signals, overlooking the comprehensive impact of audio and video on consumers’ QoE. Some studies have started to address the objective AVQA problems, recognizing the importance of jointly assessing the audio-visual aspects of multimedia content to enhance user satisfaction. In [344, 345], researchers emphasized the significance of audiovisual quality and suggested that the overall audiovisual quality can be represented as a product of audio and video quality. In [97], Winkler et al. conducted subjective experiments to assess audiovisual, audio-only, and video-only quality. The study analyzed the impact of video and audio coding parameters on quality, explored the optimal balance between audio and video bit allocation under global bitrate constraints, and investigated models for the interactions between audio and video in terms of perceived audiovisual quality. In [101], Martinez et al. proposed a FR audio-visual quality metric. The FR audio-visual quality framework introduced three models based on the findings of psychophysical experiments: the linear model, the weighted Minkowski model, and the power model. These models offer different approaches to quantify the overall audio-visual quality based on audio and video components. In the study by Martinez et al. [346], the three perceptual audio-visual models (linear, weighted Minkowski, and power models) were used to combine video and audio no-reference metrics. These combined metrics were then tested and evaluated in the research. In [102], Martinez et al. explored combination models to predict overall audio-visual quality by integrating audio and video quality estimates. It considers 7 video quality metrics (3 Full-Reference and 4 No-Reference) and 4 audio quality metrics (2 Full-Reference and 2 No-Reference), resulting in 18 Full-Reference and 24 No-Reference audio-visual combination metrics. In [7], four families of objective A/V quality prediction models were designed using a multimodal fusion strategy: product of video and audio quality predictors, fusion of video and audio quality predictors by SVR, A/V-QA models defined using 1D and 2D visual quality predictors, and deep neural families of A/V quality predictors. In [343], a NR model was proposed. The model extracts audio features from the separable convolution network and visual features from the quality-aware ResNet-50, and learns temporal information through Bi-LSTM and fuses the features using FC layers. In [347], Cao et al. proposed an objective model architecture based on attentional neural networks to consider both audio and video signals. The extended FR and NR models extract salient regions from video frames using an attention prediction model, utilize convolutional neural networks to extract short-time features, and employ gated recurrent unit (GRU) networks to model temporal relationships.

Table 5: Overview of the objective video quality assessment for emerging topics

Applicable content	Type	Algorithm	Methodology	Extracted quality features	Quality fusion
Compressed VQA	FR	FePVQ [269]	spatio-temporal similarity	Motion, structure, and texture strength	Weighted feature similarity
	FR	Sun et al. [161]	spatio-temporal similarity	CNN features	CNN
	RR	Ma et al. [168]	spatio-temporal similarity	Energy and motion features	Histogram distance
	NR	Lin et al. [272]	spatio-temporal factors	QP, motion, bit allocation factors	Weighted average
		Zhu et al. [21]	spatio-temporal statistics	Frequency band features	CNN
		V-MEON [208]	DNN	CNN features	CNN
		SSTAM [274]	spatio-temporal features	Perceivable encoding artifacts	SVR
Streaming VQA	FR	Liu et al. [275]	qiality of service factors	Client-side, video coding and CDN factors	Weighted average
		Rodríguez et al. [276]	qiality of service factors	Video quality levels switching degradation factor	Weighted average
		Bentaleb et al. [278]	content and qiality of service factors	Delay, stall, video quality and quality switch	Weighted average
		SQI [72]	content and qiality of service factors	Video presentation quality, buffering, stalling	Weighted average
		Video ATLAS [279]	content and qiality of service factors	Video presentation quality, buffering, memory	Weighted average
		Ghadiyaram et al. [281]	content and QoS factors, continuous time	Stalling, client buffering, video presentation	Wiener model and SVR
		ECT-QoE [74]	Expectation confirmation theory	Video quality, adaptation type and intensity	Random forest regression
		LSTM-QoE [282]	content and QoS factors, continuous time	Video quality, playback indicator, rebuffering	CNN
		BSQI [284]	content and qiality of service factors	Video presentation quality, buffering, adaptation	Piecewise linear
	NR	Singh et al. [286]	DNN	CNN features	CNN
	NR	Li et al. [287]	DNN, continuous time	CNN features	CNN
3D VQA	FR	Yasakethu et al. [288]	monocular spatial similarity	PSNR, SSIM, VQM	Weighted average
		Wang et al. [83]	monocular spatial similarity	2D metrics, energy estimation	Binocular rivalry
		Galkandage et al. [291]	binocular spatial, frequency similarity	HVS features	Designed fusion function
		DeMo3D [292]	binocular spatio-temporal similarity	motion, depth, and spatial features	Designed fusion function
		SR-3DVQA [293]	binocular spatio-temporal similarity	gradient, edges of depth maps	Weighted layer pooling
		Galkandage et al. [294]	binocular spatio-temporal similarity	HVS features	Two-stage regression
	RR	Hewage et al. [295]	binocular spatial similarity	Edges of depth map	Weighted average
	RR	Yu et al. [296]	binocular spatio-temporal similarity	Statistical features	Designed fusion function
	NR	Chen et al. [297]	binocular spatio-temporal statistics	Texture and disparity statistical features	SVR
		Yang et al. [300]	binocular spatio-temporal statistics	Texture statistical features	SVR
		Biswas et al. [302]	binocular spatio-temporal statistics	Statistical features	Designed fusion function
VR VQA	FR	Sun et al. [303]	spatial similarity	PSNR weighted to sphere	Weighted average
		Zakharchenko et al. [304]	spatial similarity	Pixel errors weighted to sphere	Weighted average
		Ozcinar et al. [308]	spatio-temporal similarity	PSNR, VMAF weighted to sphere and saliency	Weighted average
		Xu et al. [88]	spatial similarity	PSNR, SSIM weighted to sphere and saliency	Weighted average
		Gao et al. [310]	spatio-temporal similarity	PSNR weighted to sphere and fixation	Weighted average
		V-CNN [314]	DNN	CNN features	CNN
		Duan et al. [315]	DNN	CNN features	CNN
	RR	Meng et al. [309]	spatio-temporal similarity	spatial, temporal and amplitude resolutions	Weighted average
	NR	MFILGN [312]	spatial statistics	Statistical features	SVR
		Fei et al. [325]	DNN	CNN features	CNN
		Yang et al. [326]	DNN	CNN features	CNN
		Xu et al. [327]	DNN	CNN features	CNN
		Zhu et al. [329]	DNN	CNN features	CNN
Framerate VQA	FR	Ou et al. [334]	spatio-temporal similarity	spatial, temporal and amplitude resolutions	Weighted average
		FRQM [335]	spatio-temporal similarity	Temporal wavelet decomposition	Designed fusion function
		GREED [336]	spatio-temporal statistics similarity	Statistical features	SVR
		Lee et al. [338]	spatio-temporal statistics similarity	Statistical features	SVR
	NR	FAVER [339]	spatio-temporal statistics	Statistical features	SVR
Audio-Visual VQA	FR	Martinez et al. [101]	combination model	SESQA and VQM	Multiple fusion methods
		Martinez et al. [102]	combination model	Audio and video quality estimates	Multiple fusion methods
		Min et al. [7]	combination model	Audio and video quality estimates	Multiple fusion methods
	NR	Cao et al. [343]	DNN	CNN features	CNN
	NR	Cao et al. [347]	DNN	CNN features	CNN
HDR VQA	FR	HDR-VQM [348]	spatio-temporal similarity	Subband errors	Weighted average
	FR	HDRMAX [349]	spatio-temporal similarity	Nonlinear features	SVR
	NR	Hdr-chipqa [350]	spatio-temporal statistics	Extended BRISQUE and ChipQA	SVR
Screen and Game VQA	FR	MS-RSDS [111]	spatio-temporal similarity	Structural features	Designed fusion function
		HSFM [351]	spatio-temporal similarity	Screen and natural statistical features	Designed fusion function
		SGFTM [110]	spatio-temporal similarity	Gabor features	Designed fusion function
	NR	GAMIVAL [352]	spatio-temporal statistics	Statistical and CNN features	SVR
	NR	GAME-VQP [353]	spatio-temporal statistics	Statistical and CNN features	SVR

4.7 HDR, WCG, iTMO and TMO VQA

Due to rapid advancements in video acquisition, computational imaging, and display technologies, there is a growing interest in high dynamic range videos. HDR videos exhibit differences from SDR videos, which in turn pose new challenges for HDR VQA models.

Several studies have explored the factors that can influence the quality of HDR content. In [105], Narwaria et al. addressed key challenges in HDR video quality measurement, discussed practical aspects that make it challenging, and presented recent efforts in developing HDR video datasets subjectively annotated for visual quality. In the study conducted by Shang et al. [354], they explored the impact of live streaming challenges, such as resolution and frame rate crossover, intra-frame pulsing defects, and complex rate-control mode, on the quality of HDR content. In [109], Athar et al. explored and analyzed the effects of compression on UHD-HDR-WCG videos. They aimed to understand how various compression techniques and settings influence the visual quality and overall user experience when viewing UHD-HDR-WCG videos.

HDR videos possess unique characteristics that differ from SDR videos, necessitating specialized techniques for HDR VQA models. In [348], Narwaria et al. proposed a FR HDR video quality measure approach that involves steps to convert input luminance to perceived luminance and then analyze the impact of distortions using frequency and orientation subbands, and error pooling through spatio-temporal processing of subband errors. In [349], Ebenezer et al. proposed the HDRMAX feature set that enhances VQA algorithms designed for SDR videos, making them more sensitive to distortions in HDR videos and capture distortions in the brightest and darkest parts of videos. The nonlinear processing is designed to derive a set of nonlinear HDRMAX features for both FR and NR VQA models. In [350], Ebenezer et al. proposed a NR HDR VQA model. The approach involves a preprocessing step of local expansive nonlinearity that emphasizes distortions at the higher and lower ends of the luma range, allowing for the computation of additional quality-aware features and improves the prediction of HDR content quality using distortion-sensitive natural video statistics features. In [355], Ebenezer et al. designed a HDR NR VQA algorithm. The proposed method utilizes features that are relevant to both SDR and HDR video quality, as well as features related to motion perception, which are NIQE features, PatchMAX features, HDRMAX features, and space-time features.

In the realm of HDR VQA, there are several works focused on the VQA for Tone Mapping Operators (TMOs), Inverse Tone Mapping Operators (ITMOs), and wide color gamut. In [356], a comparison was made between tone mapped HDR video shown on a tablet and an LCD display, compared to the same HDR video shown simultaneously on an HDR display. In [357], Eilertsen et al. provided an overview of various approaches for conducting evaluation of tone mapping operators for HDR video, including experimental setups, input data selection, tone mapping operator choices, and the significance of parameter adjustment for fair comparisons. In [107], Yeganeh et al. proposed a perceptual quality measure to compare different tone mapping operators. They presented a FR quality assessment model for tone-mapped videos that considers structural fidelity, statistical naturalness, and memory effect. In the study by Mantiuk et al. [358], a model was developed to measure the visual color difference between test and reference HDR images. The model was designed to mimic the visual system’s anatomy to improve accuracy in assessing HDR color differences.

4.8 Screen and Game VQA

The growing adoption of remote office and cloud collaboration scenarios has led to an increased interest in screen content videos (SCVs) and their processing. SCVs exhibit distinct characteristics from natural scene videos and have become a focus of attention among researchers. In [111], Li et al. proposed a FR screen content VQA model. The proposed approach measures the relative standard deviation similarity between reference and distorted contents using frame differences to capture accurate spatiotemporal distortions and incorporates a multiscale strategy to enhance its performance. In [359], Li et al. proposed a NR VQA model that utilizes a multi-scale approach to extract several intra-frame features and temporal features and employs support vector regressor for quality score prediction. In [351], Zeng et al. proposed a FR screen content VQA model. The model utilizes 3D-LOG and 3D-NSS filters to extract spatiotemporal features separately from reference and distorted SCV sequences, then computes similarities and generates quality scores for both screen and natural scenes. An adaptive fusion scheme combining screen and natural quality scores through local video activity is developed to arrive at the final VQA score for the distorted video. In [110], Cheng et al. proposed the FR VQA model for screen content videos based on the spatiotemporal gabor feature, which leverages 3D-Gabor filter to simulate the human visual system’s perception of videos, particularly sensitive to edge and motion information. In [360], Motamednia et al. devised the FR objective quality assessment metric for screen content videos. The proposed model utilizes horizontal and vertical subbands of the wavelet transform to characterize the structures present in the video. In [194], Ying et al. proposed a NR VQA model for telepresence videos. The proposed model uses a multi-modal learning approach with separate pathways for visual and audio quality predictions. Features of frame-level, patch-level, clip-level, audio-level are extracted and fused to predict the quality of telepresence videos

In recent years, the video game industry has experienced significant growth, leading to a substantial increase in gaming videos on major platforms. Despite this surge, there has been limited research on automatically predicting the quality of gaming videos. In [361], Xian et al. proposed a NR VQA method for computer graphics animation videos. The proposed method extracts spatiotemporal features and visual perception information from the videos, which are then fed into an artificial neural network-based VQA model. Additionally, a convolutional neural network is applied to the VQA model to generate adaptive weight factors for the input features based on the different types of CG content in the videos. In [362], Barman et al. investigated the performance of VQA metrics on gaming videos. The study considers eight widely used VQA metrics and evaluates their performance on a dataset of reference and compressed gaming videos. In [352], Saha et al. presented the outcomes and benchmark results of various FR and NR VQA methods on a large-scale subjective study on mobile cloud gaming. In [363], Chen et al. proposed a NR VQA model for gaming content. The proposed model combines spatial and temporal gaming distorted scene statistics models, a neural noise model, and deep semantic features. In [10, 353], Yu et al. proposed a NR VQA model for ugc gaming content. The model includes feature extraction, regression modeling, and score fusion modules. The feature extraction module computes low-level NSS features and high-level features from a pre-trained CNN model using training and test videos. Two separate SVR models are trained on the NSS and CNN features, respectively, as they represent different processing stages. The final video quality predictions are obtained by fusing the responses of these two models.

5 Objective Video Quality Assessment Model Evaluation

5.1 Evaluation Criteria

With the advancement of information technology research, many objective quality assessment models have been proposed in these years, thus it is important to consider how to evaluate the performance of an objective model. On account of the reliability and accuracy of subjective quality assessment, its results are generally used as the verification criteria and optimization targets for objective quality evaluation methods. As suggested by Video Quality Experts Group (VQEG) [364], we can evaluate the performance of an objective model from the aspects of accuracy, monotonicity and consistency. Using $o_{i}$ and $s_{i}$ to represent a subjective opinion score and an objective predicted score, respectively, where $i=1,...,N$ indicates video index, $N$ denotes the number of all videos, we first use a five-parameter logistic function to fit the quality scores:

q(s)=\beta_{1}(\frac{1}{2}-\frac{1}{1+e^{\beta_{2}(s-\beta_{3})}})+\beta_{4}s+% \beta_{5},

(1)

where $s$ and $q(s)$ are the objective and best-fitting quality, $\beta_{i}(i=1,2,3,4,5)$ are the parameters to be fitted during the evaluation. Then five traditional evaluation metrics are usually adopted to measure the consistency between the ground-truth (GT) subjective ratings and the fitted quality scores, including:

•

Spearman Rank-order Correlation Coefficient (SRCC)

\text{SRCC}=1-\frac{6\sum_{i=1}^{N}d_{i}^{2}}{N(N^{2}-1)},

(2)

where $d_{i}$ indicates the difference value between the subjective and objective scores for the $i$ -th video, $N$ denotes the number of all test videos.

•

Kendall Rank-order Correlation Coefficient (KRCC)

\text{KRCC}=\frac{N_{c}-N_{d}}{\frac{1}{2}N(N-1)},

(3)

where $N_{c}$ indicates the number of concordant pairs and $N_{d}$ denotes the number of discordant pairs.

•

Pearson Linear Correlation Coefficient (PLCC)

\text{PLCC}=\frac{\sum_{i}^{N}(q_{i}-\bar{q})\cdot(o_{i}-\bar{o})}{\sqrt{\sum_% {i}^{N}(q_{i}-\bar{q})^{2}\cdot(o_{i}-\bar{o})^{2}}},

(4)

where $o_{i}$ and $q_{i}$ represent the subjective opinion score and the nonlinear-fitted objective score for the $i$ -th video, $\bar{o}$ and $\bar{q}$ indicate the mean values of all $o_{i}$ and $q_{i}$ scores.

•

Root Mean Square Error (RMSE)

\text{RMSE}=\sqrt{\frac{1}{N}\sum_{i}^{N}(q_{i}-o_{i})^{2}}.

(5)

•

Mean absolute error (MAE)

\text{MAE}=\frac{1}{N}\sum_{i}^{N}|q_{i}-o_{i}|.

(6)

Different statistical indexes demonstrate different aspects of the performance of the VQA model. Among these traditional evaluation metrics, SRCC, KRCC, and PLCC calculate the correlation between the subjective quality ratings and the objective predicted scores, which demonstrate the prediction monotonicity, and RMSE and MAE compute the error between the subjective quality ratings and the objective predicted scores, which indicates the prediction accuracy. The higher SRCC, KRCC, PLCC values (closer to 1) and the lower RMSE and MAE values (closer to 0) mean better performance.

Table 6: Performance comparison of full-reference and reduced reference video quality assessment algorithms on LIVE VQA [48] database.

Type	Metrics	SRCC					PLCC
Type	Metrics	Wireless	IP	H.264	MPEG-2	All	Wireless	IP	H.264	MPEG-2	All
FR	PSNR	0.7381	0.6000	0.7143	0.6327	0.6958	0.7274	0.6395	0.7359	0.6545	0.7499
	SSIM [120]	0.7381	0.7751	0.6905	0.7846	0.7211	0.7969	0.8269	0.7110	0.7849	0.7883
	VIF [122]	0.7143	0.6000	0.5476	0.7319	0.6861	0.7473	0.6925	0.6983	0.7504	0.7601
	STMAD [137]	0.8257	0.7721	0.9323	0.8733	0.8301	0.8887	0.8956	0.9209	0.8992	0.8774
	ViS3 [54]	0.8257	0.7712	0.7657	0.7962	0.8156	0.8597	0.8576	0.7809	0.7650	0.8251
	MOVIE [132]	0.8113	0.7154	0.7644	0.7821	0.7895	0.8392	0.7612	0.7902	0.7578	0.8112
	V-BLINDS [179]	0.8462	0.7829	0.8590	0.9371	0.8323	0.9357	0.9291	0.9032	0.8757	0.8433
	SACONVA [365]	0.8504	0.8018	0.9168	0.8614	0.8569	0.8455	0.8280	0.9116	0.8778	0.8714
	DeepQA [366]	0.8290	0.7120	0.8600	0.8940	0.8678	0.8070	0.8790	0.8820	0.8830	0.8692
	DeepVQA [155]	0.8674	0.8820	0.9200	0.9729	0.9152	0.8979	0.8937	0.9421	0.9443	0.8952
RR	TRRED [170]	0.7765	0.7513	0.8189	0.5879	0.7802	0.7726	0.7619	0.8324	0.5998	0.7743
	SRRED [170]	0.7925	0.7624	0.7542	0.7249	0.7592	0.8067	0.8033	0.7462	0.7281	0.7764
	STRRED [170]	0.7857	0.7722	0.8193	0.7193	0.8007	0.8039	0.8020	0.8228	0.7467	0.8062

Table 7: Performance comparison of no reference video quality assessment algorithms on KoNViD-1k [59], LIVE-VQC [60], YouTube-UGC [61] databases. The top half: performances of IQA metrics; the bottom half: performances of VQA metrics.

Metrics	KoNViD-1k [59]			LIVE-VQC [60]			YouTube-UGC [61]			All-Combined
Metrics	SRCC	PLCC	RMSE	SRCC	PLCC	RMSE	SRCC	PLCC	RMSE	SRCC	PLCC	RMSE
NIQE [233]	0.5417	0.5530	0.5336	0.5957	0.6286	13.110	0.2379	0.2776	0.6174	0.4622	0.4773	0.6112
BRISQUE [232]	0.6567	0.6576	0.4813	0.5925	0.6380	13.100	0.3820	0.3952	0.5919	0.5695	0.5861	0.5617
GM-LOG [367]	0.6578	0.6636	0.4818	0.5881	0.6212	13.223	0.3678	0.3920	0.5896	0.5650	0.5942	0.5588
HIGRADE [368]	0.7206	0.7269	0.4391	0.6103	0.6332	13.027	0.7376	0.7216	0.4471	0.7398	0.7368	0.4674
FRIQUEE [369]	0.7472	0.7482	0.4252	0.6579	0.7000	12.198	0.7652	0.7571	0.4169	0.7568	0.7550	0.4549
CORNIA [370]	0.7169	0.7135	0.4486	0.6719	0.7183	11.832	0.5972	0.6057	0.5136	0.6764	0.6974	0.4946
HOSA [371]	0.7654	0.7664	0.4142	0.6873	0.7414	11.353	0.6025	0.6047	0.5132	0.6957	0.7082	0.4893
KonCept512 [237]	0.7349	0.7489	0.4260	0.6645	0.7278	11.626	0.5872	0.5940	0.5135	0.6608	0.6763	0.5091
PaQ-2-PiQ [372]	0.6130	0.6014	0.5148	0.6436	0.6683	12.619	0.2658	0.2935	0.6153	0.4727	0.4828	0.6081
V-BLIINDS [179]	0.7101	0.7037	0.4595	0.6939	0.7178	11.765	0.5590	0.5551	0.5356	0.6545	0.6599	0.5200
TLVQM [182]	0.7729	0.7688	0.4102	0.7988	0.8025	10.145	0.6693	0.6590	0.4849	0.7271	0.7342	0.4705
VMEON [208]	0.1118	0.1958	0.6322	0.4024	0.4088	15.524	0.0634	0.1100	0.6304	0.2578	0.2594	0.6657
VSFA [185]	0.7728	0.7754	0.4205	0.6978	0.7426	11.649	-	-	-	-	-	-
MDVSFA [373]	0.7812	0.7856	-	0.7382	0.7728	-	-	-	-	-	-	-
VIDEVAL [183]	0.7832	0.7803	0.4026	0.7522	0.7514	11.100	0.7787	0.7733	0.4049	0.7960	0.7939	0.4268
RAPIQUE [374]	0.8031	0.8175	0.3623	0.7548	0.7863	10.518	0.7591	0.7684	0.4060	0.8070	0.8229	0.3968
PVQ [62]	0.791	0.795	-	0.770	0.807	-	-	-	-	-	-	-
Li el al. [196]	0.836	0.834	-	-	-	-	0.831	0.819	-	-	-	-
SimpleVQA [212]	0.856	0.860	-	-	-	-	0.847	0.856	-	-	-	-
FastVQA [220]	0.891	0.892	-	0.849	0.865	-	0.855	0.852	-	0.865	0.869	-

5.2 Performance Comparison

We compare the performance of the surveyed VQA methods in this subsection. Since not all reviewed algorithms are publicly available, for a fair comparison, we take the performance reported in the original papers.

Table 6 demonstrates the performance of FR and RR video quality assessment algorithms on the LIVE VQA [48] database. It can be observed that general-purpose FR-IQA measures such as PSNR, SSIM [120], VIF [122], etc., perform worse than general-purpose FR-VQA metrics such as STMAD [137], ViS3 [54], MOVIE [132], which demonstrates that hand-crafted temporal features are useful for the VQA task. Moreover, deep learning-based VQA methods such as DeepVQA [155] achieve better performance compared to traditional models, which demonstrates the effectiveness of using DNN in the VQA task.

Table 7 demonstrates the performance of NR video quality assessment algorithms on three databases including KoNViD-1k [59], LIVE-VQC [60], YouTube-UGC [61] databases. It can be observed that traditional NR-IQA measures such as NIQE [233], BRISQUE [232], etc., performs worse than deep NR-IQA models such as KonCept512 [237] and PaQ-2-PiQ [372]. Moreover, VQA models perform better than IQA models on the NR VQA task, which further demonstrates the importance and necessity of developing specific VQA algorithms.

6 Future Research Directions

Though significant advancements have been achieved in field of video quality assessment in recent years, there remain unresolved challenges and promising research directions. In this section, we present an overview of promising research direction as follows.

6.1 Human Perception Mechanism of Video Quality Assessment

Human perception is a complex system and many scientific studies have conducted research on this problem [375, 376, 377, 378, 379, 380, 381]. Due to the evolution of multimedia systems, many new capture, compression, transmission, and display techniques have been developed, which may have different influences on human visual perception [382, 20, 383, 16, 384]. It is necessary to study human perception in these new media systems, and conduct corresponding subjective quality assessment research. Moreover, vision science-based models have been dominant methods in IQA and VQA for many years even with the revolution of deep learning [120, 149, 352, 385], and will continue to play a crucial role in the future video quality assessment realm since they can provide robust and reliable results through the comprehensive understanding of human visual perception. Studying and integrating perception-based VQA models can enhance the interpretability and robustness of current VQA systems.

6.2 Large Multi-modality Models for Video Quality Assessment

Large multi-modality models (LMMs) [386, 387] have demonstrated excellent performance in various vision-language tasks, including image captioning, visual question answering, cross-modality grounding, as well as pure vision tasks such as image classification, object detection. Some studies [388, 389, 390, 391] have applied LMMs in the filed of image quality assessment. For instance, Wu et al. created Q-Bench [391], a benchmark for evaluating the low-level visual perception ability of LMMs. They further introduced two LMMs, Q-Instruct [390] and Q-Align [388], which were respectively fine-tuned Q-Pathway, a low-level visual instruction dataset, and existing I/VQA datasets. While these models achieved remarkable performance on image quality assessment and image quality description, their performance in VQA still lags behind state-of-the-art methods. The primary reason is that current LLM-based quality assessment methods are tailored for images and overlook the distinctive characteristics of video content, such as various temporal distortions. Therefore, there is a necessity to design a LLM-based video quality assessment model.

6.3 Quality Assessment of Emerging Video Media

Emerging media, such as VR/AR/MR, HFR, HDR, gaming, etc., are becoming increasingly important in multimedia, which brings new challenges and opportunities for VQA research [4, 392, 393, 394, 395, 396, 20, 397, 398, 383]. Moreover, the advancement of communication systems, such as 5G/6G, semantic communication [399], also promotes the new video applications. Thus, corresponding specific VQA systems are also required. Specifically, multi-modal VQA is an important emerging topic, especially for immersive media. For extended reality (XR), more multi-modal [400] quality assessment datasets and models are needed, which are not limited to the visual and auditory modalities, but also include other modalities, such as olfactory, gustatory, and tactile perception. It is necessary to consider incorporating the immersive and interactive nature of XR contents in VQA [8, 401], such as higher-order ambisonics.

6.4 Quality Assessment of AIGC Videos

AI generated content (AIGC) have achieved significant progress recently [402, 403, 404]. With the advancement of text-to-image [405, 403] and text-to-video [406, 407] techniques, AI-based image and video generation has been applied to various fields. Some studies have investigated the unique distortions in AI generated images (AIGIs) and conducted IQA research [408, 409, 410]. The further exploration of the quality assessment for AI generated videos (AIGVs) can help control and improve the quality of AIGVs, which is a new trend for future VQA research.

6.5 Quality Assessment of Volumetric Videos

Volumetric video has garnered increasing research interest since it can provide users immersive and realistic experiences by representing the complete volume of 3D content. Though there are numerous quality assessment studies for static 3D content [411, 412, 413, 414, 415, 416, 417, 418, 419, 420, 421], there is a scarcity of studies [422, 423, 424, 425] on volumetric video, a format representing dynamic 3D content. Zerman et al. [422] collected eight volumetric video sequences and investigated the subjective perception differences among different compression methods, including Darco, geometry-based point cloud compression (G-PCC) [426], and video-based point cloud compression (V-PCC) [426]. Based on this databaset, Fan et al. [425] introduced a multi-view learning method for NR volumetric video quality assessment, utilizing 3D-CNN to extract features from multi-view projected video sequences.

6.6 Green Learning for Video Quality Assessment

Green learning for VQA is a promising study area because of its characteristics of low carbon footprints, lightweight model, low computational complexity, and logical transparency [427]. The existing VQA models are typically based on DNNs, characterized by large model sizes and high computational complexity. So, they are hardly deployed in edge devices or real-time processing system. Mei et al. [428] have developed a lightweight NR VQA model called GreenBVQA, comprising four processing pipeline: video data cropping, unsuperised representation generation, superivsed feature selection, and MOS regression and ensembles. With the increasing demand for lightweight VQA models, green learning for VQA is becoming more and more important.

7 Summary

In this survey, we perform an extensive review of perceptual video quality assessment research. Subjective video quality assessment methodologies and databases are first reviewed. Then full-reference, reduced-reference and no-reference objective video quality assessment metrics are summarized and analyzed in sequence. Emerging topics in the realm of objective video quality assessment for compressed videos, streaming videos, stereoscopic videos, VR/AR videos, HFR videos, audio-visual videos, HDR videos, screen and game videos, are also reviewed. Finally, we evaluate and compare the performance of many objective video quality assessment models. This survey provides a systematic overview of classical and recent progress in the VQA realm, which helps researchers in related areas quickly access the progress, and find solutions and trends in their study.

\Supplements

Appendix A.

References

[1] Sandvine’s 2023 global internet phenomena report shows 24 https://www.sandvine.com/press-releases/sandvines-2023-global-internet-phenomena-report-shows-24-jump-in-video-traffic-with-netflix-volume-overtaking-youtube, 2023. [Accessed 2023-09-23]. [Online].
[2] Saha A, Pentapati S K, Shang Z, Pahwa R, Chen B, Gedik H E, Mishra S, Bovik A C. Perceptual video quality assessment: The journey continues! Frontiers in Signal Processing. 3: 1193523
[3] Zhai G, Min X. Perceptual image quality assessment: a survey. Science China Information Sciences, 2020. 63: 1–52
[4] Duan H, Zhai G, Yang X, Li D, Zhu W. Ivqad 2017: An immersive video quality assessment database. In: Proceedings of the IEEE International Conference on Systems, Signals and Image Processing (IWSSIP), 2017 1–5
[5] Chen M J, Kwon D K, Bovik A C. Study of subject agreement on stereoscopic video quality. In: Proceedings of the IEEE Southwest Symposium on Image Analysis and Interpretation, 2012 173–176
[6] Nasiri R M, Wang J, Rehman A, Wang S, Wang Z. Perceptual quality assessment of high frame rate video. In: Proceedings of the IEEE International Workshop on Multimedia Signal Processing (MMSP), 2015 1–6
[7] Min X, Zhai G, Zhou J, Farias M C, Bovik A C. Study of subjective and objective quality assessment of audio-visual signals. IEEE Transactions on Image Processing (TIP), 2020. 29: 6054–6068
[8] Zhu X, Duan H, Cao Y, Zhu Y, Zhu Y, Liu J, Chen L, Min X, Zhai G. Perceptual quality assessment of omnidirectional audio-visual signals. arXiv preprint arXiv:230710813, 2023
[9] Shang Z, Ebenezer J P, Bovik A C, Wu Y, Wei H, Sethuraman S. Subjective assessment of high dynamic range videos under different ambient conditions. In: Proceedings of the IEEE International Conference on Image Processing (ICIP), 2022 786–790
[10] Yu X, Ying Z, Birkbeck N, Wang Y, Adsumilli B, Bovik A C. Subjective and objective analysis of streamed gaming videos. IEEE Transactions on Games, 2023
[11] Wang Z, Bovik A C. Mean squared error: Love it or leave it? a new look at signal fidelity measures. IEEE signal processing magazine, 2009. 26: 98–117
[12] Wang Z, Bovik A C. Reduced-and no-reference image quality assessment. IEEE Signal Processing Magazine, 2011. 28: 29–40
[13] Lin W, Kuo C C J. Perceptual visual quality metrics: A survey. Journal of Visual Communication and Image Representation, 2011. 22: 297–312
[14] Mallat S G. A theory for multiresolution signal decomposition: the wavelet representation. IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI), 1989. 11: 674–693
[15] Hyvärinen A, Oja E. Independent component analysis: algorithms and applications. Neural Networks, 2000. 13: 411–430
[16] Duan H, Shen W, Min X, Tian Y, Jung J H, Yang X, Zhai G. Develop then rival: A human vision-inspired framework for superimposed image decomposition. IEEE Transactions on Multimedia (TMM), 2022
[17] Wu J, Shi G, Lin W, Liu A, Qi F. Just noticeable difference estimation for images with free-energy principle. IEEE Transactions on Multimedia (TMM), 2013. 15: 1705–1710
[18] Wu J, Li L, Dong W, Shi G, Lin W, Kuo C C J. Enhanced just noticeable difference model for images with pattern complexity. IEEE Transactions on Image Processing (TIP), 2017. 26: 2682–2693
[19] Liu H, Heynderickx I. Visual attention in objective image quality assessment: Based on eye-tracking data. IEEE transactions on Circuits and Systems for Video Technology (TCSVT), 2011. 21: 971–982
[20] Duan H, Min X, Zhu Y, Zhai G, Yang X, Le Callet P. Confusing image quality assessment: Toward better augmented reality experience. IEEE Transactions on Image Processing (TIP), 2022. 31: 7206–7221
[21] Zhu K, Li C, Asari V, Saupe D. No-reference video quality assessment based on artifact measurement and statistical analysis. IEEE Transactions on Circuits and Systems for Video Technology (TCSVT), 2014. 25: 533–546
[22] Gu K, Liu M, Zhai G, Yang X, Zhang W. Quality assessment considering viewing distance and image resolution. IEEE Transactions on Broadcasting (TBC), 2015. 61: 520–531
[23] Zhu Y, Zhai G, Gu K, Che Z. Closing the gap: Visual quality assessment considering viewing conditions. In: Proceedings of the International Conference on Quality of Multimedia Experience (QoMEX). IEEE, 2016 1–6
[24] Duan H, Guo L, Sun W, Min X, Chen L, Zhai G. Augmented reality image quality assessment based on visual confusion theory. In: Proceedings of the IEEE International Symposium on Broadband Multimedia Systems and Broadcasting (BMSB). IEEE, 2022 1–6
[25] Moorthy A K, Bovik A C. Visual quality assessment algorithms: what does the future hold? Multimedia Tools and Applications, 2011. 51: 675–696
[26] Chandler D M. Seven challenges in image quality assessment: past, present, and future research. International Scholarly Research Notices, 2013. 2013
[27] Chikkerur S, Sundaram V, Reisslein M, Karam L J. Objective video quality assessment methods: A classification, review, and performance comparison. IEEE Transactions on Broadcasting (TBC), 2011. 57: 165–182
[28] Shahid M, Rossholm A, Lövström B, Zepernick H J. No-reference image and video quality assessment: a classification and review of recent approaches. EURASIP Journal on image and Video Processing, 2014. 2014: 1–32
[29] Chen Y, Wu K, Zhang Q. From qos to qoe: A tutorial on video quality assessment. IEEE Communications Surveys & Tutorials, 2014. 17: 1126–1165
[30] Fan Q, Luo W, Xia Y, Li G, He D. Metrics and methods of video quality assessment: a brief review. Multimedia Tools and Applications, 2019. 78: 31019–31033
[31] Li D, Jiang T, Jiang M. Recent advances and challenges in video quality assessment. ZTE Communications, 2019. 17: 3–11
[32] Zhou W, Min X, Li H, Jiang Q. A brief survey on adaptive video streaming quality assessment. Journal of Visual Communication and Image Representation, 2022. 86: 103526
[33] Quan Y, Duan H, Zhan Z, Shen Y, Lin R, Liu T, Zhang T, Wu J, Huang J, Zhai G, et al. Evaluation of the glaucomatous macular damage by chromatic pupillometry. Ophthalmology and Therapy, 2023. 1–24
[34] Quan Y, Duan H, Zhan Z, Shen Y, Lin R, Liu T, Zhang T, Wu J, Huang J, Zhai G, et al. Binocular head-mounted chromatic pupillometry can detect structural and functional loss in glaucoma. Frontiers in Neuroscience, 2023. 17
[35] Chen Q, Min X, Duan H, Zhu Y, Zhai G. Muiqa: Image quality assessment database and algorithm for medical ultrasound images. In: Proceedings of the IEEE International Conference on Image Processing (ICIP). IEEE, 2021 2958–2962
[36] Chen Q, Liu F, Duan H, Wang Y, Min X, Zhou Y, Zhai G. Mriqa: Subjective method and objective model for magnetic resonance image quality assessment. In: Proceedings of the IEEE International Conference on Visual Communications and Image Processing (VCIP). IEEE, 2022 1–5
[37] Wu S, Duan H, Min X, Tu D, Zhai G. Accurate compensation makes the world more clear for the visually impaired. In: Proceedings of the IEEE International Conference on Image Processing (ICIP). IEEE, 2021 604–608
[38] Duan H, Shen W, Min X, Tu D, Teng L, Wang J, Zhai G. Masked autoencoders as image processors. arXiv preprint arXiv:230317316, 2023
[39] Li Y, Qian Q, Duan H, Min X, Xu Y, Jiang X. Boosting power line inspection in bad weather: Removing weather noise with channel-spatial attention-based unet. Multimedia Tools and Applications, 2023. 1–17
[40] Hu M, Zhai G, Li D, Fan Y, Duan H, Zhu W, Yang X. Combination of near-infrared and thermal imaging techniques for the remote and simultaneous measurements of breathing and heart rates under sleep situation. PloS one, 2018. 13: e0190466
[41] Hu M, Zhai G, Li D, Fan Y, Duan H, Zhu W, Yang X. Dual-mode imaging system for non-contact heart rate estimation during night. In: Proceedings of the IEEE International Conference on Multimedia & Expo Workshops (ICMEW). IEEE, 2017 97–102
[42] Series B. Methodology for the subjective assessment of the quality of television pictures. Recommendation ITU-R BT, 2012. 500–13
[43] ITU. Subjective evaluation of media quality using a crowdsourcing approach. Recommendation ITU-T, 2018
[44] Amazon mechanical turk. https://www.mturk.com/, 2023. [Accessed 2023-09-23]. [Online].
[45] Microworkers. https://microworkers.com, 2023. [Accessed 2023-09-23]. [Online].
[46] Crowdflower. https://www.crowdflower.com/, 2023. [Accessed 2023-09-23]. [Online].
[47] Crowdee. https://crowdee.de/, 2023. [Accessed 2023-09-23]. [Online].
[48] Seshadrinathan K, Soundararajan R, Bovik A C, Cormack L K. Study of subjective and objective quality assessment of video. IEEE Transactions on Image Processing (TIP), 2010. 19: 1427–1441
[49] De Simone F, Tagliasacchi M, Naccari M, Tubaro S, Ebrahimi T. A h. 264/avc video database for the evaluation of quality metrics. In: Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2010 2430–2433
[50] Vqeg hdtv phase i database. https://www.its.bldrdoc.gov/vqeg/projects/hdtv/hdtv.aspx, 2010. [Accessed 2023-06-28]. [Online].
[51] Fan Z, Songnan L, Lin M, Yuk C W, King N N. Ivp subjective quality video database. http://ivp.ee.cuhk.edu.hk/research/database/subjective/, 2011. [Accessed 2023-06-28]. [Online].
[52] Keimel C, Redl A, Diepold K. The tum high definition video datasets. In: Proceedings of the International Workshop on Quality of Multimedia Experience, 2012 97–102
[53] Moorthy A K, Choi L K, Bovik A C, De Veciana G. Video quality assessment on mobile devices: Subjective, behavioral and objective studies. IEEE Journal of Selected Topics in Signal Processing (JSTSP), 2012. 6: 652–671
[54] Vu P V, Chandler D M. Vis 3: an algorithm for video quality assessment via analysis of spatial and spatiotemporal slices. Journal of Electronic Imaging, 2014. 23: 013016–013016
[55] Lin J Y, Song R, Wu C H, Liu T, Wang H, Kuo C C J. Mcl-v: A streaming video quality assessment database. Journal of Visual Communication and Image Representation, 2015. 30: 1–9
[56] Wang H, Gan W, Hu S, Lin J Y, Jin L, Song L, Wang P, Katsavounidis I, Aaron A, Kuo C C J. Mcl-jcv: a jnd-based h. 264/avc video quality assessment dataset. In: Proceedings of the IEEE International Conference on Image Processing (ICIP), 2016 1509–1513
[57] Nuutinen M, Virtanen T, Vaahteranoksa M, Vuori T, Oittinen P, Häkkinen J. Cvd2014—a database for evaluating no-reference video quality assessment algorithms. IEEE Transactions on Image Processing (TIP), 2016. 25: 3073–3086
[58] Ghadiyaram D, Pan J, Bovik A C, Moorthy A K, Panda P, Yang K C. In-capture mobile video distortions: A study of subjective behavior and objective algorithms. IEEE Transactions on Circuits and Systems for Video Technology (TCSVT), 2017. 28: 2061–2077
[59] Hosu V, Hahn F, Jenadeleh M, Lin H, Men H, Szirányi T, Li S, Saupe D. The konstanz natural video database (konvid-1k). In: Proceedings of the International Conference on Quality of Multimedia Experience (QoMEX), 2017 1–6
[60] Sinno Z, Bovik A C. Large-scale study of perceptual video quality. IEEE Transactions on Image Processing (TIP), 2018. 28: 612–627
[61] Wang Y, Inguva S, Adsumilli B. Youtube ugc dataset for video compression research. In: Proceedings of the International Workshop on Multimedia Signal Processing (MMSP), 2019 1–5
[62] Ying Z, Mandal M, Ghadiyaram D, Bovik A. Patch-vq:’patching up’the video quality problem. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2021 14019–14029
[63] Li Y, Meng S, Zhang X, Wang S, Wang Y, Ma S. Ugc-video: perceptual quality assessment of user-generated videos. In: Proceedings of the IEEE Conference on Multimedia Information Processing and Retrieval (MIPR), 2020 35–38
[64] Yu X, Birkbeck N, Wang Y, Bampis C G, Adsumilli B, Bovik A C. Predicting the quality of compressed videos with pre-existing distortions. IEEE Transactions on Image Processing (TIP), 2021. 30: 7511–7526
[65] Wang Y, Ke J, Talebi H, Yim J G, Birkbeck N, Adsumilli B, Milanfar P, Yang F. Rich features for perceptual quality assessment of ugc videos. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2021 13435–13444
[66] Haiqiang W, Gary L, Shan L, C-C Jay K. Icme 2021 ugc-vqa challenge. http://ugcvqa.com/, 2021. [Accessed 2023-06-28]. [Online].
[67] Zhang Z, Wu W, Sun W, Tu D, Lu W, Min X, Chen Y, Zhai G. Md-vqa: Multi-dimensional quality assessment for ugc live videos. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2023 1746–1755
[68] Chen C, Choi L K, De Veciana G, Caramanis C, Heath R W, Bovik A C. Modeling the time—varying subjective quality of http video streams with rate adaptations. IEEE Transactions on Image Processing (TIP), 2014. 23: 2206–2221
[69] Ghadiyaram D, Bovik A C, Yeganeh H, Kordasiewicz R, Gallant M. Study of the effects of stalling events on the quality of experience of mobile streaming videos. In: Proceedings of the IEEE Global Conference on Signal and Information Processing (GlobalSIP), 2014 989–993
[70] Bampis C G, Li Z, Moorthy A K, Katsavounidis I, Aaron A, Bovik A C. Study of temporal effects on subjective video quality of experience. IEEE Transactions on Image Processing (TIP), 2017. 26: 5217–5231
[71] Bampis C G, Li Z, Katsavounidis I, Huang T Y, Ekanadham C, Bovik A C. Towards perceptually optimized adaptive video streaming-a realistic quality of experience database. IEEE Transactions on Image Processing (TIP), 2021. 30: 5182–5197
[72] Duanmu Z, Zeng K, Ma K, Rehman A, Wang Z. A quality-of-experience index for streaming video. IEEE Journal of Selected Topics in Signal Processing (JSTSP), 2016. 11: 154–166
[73] Duanmu Z, Ma K, Wang Z. Quality-of-experience of adaptive video streaming: Exploring the space of adaptations. In: Proceedings of the ACM international conference on Multimedia (ACM MM), 2017 1752–1760
[74] Duanmu Z, Rehman A, Wang Z. A quality-of-experience database for adaptive video streaming. IEEE Transactions on Broadcasting (TBC), 2018. 64: 474–487
[75] Duanmu Z, Liu W, Li Z, Chen D, Wang Z, Wang Y, Gao W. Waterloo streaming quality-of-experience database iv. https://ieee-dataport.org/open-access/waterloo-streaming-quality-experience-database-iv, 2019. [Accessed 2023-06-28]. [Online].
[76] De Silva V, Arachchi H K, Ekmekcioglu E, Kondoz A. Toward an impairment metric for stereoscopic video: A full-reference video quality metric to assess compressed stereoscopic video. IEEE Transactions on Image Processing (TIP), 2013. 22: 3392–3404
[77] Jumisko-Pyykkö S, Haustola T, Boev A, Gotchev A. Subjective evaluation of mobile 3d video content: depth range versus compression artifacts. In: Proceedings of the SPIE. SPIE, volume 7881, 2011 126–137
[78] Goldmann L, De Simone F, Ebrahimi T. A comprehensive database and subjective evaluation methodology for quality of experience in stereoscopic video. In: Proceedings of the Three-Dimensional Image Processing (3DIP) and Applications. SPIE, volume 7526, 2010 242–252
[79] Urvoy M, Barkowsky M, Cousseau R, Koudota Y, Ricorde V, Le Callet P, Gutierrez J, Garcia N. Nama3ds1-cospad1: Subjective video quality assessment database on coding conditions introducing freely available high quality 3d stereoscopic sequences. In: Proceedings of the International Workshop on Quality of Multimedia Experience, 2012 109–114
[80] Banitalebi-Dehkordi A, Pourazad M T, Nasiopoulos P. Effect of high frame rates on 3d video quality of experience. In: Proceedings of the IEEE International Conference on Consumer Electronics (ICCE), 2014 416–417
[81] Banitalebi-Dehkordi A, Pourazad M T, Nasiopoulos P. The effect of frame rate on 3d video quality and bitrate. 3D Research, 2015. 6: 1–13
[82] Dumić E, Grgić S, Šakić K, Rocha P M R, da Silva Cruz L A. 3d video subjective quality: a new database and grade comparison study. Multimedia Tools and Applications (MTAP), 2017. 76: 2087–2109
[83] Wang J, Wang S, Wang Z. Asymmetrically compressed stereoscopic 3d videos: Quality assessment and rate-distortion performance evaluation. IEEE Transactions on Image Processing (TIP), 2017. 26: 1330–1343
[84] Zhang Y, Wang Y, Liu F, Liu Z, Li Y, Yang D, Chen Z. Subjective panoramic video quality assessment database for coding applications. IEEE Transactions on Broadcasting (TBC), 2018. 64: 461–473
[85] Zhang B, Zhao J, Yang S, Zhang Y, Wang J, Fei Z. Subjective and objective quality assessment of panoramic videos in virtual reality environments. In: Proceedings of the IEEE International Conference on Multimedia & Expo Workshops (ICMEW), 2017 163–168
[86] Lopes F, Ascenso J, Rodrigues A, Queluz M P. Subjective and objective quality assessment of omnidirectional video. In: Applications of Digital Image Processing XLI. SPIE, volume 10752, 2018 249–265
[87] Singla A, Fremerey S, Robitza W, Lebreton P, Raake A. Comparison of subjective quality evaluation for hevc encoded omnidirectional videos at different bit-rates for uhd and fhd resolution. In: Proceedings of the Thematic Workshops of ACM Multimedia, 2017 511–519
[88] Xu M, Li C, Chen Z, Wang Z, Guan Z. Assessing visual quality of omnidirectional videos. IEEE Transactions on Circuits and Systems for Video Technology (TCSVT), 2018. 29: 3516–3530
[89] Tran H T, Pham C T, Ngoc N P, Pham A T, Thang T C. A study on quality metrics for 360 video communications. IEICE TRANSACTIONS on Information and Systems, 2018. 101: 28–36
[90] Li C, Xu M, Du X, Wang Z. Bridge the gap between vqa and human behavior on omnidirectional video: A large-scale dataset and a deep learning model. In: Proceedings of the ACM international conference on Multimedia, 2018 932–940
[91] Jin Y, Chen M, Goodall T, Patney A, Bovik A C. Subjective and objective quality assessment of 2d and 3d foveated video compression in virtual reality. IEEE Transactions on Image Processing (TIP), 2021. PP: 1–1
[92] Mackin A, Zhang F, Bull D R. A study of high frame rate video formats. IEEE Transactions on Multimedia (TMM), 2018. 21: 1499–1512
[93] Madhusudana P C, Yu X, Birkbeck N, Wang Y, Adsumilli B, Bovik A C. Subjective and objective quality assessment of high frame rate videos. IEEE Access, 2021. 9: 108069–108082
[94] Lee D Y, Paul S, Bampis C G, Ko H, Kim J, Jeong S Y, Homan B, Bovik A C. A subjective and objective study of space-time subsampled video quality. IEEE Transactions on Image Processing (TIP), 2021. 31: 934–948
[95] Men H, Hosu V, Lin H, Bruhn A, Saupe D. Visual quality assessment for interpolated slow-motion videos based on a novel database. In: Proceedings of the IEEE International Conference on Quality of Multimedia Experience (QoMEX), 2020 1–6
[96] Danier D, Zhang F, Bull D. Bvi-vfi: A video quality database for video frame interpolation. arXiv preprint arXiv:221000823, 2022
[97] Winkler S, Faller C. Perceived audiovisual quality of low-bitrate multimedia content. IEEE Transactions on Multimedia (TMM), 2006. 8: 973–980
[98] Pinson M H, Janowski L, Pépion R, Huynh-Thu Q, Schmidmer C, Corriveau P, Younkin A, Le Callet P, Barkowsky M, Ingram W. The influence of subjects and environment on audiovisual subjective tests: An international study. IEEE Journal of Selected Topics in Signal Processing (JSTSP), 2012. 6: 640–651
[99] Pinson M H, Schmidmer C, Janowski L, Pépion R, Huynh-Thu Q, Corriveau P, Younkin A, Le Callet P, Barkowsky M, Ingram W. Subjective and objective evaluation of an audiovisual subjective dataset for research and development. In: Proceedings of the International Workshop on Quality of Multimedia Experience (QoMEX), 2013 30–31
[100] Demirbilek E, Grégoire J C. Towards reduced reference parametric models for estimating audiovisual quality in multimedia services. In: Proceedings of the IEEE International Conference on Communications (ICC), 2016 1–6
[101] Martinez H B, Farias M C. Full-reference audio-visual video quality metric. Journal of Electronic Imaging, 2014. 23: 061108–061108
[102] Martinez H A B, Farias M C. Combining audio and video metrics to assess audio-visual quality. Multimedia Tools and Applications, 2018. 77: 23993–24012
[103] Fela R F, Pastor A, Le Callet P, Zacharov N, Vigier T, Forchhammer S. Perceptual evaluation on audio-visual dataset of 360 content. In: Proceedings of the IEEE International Conference on Multimedia and Expo Workshops (ICMEW), 2022 1–6
[104] Banitalebi-Dehkordi A, Azimi M, Pourazad M T, Nasiopoulos P. Compression of high dynamic range video using the hevc and h. 264/avc standards. In: Proceedings of the IEEE International Conference on Heterogeneous Networking for Quality, Reliability, Security and Robustness, 2014 8–12
[105] Narwaria M, Da Silva M P, Le Callet P. Study of high dynamic range video quality assessment. In: Proceedings of the Applications of Digital Image Processing. SPIE, volume 9599, 2015 289–301
[106] Mukherjee R, Debattista K, Bashford-Rogers T, Vangorp P, Mantiuk R, Bessa M, Waterfield B, Chalmers A. Objective and subjective evaluation of high dynamic range video compression. Signal Processing: Image Communication, 2016. 47: 426–437
[107] Yeganeh H, Wang S, Zeng K, Eisapour M, Wang Z. Objective quality assessment of tone-mapped videos. In: Proceedings of the IEEE International Conference on Image Processing (ICIP), 2016 899–903
[108] Azimi M, Banitalebi-Dehkordi A, Dong Y, Pourazad M T, Nasiopoulos P. Evaluating the performance of existing full-reference quality metrics on high dynamic range (hdr) video content. arXiv preprint arXiv:180304815, 2018
[109] Athar S, Costa T, Zeng K, Wang Z. Perceptual quality assessment of uhd-hdr-wcg videos. In: Proceedings of the IEEE International Conference on Image Processing (ICIP), 2019 1740–1744
[110] Cheng S, Zeng H, Chen J, Hou J, Zhu J, Ma K K. Screen content video quality assessment: Subjective and objective study. IEEE Transactions on Image Processing (TIP), 2020. 29: 8636–8651
[111] Li T, Min X, Zhao H, Zhai G, Xu Y, Zhang W. Subjective and objective quality assessment of compressed screen content videos. IEEE Transactions on Broadcasting (TBC), 2020. 67: 438–449
[112] Barman N, Zadtootaghaj S, Schmidt S, Martini M G, Möller S. Gamingvideoset: a dataset for gaming video streaming applications. In: Proceedings of the Annual Workshop on Network and Systems Support for Games (NetGames). IEEE, 2018 1–6
[113] Barman N, Jammeh E, Ghorashi S A, Martini M G. No-reference video quality estimation based on machine learning for passive gaming video streaming applications. IEEE Access, 2019. 7: 74511–74527
[114] Zadtootaghaj S, Schmidt S, Sabet S S, Möller S, Griwodz C. Quality estimation models for gaming video streaming services using perceptual video quality dimensions. In: Proceedings of the ACM Multimedia Systems Conference (ACM MM), 2020 213–224
[115] Wen S, Ling S, Wang J, Chen X, Jing Y, Le Callet P. Subjective and objective quality assessment of mobile gaming video. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2022 1810–1814
[116] Duan H, Zhu X, Zhu Y, Min X, Zhai G. A quick review of human perception in immersive media. IEEE Open Journal on Immersive Displays (OJ-ID), 2024
[117] Liu Z, Yeh R A, Tang X, Liu Y, Agarwala A. Video frame synthesis using deep voxel flow. In: Proceedings of the IEEE International Conference on Computer Vision, 2017 4463–4471
[118] Xu X, Siyao L, Sun W, Yin Q, Yang M H. Quadratic video interpolation. Proceedings of the Advances in Neural Information Processing Systems (NeurIPS), 2019. 32
[119] Danier D, Zhang F, Bull D. St-mfnet: A spatio-temporal multi-flow network for frame interpolation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2022 3521–3531
[120] Wang Z, Bovik A C, Sheikh H R, Simoncelli E P. Image quality assessment: from error visibility to structural similarity. IEEE Transactions on Image Processing (TIP), 2004. 13: 600–612
[121] Wang Z, Simoncelli E P, Bovik A C. Multiscale structural similarity for image quality assessment. In: Proceedings of the Asilomar Conference on Signals, Systems & Computers, 2003. Ieee, volume 2, 2003 1398–1402
[122] Sheikh H R, Bovik A C. Image information and visual quality. IEEE Transactions on Image Processing (TIP), 2006. 15: 430–444
[123] Zhang R, Isola P, Efros A A, Shechtman E, Wang O. The unreasonable effectiveness of deep features as a perceptual metric. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2018 586–595
[124] Wang Z, Lu L, Bovik A C. Video quality assessment based on structural distortion measurement. Signal processing: Image communication, 2004. 19: 121–132
[125] Wang Z, Li Q. Video quality assessment using a statistical model of human visual speed perception. JOSA A, 2007. 24: B61–B69
[126] Moorthy A K, Bovik A C. Efficient video quality assessment along temporal trajectories. IEEE Transactions on Circuits and Systems for Video Technology (TCSVT), 2010. 20: 1653–1658
[127] Moorthy A K, Bovik A C. A motion compensated approach to video quality assessment. In: Proceedings of the Conference Record of the Forty-Third Asilomar Conference on Signals, Systems and Computers, 2009 872–875
[128] Seshadrinathan K, Bovik A C. Temporal hysteresis model of time varying subjective video quality. In: Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2011 1153–1156
[129] Park J, Seshadrinathan K, Lee S, Bovik A C. Video quality pooling adaptive to perceptual distortion severity. IEEE Transactions on Image Processing (TIP), 2012. 22: 610–620
[130] Manasa K, Channappayya S S. An optical flow-based full reference video quality assessment algorithm. IEEE Transactions on Image Processing (TIP), 2016. 25: 2480–2492
[131] Zeng K, Wang Z. 3d-ssim for video quality assessment. In: Proceedings of the 19th IEEE International Conference on Image Processing (ICIP), 2012 621–624
[132] Seshadrinathan K, Bovik A C. Motion tuned spatio-temporal quality assessment of natural videos. IEEE Transactions on Image Processing (TIP), 2009. 19: 335–350
[133] Seshadrinathan K, Bovik A C. Motion-based perceptual quality assessment of video. In: Human Vision and Electronic Imaging XIV. SPIE, volume 7240, 2009 283–294
[134] Choi L K, Bovik A C. Video quality assessment accounting for temporal visual masking of local flicker. Signal Processing: Image Communication, 2018. 67: 182–198
[135] Choi L K, Bovik A C. Flicker sensitive motion tuned video quality assessment. In: Proceedings of the IEEE Southwest Symposium on Image Analysis and Interpretation (SSIAI). IEEE, 2016 29–32
[136] Wang Y, Jiang T, Ma S, Gao W. Novel spatio-temporal structural information based video quality metric. IEEE Transactions on Circuits and Systems for Video Technology (TCSVT), 2012. 22: 989–998
[137] Vu P V, Vu C T, Chandler D M. A spatiotemporal most-apparent-distortion model for video quality assessment. In: Proceedings of the IEEE International Conference on Image Processing (ICIP), 2011 2505–2508
[138] Larson E C, Chandler D M. Most apparent distortion: full-reference image quality assessment and the role of strategy. Journal of electronic imaging, 2010. 19: 011006
[139] Yan P, Mou X. Video quality assessment based on motion structure partition similarity of spatiotemporal slice images. Journal of Electronic Imaging, 2018. 27: 033019
[140] Xue W, Zhang L, Mou X, Bovik A C. Gradient magnitude similarity deviation: A highly efficient perceptual image quality index. IEEE Transactions on Image Processing (TIP), 2013. 23: 684–695
[141] Zhang F, Bull D R. A perception-based hybrid model for video quality assessment. IEEE Transactions on Circuits and Systems for Video Technology (TCSVT), 2015. 26: 1017–1028
[142] Duan H, Zhai G, Min X, Fang Y, Che Z, Yang X, Zhi C, Yang H, Liu N. Learning to predict where the children with asd look. In: Proceedings of the IEEE International Conference on Image Processing (ICIP), 2018 704–708
[143] Fang Y, Duan H, Shi F, Min X, Zhai G. Identifying children with autism spectrum disorder based on gaze-following. In: Proceedings of the IEEE International Conference on Image Processing (ICIP), 2020 423–427
[144] Wu J, Liu Y, Dong W, Shi G, Lin W. Quality assessment for video with degradation along salient trajectories. IEEE Transactions on Multimedia (TMM), 2019. 21: 2738–2749
[145] You J, Ebrahimi T, Perkis A. Attention driven foveated video quality assessment. IEEE Transactions on Image Processing (TIP), 2013. 23: 200–213
[146] Peng P, Liao D, Li Z N. An efficient temporal distortion measure of videos based on spacetime texture. Pattern Recognition, 2017. 70: 1–11
[147] Zhang W, Liu H. Study of saliency in objective video quality assessment. IEEE Transactions on Image Processing (TIP), 2017. 26: 1275–1288
[148] Freitas P G, Akamine W Y, Farias M C. Using multiple spatio-temporal features to estimate video quality. Signal Processing: Image Communication, 2018. 64: 1–10
[149] Li Z, Aaron A, Katsavounidis I, Moorthy A, Manohara M, et al. Toward a practical perceptual video quality metric. The Netflix Tech Blog, 2016. 6: 2
[150] Li S, Zhang F, Ma L, Ngan K N. Image quality assessment by separately evaluating detail losses and additive impairments. IEEE Transactions on Multimedia (TMM), 2011. 13: 935–949
[151] Bampis C G, Li Z, Bovik A C. Spatiotemporal feature integration and model fusion for full reference video quality assessment. IEEE Transactions on Circuits and Systems for Video Technology (TCSVT), 2018. 29: 2256–2270
[152] Bampis C G, Bovik A C, Li Z. A simple prediction fusion improves data-driven full-reference video quality assessment models. In: Proceedings of the Picture Coding Symposium (PCS), 2018 298–302
[153] Venkataramanan A K, Stejerean C, Bovik A C. Funque: Fusion of unified quality evaluators. In: Proceedings of the IEEE International Conference on Image Processing (ICIP), 2022 2147–2151
[154] Liu Y, Wu J, Li A, Li L, Dong W, Shi G, Lin W. Video quality assessment with serial dependence modeling. IEEE Transactions on Multimedia (TMM), 2021. 24: 3754–3768
[155] Kim W, Kim J, Ahn S, Kim J, Lee S. Deep video quality assessor: From spatio-temporal visual sensitivity to a convolutional neural aggregation network. In: Proceedings of the European Conference on Computer Vision (ECCV), 2018 219–234
[156] Xu M, Chen J, Wang H, Liu S, Li G, Bai Z. C3dvqa: Full-reference video quality assessment with 3d convolutional neural network. In: Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2020 4447–4451
[157] Zhang Y, Gao X, He L, Lu W, He R. Objective video quality assessment combining transfer learning with cnn. IEEE Transactions on Neural Networks and Learning Systems (TNNLS), 2019. 31: 2716–2730
[158] Zhang Y, He L, Lu W, Li J, Gao X. Video quality assessment with dense features and ranking pooling. Neurocomputing, 2021. 457: 242–253
[159] Wo W, Zhang Y, Hu Y, Chen Z, Liu S. Video quality assessment based on quality aggregation networks. In: Proceedings of the IEEE International Conference on Visual Communications and Image Processing (VCIP), 2022 1–5
[160] Li Y, Feng L, Xu J, Zhang T, Liao Y, Li J. Full-reference and no-reference quality assessment for compressed user-generated content videos. In: Proceedings of the IEEE International Conference on Multimedia & Expo Workshops (ICMEW), 2021 1–6
[161] Sun W, Wang T, Min X, Yi F, Zhai G. Deep learning based full-reference and no-reference quality assessment models for compressed ugc videos. In: Proceedings of the IEEE International Conference on Multimedia & Expo Workshops (ICMEW), 2021 1–6
[162] Li Y, Meng S, Zhang X, Wang M, Wang S, Wang Y, Ma S. User-generated video quality assessment: A subjective and objective study. IEEE Transactions on Multimedia (TMM), 2021
[163] Pinson M H, Wolf S. A new standardized method for objectively measuring video quality. IEEE Transactions on broadcasting (TBC), 2004. 50: 312–322
[164] Masry M, Hemami S S, Sermadevi Y. A scalable wavelet-based video distortion metric and applications. IEEE Transactions on Circuits and Systems for Video Technology (TCSVT), 2006. 16: 260–273
[165] Le Callet P, Viard-Gaudin C, Barba D. A convolutional neural network approach for objective video quality assessment. IEEE Transactions on Neural Networks, 2006. 17: 1316–1327
[166] Gunawan I P, Ghanbari M. Reduced-reference video quality assessment using discriminative local harmonic strength with motion consideration. IEEE Transactions on Circuits and Systems for Video Technology (TCSVT), 2008. 18: 71–83
[167] Zeng K, Wang Z. Temporal motion smoothness measurement for reduced-reference video quality assessment. In: Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2010 1010–1013
[168] Ma L, Li S, Ngan K N. Reduced-reference video quality assessment of compressed video sequences. IEEE Transactions on circuits and systems for video technology (TCSVT), 2012. 22: 1441–1456
[169] Zhu K, Barkowsky M, Shen M, Le Callet P, Saupe D. Optimizing feature pooling and prediction models of vqa algorithms. In: Proceedings of the IEEE International Conference on Image Processing (ICIP), 2014 541–545
[170] Soundararajan R, Bovik A C. Video quality assessment by reduced reference spatio-temporal entropic differencing. IEEE Transactions on Circuits and Systems for Video Technology (TCSVT), 2012. 23: 684–694
[171] Bampis C G, Gupta P, Soundararajan R, Bovik A C. Speed-qa: Spatial efficient entropic differencing for image and video quality. IEEE Signal Processing Letters, 2017. 24: 1333–1337
[172] Wang M, Zhang F, Agrafiotis D. A very low complexity reduced reference video quality metric based on spatio-temporal information selection. In: Proceedings of the IEEE International Conference on Image Processing (ICIP), 2015 571–575
[173] Melcher D, Wolf S. Objective measures for detecting digital tiling. T1A1, 1995. 5: 95–104
[174] Webster A A, Jones C T, Pinson M H, Voran S D, Wolf S. Objective video quality assessment system based on human perception. In: Human vision, visual processing, and digital display IV. SPIE, volume 1913, 1993 15–26
[175] Tetsuji Y, Kameda M, Miyahara M M. Objective picture quality scale for video images (pqsvideo): definition of distortion factors. In: Proceedings of the Visual Communications and Image Processing. SPIE, volume 4067, 2000 801–809
[176] Wang Z, Bovik A C, Evan B L. Blind measurement of blocking artifacts in images. In: Proceedings of the IEEE International Conference on Image Processing (ICIP). volume 3, 2000 981–984
[177] Zhu K, Hirakawa K, Asari V, Saupe D. A no-reference video quality assessment based on laplacian pyramids. In: Proceedings of the IEEE International Conference on Image Processing (ICIP), 2013 49–53
[178] Soundararajan R, Bovik A C. Rred indices: Reduced reference entropic differencing for image quality assessment. IEEE Transactions on Image Processing (TIP), 2011. 21: 517–526
[179] Saad M A, Bovik A C, Charrier C. Blind prediction of natural video quality. IEEE Transactions on Image Processing (TIP), 2014. 23: 1352–1365
[180] Xu J, Ye P, Liu Y, Doermann D. No-reference video quality assessment via feature learning. In: Proceedings of the IEEE International Conference on Image Processing (ICIP), 2014 491–495
[181] Mittal A, Saad M A, Bovik A C. A completely blind video integrity oracle. IEEE Transactions on Image Processing (TIP), 2015. 25: 289–300
[182] Korhonen J. Two-level approach for no-reference consumer video quality assessment. IEEE Transactions on Image Processing (TIP), 2019. 28: 5923–5938
[183] Tu Z, Wang Y, Birkbeck N, Adsumilli B, Bovik A C. UGC-VQA: Benchmarking blind video quality assessment for user generated content. IEEE Transactions on Image Processing (TIP), 2021. 30: 4449–4464
[184] Ebenezer J P, Shang Z, Wu Y, Wei H, Sethuraman S, Bovik A C. Chipqa: No-reference video quality prediction via space-time chips. IEEE Transactions on Image Processing (TIP), 2021. 30: 8059–8074
[185] Li D, Jiang T, Jiang M. Quality assessment of in-the-wild videos. In: Proceedings of the ACM Multimedia Conference (ACM MM), 2019 2351–2359
[186] Li D, Jiang T, Jiang M. Unified quality assessment of in-the-wild videos with mixed datasets training. International Journal of Computer Vision (IJCV), 2021. 129: 1238–1257
[187] Tang J, Dong Y, Xie R, Gu X, Song L, Li L, Zhou B. Deep blind video quality assessment for user generated videos. In: Proceedings of the IEEE International Conference on Visual Communications and Image Processing (VCIP), 2020 156–159
[188] Chen P, Li L, Ma L, Wu J, Shi G. Rirnet: Recurrent-in-recurrent network for video quality assessment. In: Proceedings of the ACM International Conference on Multimedia (ACM MM), 2020 834–842
[189] Chen B, Zhu L, Li G, Lu F, Fan H, Wang S. Learning generalized spatial-temporal deep feature representation for no-reference video quality assessment. IEEE Transactions on Circuits and Systems for Video Technology (TCSVT), 2021. 32: 1903–1916
[190] You J. Long short-term convolutional transformer for no-reference video quality assessment. In: Proceedings of the ACM International Conference on Multimedia (ACM MM), 2021 2112–2120
[191] You J, Lin Y. Efficient transformer with locally shared attention for video quality assessment. In: Proceedings of the IEEE International Conference on Image Processing (ICIP), 2022 356–360
[192] Wu H, Chen C, Liao L, Hou J, Sun W, Yan Q, Lin W. Discovqa: Temporal distortion-content transformers for video quality assessment. IEEE Transactions on Circuits and Systems for Video Technology (TCSVT), 2023
[193] Xu J, Li J, Zhou X, Zhou W, Wang B, Chen Z. Perceptual quality assessment of internet videos. In: Proceedings of the ACM International Conference on Multimedia (ACM MM), 2021 1248–1257
[194] Ying Z, Ghadiyaram D, Bovik A. Telepresence video quality assessment. In: Proceedings of the European Conference on Computer Vision (ECCV). Springer, 2022 327–347
[195] Li B, Zhang W, Tian M, Jiang J, Zhai G, Wang X. Learning a blind quality evaluator for ugc videos in perceptually relevant domains. In: Proceedings of the IEEE International Conference on Multimedia and Expo (ICME), 2022 1–6
[196] Li B, Zhang W, Tian M, Zhai G, Wang X. Blindly assess quality of in-the-wild videos via quality-aware pre-training and motion perception. IEEE Transactions on Circuits and Systems for Video Technology (TCSVT), 2022. 32: 5944–5958
[197] Liu Y, Wu J, Li L, Dong W, Shi G. Quality assessment of ugc videos based on decomposition and recomposition. IEEE Transactions on Circuits and Systems for Video Technology (TCSVT), 2022. 33: 1043–1054
[198] Wang Y, Yim J G, Birkbeck N, Ke J, Talebi H, Chen X, Yang F, Adsumilli B. Revisiting the efficiency of ugc video quality assessment. In: Proceedings of the IEEE International Conference on Image Processing (ICIP), 2022 3016–3020
[199] Telili A, Fezza S A, Hamidouche W, Meftah H F. 2bivqa: Double bi-lstm based video quality assessment of ugc videos. arXiv preprint arXiv:220814774, 2022
[200] Lu W, Sun W, Zhang Z, Tu D, Min X, Zhai G. Bh-vqa: Blind high frame rate video quality assessment. In: Proceedings of the IEEE International Conference on Multimedia and Expo (ICME), 2023 2501–2506
[201] Zhu H, Chen B, Zhu L, Wang S. Learning spatiotemporal interactions for user-generated video quality assessment. IEEE Transactions on Circuits and Systems for Video Technology (TCSVT), 2022. 33: 1031–1042
[202] Zhang A X, Wang Y G, Tang W, Li L, Kwong S. Hvs revisited: A comprehensive video quality assessment framework. arXiv preprint arXiv:221004158, 2022
[203] Chen P, Li L, Li H, Wu J, Dong W, Shi G. Dynamic expert-knowledge ensemble for generalizable video quality assessment. IEEE Transactions on Circuits and Systems for Video Technology (TCSVT), 2022
[204] Kwong N W, Chan Y L, Tsang S H, Lun D P K. Quality feature learning via multi-channel cnn and gru for no-reference video quality assessment. IEEE Access, 2023. 11: 28060–28075
[205] Wu H, Liao L, Wang A, Chen C, Hou J, Sun W, Yan Q, Lin W. Towards robust text-prompted semantic criterion for in-the-wild video quality assessment. arXiv preprint arXiv:230414672, 2023
[206] Wu H, Liao L, Hou J, Chen C, Zhang E, Wang A, Sun W, Yan Q, Lin W. Exploring opinion-unaware video quality assessment with semantic affinity criterion. arXiv preprint arXiv:230213269, 2023
[207] Liu H, Wu M, Yuan K, Sun M, Tang Y, Zheng C, Wen X, Li X. Ada-dqa: Adaptive diverse quality-aware feature acquisition for video quality assessment. arXiv preprint arXiv:230800729, 2023
[208] Liu W, Duanmu Z, Wang Z. End-to-end blind quality assessment of compressed videos using deep neural networks. In: Proceedings of the ACM Multimedia Conference (ACM MM), 2018 546–554
[209] You J, Korhonen J. Deep neural networks for no-reference video quality assessment. In: Proceedings of the IEEE International Conference on Image Processing (ICIP), 2019 2349–2353
[210] Yi F, Chen M, Sun W, Min X, Tian Y, Zhai G. Attention based network for no-reference ugc video quality assessment. In: Proceedings of the IEEE International Conference on Image Processing (ICIP), 2021 1414–1418
[211] Wen S, Wang J. A strong baseline for image and video quality assessment. arXiv preprint arXiv:211107104, 2021
[212] Sun W, Min X, Lu W, Zhai G. A deep learning based no-reference quality assessment model for ugc videos. In: Proceedings of the ACM International Conference on Multimedia (ACM MM), 2022 856–865
[213] Sun W, Wen W, Min X, Lan L, Zhai G, Ma K. Analysis of video quality datasets via design of minimalistic video quality models. arXiv preprint arXiv:230713981, 2023
[214] Xing F, Wang Y G, Wang H, Li L, Zhu G. Starvqa: Space-time attention for video quality assessment. In: Proceedings of the IEEE International Conference on Image Processing (ICIP), 2022 2326–2330
[215] Lin L, Wang Z, He J, Chen W, Xu Y, Zhao T. Deep quality assessment of compressed videos: A subjective and objective study. IEEE Transactions on Circuits and Systems for Video Technology (TCSVT), 2022
[216] Shen W, Zhou M, Liao X, Jia W, Xiang T, Fang B, Shang Z. An end-to-end no-reference video quality assessment method with hierarchical spatiotemporal feature representation. IEEE Transactions on Broadcasting (TBC), 2022. 68: 651–660
[217] Xian W, Zhou M, Fang B, Liao X, Ji C, Xiang T, Jia W. Spatiotemporal feature hierarchy-based blind prediction of natural video quality via transfer learning. IEEE Transactions on Broadcasting (TBC), 2022. 69: 130–143
[218] Guan X, Li F, Zhang Y, Cosman P C. End-to-end blind video quality assessment based on visual and memory attention modeling. IEEE Transactions on Multimedia (TMM), 2022
[219] Lu W, Sun W, Min X, Zhu W, Zhou Q, He J, Wang Q, Zhang Z, Wang T, Zhai G. Deep neural network for blind visual quality assessment of 4k content. arXiv preprint arXiv:220604363, 2022
[220] Wu H, Chen C, Hou J, Liao L, Wang A, Sun W, Yan Q, Lin W. Fast-vqa: Efficient end-to-end video quality assessment with fragment sampling. In: Proceedings of the European Conference on Computer Vision (ECCV), 2022 538–554
[221] Wu H, Liao L, Chen C, Hou J, Wang A, Sun W, Yan Q, Lin W. Disentangling aesthetic and technical effects for video quality assessment of user generated content. arXiv preprint arXiv:221104894, 2022
[222] Kou T, Liu X, Sun W, Jia J, Min X, Zhai G, Liu N. Stablevqa: A deep no-reference quality assessment model for video stability. arXiv preprint arXiv:230804904, 2023
[223] Yuan K, Kong Z, Zheng C, Sun M, Wen X. Capturing co-existing distortions in user-generated content for no-reference video quality assessment. arXiv preprint arXiv:230716813, 2023
[224] Ke J, Zhang T, Wang Y, Milanfar P, Yang F. Mret: Multi-resolution transformer for video quality assessment. Frontiers in Signal Processing, 2023. 3: 1137006
[225] Wu J, Liu Y, Li L, Dong W, Shi G. No-reference video quality assessment with heterogeneous knowledge ensemble. In: Proceedings of the ACM International Conference on Multimedia (ACM MM), 2021 4174–4182
[226] Liu Y, Wu J, Li L, Dong W, Zhang J, Shi G. Spatiotemporal representation learning for blind video quality assessment. IEEE Transactions on Circuits and Systems for Video Technology (TCSVT), 2021. 32: 3500–3513
[227] Chen P, Li L, Wu J, Dong W, Shi G. Unsupervised curriculum domain adaptation for no-reference video quality assessment. In: Proceedings of the IEEE International Conference on Computer Vision (ICCV), 2021 5178–5187
[228] Chen P, Li L, Wu J, Dong W, Shi G. Contrastive self-supervised pre-training for video quality assessment. IEEE Transactions on Image Processing (TIP), 2021. 31: 458–471
[229] Madhusudana P C, Birkbeck N, Wang Y, Adsumilli B, Bovik A C. Conviqt: Contrastive video quality estimator. arXiv preprint arXiv:220614713, 2022
[230] Mitra S, Soundararajan R. Multiview contrastive learning for completely blind video quality assessment of user generated content. In: Proceedings of the ACM International Conference on Multimedia (ACM MM), 2022 1914–1924
[231] Jiang S, Sang Q, Hu Z, Liu L. Self-supervised representation learning for video quality assessment. IEEE Transactions on Broadcasting (TBC), 2022. 69: 118–129
[232] Mittal A, Moorthy A K, Bovik A C. No-reference image quality assessment in the spatial domain. IEEE Transactions on Image Processing (TIP), 2012. 21: 4695–4708
[233] Mittal A, Soundararajan R, Bovik A C. Making a “completely blind” image quality analyzer. IEEE Signal Processing Letters, 2012. 20: 209–212
[234] Tu Z, Chen C J, Chen L H, Birkbeck N, Adsumilli B, Bovik A C. A comparative evaluation of temporal pooling methods for blind video quality assessment. In: Proceedings of the IEEE International Conference on Image Processing (ICIP), 2020 141–145
[235] Simonyan K, Zisserman A. Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:14091556, 2014
[236] Liu Z, Ning J, Cao Y, Wei Y, Zhang Z, Lin S, Hu H. Video swin transformer. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2022 3202–3211
[237] Hosu V, Lin H, Sziranyi T, Saupe D. Koniq-10k: An ecologically valid database for deep learning of blind image quality assessment. IEEE Transactions on Image Processing (TIP), 2020. 29: 4041–4056
[238] Ying Z, Niu H, Gupta P, Mahajan D, Ghadiyaram D, Bovik A. From patches to pictures (paq-2-piq): Mapping the perceptual space of picture quality. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2020 3575–3585
[239] Hara K, Kataoka H, Satoh Y. Learning spatio-temporal features with 3d residual networks for action recognition. In: Proceedings of the IEEE International Conference on Computer Vision Workshops (ICCVW), 2017 3154–3160
[240] Carreira J, Zisserman A. Quo vadis, action recognition? a new model and the kinetics dataset. In: proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2017 6299–6308
[241] Howard A, Sandler M, Chu G, Chen L C, Chen B, Tan M, Wang W, Zhu Y, Pang R, Vasudevan V, et al. Searching for mobilenetv3. In: Proceedings of the IEEE International Conference on Computer Vision (ICCV), 2019 1314–1324
[242] Tran D, Wang H, Torresani L, Ray J, LeCun Y, Paluri M. A closer look at spatiotemporal convolutions for action recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2018 6450–6459
[243] Gemmeke J F, Ellis D P, Freedman D, Jansen A, Lawrence W, Moore R C, Plakal M, Ritter M. Audio set: An ontology and human-labeled dataset for audio events. In: Proceedings of the IEEE international conference on acoustics, speech and signal processing (ICASSP). IEEE, 2017 776–780
[244] Zhang W, Ma K, Zhai G, Yang X. Uncertainty-aware blind image quality assessment in the laboratory and wild. IEEE Transactions on Image Processing (TIP), 2021. 30: 3474–3486
[245] Ciancio A, da Costa A L N T, da Silva E A B, Said A, Samadani R, Obrador P. No-reference blur assessment of digital pictures based on multifeature classifiers. IEEE Transactions on Image Processing (TIP), 2010. 20: 64–75
[246] Ghadiyaram D, Bovik A C. Massive online crowdsourced study of subjective and objective picture quality. IEEE Transactions on Image Processing (TIP), 2015. 25: 372–387
[247] Fang Y, Zhu H, Zeng Y, Ma K, Wang Z. Perceptual quality assessment of smartphone photography. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2020 3677–3686
[248] Feichtenhofer C, Fan H, Malik J, He K. Slowfast networks for video recognition. In: Proceedings of the IEEE International conference on computer vision (ICCV), 2019 6202–6211
[249] Xie S, Girshick R, Dollár P, Tu Z, He K. Aggregated residual transformations for deep neural networks. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2017 1492–1500
[250] Tan M, Le Q. Efficientnet: Rethinking model scaling for convolutional neural networks. In: Proceedings of the International Conference on Machine Learning (ICML). PMLR, 2019 6105–6114
[251] Stroud J, Ross D, Sun C, Deng J, Sukthankar R. D3d: Distilled 3d networks for video action recognition. In: Proceedings of the IEEE Winter Conference on Applications of Computer Vision (WACV), 2020 625–634
[252] Chu G, Arikan O, Bender G, Wang W, Brighton A, Kindermans P J, Liu H, Akin B, Gupta S, Howard A. Discovering multi-hardware mobile models via architecture search. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2021 3022–3031
[253] Kondratyuk D, Yuan L, Li Y, Zhang L, Tan M, Brown M, Gong B. Movinets: Mobile video networks for efficient video recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2021 16020–16030
[254] Tan M, Le Q. Efficientnetv2: Smaller models and faster training. In: Proceedings of the International Conference on Machine Learning (ICML). PMLR, 2021 10096–10106
[255] Zhan Y, Zhang R. No-reference image sharpness assessment based on maximum gradient and variability of gradients. IEEE Transactions on Multimedia (TMM), 2017. 20: 1796–1808
[256] Chen G, Zhu F, Ann Heng P. An efficient statistical method for image noise level estimation. In: Proceedings of the IEEE International Conference on Computer Vision (ICCV), 2015 477–485
[257] Wang Z, Sheikh H R, Bovik A C. No-reference perceptual quality assessment of jpeg compressed images. In: Proceedings of the International conference on Image Processing (ICIP). volume 1, 2002 I–I
[258] Panetta K, Gao C, Agaian S. No reference color image contrast and quality measures. IEEE Transactions on Consumer Electronics, 2013. 59: 643–651
[259] Hara K, Kataoka H, Satoh Y. Can spatiotemporal 3d cnns retrace the history of 2d cnns and imagenet? In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2018 6546–6555
[260] Liu Z, Mao H, Wu C Y, Feichtenhofer C, Darrell T, Xie S. A convnet for the 2020s. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2022 11976–11986
[261] Liu Y, Zhang X Y, Bian J W, Zhang L, Cheng M M. Samnet: Stereoscopically attentive multi-scale network for lightweight salient object detection. IEEE Transactions on Image Processing (TIP), 2021. 30: 3804–3814
[262] He K, Zhang X, Ren S, Sun J. Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016 770–778
[263] Tran D, Bourdev L, Fergus R, Torresani L, Paluri M. Learning spatiotemporal features with 3d convolutional networks. In: Proceedings of the IEEE International Conference on Computer Vision (ICCV), 2015 4489–4497
[264] Liao L, Xu K, Wu H, Chen C, Sun W, Yan Q, Lin W. Exploring the effectiveness of video perceptual representation in blind video quality assessment. In: Proceedings of the ACM International Conference on Multimedia (ACM MM), 2022 837–846
[265] Radford A, Kim J W, Hallacy C, Ramesh A, Goh G, Agarwal S, Sastry G, Askell A, Mishkin P, Clark J, et al. Learning transferable visual models from natural language supervision. In: Proceedings of the International Conference on Machine Learning (ICML). PMLR, 2021 8748–8763
[266] Xing F, Wang Y G, Tang W, Zhu G, Kwong S. Starvqa+: Co-training space-time attention for video quality assessment. arXiv preprint arXiv:230612298, 2023
[267] Xie S, Tu Z. Holistically-nested edge detection. In: Proceedings of the IEEE International Conference on Computer Vision (ICCV), 2015 1395–1403
[268] Tsai F J, Peng Y T, Lin Y Y, Tsai C C, Lin C W. Stripformer: Strip transformer for fast image deblurring. In: Proceedings of the European Conference on Computer Vision (ECCV). Springer, 2022 146–162
[269] Xu L, Lin W, Ma L, Zhang Y, Fang Y, Ngan K N, Li S, Yan Y. Free-energy principle inspired video quality metric and its use in video coding. IEEE Transactions on Multimedia (TMM), 2016. 18: 590–602
[270] Rassool R. Vmaf reproducibility: Validating a perceptual practical video quality metric. In: Proceedings of the IEEE International Symposium on Broadband Multimedia Systems and Broadcasting (BMSB), 2017 1–2
[271] Lee S O, Sim D G. Hybrid bitstream-based video quality assessment method for scalable video coding. Optical Engineering, 2012. 51: 067403–067403
[272] Lin X, Ma H, Luo L, Chen Y. No-reference video quality assessment in the compressed domain. IEEE Transactions on Consumer Electronics, 2012. 58: 505–512
[273] Huang X, Søgaard J, Forchhammer S. No-reference pixel based video quality assessment for hevc decoded video. Journal of Visual Communication and Image Representation, 2017. 43: 173–184
[274] Lin L, Zheng Y, Chen W, Lan C, Zhao T. Saliency-aware spatio-temporal artifact detection for compressed video quality assessment. IEEE Signal Processing Letters, 2023
[275] Liu X, Dobrian F, Milner H, Jiang J, Sekar V, Stoica I, Zhang H. A case for a coordinated internet video control plane. In: Proceedings of the ACM SIGCOMM conference on Applications, technologies, architectures, and protocols for computer communication, 2012 359–370
[276] Rodríguez D Z, Wang Z, Rosa R L, Bressan G. The impact of video-quality-level switching on user quality of experience in dynamic adaptive streaming over http. EURASIP Journal on Wireless Communications and Networking, 2014. 2014: 1–15
[277] Nightingale J, Wang Q, Grecos C, Goma S. The impact of network impairment on quality of experience (qoe) in h. 265/hevc video streaming. IEEE Transactions on Consumer Electronics, 2014. 60: 242–250
[278] Bentaleb A, Begen A C, Zimmermann R. Sdndash: Improving qoe of http adaptive streaming using software defined networking. In: Proceedings of the ACM International Conference on Multimedia (ACM MM), 2016 1296–1305
[279] Bampis C G, Bovik A C. Learning to predict streaming video qoe: Distortions, rebuffering and memory. arXiv preprint arXiv:170300633, 2017
[280] Bampis C G, Li Z, Bovik A C. Continuous prediction of streaming video qoe using dynamic networks. IEEE Signal Processing Letters, 2017. 24: 1083–1087
[281] Ghadiyaram D, Pan J, Bovik A C. Learning a continuous-time streaming video qoe model. IEEE Transactions on Image Processing (TIP), 2018. 27: 2257–2271
[282] Eswara N, Ashique S, Panchbhai A, Chakraborty S, Sethuram H P, Kuchi K, Kumar A, Channappayya S S. Streaming video qoe modeling and prediction: A long short-term memory approach. IEEE Transactions on Circuits and Systems for Video Technology (TCSVT), 2019. 30: 661–673
[283] Rao R R R, Göring S, Raake A. Avqbits—adaptive video quality model based on bitstream information for various video applications. IEEE Access, 2022. 10: 80321–80351
[284] Duanmu Z, Liu W, Chen D, Li Z, Wang Z, Wang Y, Gao W. A bayesian quality-of-experience model for adaptive streaming videos. ACM Transactions on Multimedia Computing, Communications and Applications (TOMM), 2023. 18: 1–24
[285] Shang Z, Ebenezer J P, Wu Y, Wei H, Sethuraman S, Bovik A C. Study of the subjective and objective quality of high motion live streaming videos. IEEE Transactions on Image Processing, 2021. 31: 1027–1041
[286] Singh K D, Hadjadj-Aoul Y, Rubino G. Quality of experience estimation for adaptive http/tcp video streaming using h. 264/avc. In: Proceedings of the IEEE Consumer Communications and Networking Conference (CCNC). IEEE, 2012 127–131
[287] Li L, Chen P, Lin W, Xu M, Shi G. From whole video to frames: Weakly-supervised domain adaptive continuous-time qoe evaluation. IEEE Transactions on Image Processing (TIP), 2022. 31: 4937–4951
[288] Yasakethu S, Hewage C T, Fernando W A C, Kondoz A M. Quality analysis for 3d video using 2d video quality models. IEEE Transactions on Consumer Electronics, 2008. 54: 1969–1976
[289] Nur G, Arachchi H K, Dogan S, Kondoz A. Extended vqm model for predicting 3d video quality considering ambient illumination context. In: Proceedings of the 3DTV Conference: The True Vision-Capture, Transmission and Display of 3D Video (3DTV-CON). IEEE, 2011 1–4
[290] Hong W, Yu L. A spatio-temporal perceptual quality index measuring compression distortions of three-dimensional video. IEEE Signal Processing Letters, 2017. 25: 214–218
[291] Galkandage C, Calic J, Dogan S, Guillemaut J Y. Stereoscopic video quality assessment using binocular energy. IEEE Journal of Selected Topics in Signal Processing (JSTSP), 2016. 11: 102–112
[292] Appina B, Channappayya S S. Full-reference 3-d video quality assessment using scene component statistical dependencies. IEEE Signal Processing Letters, 2018. 25: 823–827
[293] Zhang Y, Zhang H, Yu M, Kwong S, Ho Y S. Sparse representation-based video quality assessment for synthesized 3d videos. IEEE Transactions on Image Processing (TIP), 2019. 29: 509–524
[294] Galkandage C, Calic J, Dogan S, Guillemaut J Y. Full-reference stereoscopic video quality assessment using a motion sensitive hvs model. IEEE Transactions on Circuits and Systems for Video Technology (TCSVT), 2020. 31: 452–466
[295] Hewage C T, Martini M G. Reduced-reference quality assessment for 3d video compression and transmission. IEEE Transactions on Consumer Electronics, 2011. 57: 1185–1193
[296] Yu M, Zheng K, Jiang G, Shao F, Peng Z. Binocular perception based reduced-reference stereo video quality assessment method. Journal of Visual Communication and Image Representation, 2016. 38: 246–255
[297] Chen Z, Zhou W, Li W. Blind stereoscopic video quality assessment: From depth perception to overall experience. IEEE Transactions on Image Processing (TIP), 2017. 27: 721–734
[298] Yang J, Ji C, Jiang B, Lu W, Meng Q. No reference quality assessment of stereo video based on saliency and sparsity. IEEE Transactions on Broadcasting (TBC), 2018. 64: 341–353
[299] Yang J, Zhu Y, Ma C, Lu W, Meng Q. Stereoscopic video quality assessment based on 3d convolutional neural networks. Neurocomputing, 2018. 309: 83–93
[300] Yang J, Zhao Y, Jiang B, Lu W, Gao X. No-reference quality evaluation of stereoscopic video based on spatio-temporal texture. IEEE Transactions on Multimedia (TMM), 2019. 22: 2635–2644
[301] Appina B, Dendi S V R, Manasa K, Channappayya S S, Bovik A C. Study of subjective quality and objective blind quality prediction of stereoscopic videos. IEEE Transactions on Image Processing (TIP), 2019. 28: 5027–5040
[302] Biswas S, Appina B, Kara P A, Simon A. Jomodevi: A joint motion and depth visibility prediction algorithm for perceived stereoscopic 3d quality. Signal Processing: Image Communication, 2022. 108: 116820
[303] Sun Y, Lu A, Yu L. Weighted-to-spherically-uniform quality evaluation for omnidirectional video. IEEE signal processing letters, 2017. 24: 1408–1412
[304] Zakharchenko V, Choi K P, Park J H. Quality metric for spherical panoramic video. In: Optics and Photonics for Information Processing X. SPIE, volume 9970, 2016 57–65
[305] Yu M, Lakshman H, Girod B. A framework to evaluate omnidirectional video coding schemes. In: Proceedings of the IEEE International Symposium on Mixed and Augmented Reality, 2015 31–36
[306] Zhou Y, Yu M, Ma H, Shao H, Jiang G. Weighted-to-spherically-uniform ssim objective quality evaluation for panoramic video. In: Proceedings of the IEEE International Conference on Signal Processing (ICSP), 2018 54–57
[307] Chen S, Zhang Y, Li Y, Chen Z, Wang Z. Spherical structural similarity index for objective omnidirectional video quality assessment. In: Proceedings of the IEEE international conference on multimedia and expo (ICME), 2018 1–6
[308] Ozcinar C, Cabrera J, Smolic A. Visual attention-aware omnidirectional video streaming using optimal tiles for virtual reality. IEEE Journal on Emerging and Selected Topics in Circuits and Systems, 2019. 9: 217–230
[309] Meng Y, Ma Z. Viewport-based omnidirectional video quality assessment: Database, modeling and inference. IEEE Transactions on Circuits and Systems for Video Technology (TCSVT), 2021. 32: 120–134
[310] Gao P, Zhang P, Smolic A. Quality assessment for omnidirectional video: A spatio-temporal distortion modeling approach. IEEE Transactions on Multimedia (TMM), 2020. 24: 1–16
[311] Azevedo R G d A, Birkbeck N, Janatra I, Adsumilli B, Frossard P. Multi-feature 360 video quality estimation. IEEE Open Journal of Circuits and Systems, 2021. 2: 338–349
[312] Zhou W, Xu J, Jiang Q, Chen Z. No-reference quality assessment for 360-degree images by analysis of multifrequency information and local-global naturalness. IEEE Transactions on Circuits and Systems for Video Technology (TCSVT), 2021. 32: 1778–1791
[313] Li C, Xu M, Jiang L, Zhang S, Tao X. Viewport proposal cnn for $360^{\circ}$ video quality assessment. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2019 10169–10178
[314] Xu M, Jiang L, Li C, Wang Z, Tao X. Viewport-based cnn: A multi-task approach for assessing $360^{\circ}$ video quality. IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI), 2020. 44: 2198–2215
[315] Duan H, Min X, Sun W, Zhu Y, Zhang X P, Zhai G. Attentive deep image quality assessment for omnidirectional stitching. IEEE Journal of Selected Topics in Signal Processing (JSTSP), 2023
[316] Zhu Y, Min X, Zhu D, Zhai G, Yang X, Zhang W, Gu K, Zhou J. Toward visual behavior and attention understanding for augmented 360 degree videos. ACM Transactions on Multimedia Computing, Communications and Applications (TOMM), 2023. 19: 1–24
[317] Zhu Y, Zhai G, Yang Y, Duan H, Min X, Yang X. Viewing behavior supported visual saliency predictor for 360 degree videos. IEEE Transactions on Circuits and Systems for Video Technology (TCSVT), 2021. 32: 4188–4201
[318] Zhu Y, Zhai G, Min X, Zhou J. Learning a deep agent to predict head movement in 360-degree images. ACM Transactions on Multimedia Computing, Communications, and Applications (TOMM), 2020. 16: 1–23
[319] Zhu Y, Zhai G, Min X, Zhou J. The prediction of saliency map for head and eye movements in 360 degree images. IEEE Transactions on Multimedia (TMM), 2019. 22: 2331–2344
[320] Zhu Y, Zhai G, Min X. The prediction of head and eye movement for 360 degree images. Signal Processing: Image Communication, 2018. 69: 15–25
[321] Li J, Zhai G, Zhu Y, Zhou J, Zhang X P. How sound affects visual attention in omnidirectional videos. In: Proceedings of the IEEE International Conference on Image Processing (ICIP), 2022 3066–3070
[322] Yang Y, Zhu Y, Gao Z, Zhai G. Salgfcn: Graph based fully convolutional network for panoramic saliency prediction. In: Proceedings of the International Conference on Visual Communications and Image Processing (VCIP), 2021 1–5
[323] Ren X, Duan H, Min X, Zhu Y, Shen W, Wang L, Shi F, Fan L, Yang X, Zhai G. Where are the children with autism looking in reality? In: Proceedings of the CAAI International Conference on Artificial Intelligence (CICAI). Springer, 2023 588–600
[324] Yang J, Liu T, Jiang B, Song H, Lu W. 3d panoramic virtual reality video quality assessment based on 3d convolutional neural networks. IEEE Access, 2018. 6: 38669–38682
[325] Fei Z, Wang F, Wang J, Xie X. Qoe evaluation methods for 360-degree vr video transmission. IEEE Journal of Selected Topics in Signal Processing (JSTSP, 2019. 14: 78–88
[326] Yang J, Liu T, Jiang B, Lu W, Meng Q. Panoramic video quality assessment based on non-local spherical cnn. IEEE Transactions on Multimedia (TMM), 2020. 23: 797–809
[327] Xu J, Zhou W, Chen Z. Blind omnidirectional image quality assessment with viewport oriented graph convolutional networks. IEEE Transactions on Circuits and Systems for Video Technology (TCSVT), 2020. 31: 1724–1737
[328] Guo J, Luo Y. No-reference omnidirectional video quality assessment based on generative adversarial networks. Multimedia Tools and Applications (MTAP), 2021. 80: 27531–27552
[329] Zhu H, Li T, Wang C, Jin W, Murali S, Xiao M, Ye D, Li M. Eyeqoe: A novel qoe assessment model for 360-degree videos using ocular behaviors. Proceedings of the ACM on Interactive, Mobile, Wearable and Ubiquitous Technologies, 2022. 6: 1–26
[330] Yang L, Xu M, Li S, Guo Y, Wang Z. Blind vqa on $360^{\circ}$ video via progressively learning from pixels, frames, and video. IEEE Transactions on Image Processing (TIP), 2022. 32: 128–143
[331] An T, Sun S, Liu R. Panoramic video quality assessment based on spatial-temporal convolutional neural networks. In: Proceedings of the International Conference on Signal and Information Processing, Networking and Computers (ICSINC). Springer, 2022 1348–1356
[332] Ou Y F, Ma Z, Liu T, Wang Y. Perceptual quality assessment of video considering both frame rate and quantization artifacts. IEEE Transactions on Circuits and Systems for Video Technology (TCSVT), 2010. 21: 286–298
[333] Ma Z, Xu M, Ou Y F, Wang Y. Modeling of rate and perceptual quality of compressed video as functions of frame rate and quantization stepsize and its applications. IEEE Transactions on Circuits and Systems for Video Technology (TCSVT), 2011. 22: 671–682
[334] Ou Y F, Xue Y, Wang Y. Q-star: A perceptual video quality model considering impact of spatial, temporal, and amplitude resolutions. IEEE Transactions on Image Processing (TIP), 2014. 23: 2473–2486
[335] Zhang F, Mackin A, Bull D R. A frame rate dependent video quality metric based on temporal wavelet decomposition and spatiotemporal pooling. In: Proceedings of the IEEE International Conference on Image Processing (ICIP), 2017 300–304
[336] Madhusudana P C, Birkbeck N, Wang Y, Adsumilli B, Bovik A C. St-greed: Space-time generalized entropic differences for frame rate dependent video quality prediction. IEEE Transactions on Image Processing (TIP), 2021. 30: 7446–7457
[337] Madhusudana P C, Birkbeck N, Wang Y, Adsumilli B, Bovik A C. High frame rate video quality assessment using vmaf and entropic differences. In: Proceedings of the Picture Coding Symposium (PCS), 2021 1–5
[338] Lee D Y, Kim J, Ko H, Bovik A C. Video quality model of compression, resolution and frame rate adaptation based on space-time regularities. IEEE Transactions on Image Processing (TIP), 2022. 31: 3644–3656
[339] Zheng Q, Tu Z, Fan Y, Zeng X, Bovik A C. No-reference quality assessment of variable frame-rate videos using temporal bandpass statistics. In: Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2022 1795–1799
[340] Yang K C, Huang A M, Nguyen T Q, Guest C C, Das P K. A new objective quality metric for frame interpolation used in video compression. IEEE Transactions on Broadcasting (TBC), 2008. 54: 680–11
[341] Danier D, Zhang F, Bull D. Flolpips: A bespoke video quality metric for frame interpolation. In: Proceedings of the Picture Coding Symposium (PCS), 2022 283–287
[342] Hou Q, Ghildyal A, Liu F. A perceptual quality metric for video frame interpolation. In: Proceedings of the European Conference on Computer Vision (ECCV). Springer, 2022 234–253
[343] Cao Y, Min X, Sun W, Zhai G. Subjective and objective audio-visual quality assessment for user generated content. IEEE Transactions on Image Processing (TIP), 2023
[344] Beerends J G, De Caluwe F E. The influence of video quality on perceived audio quality and vice versa. Journal of the Audio Engineering Society, 1999. 47: 355–362
[345] Hands D S. A basic multimedia quality model. IEEE Transactions on Multimedia (TMM), 2004. 6: 806–816
[346] Martinez H B, Farias M C. A no-reference audio-visual video quality metric. In: Proceedings of the European Signal Processing Conference (EUSIPCO). IEEE, 2014 2125–2129
[347] Cao Y, Min X, Sun W, Zhai G. Attention-guided neural networks for full-reference and no-reference audio-visual quality assessment. IEEE Transactions on Image Processing (TIP), 2023. 32: 1882–1896
[348] Narwaria M, Da Silva M P, Le Callet P. Hdr-vqm: An objective quality measure for high dynamic range video. Signal Processing: Image Communication, 2015. 35: 46–60
[349] Ebenezer J P, Shang Z, Wu Y, Wei H, Sethuraman S, Bovik A C. Making video quality assessment models robust to bit depth. IEEE Signal Processing Letters, 2023
[350] Ebenezer J P, Shang Z, Wu Y, Wei H, Sethuraman S, Bovik A C. Hdr-chipqa: No-reference quality assessment on high dynamic range videos. arXiv preprint arXiv:230413156, 2023
[351] Zeng H, Huang H, Hou J, Cao J, Wang Y, Ma K K. Screen content video quality assessment model using hybrid spatiotemporal features. IEEE Transactions on Image Processing, 2022. 31: 6175–6187
[352] Saha A, Chen Y C, Davis C, Qiu B, Wang X, Gowda R, Katsavounidis I, Bovik A C. Study of subjective and objective quality assessment of mobile cloud gaming videos. IEEE Transactions on Image Processing, 2023
[353] Yu X, Tu Z, Birkbeck N, Wang Y, Adsumilli B, Bovik A C. Perceptual quality assessment of ugc gaming videos. arXiv preprint arXiv:220400128, 2022
[354] Shang Z, Chen Y, Wu Y, Wei H, Sethuraman S. Subjective and objective video quality assessment of high dynamic range sports content. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, 2023 556–564
[355] Ebenezer J P, Shang Z, Chen Y, Wu Y, Wei H, Sethuraman S, Bovik A C. Hdr or sdr? a subjective and objective study of scaled and compressed videos. arXiv preprint arXiv:230413162, 2023
[356] Melo M, Bessa M, Debattista K, Chalmers A. Evaluation of hdr video tone mapping for mobile devices. Signal Processing: Image Communication, 2014. 29: 247–256
[357] Eilertsen G, Unger J, Mantiuk R K. Evaluation of tone mapping operators for hdr video. In: High dynamic range video, Elsevier, 185–207, 2016
[358] Mantiuk R, Kim K J, Rempel A G, Heidrich W. Hdr-vdp-2: A calibrated visual metric for visibility and quality predictions in all luminance conditions. ACM Transactions on Graphics (TOG), 2011. 30: 1–14
[359] Li T, Min X, Zhu W, Xu Y, Zhang W. No-reference screen content video quality assessment. Displays, 2021. 69: 102030
[360] Motamednia H, Cheraaqee P, Mansouri A, Mahmoudi-Aznaveh A. Quality assessment of screen content videos. In: 2023 6th International Conference on Pattern Recognition and Image Analysis (IPRIA). IEEE, 2023 1–7
[361] Xian W, Zhou M, Fang B, Kwong S. A content-oriented no-reference perceptual video quality assessment method for computer graphics animation videos. Information Sciences, 2022. 608: 1731–1746
[362] Barman N, Schmidt S, Zadtootaghaj S, Martini M G, Möller S. An evaluation of video quality assessment metrics for passive gaming video streaming. In: Proceedings of the 23rd packet video workshop, 2018 7–12
[363] Chen Y C, Saha A, Davis C, Qiu B, Wang X, Gowda R, Katsavounidis I, Bovik A C. Gamival: Video quality prediction on mobile cloud gaming content. IEEE Signal Processing Letters, 2023. 30: 324–328
[364] Final report from the video quality experts group on the validation of objective models of video quality assessment. http://www.vqeg.org/, 2000. [Accessed 2023-06-28]. [Online].
[365] Li Y, Po L M, Cheung C H, Xu X, Feng L, Yuan F, Cheung K W. No-reference video quality assessment with 3d shearlet transform and convolutional neural networks. IEEE Transactions on Circuits and Systems for Video Technology (TCSVT), 2015. 26: 1044–1057
[366] Kim J, Lee S. Deep learning of human visual sensitivity in image quality assessment framework. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2017 1676–1684
[367] Xue W, Mou X, Zhang L, Bovik A C, Feng X. Blind image quality assessment using joint statistics of gradient magnitude and laplacian features. IEEE Transactions on Image Processing (TIP), 2014. 23: 4850–4862
[368] Kundu D, Ghadiyaram D, Bovik A C, Evans B L. No-reference quality assessment of tone-mapped HDR pictures. IEEE Transactions on Image Processing (TIP), 2017. 26: 2957–2971
[369] Ghadiyaram D, Bovik A C. Perceptual quality prediction on authentically distorted images using a bag of features approach. Journal of Vision, 2017. 17: 32–32
[370] Ye P, Kumar J, Kang L, Doermann D. Unsupervised feature learning framework for no-reference image quality assessment. In: Proceedings of the IEEE Conference Computer Vision Pattern Recognition (CVPR), 2012 1098–1105
[371] Xu J, Ye P, Li Q, Du H, Liu Y, Doermann D. Blind image quality assessment based on high order statistics aggregation. IEEE Transactions on Image Processing (TIP), 2016. 25: 4444–4457
[372] Ying Z, Niu H, Gupta P, Mahajan D, Ghadiyaram D, Bovik A. From patches to pictures (PaQ-2-PiQ): Mapping the perceptual space of picture quality. In: Proceedings of the IEEE Conference Computer Vision Pattern Recognition (CVPR), 2020
[373] Li D, Jiang T, Jiang M. Unified quality assessment of in-the-wild videos with mixed datasets training. International Journal of Computer Vision, 2021
[374] Tu Z, Yu X, Wang Y, Birkbeck N, Adsumilli B, Bovik A C. Rapique: Rapid and accurate video quality prediction of user generated content. IEEE Open Journal of Signal Processing, 2021. 2: 425–440
[375] Tong F, Meng M, Blake R. Neural bases of binocular rivalry. Trends in Cognitive Sciences, 2006. 10: 502–511
[376] Blake R, Logothetis N K. Visual competition. Nature Reviews Neuroscience, 2002. 3: 13–21
[377] Duan H, Zhai G, Min X, Che Z, Fang Y, Yang X, Gutiérrez J, Callet P L. A dataset of eye movements for the children with autism spectrum disorder. In: Proceedings of the ACM Multimedia Systems Conference (ACM MMSys), 2019 255–260
[378] Duan H, Min X, Fang Y, Fan L, Yang X, Zhai G. Visual attention analysis and prediction on human faces for children with autism spectrum disorder. ACM Transactions on Multimedia Computing, Communications, and Applications (TOMM), 2019. 15: 1–23
[379] Tu D, Min X, Duan H, Guo G, Zhai G, Shen W. End-to-end human-gaze-target detection with transformers. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR). IEEE, 2022 2192–2200
[380] Tu D, Min X, Duan H, Guo G, Zhai G, Shen W. Iwin: Human-object interaction detection via transformer with irregular windows. In: Proceedings of the European Conference on Computer Vision (ECCV). Springer, 2022 87–103
[381] Shi F, Sun W, Duan H, Liu X, Hu M, Wang W, Zhai G. Drawing reveals hallmarks of children with autism. Displays, 2021. 67: 102000
[382] Shi G, Xiao Y, Li Y, Xie X. From semantic communication to semantic-aware networking: Model, architecture, and open problems. IEEE Communications Magazine, 2021. 59: 44–50
[383] Duan H, Shen W, Min X, Tu D, Li J, Zhai G. Saliency in augmented reality. In: Proceedings of the ACM International Conference on Multimedia (ACM MM), 2022 6549–6558
[384] Duan H, Min X, Shen W, Zhai G. A unified two-stage model for separating superimposed images. In: Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2022 2065–2069
[385] Mantiuk R K, Denes G, Chapiro A, Kaplanyan A, Rufo G, Bachy R, Lian T, Patney A. Fovvideovdp: A visible difference predictor for wide field-of-view video. ACM Transactions on Graphics (TOG), 2021. 40: 1–19
[386] Liu H, Li C, Li Y, Lee Y J. Improved baselines with visual instruction tuning. arXiv preprint arXiv:231003744, 2023
[387] Ye Q, Xu H, Xu G, Ye J, Yan M, Zhou Y, Wang J, Hu A, Shi P, Shi Y, et al. mplug-owl: Modularization empowers large language models with multimodality. arXiv preprint arXiv:230414178, 2023
[388] Wu H, Zhang Z, Zhang W, Chen C, Liao L, Li C, Gao Y, Wang A, Zhang E, Sun W, et al. Q-align: Teaching lmms for visual scoring via discrete text-defined levels. arXiv preprint arXiv:231217090, 2023
[389] Zhang Z, Wu H, Ji Z, Li C, Zhang E, Sun W, Liu X, Min X, Sun F, Jui S, et al. Q-boost: On visual quality assessment ability of low-level multi-modality foundation models. arXiv preprint arXiv:231215300, 2023
[390] Wu H, Zhang Z, Zhang E, Chen C, Liao L, Wang A, Xu K, Li C, Hou J, Zhai G, et al. Q-instruct: Improving low-level visual abilities for multi-modality foundation models. arXiv preprint arXiv:231106783, 2023
[391] Wu H, Zhang Z, Zhang E, Chen C, Liao L, Wang A, Li C, Sun W, Yan Q, Zhai G, et al. Q-bench: A benchmark for general-purpose foundation models on low-level vision. arXiv preprint arXiv:230914181, 2023
[392] Sun W, Gu K, Zhai G, Ma S, Lin W, Le Calle P. Cviqd: Subjective quality evaluation of compressed virtual reality images. In: 2017 IEEE International Conference on Image Processing (ICIP). IEEE, 2017 3450–3454
[393] Sun W, Gu K, Ma S, Zhu W, Liu N, Zhai G. A large-scale compressed 360-degree spherical image database: From subjective quality evaluation to objective model comparison. In: 2018 IEEE 20th international workshop on multimedia signal processing (MMSP). IEEE, 2018 1–6
[394] Duan H, Zhai G, Min X, Zhu Y, Fang Y, Yang X. Perceptual quality assessment of omnidirectional images. In: Proceedings of the IEEE International Symposium on Circuits and Systems (ISCAS), 2018 1–5
[395] Duan H, Zhai G, Min X, Zhu Y, Fang Y, Yang X. Perceptual quality assessment of omnidirectional images: Subjective experiment and objective model evaluation. ZTE Communications, 2019. 17: 38–47
[396] Sun W, Min X, Zhai G, Gu K, Duan H, Ma S. Mc360iqa: a multi-channel cnn for blind 360-degree image quality assessment. IEEE Journal of Selected Topics in Signal Processing (JSTSP), 2019. 14: 64–77
[397] Duan H, Zhai G, Min X, Zhu Y, Sun W, Yang X. Assessment of visually induced motion sickness in immersive videos. In: Pacific Rim Conference on Multimedia. Springer, 2017 662–672
[398] Yang J, Zhai G, Duan H. Predicting the visual saliency of the people with vims. In: Proceedings of the IEEE Visual Communications and Image Processing (VCIP), 2019 1–4
[399] Dong C, Liang H, Xu X, Han S, Wang B, Zhang P. Semantic communication system based on semantic slice models propagation. IEEE Journal on Selected Areas in Communications, 2022. 41: 202–213
[400] Sun Y, Min X, Duan H, Zhai G. The influence of text-guidance on visual attention. In: Proceedings of the IEEE International Symposium on Circuits and Systems (ISCAS), 2023 1–5
[401] Zhu Y, Zhu X, Duan H, Li J, Zhang K, Zhu Y, Chen L, Min X, Zhai G. Audio-visual saliency for omnidirectional videos. In: Proceedings of the International Conference on Image and Graphics (ICIG). Springer, 2023 365–378
[402] Radford A, Narasimhan K, Salimans T, Sutskever I, et al. Improving language understanding by generative pre-training. OpenAI, 2018
[403] Rombach R, Blattmann A, Lorenz D, Esser P, Ommer B. High-resolution image synthesis with latent diffusion models. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2022 10684–10695
[404] Zheng L, Chiang W L, Sheng Y, Zhuang S, Wu Z, Zhuang Y, Lin Z, Li Z, Li D, Xing E, et al. Judging llm-as-a-judge with mt-bench and chatbot arena. arXiv preprint arXiv:230605685, 2023
[405] Ramesh A, Dhariwal P, Nichol A, Chu C, Chen M. Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:220406125, 2022. 1: 3
[406] Singer U, Polyak A, Hayes T, Yin X, An J, Zhang S, Hu Q, Yang H, Ashual O, Gafni O, et al. Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792, 2022
[407] Wu J Z, Ge Y, Wang X, Lei S W, Gu Y, Shi Y, Hsu W, Shan Y, Qie X, Shou M Z. Tune-a-video: One-shot tuning of image diffusion models for text-to-video generation. In: Proceedings of the IEEE International Conference on Computer Vision (ICCV), 2023 7623–7633
[408] Wang J, Duan H, Liu J, Chen S, Min X, Zhai G. Aigciqa2023: A large-scale image quality assessment database for ai generated images: from the perspectives of quality, authenticity and correspondence. arXiv preprint arXiv:230700211, 2023
[409] Zhang Z, Wu W, Cheng Y, Min X, Zhai G, et al. Perceptual quality assessment for fine-grained compressed images. Journal of Visual Communication and Image Representation, 2023. 90: 103696
[410] Xu J, Liu X, Wu Y, Tong Y, Li Q, Ding M, Tang J, Dong Y. Imagereward: Learning and evaluating human preferences for text-to-image generation. arXiv preprint arXiv:230405977, 2023
[411] Zhang Z, Sun W, Min X, Wang T, Lu W, Zhu W, Zhai G. A no-reference visual quality metric for 3d color meshes. In: 2021 IEEE International Conference on Multimedia & Expo Workshops (ICMEW). IEEE, 2021 1–6
[412] Zhang Z, Sun W, Min X, Wang T, Lu W, Zhai G. No-reference quality assessment for 3d colored point cloud and mesh models. IEEE Transactions on Circuits and Systems for Video Technology, 2022. 32: 7618–7631
[413] Zhang Z, Sun W, Zhu Y, Min X, Wu W, Chen Y, Zhai G. Treating point cloud as moving camera videos: A no-reference quality assessment metric. arXiv e-prints, 2022. arXiv–2208
[414] Zhang Z, Sun W, Min X, Zhou Q, He J, Wang Q, Zhai G. Mm-pcqa: Multi-modal learning for no-reference point cloud quality assessment. arXiv preprint arXiv:220900244, 2022
[415] Fan Y, Zhang Z, Sun W, Min X, Liu N, Zhou Q, He J, Wang Q, Zhai G. A no-reference quality assessment metric for point cloud based on captured video sequences. In: 2022 IEEE 24th International Workshop on Multimedia Signal Processing (MMSP). IEEE, 2022 1–5
[416] Zhang Z, Sun W, Zhou Y, Lu W, Zhu Y, Min X, Zhai G. Eep-3dqa: Efficient and effective projection-based 3d model quality assessment. arXiv preprint arXiv:230208715, 2023
[417] Zhang Z, Sun W, Wu H, Zhou Y, Li C, Min X, Zhai G, Lin W. Gms-3dqa: Projection-based grid mini-patch sampling for 3d model quality assessment. arXiv preprint arXiv:230605658, 2023
[418] Zhang Z, Sun W, Zhou Y, Wu H, Li C, Min X, Liu X, Zhai G, Lin W. Advancing zero-shot digital human quality assessment through text-prompted evaluation. arXiv preprint arXiv:230702808, 2023
[419] Zhou Y, Zhang Z, Sun W, Min X, Ma X, Zhai G. A no-reference quality assessment method for digital human head. In: 2023 IEEE International Conference on Image Processing (ICIP). IEEE, 2023 36–40
[420] Zhang Z, Zhou Y, Sun W, Min X, Zhai G. Simple baselines for projection-based full-reference and no-reference point cloud quality assessment. arXiv preprint arXiv:231017147, 2023
[421] Zhang Z, Sun W, Zhu Y, Min X, Wu W, Chen Y, Zhai G. Evaluating point cloud from moving camera videos: A no-reference metric. IEEE Transactions on Multimedia, 2023
[422] Zerman E, Ozcinar C, Gao P, Smolic A. Textured mesh vs coloured point cloud: A subjective study for volumetric video compression. In: 2020 Twelfth International Conference on Quality of Multimedia Experience (QoMEX). IEEE, 2020 1–6
[423] Zhang Z, Zhou Y, Sun W, Lu W, Min X, Wang Y, Zhai G. Ddh-qa: A dynamic digital humans quality assessment database. In: 2023 IEEE International Conference on Multimedia and Expo (ICME). IEEE, 2023 2519–2524
[424] Zhang Z, Zhou Y, Sun W, Min X, Zhai G. Geometry-aware video quality assessment for dynamic digital human. In: 2023 IEEE International Conference on Image Processing (ICIP). IEEE, 2023 1365–1369
[425] Fan Y, Zhang Z, Sun W, Min X, Lin J, Zhai G, Liu N. Mv-vvqa: Multi-view learning for no-reference volumetric video quality assessment. In: 2023 31st European Signal Processing Conference (EUSIPCO). IEEE, 2023 670–674
[426] Schwarz S, Preda M, Baroncini V, Budagavi M, Cesar P, Chou P A, Cohen R A, Krivokuća M, Lasserre S, Li Z, et al. Emerging mpeg standards for point cloud compression. IEEE Journal on Emerging and Selected Topics in Circuits and Systems, 2018. 9: 133–148
[427] Kuo C C J, Madni A M. Green learning: Introduction, examples and outlook. Journal of Visual Communication and Image Representation, 2023. 90: 103685
[428] Mei Z, Wang Y C, Kuo C C J. Blind video quality assessment at the edge. arXiv preprint arXiv:230610386, 2023

Perceptual Video Quality Assessment: A Survey

Abstract

keywords:

1 Introduction

1.1 Related Surveys

1.2 Scope and Organization of This Survey

2 Subjective Video Quality Assessment

2.1 Subjective VQA Methodology

2.2 Subjective VQA Databases for General Purpose

2.2.1 General VQA Databases with Synthetic Distortions

2.2.2 General VQA Databases with Authentic Distortions

2.3 Subjective VQA Databases for Specific Applications

2.3.1 Streaming VQA Databases

2.3.2 3D VQA Databases

2.3.3 VR VQA Databases

2.3.4 High Frame Rate &\&& Frame interpolation VQA Databases

2.3.5 Audio-Visual VQA Databases

2.3.6 HDR, WCG, iTMO, and TMO VQA Databases

2.3.7 Screen and Game VQA Databases

3 Objective Video Quality Assessment: General-purpose Models

3.1 Full-reference Video Quality Assessment

3.1.1 Knowledge-driven FR VQA

3.1.2 Data-driven FR VQA

3.2 Reduced-reference Video Quality Assessment

3.3 No-reference Video Quality Assessment

3.3.1 Knowledge-driven NR VQA

3.3.2 Data-driven NR VQA

4 Objective Video Quality Assessment: Specific-purpose Models

4.1 Compressed VQA

4.1.1 FR and RR Methods

4.1.2 NR Methods

4.2 Streaming VQA

4.2.1 QoS-driven User QoE Assessment

4.2.2 QoS and QA-driven User QoE Assessment

4.2.3 Data-driven Approaches

4.3 Stereoscopic VQA

4.3.1 2D Extension Methods

4.3.2 Stereo Vision Perception Methods

4.4 VR VQA

4.4.1 Traditional Visual Computing Techniques

4.4.2 Deep Learning-based Computing Techniques

4.5 Framerate & Frame Interpolation VQA

4.6 Audio-Visual VQA

4.7 HDR, WCG, iTMO and TMO VQA

4.8 Screen and Game VQA

5 Objective Video Quality Assessment Model Evaluation

5.1 Evaluation Criteria

5.2 Performance Comparison

6 Future Research Directions

6.1 Human Perception Mechanism of Video Quality Assessment

6.2 Large Multi-modality Models for Video Quality Assessment

6.3 Quality Assessment of Emerging Video Media

6.4 Quality Assessment of AIGC Videos

6.5 Quality Assessment of Volumetric Videos

6.6 Green Learning for Video Quality Assessment

7 Summary

References

2.3.4 High Frame Rate $\&$ Frame interpolation VQA Databases