Abstract
With the increasing popularity of short videos on various social media platforms, there is a great challenge for evaluating the aesthetic quality of these videos. In this paper, we first construct a large-scale and properly annotated short video aesthetics (SVA) dataset. We further propose a cognitive multi-type feature fusion network (MVVA-Net) for video aesthetic quality assessment. MVVA-Net consists of two branches: intra-frame aesthetics branch and inter-frame aesthetics branch. These two branches take different types of video frames as input. The inter-frame aesthetic branch extracts the inter-frame aesthetic features based on the sequential frames extracted at fixed intervals, and the intra-frame aesthetic branch extracts the intra-frame aesthetic features based on the key frames extracted by the inter-frame difference method. Through the adaptive fusion of inter-frame aesthetic features and intra-frame aesthetic features, the video aesthetic quality can be effectively evaluated. At the same time, MVVA-Net has no fixed number of input frames, which greatly enhances the generalization ability of the model. We performed quantitative comparison and ablation studies. The experimental results show that the two branches of MVVA-Net can effectively extract the intra-frame aesthetic features and inter-frame aesthetic features of different videos. Through the adaptive fusion of intra-frame aesthetic features and inter-frame aesthetic features for video aesthetic quality assessment, MVVA-Net achieves better classification performance and stronger generalization ability than other methods. In this paper, we construct a dataset of 6900 video shots and propose a video aesthetic quality assessment method based on non-fixed model input strategy and multi-type features. Experimental results show that the model has a strong generalization ability and achieved a good performance on different datasets.
Similar content being viewed by others
Explore related subjects
Discover the latest articles, news and stories from top researchers in related subjects.References
Choi JH, Lee JS. Automated video editing for aesthetic quality improvement. Proceedings of the 23rd ACM International Conference on Multimedia. 2015.
Niu Y, Liu F. What makes a professional video? A computational aesthetics approach. IEEE Trans Circuits Syst Video Technol. 2012;22(7):1037–49.
Yan Y, Ren J, Sun G, Zhao H, Han J, Li X, et al. Unsupervised image saliency detection with Gestalt-laws guided optimization and visual attention based refinement. Pattern Recogn. 2018;79:65–78.
Fang Z, Ren J, Marshall S, Zhao H, Wang Z, Huang K, et al. Triple loss for hard face detection. Neurocomputing. 2020;398:20–30.
Kuang Q, Jin X, Zhao Q, Zhou B. Deep multimodality learning for UAV video aesthetic quality assessment. IEEE Trans Multimedia. 2019;22(10):2623–34.
Luo Y, Tang X. Photo and video quality evaluation: Focusing on the subject. European Conference on Computer Vision. 2008. Springer, Berlin, Heidelberg.
Bhattacharya S, Nojavanasghari B, Chen T, Liu D, Chang SF, Shah M. Towards a comprehensive computational model for aesthetic assessment of videos. Proceedings of the 21st ACM International Conference on Multimedia. 2013.
Chung S, Sammartino J, Bai J, Barsky BA. Can motion features inform video aesthetic preferences. University of California at Berkeley Technical Report No UCB/EECS-2012–172. 2012;29.
Phatak MV, Patwardhan MS, Arya MS. Deep Learning for motion based video aesthetics. 2019 IEEE Bombay Section Signature Conference (IBSSC). 2019. IEEE.
Duta IC, Liu L, Zhu F, Shao L. Pyramidal convolution: rethinking convolutional neural networks for visual recognition. arXiv preprint arXiv:200611538. 2020.
Wang L, Li W, Li W, Van Gool L. Appearance-and-relation networks for video classification. Proc IEEE Conference Comput Vis Pattern Recognit. 2018.
Feichtenhofer C, Fan H, Malik J, He K. Slowfast networks for video recognition. Proceedings of the IEEE/CVF International Conference on Computer Vision. 2019.
Zhao T, Wu X. Pyramid feature attention network for saliency detection. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2019.
Li H, Chen G, Li G, Yu Y. Motion guided attention for video salient object detection. Proceedings of the IEEE/CVF International Conference on Computer Vision. 2019.
Yan Y, Ren J, Zhao H, Sun G, Wang Z, Zheng J, et al. Cognitive fusion of thermal and visible imagery for effective detection and tracking of pedestrians in videos. Cogn Comput. 2018;10(1):94–104.
Moorthy AK, Obrador P, Oliver N. Towards computational models of the visual aesthetic appeal of consumer videos. European Conference on Computer Vision. 2010. Springer, Berlin, Heidelberg.
Yang CY, Yeh HH, Chen CS. Video aesthetic quality assessment by combining semantically independent and dependent features. 2011 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). 2011. IEEE.
Yushi J, Go I, Marcel W. Multimedia Grand Challenge 2012. IEEE Multimedia. 2013.
Tzelepis C, Mavridaki E, Mezaris V, Patras I. Video aesthetic quality assessment using kernel support vector machine with isotropic Gaussian sample uncertainty (KSVM-IGSU). 2016 IEEE International Conference on Image Processing (ICIP). 2016. IEEE.
Yeh HH, Yang CY, Lee MS, Chen CS. Video aesthetic quality assessment by temporal integration of photo-and motion-based features. IEEE Trans Multimedia. 2013;15(8):1944–57.
Simonyan K, Zisserman A. Very deep convolutional networks for large-scale image recognition. Computer Science. 2014.
Tran D, Bourdev L, Fergus R, Torresani L, Paluri M. Learning spatiotemporal features with 3D convolutional networks. Proc IEEE Int Conf Comput Vis. 2015.
Szegedy C, Vanhoucke V, Ioffe S, Shlens J, Wojna Z. Rethinking the inception architecture for computer vision. Proc IEEE Conf Comput Vis Pattern Recognit. 2016.
He K, Zhang X, Ren S, Sun J. Deep residual learning for image recognition. Proc IEEE Conf Comput Vis Pattern Recognit. 2016.
Krizhevsky A, Sutskever I, Hinton G E. Imagenet classification with deep convolutional neural networks. Advances in neural information processing systems. 2012;25:1097–105.
Funding
The authors received support for the research work from the Yunnan Provincial Major Science and Technology Special Plan Projects: Digitization Research and Application Demonstration of Yunnan Characteristic Industry, under Grant: 202002AD080001 and the National Natural Science Foundation of China (CN) under grant Nos. 61772360, 61876125, and 62076180.
Author information
Authors and Affiliations
Corresponding author
Ethics declarations
Conflict of Interest
The authors declare no competing interests.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Dataset will be released at https://github.com/Lm0324/MVVA-Net
Rights and permissions
About this article
Cite this article
Li, M., Wang, Z., Ren, J. et al. MVVA-Net: a Video Aesthetic Quality Assessment Network with Cognitive Fusion of Multi-type Feature–Based Strong Generalization. Cogn Comput 14, 1435–1445 (2022). https://doi.org/10.1007/s12559-021-09947-1
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s12559-021-09947-1