A compressed video quality enhancement algorithm based on CNN and transformer hybrid network

Li, Hao; He, Xiaohai; Xiong, Shuhua; He, Haibo; Chen, Honggang

doi:10.1007/s11227-024-06654-0

A compressed video quality enhancement algorithm based on CNN and transformer hybrid network

Research
Published: 07 November 2024

Volume 81, article number 144, (2025)
Cite this article

The Journal of Supercomputing Aims and scope Submit manuscript

Hao Li¹,
Xiaohai He¹,
Shuhua Xiong¹,
Haibo He² &
…
Honggang Chen¹

130 Accesses
Explore all metrics

Abstract

Convolutional neural network (CNN)-based algorithms perform well in enhancing video quality by removing artifacts in compressed videos. Existing state-of-the-art approaches primarily concentrate on leveraging the spatiotemporal details from neighboring frames through deformable convolution. Nonetheless, the training of offset fields in deformable convolution poses significant challenges, as their instability during training frequently results in offset overflow, which reduces the efficiency of correlation modeling. On the other hand, convolution alone proves insufficient for effectively modeling long-range dependencies. We introduce a CNN and transformer-based compressed video quality enhancement (CTVE) method, which comprises three essential modules: the feature initial processing (FIP) module, the feature further processing (FFP) module, and the reconstruction module. The FIP module is built upon the deformable convolution (DCN), enabling it to initially extract spatiotemporal information from neighboring frames. The FFP module is based on Swinv2-transformer, which can accurately model the relevant contextual information and adapt well to image content. Extensive experimentation conducted on JCT-VT test sequences demonstrates that our method achieves outstanding average performance in both subjective and objective quality assessments.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

A Method for Enhancing the Quality of Compressed Videos Based on 2D Convolution and Aggregating Spatio-Temporal Information

Adaptive Spatio-Temporal Convolutional Network for Video Deblurring

An Efficient and Lightweight Structure for Spatial-Temporal Feature Extraction in Video Super Resolution

Data availability

All data used to support the findings of this study are included within the article.

References

Sullivan GJ, Ohm JR, Han WJ et al (2012) Overview of the high efficiency video coding (hevc) standard. IEEE Trans Circuits Syst Video Technol 22(12):1649–1668
Article Google Scholar
Bross B, Wang YK, Ye Y et al (2021) Overview of the versatile video coding (vvc) standard and its applications. IEEE Trans Circuits Syst Video Technol 31(10):3736–3764
Article Google Scholar
Jin G (2022) Player target tracking and detection in football game video using edge computing and deep learning. J Supercomput 78(7):9475–9491
Article Google Scholar
Wang H, Qian H, Feng S, Yan S (2023) Calyolov4: lightweight yolov4 target detection based on coordinated attention. J Supercomput 79(16):18947–18969
Article Google Scholar
Wang Y, Guo R, Zhao S (2022) Target tracking algorithm based on multiscale analysis and combinatorial matching. J Supercomput 78(10):12648–12661
Article Google Scholar
Tiancheng W (2020) Unsupervised video multi-target tracking based on fast resampling particle filter. J Supercomput 76(2):1293–1304
Article Google Scholar
Dong C, Deng Y, Loy CC, et al. (2015) Compression artifacts reduction by a deep convolutional network. In: Proceedings of the IEEE international conference on Computer Vision, pp 576–584
Yang R, Xu M, Wang Z (2017) Decoder-side hevc quality enhancement with scalable convolutional neural network. In: 2017 IEEE International Conference on Multimedia and Expo (ICME), IEEE, pp 817–822
Yang R, Xu M, Liu T et al (2018) Enhancing quality for hevc compressed videos. IEEE Trans Circuits Syst Video Technol 29(7):2039–2054
Article Google Scholar
Zhang K, Zuo W, Chen Y et al (2017) Beyond a gaussian denoiser: residual learning of deep cnn for image denoising. IEEE Trans Image Process 26(7):3142–3155
Article MathSciNet Google Scholar
Guo J, Chao H (2016) Building dual-domain representations for compression artifacts reduction. In: Computer Vision–ECCV 2016: 14th European Conference, Amsterdam, The Netherlands, October 11–14, 2016, Proceedings, Part I 14, Springer, pp 628–644
Dai Y, Liu D, Wu F (2017) A convolutional neural network approach for post-processing in hevc intra coding. In: MultiMedia Modeling: 23rd International Conference, MMM 2017, Reykjavik, Iceland, January 4-6, 2017, Proceedings, Part I 23, Springer, pp 28–39
Yang R, Xu M, Wang Z, et al. (2018) Multi-frame quality enhancement for compressed video. In: Proceedings of the IEEE conference on Computer Vision and Pattern Recognition, pp 6664–6673
Guan Z, Xing Q, Xu M, et al. (2019) Mfqe 2.0: A new approach for multi-frame quality enhancement on compressed video. IEEE transactions on pattern analysis and machine intelligence 43(3):949–963
Deng J, Wang L, Pu S, et al. (2020) Spatio-temporal deformable convolution for compressed video quality enhancement. In: Proceedings of the AAAI conference on Artificial Intelligence, pp 10696–10703
Zhao M, Xu Y, Zhou S (2021) Recursive fusion and deformable spatiotemporal attention for video compression artifact reduction. In: Proceedings of the 29th ACM international conference on Multimedia, pp 5646–5654
Ding Q, Shen L, Yu L et al (2021) Patch-wise spatial-temporal quality enhancement for hevc compressed video. IEEE Trans Image Process 30:6459–6472
Article Google Scholar
Luo D, Ye M, Li S et al (2022) Coarse-to-fine spatio-temporal information fusion for compressed video quality enhancement. IEEE Signal Process Lett 29:543–547
Article Google Scholar
Gao Y, Jia M, Li S et al (2022) A multiscale gradient-backpropagation optimization framework for deformable convolution based compressed video enhancement. ICASSP 2022–2022 IEEE International Conference on Acoustics. Speech and Signal Processing (ICASSP), IEEE, pp 2110–2114
Kappeler A, Yoo S, Dai Q et al (2016) Video super-resolution with convolutional neural networks. IEEE Trans Comput Imaging 2(2):109–122
Article MathSciNet Google Scholar
Caballero J, Ledig C, Aitken A, et al. (2017) Real-time video super-resolution with spatio-temporal networks and motion compensation. In: Proceedings of the IEEE conference on Computer Vision and Pattern Recognition, pp 4778–4787
Xue T, Chen B, Wu J et al (2019) Video enhancement with task-oriented flow. Int J Comput Vision 127:1106–1125
Article Google Scholar
Dai J, Qi H, Xiong Y, et al. (2017) Deformable convolutional networks. In: Proceedings of the IEEE international conference on Computer Vision, pp 764–773
Lin J, Huang Y, Wang L (2021) Fdan: Flow-guided deformable alignment network for video super-resolution. arXiv preprint arXiv:2105.05640
Alzubaidi L, Zhang J, Humaidi AJ et al (2021) Review of deep learning: concepts, cnn architectures, challenges, applications, future directions. J big Data 8:1–74
Article Google Scholar
Vaswani A, Shazeer N, Parmar N, et al. (2017) Attention is all you need. Advances in neural information processing systems 30
Dosovitskiy A, Beyer L, Kolesnikov A, et al. (2020) An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint
Li S, Wu C, Xiong N (2022) Hybrid architecture based on cnn and transformer for strip steel surface defect classification. Electronics 11(8):1200
Article Google Scholar
Liu Z, Lin Y, Cao Y, et al. (2021) Swin transformer: Hierarchical vision transformer using shifted windows. In: Proceedings of the IEEE/CVF international conference on Computer Vision, pp 10012–10022
Liu Z, Hu H, Lin Y, et al. (2022) Swin transformer v2: Scaling up capacity and resolution. In: Proceedings of the IEEE/CVF conference on Computer Vision and Pattern Recognition, pp 12009–12019
He K, Zhang X, Ren S, et al. (2016) Deep residual learning for image recognition. In: Proceedings of the IEEE conference on Computer Vision and Pattern Recognition, pp 770–778
Zhang Y, Li K, Li K, et al. (2019) Residual non-local attention networks for image restoration. arXiv preprint
Liu D, Wen B, Fan Y, et al. (2018) Non-local recurrent network for image restoration. Advances in neural information processing systems 31
Tai Y, Yang J, Liu X, et al. (2017) Memnet: A persistent memory network for image restoration. In: Proceedings of the IEEE international conference on Computer Vision, pp 4539–4547
Wang Z, Liu D, Chang S, et al. (2016) D3: Deep dual-domain based fast restoration of jpeg-compressed images. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp 2764–2772
Chen H, He X, Qing L, et al. (2018) Dpw-sdnet: Dual pixel-wavelet domain deep cnns for soft decoding of jpeg-compressed images. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, pp 711–720
Hearst MA, Dumais ST, Osuna E et al (1998) Support vector machines. IEEE Intell Syst Appl 13(4):18–28
Article Google Scholar
Zhang T, Teng Q, He X et al (2023) Multi-scale inter-communication spatio-temporal network for video compression artifacts reduction. IEEE Trans Circuits Syst II: Express Briefs 70(3):1229–1233
Google Scholar
Wang K, Chen F, Ye Z, et al. (2023) A spatio-temporal decomposition network for compressed video quality enhancement. In: ICASSP 2023 - 2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp 1–5
Yan L, Zhao M, Liu S et al (2023) Cascaded transformer u-net for image restoration. Signal Processing 206:108902
Article Google Scholar
Liang J, Cao J, Sun G, et al. (2021) Swinir: Image restoration using swin transformer. In: Proceedings of the IEEE/CVF international conference on Computer Vision, pp 1833–1844
Chen X, Wang X, Zhou J, et al. (2023) Activating more pixels in image super-resolution transformer. In: 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp 22367–22377
Cao M, Fan Y, Zhang Y et al (2023) Vdtr: Video deblurring with transformer. IEEE Trans Circuits Syst Video Technol 33(1):160–171
Article Google Scholar
Liang J, Cao J, Fan Y et al (2024) Vrt: A video restoration transformer. IEEE Trans Image Process 33:2171–2182
Article Google Scholar
Yu L, Chang W, Wu S et al (2024) End-to-end transformer for compressed video quality enhancement. IEEE Trans Broadcasting 70(1):197–207
Article Google Scholar
Lup V, Giosan I (2023) Vtseg: Video transformer for semantic segmentation. In: 2023 IEEE 19th International Conference on Intelligent Computer Communication and Processing (ICCP), pp 95–102
Yoo JS, Lee H, Jung SW (2023) Hierarchical spatiotemporal transformers for video object segmentation. In: 2023 IEEE/CVF International Conference on Computer Vision Workshops (ICCVW), pp 795–805
Du P, Liu Y, Ling N (2024) Cgvc-t: Contextual generative video compression with transformers. IEEE J Emerging Selected Topics in Circuits Syst 14(2):209–223
Article Google Scholar
Ronneberger O, Fischer P, Brox T (2015) U-net: Convolutional networks for biomedical image segmentation. In: Medical Image Computing and Computer-Assisted Intervention–MICCAI 2015: 18th International Conference, Munich, Germany, October 5-9, 2015, Proceedings, Part III 18, Springer, pp 234–241
Yamashita R, Nishio M, Do RKG et al (2018) Convolutional neural networks: an overview and application in radiology. Insights into imaging 9:611–629
Article Google Scholar
Zeiler MD, Krishnan D, Taylor GW, et al. (2010) Deconvolutional networks. In: 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, IEEE, pp 2528–2535
Agarap AF (2018) Deep learning using rectified linear units (relu). arXiv preprint
Conde MV, Choi UJ, Burchi M, et al. (2022) Swin2sr: Swinv2 transformer for compressed image super-resolution and restoration. In: European Conference on Computer Vision, Springer, pp 669–687
Bossen F et al (2013) Common test conditions and software reference configurations. JCTVC-L1100 12(7):1
Google Scholar

Download references

Funding

This work was supported by the National Natural Science Foundation of China (Grant No. 62271336 and Grant No. 62211530110).

Author information

Authors and Affiliations

School of Electronic Information, Sichuan University, Chengdu, 610000, Sichuan, China
Hao Li, Xiaohai He, Shuhua Xiong & Honggang Chen
Chengdu Xitu Technology Co., Ltd, Organization, Chengdu, 610000, Sichuan, China
Haibo He

Authors

Hao Li
View author publications
You can also search for this author in PubMed Google Scholar
Xiaohai He
View author publications
You can also search for this author in PubMed Google Scholar
Shuhua Xiong
View author publications
You can also search for this author in PubMed Google Scholar
Haibo He
View author publications
You can also search for this author in PubMed Google Scholar
Honggang Chen
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

H.L., X.H., and S.X. conceived of the presented idea. H.H. and H.C. encouraged H.L. to investigate the crtical path and supervised the findings of this work. H.L. carried out the experiment and wrote the main manuscript text. All authors discussed the results and revised the manuscript.

Corresponding author

Correspondence to Xiaohai He.

Ethics declarations

Conflict of interest

The authors declare that they have no known conflict of interest or personal relationships that could have appeared to influence the work reported in this paper.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Cite this article

Li, H., He, X., Xiong, S. et al. A compressed video quality enhancement algorithm based on CNN and transformer hybrid network. J Supercomput 81, 144 (2025). https://doi.org/10.1007/s11227-024-06654-0

Download citation

Accepted: 23 October 2024
Published: 07 November 2024
DOI: https://doi.org/10.1007/s11227-024-06654-0

Keywords

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

A compressed video quality enhancement algorithm based on CNN and transformer hybrid network

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

A Method for Enhancing the Quality of Compressed Videos Based on 2D Convolution and Aggregating Spatio-Temporal Information

Adaptive Spatio-Temporal Convolutional Network for Video Deblurring

An Efficient and Lightweight Structure for Spatial-Temporal Feature Extraction in Video Super Resolution

Data availability

References

Funding

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Conflict of interest

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Keywords

Subscribe and save

Buy Now

Navigation

A compressed video quality enhancement algorithm based on CNN and transformer hybrid network

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

A Method for Enhancing the Quality of Compressed Videos Based on 2D Convolution and Aggregating Spatio-Temporal Information

Adaptive Spatio-Temporal Convolutional Network for Video Deblurring

An Efficient and Lightweight Structure for Spatial-Temporal Feature Extraction in Video Super Resolution

Data availability

References

Funding

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Conflict of interest

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Subscribe and save

Buy Now

Search

Navigation