Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                

Prediction and Reference Quality Adaptation for Learned Video Compression

Xihua Sheng, Li Li, , Dong Liu, , Houqiang Li,
Date of current version June 20, 2024.X. Sheng, L. Li, D. Liu, and H. Li are with the MoE Key Laboratory of Brain-inspired Intelligent Perception and Cognition, University of Science and Technology of China, Hefei 230027, China (e-mail: xhsheng@mail.ustc.edu.cn, lil1@ustc.edu.cn, dongeliu@ustc.edu.cn, lihq@ustc.edu.cn). Corresponding author: Li Li.
Abstract

Temporal prediction is one of the most important technologies for video compression. Various prediction coding modes are designed in traditional video codecs. Traditional video codecs will adaptively to decide the optimal coding mode according to the prediction quality and reference quality. Recently, learned video codecs have made great progress. However, they ignore the prediction and reference quality adaptation, which leads to incorrect utilization of temporal prediction and reconstruction error propagation. Therefore, in this paper, we first propose a confidence-based prediction quality adaptation (PQA) module to provide explicit discrimination for the spatial and channel-wise prediction quality difference. With this module, the prediction with low quality will be suppressed and that with high quality will be enhanced. The codec can adaptively decide which spatial or channel location of predictions to use. Then, we further propose a reference quality adaptation (RQA) module and an associated repeat-long training strategy to provide dynamic spatially variant filters for diverse reference qualities. With the filters, it is easier for our codec to achieve the target reconstruction quality according to reference qualities, thus reducing the propagation of reconstruction errors. Experimental results show that our codec obtains higher compression performance than the reference software of H.266/VVC and the previous state-of-the-art learned video codecs in both RGB and YUV420 colorspaces.

Index Terms:
Learned video compression, temporal prediction, prediction quality adaptation, reference quality adaptation.

I Introduction

With the rapid growth of various emerging video applications, such as internet protocol television (IPTV), live streaming, and online meetings, video data has contributed to most of the internet traffic. The large amount of video data brings large transmission and storage costs. Therefore, it puts forward high requirements for compressing videos efficiently.

Over the past several decades, a series of traditional video coding standards have been developed by the coding experts from ITU-T Video Coding Experts Group (VCEG) and ISO/IEC Motion Picture Experts Group (MPEG), such as H.264/AVC [1], H.265/HEVC [2], H.266/VVC [3]. Thanks to a variety of advanced coding technologies, the codecs based on these coding standards have significantly improved video compression performance. Among these techniques, temporal prediction plays a vital role in reducing temporal redundancy.

Various prediction coding modes exist in traditional video codecs. The decision of coding modes commonly adapts to prediction quality and reference quality. In terms of prediction quality, for example, for the regions with translational motion, such as moving vehicles and pedestrians, merge mode and advanced motion vector prediction (AMVP) mode [4] tend to obtain higher prediction quality. For the regions with non-translational motion, such as zooming and rotation, affine mode [5, 6] tends to be decided. In terms of reference quality, for example, if the quality of the reference frame is low, the proportion of skip mode [2, 3] will decrease.

Although traditional video codecs are still being refined, it is more and more challenging to achieve large improvements under limited coding complexity increases. To break through the bottleneck of compression performance, in the past few years, various learned video compression schemes on top of deep neural networks have been proposed. These schemes can roughly divided into five classes: volume coding-based [7, 8], temporal entropy modeling-based [9, 10], implicit neural representation-based [11, 12, 13, 14], residual coding-based [15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36], and temporal context mining-based [37, 38, 39, 40, 41, 42, 43, 44, 45, 46]. Among them, temporal context mining-based schemes have achieved the highest compression performance and some recently proposed schemes [39, 46, 47] even have outperformed the reference software of H.266/VVC.

However, different from traditional video codecs, existing temporal context mining-based learned video codecs ignore the adaptation for prediction quality. They commonly use multi-channel motion compensation to learn temporal contexts as predictions. Since the complexities of video content and motion patterns are unevenly distributed in space, spatial-wise prediction quality difference exists in temporal contexts. In addition, since each channel of a temporal context is compensated by different motion vectors [46], channel-wise prediction quality difference also exists in temporal contexts. Without explicit discrimination, the temporal contexts are directly stacked into the contextual encoder-decoder as conditions to reduce temporal redundancy. However, due to the prediction quality difference, it is difficult for the codec to decide which spatial or channel location of contexts to use as predictions. Therefore, in this paper, we propose a simple yet effective prediction quality adaptation (PQA) module to adaptively explore temporal contexts. In this module, by comparing a temporal context and an intermediate feature of the contextual encoder or decoder, a confidence value is calculated for each spatial and channel location of the temporal context that indicates the correctness of the temporal context. With the confidence map, the prediction with low quality will be suppressed and the prediction with high quality will be enhanced. The codec can adaptively decide which spatial or channel location of the temporal context to use as predictions.

In addition to prediction quality, existing temporal context mining-based learned video codecs also ignore the adaptation for reference quality, which makes them suffer from reconstruction error propagation. The main reason is that in learned video codecs, the reconstruction distortion not only comes from quantization but also comes from the lossy transform. The non-linear transform networks generate an implicit quantization for input frames. The degree of the implicit quantization depends on the λ𝜆\lambdaitalic_λ in the rate-distortion loss function and the quality of reference that is used to predict temporal contexts. The transform networks need to adapt to the reference quality to achieve the target reconstruction quality controlled by λ𝜆\lambdaitalic_λ. Ignoring the reference quality adaptation, learned video codecs tend to suffer from continuous degradation of reconstruction quality. Therefore, to make the codec adapt to reference quality, we propose a reference quality adaptation (RQA) module. This module learns spatially variant filters from reference frames. These filters are then applied to the intermediate features of transform networks to conduct spatially variant filtering. Given different references, the module can learn different filters to help transform networks adapt to their qualities. In addition, to make the RQA module “see” a wide range of reference qualities during training, we further propose a repeat-long training strategy that combines repeating compressing and long-sequence cascaded training. With the training strategy, a wider range of reference qualities can be generated.

Our contributions are summarized as follows:

  • We propose a confidence-based prediction quality adaptation (PQA) module to adapt to the spatial and channel-wise prediction quality difference of temporal contexts.

  • We propose a reference adaptation (RQA) module and an associated repeat-long training strategy to adapt to different qualities of reference, thus reducing the reconstruction error propagation.

  • With our proposed methods, our codec can outperform traditional video codecs and previous state-of-the-art learned video codecs in both RGB and YUV420 colorspaces.

The remainder of this paper is organized as follows. Section II gives a review of related work. Section III gives an overview of the learned video compression framework based on our proposed method. Section IV descibes our proposed methods in detail. Section V presents the experimental results and ablation studies. Section VI gives a conclusion of this paper.

II Related Work

Existing learned video compression schemes can be roughly divided into the following five classes.

Habibian et al. [7] proposed the pioneer of volume coding-based schemes. They used 3D convolution to capture the temporal correlation between multiple frames. These frames consist of a 3D volume and are compressed into a compact 3D latent code with a 3D auto-encoder. To reduce high computing costs caused by traditional 3D convolution, Sun et al. [8] proposed a frame-based 3D convolution for efficient multi-frame fusion. Currently, the compression ratio of this kind of scheme is only comparable with the industrial software of H.265/HEVC—x265 with very fast preset.

Liu et al. [9] proposed the first work of temporal entropy modeling-based schemes. They first transformed each frame into the latent space independently. When estimating the probability distribution parameters of the current latent code, they built a temporal entropy model and used the latent codes of previous frames as priors to reduce temporal redundancy. Based on this work, Mentzer et al. [10] proposed to split the latent codes into a sequence of tokens. Then, they proposed a Transformer-based temporal entropy model to use the transmitted tokens to predict the current token. Without motion-compensated prediction, the coding complexity of this kind of scheme is smaller. Currently, the compression performance of this kind of scheme is higher than x265 with very slow preset.

Chen et al. [11] proposed the first work of implicit neural representation-based learned video compression schemes—NeRV. For video encoding, they fed the frame indexes into a neural network and fitted neural networks to regress video frames. Then, pruning, quantization, and entropy coding are performed on the parameters of neural networks. The decoding process is a simple neural network feedforward operation. To improve the video representation ability, Kwan et al. [14] further proposed to split each video frame into patches and feed the indexes of patches in different frames into neural networks. Chen et al. [12] proposed to replace indexes with content-adaptive embeddings learned from video frames. The embeddings and parameters are both transmitted to the decoder. Feeding the transmitted embeddings into networks can obtain reconstructed frames. Based on this work, Zhao et al. [13] further proposed to learn a content embedding from the current frame and learn another difference embedding from the difference between the current frame and adjacent frames. The learned two embeddings are used to reconstruct videos. Because of the need to fit all video frames, the encoding time of this kind of scheme is long. However, their decoding time is short since the transmitted decoder networks are lightweight. Currently, the best compression performance [48] of this kind of scheme is higher than x265 with very slow preset.

Lu et al. [19] proposed the pioneer of residual coding-based schemes—DVC. They followed the traditional hybrid video coding framework and used neural networks to implement most coding modules, such as motion estimation, motion compression, motion compensation, residue compression, and entropy models. Based on DVC, a series of work emerged [19, 31, 49, 15, 16, 17, 18, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 50, 31, 32, 33, 34, 35, 36] . Most of them focused on improving the temporal prediction efficiency. Lin et al. [20] utilized multiple reference frames and associated multiple motion vectors to generate more accurate temporal prediction of the current frame and motion vector prediction. Agustsson et al. [23] replaced the commonly-used space flow with a scale-space flow which added another scale dimension to handle disocclusions and fast motion. Hu et al. [17] proposed a resolution-adaptive flow coding method to effectively compress motion flows with multi-resolution flow representations. Hu et al. [21] further proposed to perform coding operations in feature space. They used the learnable offsets of deformable convolution to represent motion and deformable convolution to perform motion compensation. Currently, this kind of scheme has outperformed the reference software of H.265/HEVC—HM.

Refer to caption
Figure 1: Overview of the learned video codec based on our proposed prediction quality adaptation module and reference quality adaptation module.
Refer to caption
Figure 2: Architecture of the contextual encoder-decoder and frame generator. The proposed prediction quality adaptation (PQA) module is used to provide explicit spatial and channel-wise discrimination for the prediction quality of temporal contexts. The proposed reference quality adaptation (RQA) module is used to dynamically adjust the contextual encoder-decoder to adapt to reference quality to reduce reconstruction error propagation.

Li et al. [41] proposed the first temporal context mining-based scheme—DCVC. They performed motion compensation to the feature learned from a reference frame to generate a single-scale temporal context. Rather than calculating the residue, they regarded the temporal context as a condition and concatenated it with the input frame. The contextual encoder and decoder can learn how to take advantage of the temporal context to reduce temporal redundancy sufficiently. Based on DCVC, Sheng et al. [37] proposed DCVC-TCM, which uses reference features instead of reference frames to propagate the temporal information. From the reference features, they designed a temporal context mining (TCM) module to predict multi-scale temporal contexts and proposed a temporal context re-filling (TCR) method to make full use of the predicted multi-scale temporal contexts. Inheriting the TCM and TCR in DCVC-TCM, most extended schemes [37, 41, 42, 46, 39, 38] focus on generating more accurate temporal contexts. Sheng et al. [39] proposed a spatial decomposition-based motion model and a long-term temporal fusion module to handle motion inconsistency and motion occlusion. Li et al. [46, 47] design a hierarchical quality structure and a feature refreshing method to periodically generate high-quality temporal contexts. Currently, this kind of scheme has outperformed the reference software of H.266/VVC—VTM.

III Overview

To adapt to temporal contexts with different qualities caused by prediction quality difference and reference quality difference, we propose a temporal context quality adaptation method based on our baseline—DCVC-SDD [39]. We first give an overview to introduce the framework based on our proposed methods.

III-1 Motion Estimation

To estimate the motion between video frames, we use the structure and detail decomposition-based motion estimation method proposed in our baseline [39]. The input frame xtsubscript𝑥𝑡x_{t}italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT and reference frame x^t1subscript^𝑥𝑡1\hat{x}_{t-1}over^ start_ARG italic_x end_ARG start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT are first decomposed into structure and detail components. Then, the motion vectors (vtssuperscriptsubscript𝑣𝑡𝑠v_{t}^{s}italic_v start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_s end_POSTSUPERSCRIPT, vtdsuperscriptsubscript𝑣𝑡𝑑v_{t}^{d}italic_v start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT) between their structure and detail components are estimated respectively using SpyNet [51].

III-2 Motion Coding

After obtaining the motion vectors of structure and detail components, a motion encoder compresses vtssuperscriptsubscript𝑣𝑡𝑠v_{t}^{s}italic_v start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_s end_POSTSUPERSCRIPT and vtdsuperscriptsubscript𝑣𝑡𝑑v_{t}^{d}italic_v start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT jointly into a compact latent representation m^tsubscript^𝑚𝑡\hat{m}_{t}over^ start_ARG italic_m end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT. After receiving the transmitted latent representation m^tsubscript^𝑚𝑡\hat{m}_{t}over^ start_ARG italic_m end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT, the motion decoder inversely transforms it back to the reconstructed motion vectors v^tssuperscriptsubscript^𝑣𝑡𝑠\hat{v}_{t}^{s}over^ start_ARG italic_v end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_s end_POSTSUPERSCRIPT and v^tdsuperscriptsubscript^𝑣𝑡𝑑\hat{v}_{t}^{d}over^ start_ARG italic_v end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT. Similar to [46, 39], a learnable motion quantization step and an inverse motion quantization step is embedded into the motion encoder and decoder respectively for supporting rate adjustment with a single model.

III-3 Temporal Contex Mining

Following our baseline [39], we use a structure and detail decomposition-based temporal context mining module [37, 39, 42, 46] to perform feature-domain motion compensation to the reference feature F^t1subscript^𝐹𝑡1\hat{F}_{t-1}over^ start_ARG italic_F end_ARG start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT to generate short-term multi-scale temporal contexts C¯t0,C¯t1,C¯t2superscriptsubscript¯𝐶𝑡0superscriptsubscript¯𝐶𝑡1superscriptsubscript¯𝐶𝑡2\bar{C}_{t}^{0},\bar{C}_{t}^{1},\bar{C}_{t}^{2}over¯ start_ARG italic_C end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT , over¯ start_ARG italic_C end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT , over¯ start_ARG italic_C end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT. To handle motion occlusion, we fuse a long-term reference feature H^t1subscript^𝐻𝑡1\hat{H}_{t-1}over^ start_ARG italic_H end_ARG start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT with C¯t0,C¯t1,C¯t2superscriptsubscript¯𝐶𝑡0superscriptsubscript¯𝐶𝑡1superscriptsubscript¯𝐶𝑡2\bar{C}_{t}^{0},\bar{C}_{t}^{1},\bar{C}_{t}^{2}over¯ start_ARG italic_C end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT , over¯ start_ARG italic_C end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT , over¯ start_ARG italic_C end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT to generate long short-term fused temporal contexts Ct0,Ct1,Ct2superscriptsubscript𝐶𝑡0superscriptsubscript𝐶𝑡1superscriptsubscript𝐶𝑡2C_{t}^{0},C_{t}^{1},C_{t}^{2}italic_C start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT , italic_C start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT , italic_C start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT.

III-4 PQA and RQA-Enhanced Contextual Encoder-Decoder

The contextual encoder and decoder are based on an auto-encoder structure as shown in Fig. 2. The input frame xtsubscript𝑥𝑡x_{t}italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT is transformed into a compact latent representation ytsubscript𝑦𝑡y_{t}italic_y start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT. After quantization, the latent representation y^tsubscript^𝑦𝑡\hat{y}_{t}over^ start_ARG italic_y end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT is signaled into the bitstream by the arithmetic encoder (AE). After receiving the bitstream at the decoder side, the arithmetic decoder (AD) performs entropy decoding to the bitstream to obtain y^tsubscript^𝑦𝑡\hat{y}_{t}over^ start_ARG italic_y end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT. Then, the contextual decoder and frame generator inversely transform y^tsubscript^𝑦𝑡\hat{y}_{t}over^ start_ARG italic_y end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT back to the reconstructed frame x^tsubscript^𝑥𝑡\hat{x}_{t}over^ start_ARG italic_x end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT. Before obtaining x^tsubscript^𝑥𝑡\hat{x}_{t}over^ start_ARG italic_x end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT, we regard the input feature F^tsubscript^𝐹𝑡\hat{F}_{t}over^ start_ARG italic_F end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT of the last convolutional layer as the reference feature to help compress the next frame xt1subscript𝑥𝑡1x_{t-1}italic_x start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT. Similar to the motion encoder and decoder, a learnable contextual quantization step and an inverse contextual quantization step are embedded into the contextual encoder and decoder respectively for a variable rate. Following the paradigm of temporal context mining-based scheme [46, 37, 39], the learned multi-scale temporal contexts Ct0,Ct1,Ct2superscriptsubscript𝐶𝑡0superscriptsubscript𝐶𝑡1superscriptsubscript𝐶𝑡2C_{t}^{0},C_{t}^{1},C_{t}^{2}italic_C start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT , italic_C start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT , italic_C start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT are fed into the contextual encoder and decoder as conditions [41]. Considering that spatial and channel-wise prediction quality differences exist in temporal contexts, we propose a prediction quality adaptation module (PQA) to provide explicit discrimination for the quality of temporal contexts Ct0,Ct1,Ct2superscriptsubscript𝐶𝑡0superscriptsubscript𝐶𝑡1superscriptsubscript𝐶𝑡2C_{t}^{0},C_{t}^{1},C_{t}^{2}italic_C start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT , italic_C start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT , italic_C start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT. After discrimination, the refined temporal contexts C^t0,C^t1,C^t2superscriptsubscript^𝐶𝑡0superscriptsubscript^𝐶𝑡1superscriptsubscript^𝐶𝑡2\hat{C}_{t}^{0},\hat{C}_{t}^{1},\hat{C}_{t}^{2}over^ start_ARG italic_C end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT , over^ start_ARG italic_C end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT , over^ start_ARG italic_C end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT serve as new conditions. In addition to prediction quality, reference quality also needs to be adapted by the contextual encoder and decoder to reduce reconstruction error propagation. Therefore, we propose a reference quality adaptation module (RQA) to dynamically adjust the contextual encoder and decoder to adapt to different reference qualities. More details about PQA and RQA will be described in Section IV-A and Section IV-B.

III-5 Frame Generator

After obtaining the feature decoded by the contextual decoder, we use a frame generator which is comprised of two U-Nets [39, 42, 46] reconstructs the feature back to the pixel-domain reconstructed frame x^t1subscript^𝑥𝑡1\hat{x}_{t-1}over^ start_ARG italic_x end_ARG start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT. Before obtaining x^tsubscript^𝑥𝑡\hat{x}_{t}over^ start_ARG italic_x end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT, we regard the input feature F^tsubscript^𝐹𝑡\hat{F}_{t}over^ start_ARG italic_F end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT of the last convolutional layer as the reference feature for compressing the next frame.

III-6 Long-Term Reference Generator

To handle motion occlusion, we follow our baseline [39] and introduce a ConvLSTM-based long-term reference generator. After obtaining the reference feature F^tsubscript^𝐹𝑡\hat{F}_{t}over^ start_ARG italic_F end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT of the next frame xt+1subscript𝑥𝑡1x_{t+1}italic_x start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT, we feed F^tsubscript^𝐹𝑡\hat{F}_{t}over^ start_ARG italic_F end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT into the long-term reference generator to generate a long-term reference feature H^tsubscript^𝐻𝑡\hat{H}_{t}over^ start_ARG italic_H end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT, which will be used in the temporal context mining module to provide long-term temporal contexts.

III-7 Entropy Model

We use hyperprior entropy model [52], quadtree partition-based spatial entropy model [46], and conditional temporal entropy model [37, 39, 42, 46] to jointly estimate the probability distribution of motion vector latent representation m^tsubscript^𝑚𝑡\hat{m}_{t}over^ start_ARG italic_m end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT and contextual latent representation y^tsubscript^𝑦𝑡\hat{y}_{t}over^ start_ARG italic_y end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT, which are both assumed to follow the Laplace distribution.

Refer to caption
Figure 3: Visualization of the spatial and channel-wise prediction quality difference of temporal contexts.

IV Methodology

IV-A Prediction Quality Adaptation

Temporal contexts are generated by feature-domain temporal prediction. Therefore, the quality of temporal prediction affects the temporal context quality. Since the complexities of video content and motion patterns are unevenly distributed in space, spatial-wise prediction quality difference exists in temporal contexts. For example, as shown in Fig. 3 (a), the prediction error of the horsetail in the red rectangle is large since it has complex textures and motion, while that of the horse body is smaller since their textures are simpler and their motion pattern is mainly simple translational motion. Using temporal context with large spatial prediction errors may increase coding costs. Therefore, it is necessary to give explicit discrimination to the spatial-wise temporal context quality difference to help the context encoder and decoder decide whether to use the temporal context in a certain region.

In addition to the spatial-wise prediction quality difference, channel-wise prediction quality difference also exists in temporal contexts. Commonly, temporal contexts have multiple feature channels, and each channel is compensated by different motion vectors [46]. Therefore, temporal information is unevenly distributed across context channels. Different context channels may have different prediction qualities and contain different temporal information. For example, as shown in Fig. 3 (b), the 9thsuperscript9𝑡9^{th}9 start_POSTSUPERSCRIPT italic_t italic_h end_POSTSUPERSCRIPT and 47thsuperscript47𝑡47^{th}47 start_POSTSUPERSCRIPT italic_t italic_h end_POSTSUPERSCRIPT channels of the temporal context have more temporal information and their temporal prediction qualities are higher, while the 17thsuperscript17𝑡17^{th}17 start_POSTSUPERSCRIPT italic_t italic_h end_POSTSUPERSCRIPT channel of temporal context has less temporal information and its temporal prediction quality is lower. Therefore, it is also necessary to give explicit discrimination to the channel-wise temporal context quality difference to help the contextual encoder and decoder decide which channel of temporal information to use.

To give explicit discrimination to the spatial and channel-wise prediction quality difference of temporal contexts, we design a simple yet effective prediction quality adaptation module as illustrated in Fig. 4. Given a certain scale temporal context Ctisuperscriptsubscript𝐶𝑡𝑖C_{t}^{i}italic_C start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT, we first concatenate (concat) it with an intermediate feature It,eisuperscriptsubscript𝐼𝑡𝑒𝑖I_{t,e}^{i}italic_I start_POSTSUBSCRIPT italic_t , italic_e end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT of the contextual encoder or an intermediate feature It,disuperscriptsubscript𝐼𝑡𝑑𝑖I_{t,d}^{i}italic_I start_POSTSUBSCRIPT italic_t , italic_d end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT of the contextual decoder with the same scale. For the biggest temporal context Ct0superscriptsubscript𝐶𝑡0C_{t}^{0}italic_C start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT in the encoder, the input frame xtsubscript𝑥𝑡x_{t}italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT serves as It,e0superscriptsubscript𝐼𝑡𝑒0I_{t,e}^{0}italic_I start_POSTSUBSCRIPT italic_t , italic_e end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT. Then, we feed them into a convolutional layer to obtain a feature with the same spatial and channel dimension as Ctisuperscriptsubscript𝐶𝑡𝑖C_{t}^{i}italic_C start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT. The feature is passed into a Sigmoid function to obtain a group of confidence maps for each channel of the temporal context Ctisuperscriptsubscript𝐶𝑡𝑖C_{t}^{i}italic_C start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT. Considering that the temporal contexts for reducing temporal redundancy and video reconstruction may be different, given a temporal context Ctisuperscriptsubscript𝐶𝑡𝑖C_{t}^{i}italic_C start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT, we calculate different confidence maps for the contextual encoder and decoder.

wt,eisuperscriptsubscript𝑤𝑡𝑒𝑖\displaystyle w_{t,e}^{i}italic_w start_POSTSUBSCRIPT italic_t , italic_e end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT =σ(e(concat(Cti,It,ei))),absent𝜎subscript𝑒𝑐𝑜𝑛𝑐𝑎𝑡superscriptsubscript𝐶𝑡𝑖superscriptsubscript𝐼𝑡𝑒𝑖\displaystyle=\sigma\left(\mathcal{F}_{e}\left(concat(C_{t}^{i},I_{t,e}^{i})% \right)\right),= italic_σ ( caligraphic_F start_POSTSUBSCRIPT italic_e end_POSTSUBSCRIPT ( italic_c italic_o italic_n italic_c italic_a italic_t ( italic_C start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT , italic_I start_POSTSUBSCRIPT italic_t , italic_e end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT ) ) ) , (1)
wt,disuperscriptsubscript𝑤𝑡𝑑𝑖\displaystyle w_{t,d}^{i}italic_w start_POSTSUBSCRIPT italic_t , italic_d end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT =σ(d(concat(Cti,It,di))),absent𝜎subscript𝑑𝑐𝑜𝑛𝑐𝑎𝑡superscriptsubscript𝐶𝑡𝑖superscriptsubscript𝐼𝑡𝑑𝑖\displaystyle=\sigma\left(\mathcal{F}_{d}\left(concat(C_{t}^{i},I_{t,d}^{i})% \right)\right),= italic_σ ( caligraphic_F start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT ( italic_c italic_o italic_n italic_c italic_a italic_t ( italic_C start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT , italic_I start_POSTSUBSCRIPT italic_t , italic_d end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT ) ) ) ,
i𝑖\displaystyle iitalic_i =0,1,2,absent012\displaystyle=0,1,2,= 0 , 1 , 2 ,

where wt,eisuperscriptsubscript𝑤𝑡𝑒𝑖w_{t,e}^{i}italic_w start_POSTSUBSCRIPT italic_t , italic_e end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT is the confidence map group for contextual encoder and wd,tisuperscriptsubscript𝑤𝑑𝑡𝑖w_{d,t}^{i}italic_w start_POSTSUBSCRIPT italic_d , italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT is the confidence map group for contextual decoder. i𝑖iitalic_i is the scale index of temporal contexts. esubscript𝑒\mathcal{F}_{e}caligraphic_F start_POSTSUBSCRIPT italic_e end_POSTSUBSCRIPT and dsubscript𝑑\mathcal{F}_{d}caligraphic_F start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT denote 2D convolution operations with 3×3333\times 33 × 3 filter size. σ𝜎\sigmaitalic_σ refers to the Sigmoid function. Each element of wt,eisuperscriptsubscript𝑤𝑡𝑒𝑖w_{t,e}^{i}italic_w start_POSTSUBSCRIPT italic_t , italic_e end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT and wd,tisuperscriptsubscript𝑤𝑑𝑡𝑖w_{d,t}^{i}italic_w start_POSTSUBSCRIPT italic_d , italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT is in the range of 0 to 1. The greater the value of one element, the more important the temporal context of the corresponding position is.

After obtaining the confidence map for each channel of Ctisuperscriptsubscript𝐶𝑡𝑖C_{t}^{i}italic_C start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT, we perform element-wise product operation between the confidence map and the corresponding channel of temporal context.

C^t,ei,msuperscriptsubscript^𝐶𝑡𝑒𝑖𝑚\displaystyle\hat{C}_{t,e}^{i,m}over^ start_ARG italic_C end_ARG start_POSTSUBSCRIPT italic_t , italic_e end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i , italic_m end_POSTSUPERSCRIPT =wt,ei,mCti,mabsentdirect-productsuperscriptsubscript𝑤𝑡𝑒𝑖𝑚superscriptsubscript𝐶𝑡𝑖𝑚\displaystyle=w_{t,e}^{i,m}\odot C_{t}^{i,m}= italic_w start_POSTSUBSCRIPT italic_t , italic_e end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i , italic_m end_POSTSUPERSCRIPT ⊙ italic_C start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i , italic_m end_POSTSUPERSCRIPT (2)
C^t,di,msuperscriptsubscript^𝐶𝑡𝑑𝑖𝑚\displaystyle\hat{C}_{t,d}^{i,m}over^ start_ARG italic_C end_ARG start_POSTSUBSCRIPT italic_t , italic_d end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i , italic_m end_POSTSUPERSCRIPT =wt,di,mCti,mabsentdirect-productsuperscriptsubscript𝑤𝑡𝑑𝑖𝑚superscriptsubscript𝐶𝑡𝑖𝑚\displaystyle=w_{t,d}^{i,m}\odot C_{t}^{i,m}= italic_w start_POSTSUBSCRIPT italic_t , italic_d end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i , italic_m end_POSTSUPERSCRIPT ⊙ italic_C start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i , italic_m end_POSTSUPERSCRIPT
m𝑚\displaystyle mitalic_m =0,,M1,absent0𝑀1\displaystyle=0,\cdots,M-1,= 0 , ⋯ , italic_M - 1 ,

where direct-product\odot is the element-wise product operation, m𝑚mitalic_m is the channel index of the temporal context Ctisuperscriptsubscript𝐶𝑡𝑖C_{t}^{i}italic_C start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT, and M𝑀Mitalic_M is the number of channels of Ctisuperscriptsubscript𝐶𝑡𝑖C_{t}^{i}italic_C start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT.

Refer to caption
Figure 4: Architecture of our proposed prediction quality adaptation module.

IV-B Reference Quality Adaptation

In addition to prediction quality, reference quality also needs to be adapted to reduce reconstruction error propagation. As shown in Fig. 5, the continuous degradation of reconstruction quality is relatively limited in traditional video codecs, such as VTM. Their transforms like DCT are lossless and their reconstruction distortion only comes from the quantization. Therefore, it is easier for them to achieve a target reconstruction quality by adjusting a quantization step. However, in learned video codecs, the reconstruction distortion not only comes from quantization but also comes from non-linear transform networks. The non-linear transform networks generate an implicit quantization for the input frame. The degree of the implicit quantization depends on the λ𝜆\lambdaitalic_λ in the rate-distortion loss function and the quality of reference that is used to predict temporal contexts. In other words, the transform networks will adapt to the reference quality to achieve the target reconstruction quality controlled by λ𝜆\lambdaitalic_λ.

However, current learned video compression schemes are difficult to achieve this goal. There are two main reasons for this problem. One is their transform networks cannot adapt to different reference qualities well. Although some learned video compression schemes proposed to add a periodically varying weight before λ𝜆\lambdaitalic_λ in the loss function [39, 46] to adapt to different reference qualities, it is still difficult to adapt various reference qualities with a limited number of weights (1.2, 0.5, 0.9) to achieve the target quality. As shown in Fig. 5, for our baseline, which has used this kind of reference quality adaptive loss function, the qualities of its reconstructed frames still gradually decrease as the reference qualities decrease. Another reason is that the transform networks only “see” a limited range of reference qualities during training, resulting in an inability to adapt to unseen reference qualities during testing.

Refer to caption
Figure 5: Illustration of the reconstruction quality difference across video frames of the reference software of H.266/VVC [3] and our baseline—DCVC-SDD [39].
Refer to caption
Figure 6: Architecture of our proposed reference quality adaptation module.

To make the transform networks adapt to reference qualities well, we propose a reference quality adaptation module, as presented in Fig. 6. We first feed the reference frame x^t1subscript^𝑥𝑡1\hat{x}_{t-1}over^ start_ARG italic_x end_ARG start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT into a convolutional layer RQAsubscript𝑅𝑄𝐴\mathcal{F}_{RQA}caligraphic_F start_POSTSUBSCRIPT italic_R italic_Q italic_A end_POSTSUBSCRIPT to obtain spatially variant filters WtθRH×W×k×ksuperscriptsubscript𝑊𝑡𝜃superscript𝑅𝐻𝑊𝑘𝑘W_{t}^{\theta}\in R^{H\times W\times k\times k}italic_W start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_θ end_POSTSUPERSCRIPT ∈ italic_R start_POSTSUPERSCRIPT italic_H × italic_W × italic_k × italic_k end_POSTSUPERSCRIPT for each position of an intermediate feature Itsubscript𝐼𝑡I_{t}italic_I start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT of the contextual encoder or decoder, where H𝐻Hitalic_H and W𝑊Witalic_W are height and width of the intermediate feature, and k𝑘kitalic_k refers to the size of filters:

Wtθ=RQA(x^t1).superscriptsubscript𝑊𝑡𝜃subscript𝑅𝑄𝐴subscript^𝑥𝑡1W_{t}^{\theta}=\mathcal{F}_{RQA}\left(\hat{x}_{t-1}\right).italic_W start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_θ end_POSTSUPERSCRIPT = caligraphic_F start_POSTSUBSCRIPT italic_R italic_Q italic_A end_POSTSUBSCRIPT ( over^ start_ARG italic_x end_ARG start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT ) . (3)

Then, each filter Wtθ(i,j)superscriptsubscript𝑊𝑡𝜃𝑖𝑗W_{t}^{\theta}(i,j)italic_W start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_θ end_POSTSUPERSCRIPT ( italic_i , italic_j ) is applied to a k×k𝑘𝑘k\times kitalic_k × italic_k window of the intermediate feature Itsubscript𝐼𝑡I_{t}italic_I start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT centered at position (i,j)𝑖𝑗(i,j)( italic_i , italic_j ) to conduct spatially variant filtering.

Ot(i,j,c)=u=k/2k/2v=k/2k/2Wtθ(i,j,u,v)×It(i+u,j+v,c).subscript𝑂𝑡𝑖𝑗𝑐superscriptsubscript𝑢delimited-⌈⌋𝑘2𝑘2superscriptsubscript𝑣𝑘2𝑘2superscriptsubscript𝑊𝑡𝜃𝑖𝑗𝑢𝑣subscript𝐼𝑡𝑖𝑢𝑗𝑣𝑐O_{t}(i,j,c)=\sum_{u=-\lceil k/2\rfloor}^{\lfloor k/2\rfloor}\sum_{v=-\lfloor k% /2\rfloor}^{\lfloor k/2\rfloor}W_{t}^{\theta}(i,j,u,v)\times I_{t}(i+u,j+v,c).italic_O start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_i , italic_j , italic_c ) = ∑ start_POSTSUBSCRIPT italic_u = - ⌈ italic_k / 2 ⌋ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⌊ italic_k / 2 ⌋ end_POSTSUPERSCRIPT ∑ start_POSTSUBSCRIPT italic_v = - ⌊ italic_k / 2 ⌋ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⌊ italic_k / 2 ⌋ end_POSTSUPERSCRIPT italic_W start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_θ end_POSTSUPERSCRIPT ( italic_i , italic_j , italic_u , italic_v ) × italic_I start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_i + italic_u , italic_j + italic_v , italic_c ) . (4)

To make the RQA module “see” a wide range of reference qualities during training, we add a repeat-long training step that combines repeating compressing and long-sequence cascaded training in the commonly-used multiple-step training strategy, which will be introduced in Section IV-C.

TABLE I: Training strategy of our scheme for encoding RGB videos when distortion is measured by RGB PSNR.
Frames Network Loss LR Epoch
2 Inter LtmeDsuperscriptsubscript𝐿𝑡𝑚𝑒𝐷L_{t}^{meD}italic_L start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m italic_e italic_D end_POSTSUPERSCRIPT 1e41𝑒41e-41 italic_e - 4 2
2 Recon LtrecDsuperscriptsubscript𝐿𝑡𝑟𝑒𝑐𝐷L_{t}^{recD}italic_L start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_r italic_e italic_c italic_D end_POSTSUPERSCRIPT 1e41𝑒41e-41 italic_e - 4 1
3 Recon LtrecDsuperscriptsubscript𝐿𝑡𝑟𝑒𝑐𝐷L_{t}^{recD}italic_L start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_r italic_e italic_c italic_D end_POSTSUPERSCRIPT 1e41𝑒41e-41 italic_e - 4 1
6 Recon LtrecDsuperscriptsubscript𝐿𝑡𝑟𝑒𝑐𝐷L_{t}^{recD}italic_L start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_r italic_e italic_c italic_D end_POSTSUPERSCRIPT 1e41𝑒41e-41 italic_e - 4 1
6 Inter LtmeDsuperscriptsubscript𝐿𝑡𝑚𝑒𝐷L_{t}^{meD}italic_L start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m italic_e italic_D end_POSTSUPERSCRIPT 1e41𝑒41e-41 italic_e - 4 2
6 Inter LtmeRDsuperscriptsubscript𝐿𝑡𝑚𝑒𝑅𝐷L_{t}^{meRD}italic_L start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m italic_e italic_R italic_D end_POSTSUPERSCRIPT 1e41𝑒41e-41 italic_e - 4 6
6 Recon LtrecDsuperscriptsubscript𝐿𝑡𝑟𝑒𝑐𝐷L_{t}^{recD}italic_L start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_r italic_e italic_c italic_D end_POSTSUPERSCRIPT 1e41𝑒41e-41 italic_e - 4 2
6 Recon LtrecRDsuperscriptsubscript𝐿𝑡𝑟𝑒𝑐𝑅𝐷L_{t}^{recRD}italic_L start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_r italic_e italic_c italic_R italic_D end_POSTSUPERSCRIPT 1e41𝑒41e-41 italic_e - 4 6
6 All Ltallsuperscriptsubscript𝐿𝑡𝑎𝑙𝑙L_{t}^{all}italic_L start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_a italic_l italic_l end_POSTSUPERSCRIPT 1e41𝑒41e-41 italic_e - 4 4
6 All Ltallsuperscriptsubscript𝐿𝑡𝑎𝑙𝑙L_{t}^{all}italic_L start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_a italic_l italic_l end_POSTSUPERSCRIPT 5e55𝑒55e-55 italic_e - 5 3
6 All Ltallsuperscriptsubscript𝐿𝑡𝑎𝑙𝑙L_{t}^{all}italic_L start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_a italic_l italic_l end_POSTSUPERSCRIPT 1e51𝑒51e-51 italic_e - 5 3
6 All Ltallsuperscriptsubscript𝐿𝑡𝑎𝑙𝑙L_{t}^{all}italic_L start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_a italic_l italic_l end_POSTSUPERSCRIPT 5e65𝑒65e-65 italic_e - 6 4
6 All LTallsuperscriptsubscript𝐿𝑇𝑎𝑙𝑙L_{T}^{all}italic_L start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_a italic_l italic_l end_POSTSUPERSCRIPT 5e55𝑒55e-55 italic_e - 5 2
6 All LTallsuperscriptsubscript𝐿𝑇𝑎𝑙𝑙L_{T}^{all}italic_L start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_a italic_l italic_l end_POSTSUPERSCRIPT 5e65𝑒65e-65 italic_e - 6 2
6 All LTallsuperscriptsubscript𝐿𝑇𝑎𝑙𝑙L_{T}^{all}italic_L start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_a italic_l italic_l end_POSTSUPERSCRIPT 1e61𝑒61e-61 italic_e - 6 1
19 All LTallrlsuperscriptsubscript𝐿𝑇𝑎𝑙𝑙𝑟𝑙L_{T}^{all-r-l}italic_L start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_a italic_l italic_l - italic_r - italic_l end_POSTSUPERSCRIPT 1e61𝑒61e-61 italic_e - 6 1

IV-C Training Strategy

We propose a step-by-step training strategy to train our learned video compression scheme. The training details for encoding RGB videos when the reconstruction distortion is measured by RGB PSNR are listed in Table. I. According to the training loss function, the training strategy can be classified into 7 classes: LtmeDsuperscriptsubscript𝐿𝑡𝑚𝑒𝐷L_{t}^{meD}italic_L start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m italic_e italic_D end_POSTSUPERSCRIPT, LtmeRDsuperscriptsubscript𝐿𝑡𝑚𝑒𝑅𝐷L_{t}^{meRD}italic_L start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m italic_e italic_R italic_D end_POSTSUPERSCRIPT, LtrecDsuperscriptsubscript𝐿𝑡𝑟𝑒𝑐𝐷L_{t}^{recD}italic_L start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_r italic_e italic_c italic_D end_POSTSUPERSCRIPT, LtrecRDsuperscriptsubscript𝐿𝑡𝑟𝑒𝑐𝑅𝐷L_{t}^{recRD}italic_L start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_r italic_e italic_c italic_R italic_D end_POSTSUPERSCRIPT, Ltallsuperscriptsubscript𝐿𝑡𝑎𝑙𝑙L_{t}^{all}italic_L start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_a italic_l italic_l end_POSTSUPERSCRIPT, LTallsuperscriptsubscript𝐿𝑇𝑎𝑙𝑙L_{T}^{all}italic_L start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_a italic_l italic_l end_POSTSUPERSCRIPT, and LTallrlsuperscriptsubscript𝐿𝑇𝑎𝑙𝑙𝑟𝑙L_{T}^{all-r-l}italic_L start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_a italic_l italic_l - italic_r - italic_l end_POSTSUPERSCRIPT.

  • LtmeDsuperscriptsubscript𝐿𝑡𝑚𝑒𝐷L_{t}^{meD}italic_L start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m italic_e italic_D end_POSTSUPERSCRIPT is comprised of Dtmesuperscriptsubscript𝐷𝑡𝑚𝑒D_{t}^{me}italic_D start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m italic_e end_POSTSUPERSCRIPT. We denote Dtmesuperscriptsubscript𝐷𝑡𝑚𝑒D_{t}^{me}italic_D start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m italic_e end_POSTSUPERSCRIPT to the distortion between xtsubscript𝑥𝑡x_{t}italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT and its predicted frame x~tsubscript~𝑥𝑡\tilde{x}_{t}over~ start_ARG italic_x end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT. MSE is used to measure the distortion. The predicted frame x~tsubscript~𝑥𝑡\tilde{x}_{t}over~ start_ARG italic_x end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT is generated by warping the reference frame x^t1subscript^𝑥𝑡1\hat{x}_{t-1}over^ start_ARG italic_x end_ARG start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT using the decoded motion vectors v^tsubscript^𝑣𝑡\hat{v}_{t}over^ start_ARG italic_v end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT. Using LtmeDsuperscriptsubscript𝐿𝑡𝑚𝑒𝐷L_{t}^{meD}italic_L start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m italic_e italic_D end_POSTSUPERSCRIPT as the loss function is to make the motion decoder generate accurate motion vectors:

    LtmeD=wtλDtme.superscriptsubscript𝐿𝑡𝑚𝑒𝐷subscript𝑤𝑡𝜆superscriptsubscript𝐷𝑡𝑚𝑒L_{t}^{meD}=w_{t}\cdot\lambda\cdot D_{t}^{me}.italic_L start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m italic_e italic_D end_POSTSUPERSCRIPT = italic_w start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ⋅ italic_λ ⋅ italic_D start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m italic_e end_POSTSUPERSCRIPT . (5)
  • LtmeRDsuperscriptsubscript𝐿𝑡𝑚𝑒𝑅𝐷L_{t}^{meRD}italic_L start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m italic_e italic_R italic_D end_POSTSUPERSCRIPT is comprised of Dtmesuperscriptsubscript𝐷𝑡𝑚𝑒D_{t}^{me}italic_D start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m italic_e end_POSTSUPERSCRIPT and Rtmesuperscriptsubscript𝑅𝑡𝑚𝑒R_{t}^{me}italic_R start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m italic_e end_POSTSUPERSCRIPT. We denote Rtmesuperscriptsubscript𝑅𝑡𝑚𝑒R_{t}^{me}italic_R start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m italic_e end_POSTSUPERSCRIPT to the joint bit rate used for encoding the quantized motion latent representation m^tsubscript^𝑚𝑡\hat{m}_{t}over^ start_ARG italic_m end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT and its associated hyperprior. Using LtmeRDsuperscriptsubscript𝐿𝑡𝑚𝑒𝑅𝐷L_{t}^{meRD}italic_L start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m italic_e italic_R italic_D end_POSTSUPERSCRIPT as the loss function is to achieve a trade-off between the accuracy and the consumed bit rate of motion vectors:

    LtmeRD=wtλDtme+Rtme.superscriptsubscript𝐿𝑡𝑚𝑒𝑅𝐷subscript𝑤𝑡𝜆superscriptsubscript𝐷𝑡𝑚𝑒superscriptsubscript𝑅𝑡𝑚𝑒L_{t}^{meRD}=w_{t}\cdot\lambda\cdot D_{t}^{me}+R_{t}^{me}.italic_L start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m italic_e italic_R italic_D end_POSTSUPERSCRIPT = italic_w start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ⋅ italic_λ ⋅ italic_D start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m italic_e end_POSTSUPERSCRIPT + italic_R start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m italic_e end_POSTSUPERSCRIPT . (6)
  • LtrecDsuperscriptsubscript𝐿𝑡𝑟𝑒𝑐𝐷L_{t}^{recD}italic_L start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_r italic_e italic_c italic_D end_POSTSUPERSCRIPT is comprised of Dtrecsuperscriptsubscript𝐷𝑡𝑟𝑒𝑐D_{t}^{rec}italic_D start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_r italic_e italic_c end_POSTSUPERSCRIPT. We denote Dtrecsuperscriptsubscript𝐷𝑡𝑟𝑒𝑐D_{t}^{rec}italic_D start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_r italic_e italic_c end_POSTSUPERSCRIPT to the distortion between xtsubscript𝑥𝑡x_{t}italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT and its reconstructed frame x^tsubscript^𝑥𝑡\hat{x}_{t}over^ start_ARG italic_x end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT. MSE is used to measure the distortion. Using LtrecDsuperscriptsubscript𝐿𝑡𝑟𝑒𝑐𝐷L_{t}^{recD}italic_L start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_r italic_e italic_c italic_D end_POSTSUPERSCRIPT as the loss function is to make the contextual decoder and frame generator generate a high-quality reconstructed frame:

    LtrecD=wtλDtrec.superscriptsubscript𝐿𝑡𝑟𝑒𝑐𝐷subscript𝑤𝑡𝜆superscriptsubscript𝐷𝑡𝑟𝑒𝑐L_{t}^{recD}=w_{t}\cdot\lambda\cdot D_{t}^{rec}.italic_L start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_r italic_e italic_c italic_D end_POSTSUPERSCRIPT = italic_w start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ⋅ italic_λ ⋅ italic_D start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_r italic_e italic_c end_POSTSUPERSCRIPT . (7)
  • LtrecRDsuperscriptsubscript𝐿𝑡𝑟𝑒𝑐𝑅𝐷L_{t}^{recRD}italic_L start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_r italic_e italic_c italic_R italic_D end_POSTSUPERSCRIPT is comprised of Dtrecsuperscriptsubscript𝐷𝑡𝑟𝑒𝑐D_{t}^{rec}italic_D start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_r italic_e italic_c end_POSTSUPERSCRIPT and Rtrecsuperscriptsubscript𝑅𝑡𝑟𝑒𝑐R_{t}^{rec}italic_R start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_r italic_e italic_c end_POSTSUPERSCRIPT. We denote Rtrecsuperscriptsubscript𝑅𝑡𝑟𝑒𝑐R_{t}^{rec}italic_R start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_r italic_e italic_c end_POSTSUPERSCRIPT to the joint bit rate used for encoding the quantized contextual latent representation y^tsubscript^𝑦𝑡\hat{y}_{t}over^ start_ARG italic_y end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT and its associated hyperprior of xtsubscript𝑥𝑡x_{t}italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT. Using LtrecRDsuperscriptsubscript𝐿𝑡𝑟𝑒𝑐𝑅𝐷L_{t}^{recRD}italic_L start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_r italic_e italic_c italic_R italic_D end_POSTSUPERSCRIPT as the loss function is to achieve a trade-off between the quality of reconstructed frame x^tsubscript^𝑥𝑡\hat{x}_{t}over^ start_ARG italic_x end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT and the consumed bit rate of contextual latent representation y^tsubscript^𝑦𝑡\hat{y}_{t}over^ start_ARG italic_y end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT:

    LtrecRD=wtλDtrec+Rtrec.superscriptsubscript𝐿𝑡𝑟𝑒𝑐𝑅𝐷subscript𝑤𝑡𝜆superscriptsubscript𝐷𝑡𝑟𝑒𝑐superscriptsubscript𝑅𝑡𝑟𝑒𝑐L_{t}^{recRD}=w_{t}\cdot\lambda\cdot D_{t}^{rec}+R_{t}^{rec}.italic_L start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_r italic_e italic_c italic_R italic_D end_POSTSUPERSCRIPT = italic_w start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ⋅ italic_λ ⋅ italic_D start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_r italic_e italic_c end_POSTSUPERSCRIPT + italic_R start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_r italic_e italic_c end_POSTSUPERSCRIPT . (8)
  • Ltallsuperscriptsubscript𝐿𝑡𝑎𝑙𝑙L_{t}^{all}italic_L start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_a italic_l italic_l end_POSTSUPERSCRIPT is comprised of Dtrecsuperscriptsubscript𝐷𝑡𝑟𝑒𝑐D_{t}^{rec}italic_D start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_r italic_e italic_c end_POSTSUPERSCRIPT and Rtallsuperscriptsubscript𝑅𝑡𝑎𝑙𝑙R_{t}^{all}italic_R start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_a italic_l italic_l end_POSTSUPERSCRIPT. We denote Rtallsuperscriptsubscript𝑅𝑡𝑎𝑙𝑙R_{t}^{all}italic_R start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_a italic_l italic_l end_POSTSUPERSCRIPT to all the bit rates used for encoding the quantized contextual latent representation y^tsubscript^𝑦𝑡\hat{y}_{t}over^ start_ARG italic_y end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT, the quantized motion latent representation m^tsubscript^𝑚𝑡\hat{m}_{t}over^ start_ARG italic_m end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT, and their hyperprior. Using Ltallsuperscriptsubscript𝐿𝑡𝑎𝑙𝑙L_{t}^{all}italic_L start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_a italic_l italic_l end_POSTSUPERSCRIPT as the loss function is to achieve a trade-off between the quality of the reconstructed frame and all the consumed bit rates of the coded frame:

    Ltallsuperscriptsubscript𝐿𝑡𝑎𝑙𝑙\displaystyle L_{t}^{all}italic_L start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_a italic_l italic_l end_POSTSUPERSCRIPT =wtλDtrec+Rtallabsentsubscript𝑤𝑡𝜆superscriptsubscript𝐷𝑡𝑟𝑒𝑐superscriptsubscript𝑅𝑡𝑎𝑙𝑙\displaystyle=w_{t}\cdot\lambda\cdot D_{t}^{rec}+R_{t}^{all}= italic_w start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ⋅ italic_λ ⋅ italic_D start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_r italic_e italic_c end_POSTSUPERSCRIPT + italic_R start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_a italic_l italic_l end_POSTSUPERSCRIPT (9)
    =wtλDtrec+Rtme+Rtrec.absentsubscript𝑤𝑡𝜆superscriptsubscript𝐷𝑡𝑟𝑒𝑐superscriptsubscript𝑅𝑡𝑚𝑒superscriptsubscript𝑅𝑡𝑟𝑒𝑐\displaystyle=w_{t}\cdot\lambda\cdot D_{t}^{rec}+R_{t}^{me}+R_{t}^{rec}.= italic_w start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ⋅ italic_λ ⋅ italic_D start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_r italic_e italic_c end_POSTSUPERSCRIPT + italic_R start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m italic_e end_POSTSUPERSCRIPT + italic_R start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_r italic_e italic_c end_POSTSUPERSCRIPT .
  • LTallsuperscriptsubscript𝐿𝑇𝑎𝑙𝑙L_{T}^{all}italic_L start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_a italic_l italic_l end_POSTSUPERSCRIPT is the average Ltallsuperscriptsubscript𝐿𝑡𝑎𝑙𝑙L_{t}^{all}italic_L start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_a italic_l italic_l end_POSTSUPERSCRIPT loss of T𝑇Titalic_T consecutive frames. Calculating the average loss of multiple frames is to achieve a cascaded fine-tuning for reducing the error propagation [37, 39, 42, 46]:

    LTallsuperscriptsubscript𝐿𝑇𝑎𝑙𝑙\displaystyle L_{T}^{all}italic_L start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_a italic_l italic_l end_POSTSUPERSCRIPT =1TtLtallabsent1𝑇subscript𝑡superscriptsubscript𝐿𝑡𝑎𝑙𝑙\displaystyle=\frac{1}{T}\sum_{t}L_{t}^{all}= divide start_ARG 1 end_ARG start_ARG italic_T end_ARG ∑ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT italic_L start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_a italic_l italic_l end_POSTSUPERSCRIPT (10)
    =1Tt{wtλDtrec+Rtall}absent1𝑇subscript𝑡subscript𝑤𝑡𝜆superscriptsubscript𝐷𝑡𝑟𝑒𝑐superscriptsubscript𝑅𝑡𝑎𝑙𝑙\displaystyle=\frac{1}{T}\sum_{t}\left\{w_{t}\cdot\lambda\cdot D_{t}^{rec}+R_{% t}^{all}\right\}= divide start_ARG 1 end_ARG start_ARG italic_T end_ARG ∑ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT { italic_w start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ⋅ italic_λ ⋅ italic_D start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_r italic_e italic_c end_POSTSUPERSCRIPT + italic_R start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_a italic_l italic_l end_POSTSUPERSCRIPT }
    =1Tt{wtλDtrec+Rtme+Rtrec},absent1𝑇subscript𝑡subscript𝑤𝑡𝜆superscriptsubscript𝐷𝑡𝑟𝑒𝑐superscriptsubscript𝑅𝑡𝑚𝑒superscriptsubscript𝑅𝑡𝑟𝑒𝑐\displaystyle=\frac{1}{T}\sum_{t}\left\{w_{t}\cdot\lambda\cdot D_{t}^{rec}+R_{% t}^{me}+R_{t}^{rec}\right\},= divide start_ARG 1 end_ARG start_ARG italic_T end_ARG ∑ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT { italic_w start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ⋅ italic_λ ⋅ italic_D start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_r italic_e italic_c end_POSTSUPERSCRIPT + italic_R start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m italic_e end_POSTSUPERSCRIPT + italic_R start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_r italic_e italic_c end_POSTSUPERSCRIPT } ,
    Refer to caption
    Figure 7: Illustration of our proposed repeat-long training step that combines the repeating compressing and long-sequence cascaded training.
  • LTallrlsuperscriptsubscript𝐿𝑇𝑎𝑙𝑙𝑟𝑙L_{T}^{all-r-l}italic_L start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_a italic_l italic_l - italic_r - italic_l end_POSTSUPERSCRIPT is the RQA-associated repeat-long training loss proposed in our paper to make RQA module “see” more reference qualities during training, as described in Section IV-B. To generate reference frames with various qualities, we first repeat compressing the first P-frame for certain times (randomly selected from 0 to N𝑁Nitalic_N). Then, we regard its reconstructed frame as the reference of the second P-frame and calculate the cascaded loss LTallsuperscriptsubscript𝐿𝑇𝑎𝑙𝑙L_{T}^{all}italic_L start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_a italic_l italic_l end_POSTSUPERSCRIPT of the remaining T𝑇Titalic_T frames. A similar repeating compressing strategy is proposed by the coding experts of [53] and has been accepted into the developing AVS end-to-end video coding standard reference model AVS-EEM-v2.0111AVS video coding group is exploring a lightweight end-to-end learned video coding standard. AVS-EEM is the corresponding reference model. Its training and testing codes can be accessed at https://gitlab.com/xhsheng/avs-eem once authorized. . However, we find that directly applying the repeating compressing strategy only brings a little compression performance improvement in our model. This is because this strategy can only generate reference with various qualities for the first P-frame, and cannot affect the reference qualities of subsequent P-frames. Therefore, we combine this strategy with long-sequence training. We increase the number of frames T𝑇Titalic_T (T=17𝑇17T=17italic_T = 17) used to calculate the cascaded loss LTallsuperscriptsubscript𝐿𝑇𝑎𝑙𝑙L_{T}^{all}italic_L start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_a italic_l italic_l end_POSTSUPERSCRIPT to generate various reference qualities for subsequent P-frames. With the repeating compressing and long-sequence training strategies, the RQA module can adapt to a wider range of reference qualities.

TABLE II: Fine-tuning Strategy of Our Scheme for encoding RGB videos when distortion is measured by MS-SSIM.
Frames Network Loss LR Epoch
6 All LTallsuperscriptsubscript𝐿𝑇𝑎𝑙𝑙L_{T}^{all}italic_L start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_a italic_l italic_l end_POSTSUPERSCRIPT 5e55𝑒55e-55 italic_e - 5 4
6 All LTallsuperscriptsubscript𝐿𝑇𝑎𝑙𝑙L_{T}^{all}italic_L start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_a italic_l italic_l end_POSTSUPERSCRIPT 5e65𝑒65e-65 italic_e - 6 2
6 All LTallsuperscriptsubscript𝐿𝑇𝑎𝑙𝑙L_{T}^{all}italic_L start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_a italic_l italic_l end_POSTSUPERSCRIPT 1e61𝑒61e-61 italic_e - 6 1
19 All LTallrlsuperscriptsubscript𝐿𝑇𝑎𝑙𝑙𝑟𝑙L_{T}^{all-r-l}italic_L start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_a italic_l italic_l - italic_r - italic_l end_POSTSUPERSCRIPT 1e61𝑒61e-61 italic_e - 6 1
TABLE III: Fine-tuning strategy of our scheme for encoding YUV420 videos when distortion is measured by YUV PSNR..
Frames Network Loss LR Epoch
6 All Ltallsuperscriptsubscript𝐿𝑡𝑎𝑙𝑙L_{t}^{all}italic_L start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_a italic_l italic_l end_POSTSUPERSCRIPT 1e41𝑒41e-41 italic_e - 4 4
6 All Ltallsuperscriptsubscript𝐿𝑡𝑎𝑙𝑙L_{t}^{all}italic_L start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_a italic_l italic_l end_POSTSUPERSCRIPT 5e55𝑒55e-55 italic_e - 5 4
6 All Ltallsuperscriptsubscript𝐿𝑡𝑎𝑙𝑙L_{t}^{all}italic_L start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_a italic_l italic_l end_POSTSUPERSCRIPT 5e65𝑒65e-65 italic_e - 6 1
6 All LTallsuperscriptsubscript𝐿𝑇𝑎𝑙𝑙L_{T}^{all}italic_L start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_a italic_l italic_l end_POSTSUPERSCRIPT 5e55𝑒55e-55 italic_e - 5 2
6 All LTallsuperscriptsubscript𝐿𝑇𝑎𝑙𝑙L_{T}^{all}italic_L start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_a italic_l italic_l end_POSTSUPERSCRIPT 5e65𝑒65e-65 italic_e - 6 2
6 All LTallsuperscriptsubscript𝐿𝑇𝑎𝑙𝑙L_{T}^{all}italic_L start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_a italic_l italic_l end_POSTSUPERSCRIPT 1e61𝑒61e-61 italic_e - 6 1
19 All LTallrlsuperscriptsubscript𝐿𝑇𝑎𝑙𝑙𝑟𝑙L_{T}^{all-r-l}italic_L start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_a italic_l italic_l - italic_r - italic_l end_POSTSUPERSCRIPT 1e61𝑒61e-61 italic_e - 6 1
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Figure 8: Rate-distortion curves of different schemes on the HEVC, UVG, and MCL-JCV RGB video datasets. The quality is measured by RGB PSNR.
TABLE IV: BD-rate(%) comparison in RGB colorspace measured with PSNR. The anchor is VTM.
HEVC Class B HEVC Class C HEVC Class D HEVC Class E HEVC Class RGB UVG MCL-JCV
VTM 0.0 0.0 0.0 0.0 0.0 0.0 0.0
HM 39.0 37.6 34.7 48.6 44.0 36.4 41.9
CANF-VC 58.2 73.0 48.8 116.8 87.5 56.3 60.5
DCVC 115.7 150.8 106.4 257.5 118.6 129.5 103.9
DCVC-TCM 32.8 62.1 29.0 75.8 25.4 23.1 38.2
DCVC-HEM –0.7 16.1 –7.1 20.9 –15.6 –17.2 –1.6
DCVC-DC –13.9 –8.8 –27.7 –19.1 –27.9 –25.9 –14.4
DCVC-FM –8.8 –5.0 –23.3 –20.8 –18.6 –20.5 –7.4
DCVC-SDD –13.7 –2.3 –24.9 –8.4 –17.5 –19.7 –7.1
Ours –20.3 –10.4 –29.2 –15.0 –25.9 –29.2 –16.7
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Figure 9: Rate-distortion curves of different schemes on the HEVC, UVG, and MCL-JCV RGB video datasets. The quality is measured by MS-SSIM.
TABLE V: BD-rate(%) comparison measured with MS-SSIM. The anchor is VTM.
HEVC Class B HEVC Class C HEVC Class D HEVC Class E HEVC Class RGB UVG MCL-JCV
VTM 0.0 0.0 0.0 0.0 0.0 0.0 0.0
HM 36.8 38.7 34.9 38.4 37.3 37.1 43.7
CANF-VC 25.5 17.7 1.5 114.9 52.9 33.1 11.7
DCVC 35.9 24.9 2.7 90.0 43.7 11.9 39.1
DCVC-TCM –20.5 –21.7 –36.2 –20.5 –21.1 –6.0 –18.6
DCVC-HEM –47.4 –43.3 –55.5 –52.4 –45.8 –32.7 –44.0
DCVC-DC –53.0 –54.6 –63.4 –60.7 –54.4 –36.7 –49.1
DCVC-FM –12.5 –18.0 –30.6 –32.6 –16.6 –7.3 –5.0
DCVC-SDD –48.0 –49.6 –60.0 –51.5 –46.3 –34.2 –46.3
Ours –55.6 –55.4 –64.1 –59.1 –54.3 –39.7 –51.0

We use the Lagrangian multiplier λ𝜆\lambdaitalic_λ to control the rate-distortion (R-D) trade-off. We also add a periodically varying weight wtsubscript𝑤𝑡w_{t}italic_w start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT for each P-frame before the Lagrangian multiplier λ𝜆\lambdaitalic_λ to implement the hierarchical quality [39, 46]. The detailed setting of λ𝜆\lambdaitalic_λ and wtsubscript𝑤𝑡w_{t}italic_w start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT can be found in the Section V-A3.

When using LtmeDsuperscriptsubscript𝐿𝑡𝑚𝑒𝐷L_{t}^{meD}italic_L start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m italic_e italic_D end_POSTSUPERSCRIPT as the loss function, we only train the motion estimation module, motion encoder, and motion decoder (Inter). When using LtmeRDsuperscriptsubscript𝐿𝑡𝑚𝑒𝑅𝐷L_{t}^{meRD}italic_L start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m italic_e italic_R italic_D end_POSTSUPERSCRIPT as the loss function, we add the motion entropy model into the training loop (Inter). When using LtrecDsuperscriptsubscript𝐿𝑡𝑟𝑒𝑐𝐷L_{t}^{recD}italic_L start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_r italic_e italic_c italic_D end_POSTSUPERSCRIPT as the loss function, we only train the temporal context mining module, contextual encoder, contextual decoder, and frame generator (Rec). When using LtrecRDsuperscriptsubscript𝐿𝑡𝑟𝑒𝑐𝑅𝐷L_{t}^{recRD}italic_L start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_r italic_e italic_c italic_R italic_D end_POSTSUPERSCRIPT as the loss function, we add the contextual entropy model into the training loop (Rec). When using Ltallsuperscriptsubscript𝐿𝑡𝑎𝑙𝑙L_{t}^{all}italic_L start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_a italic_l italic_l end_POSTSUPERSCRIPT or LTallsuperscriptsubscript𝐿𝑇𝑎𝑙𝑙L_{T}^{all}italic_L start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_a italic_l italic_l end_POSTSUPERSCRIPT as the loss function, we train all modules.

When the quality of reconstructed RGB videos is measured by MS-SSIM, we use the RGB-MSE model before being trained with LTallrepeatlongsuperscriptsubscript𝐿𝑇𝑎𝑙𝑙𝑟𝑒𝑝𝑒𝑎𝑡𝑙𝑜𝑛𝑔L_{T}^{all}-repeat-longitalic_L start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_a italic_l italic_l end_POSTSUPERSCRIPT - italic_r italic_e italic_p italic_e italic_a italic_t - italic_l italic_o italic_n italic_g as the pre-trained model. Then, we replace the distortion metric from MSE to 1–MS-SSIM to fine-tune the pre-trained model. The detailed fine-tuning strategy is listed in Table. II.

When encoding YUV420 videos, we use the RGB-MSE model before being trained with LTallsuperscriptsubscript𝐿𝑇𝑎𝑙𝑙L_{T}^{all}italic_L start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_a italic_l italic_l end_POSTSUPERSCRIPT in RGB colorspace as the pre-trained model. Then, we fine-tune the pre-trained model with the loss functions calculated in the YUV colorspace. The detailed fine-tuning strategy is listed in Table. III.

V Experiments

V-A Experimental Setup

V-A1 Training Data

We follow most existing learned video coding schemes to use 7-frame videos of Vimeo-90k [54] dataset for short-sequence training. We also follow M-LVC [20] to generate some sequences containing 19 frames from raw Vimeo videos for long-sequence training. We randomly crop the original videos into 256×\times×256 patches.

Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Figure 10: Rate-distortion curves of different schemes on the HEVC, UVG, and MCL-JCV YUV420 video datasets. The quality is measured by YUV PSNR.

V-A2 Testing Data

We use videos in RGB format and YUV420 format to evaluate the performance of our proposed scheme. When testing YUV420 videos, we use the original YUV420-format UVG dataset [55], MCL-JCV dataset [56], and HEVC dataset [57]. We feed them into learned video codecs without any change. When testing RGB videos, we convert UVG, MCL-JCV, and HEVC datasets from YUV420 format to RGB format using FFmpeg and then feed them into learned video codecs. In addition, following [37, 46], we also test 6 RGB-format videos from HEVC RGB dataset [58].

V-A3 Implementation Details

We implement our proposed methods based on our baseline [39]. Following [39, 42, 46], we first set 4 basic λ𝜆\lambdaitalic_λ values (85, 170, 380, 840) to control the R-D trade-off. For each λ𝜆\lambdaitalic_λ, four learnable quantization steps are embedded into the motion encoder, motion decoder, contextual encoder, and contextual decoder. During testing, we interpolate the quantization steps to support variable rates [39, 46]. For hierarchical quality structure, we follow our baseline [39] and set the hierarchical weight wtsubscript𝑤𝑡w_{t}italic_w start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT as (0.5, 1.2, 0.5, 0.9) for 4 consecutive frames. The weight wtsubscript𝑤𝑡w_{t}italic_w start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT of the first p frame is 1.2. We set the number of training frames to 6 or 19 to set the weight wtsubscript𝑤𝑡w_{t}italic_w start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT of the last training frame as 1.2. For different wtsubscript𝑤𝑡w_{t}italic_w start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT, the parameters of the first convolutional layer of the temporal context mining module are not shared [39, 46].

When PSNR is evaluated in RGB colorspace, we feed the original Vimeo training frames in RGB format into our model to obtain reconstructed frames. Then we calculate the distortion (MSE) between input frames and reconstructed frames in RGB format to train our model. When the PSNR is evaluated in YUV420 colorspace, we first convert the format of Viemo training frames from RGB to YUV444 and feed the frames in YUV444 format into our model to obtain reconstructed frames. Then we convert the format of input frames and reconstructed frames from YUV444 to YUV420 and calculate their MSE in YUV420 format to train our model. Although the weight of compound evaluation YUV PSNR is 6:1:1 [59], we find that setting the weight of MSE of YUV components to 4:1:1 during training can achieve more stable results. We implement our model with PyTorch. The AdamW [60] optimizer is used and the batch size is set to 8.

V-A4 Test configurations

We focus on the low-delay coding scenario in this paper. Following [37, 39, 42, 41], we test 96 frames for each video sequence. When testing RGB videos, we compare with CANF-VC [43], DCVC [41], DCVC-TCM [37], DCVC-HEM [42],DCVC-DC [42], DCVC-FM [47], and our baseline—DCVC-SDD [39]. When testing YUV420 videos, we only compare with DCVC-DC [42] and DCVC-FM [47] since they are the only learned video codecs that have released models for the YUV420 colorspace. Since DCVC-FM is a variant bit rate model and its bit range is much wider, we align its bit rate range with that of other learned video codecs by adjusting its q_index_i and q_index_p.

We also compare with traditional video codecs, including the reference software of H.265/HEVC—HM-16.20 [61] and the reference software of H.266/VVC—VTM-13.2 [62]. When testing RGB videos, for HM-16.20, we use encoder_lowdelay_main_rext configuration. For VTM-13.2, we use encoder_lowdelay_vtm configuration. Following [37], we set the internal colorspace as YUV444 for better compression performance. When testing YUV420 videos, for HM-16.20, we use encoder_lowdelay_main configuration. For VTM-13.2, we use encoder_lowdelay_vtm configuration.

TABLE VI: BD-rate(%) comparison in YUV colorspace measured with PSNR. The anchor is VTM.
HEVC Class B HEVC Class C HEVC Class D HEVC Class E UVG MCL-JCV
VTM 0.0 0.0 0.0 0.0 0.0 0.0
HM 41.0 36.1 31.9 44.6 37.7 43.2
DCVC-DC –11.6 –13.1 –28.8 –18.1 –17.2 –11.0
DCVC-FM –16.6 –17.7 –33.4 –29.9 –25.0 –15.6
Ours –14.0 –13.7 –29.8 –13.2 –18.3 –9.8
Refer to caption
Figure 11: Subjective quality comparison on the 7th frame of HEVC Class D RaceHorses sequence and the 5th frame of MCL-JCV videoSRC14 sequence.

V-A5 Evaluation Metrics

When testing RGB videos, we use RGB PSNR to measure the distortion between reconstructed videos and original frames. When testing YUV420 videos, following DCVC-DC [42], we use compound YVU PSNR as the distortion metric. The weight of YUV components is 6:1:1 [59]. We use bits per pixel (bpp) to measure the average number of bits for encoding each pixel in each frame.

V-B Experimental Results

V-B1 Objective Comparison Results for RGB Colorspace

When testing RGB videos and using RGB PSNR to measure the distortion, we present the rate-distortion curves of different codecs on HEVC, UVG, and MCL-JCV RGB video datasets in Fig. 8. From the curves, we find our proposed scheme outperforms our baseline DCVC-SDD [39] by a large margin and even achieves a better compression performance than VTM and previous SOTA scheme–DCVC-DC and DCVC-FM. We list the detailed BD-rate comparison in Table V. The results show that our proposed scheme achieves an average 21.0% bitrate saving over VTM. When using our baseline DCVC-SDD [39] and DCVC-DC [46] as the anchor, our proposed scheme can achieve an average 9.3% and 2.4% bitrate saving, respectively. When testing RGB videos and using MS-SSIM to measure the distortion, we present the rate-distortion curves in Fig. 8 and list the BD-rate comparison in Table V. The comparison shows that our proposed scheme shows a larger improvement. When using VTM as the anchor, about an average 54% bitrate saving is achieved by our proposed scheme. It also outperforms our baseline DCVC-SDD [39], DCVC-DC [46], and DCVC-FM [47].

V-B2 Objective Comparison Results for YUV420 Colorspace

When compressing YUV420 videos and using YUV PSNR to measure the distortion, we illustrate the rate-distortion curves in Fig. 10 and list the corresponding BD-rate comparison results in Table VI. The results show that our proposed scheme can also achieve a better compression performance than HM and VTM in terms of YUV PSNR. We also compare with DCVC-DC and DCVC-FM. The results show that the compression performance of our scheme is better than that of DCVC-DC on most testing datasets but is worse than that of DCVC-FM. This may be because DCVC-FM has a much longer YUV fine-tuning cycle than our YUV model and has a more complex training strategy. We will further explore to improve our YUV compression performance in the future.

TABLE VII: Effectiveness of Proposed Technologies.
Model Index PQA RQA Repeat Long BD-Rate(%)
M1subscript𝑀1M_{1}italic_M start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT 0.0
M2subscript𝑀2M_{2}italic_M start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT –2.1
M3subscript𝑀3M_{3}italic_M start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT –4.6
M4subscript𝑀4M_{4}italic_M start_POSTSUBSCRIPT 4 end_POSTSUBSCRIPT –5.4
M5subscript𝑀5M_{5}italic_M start_POSTSUBSCRIPT 5 end_POSTSUBSCRIPT –7.5
M6subscript𝑀6M_{6}italic_M start_POSTSUBSCRIPT 6 end_POSTSUBSCRIPT –8.9

V-B3 Subjective Comparison Results

We illustrate the reconstructed frames of HEVC Class D RaceHorses sequence and MCL-JCV videoSRC14 sequence in Fig. 11. By comparing the reconstructed frames of VTM, DCVC-DC [46], DCVC-SDD [39], and our scheme, we can observe that our scheme can reconstruct clearer textures with similar bitrate cost. For example, the edge of the saddle of the horse in our reconstructed RaceHorses sequence is sharper. The hair, ear, and earring of the dancing woman in our reconstructed videoSRC14 sequence retain more details.

V-C Ablation Studies

V-C1 Effectiveness of Proposed technologies

In this paper, based on our baseline [39], we propose to improve the temporal context quality adaptation ability for learned video compression. To achieve this goal, we propose a prediction quality adaptation module, a reference quality adaptation module, and a training strategy that combines repeating compressing and long-sequence cascaded training. To verify the effectiveness of these proposed technologies, we conduct an ablation study on the HEVC dataset by progressively enabling these technologies, as presented in Table VII. For fairness, the only difference between our training strategy and that of our baseline [39] is the last newly proposed training step. Comparing Model M1subscript𝑀1M_{1}italic_M start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT, M2subscript𝑀2M_{2}italic_M start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT, and M3subscript𝑀3M_{3}italic_M start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT, we find that enabling our proposed prediction quality adaptation (PQA) module can bring 2.1% BD-Rate reduction. Enabling the proposed reference quality adaptation (RQA) brings an additional 2.5% performance improvement. To verify the effectiveness of our proposed new training strategies, based on Model M3subscript𝑀3M_{3}italic_M start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT, we add the repeat-compressing strategy and long-sequence training strategy respectively. Comparing Model M3subscript𝑀3M_{3}italic_M start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT and M4subscript𝑀4M_{4}italic_M start_POSTSUBSCRIPT 4 end_POSTSUBSCRIPT, applying the repeat-compressing strategy only brings 0.8% performance gain. Comparing Model M3subscript𝑀3M_{3}italic_M start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT and M5subscript𝑀5M_{5}italic_M start_POSTSUBSCRIPT 5 end_POSTSUBSCRIPT, applying the long-sequence training strategy can bring a 2.9% performance gain. If we apply the two strategies simultaneously, comparing Model M3subscript𝑀3M_{3}italic_M start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT and M6subscript𝑀6M_{6}italic_M start_POSTSUBSCRIPT 6 end_POSTSUBSCRIPT, a higher performance gain (4.3%) can be achieved.

Refer to caption
Figure 12: Visualization of temporal contexts of different channels and their corresponding confidence maps.
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Figure 13: Reconstructed frame quality comparison between our proposed scheme and our baseline [39] under the same rate points.

V-C2 Analysis of Prediction Quality Adaptation Module

To explore why our proposed PQA module can bring performance improvement, we visualize the temporal contexts generated by the model with and without the PQA module. As shown in Fig. 12, we can observe that spatial-wise prediction difference exists for the temporal context of a certain channel. For example, the confidence map of the temporal context of the 14thsuperscript14𝑡14^{th}14 start_POSTSUPERSCRIPT italic_t italic_h end_POSTSUPERSCRIPT channel has larger confidence values at the edges of objects, which indicates that the temporal context of this channel mainly provides prediction for high-frequency object edges. The confidence map of the temporal context of the 29thsuperscript29𝑡29^{th}29 start_POSTSUPERSCRIPT italic_t italic_h end_POSTSUPERSCRIPT channel has larger confidence values except for the edges of objects, which indicates that this channel mainly provides a prediction for low-frequency regions. Comparing different channels, we find that channel-wise prediction difference also exists for the temporal contexts. For the channels with higher prediction qualities, such as the 10thsuperscript10𝑡10^{th}10 start_POSTSUPERSCRIPT italic_t italic_h end_POSTSUPERSCRIPT, 21stsuperscript21𝑠𝑡21^{st}21 start_POSTSUPERSCRIPT italic_s italic_t end_POSTSUPERSCRIPT, 27thsuperscript27𝑡27^{th}27 start_POSTSUPERSCRIPT italic_t italic_h end_POSTSUPERSCRIPT, and 48thsuperscript48𝑡48^{th}48 start_POSTSUPERSCRIPT italic_t italic_h end_POSTSUPERSCRIPT channels, the values of their confidence maps are larger, which means they provide the main temporal information. If they cannot provide enough temporal information for some regions, other channels provide additional supplementary information. For most other channels, their temporal prediction has lower qualities and only provide limited temporal information, resulting the smaller values for their confidence maps. The analysis verifies that our proposed PQA module can provide explicit spatial and channel-wise discrimination for temporal contexts.

V-C3 Analysis of Reference Quality Adaptation Module

To analyze why our proposed RQA module can bring performance gain, we compare the reconstructed frame qualities of our model with the reference quality module and our baseline [39]. As illustrated in Fig. 13, we take the sequences in HEVC Class B, C, D, E datasets as examples. Although our baseline has used the periodically varying loss function to reduce error propagation, its reconstructed frame qualities still gradually decrease as the reference qualities decrease. For example, the reconstructed frame quality of RaceHorses of the HEVC Class D dataset drops by about 2dB. However, given a reference with low quality, our scheme can still achieve the target reconstructed frame quality. The error propagation is effectively alleviated. The analysis verifies that the RQA module can help our codec adapt to different reference qualities.

TABLE VIII: Average encoding/decoding time for a 1080p frame (in seconds).
Schemes Enc Time Dec Time
DCVC [41] 14.96 s 44.01 s
DCVC-TCM [37] 0.81 s 0.48 s
DCVC-HEM [42] 0.75 s 0.26 s
DCVC-DC [46] 0.82 s 0.64 s
DCVC-SDD [39] 0.94 s 0.74 s
Ours 1.11 s 0.85 s

V-D Running Time and Model Complexity

Following the setting of previous learned video coding schemes [37, 39, 46, 42], we include the time for model inference, entropy modeling, entropy coding, and data transfer between CPU and GPU when calculating the encoding and decoding time. Table. VIII lists the detailed encoding and decoding time for a 1920×\times×1080 video frame of different learned video codecs. We run all these codecs on a NVIDIA 3090 GPU. The comparison results show that our proposed technologies lead to only 0.17s encoding time and 0.11s decoding time increase compared with our baseline [37]. In addition, the number of trainable parameters of our proposed scheme is 19.3M, which is only 0.5M more than our baseline.

VI Conclusion

In this paper, we first propose a confidence-based prediction quality adaptation module to adapt to different prediction qualities. With this module, our codec can learn spatial and channel-wise confidence maps to adaptively decide which spatial or channel location of predictions to use. Then, we further propose a reference quality adaptation module and an associated repeat-long training strategy to provide dynamic spatially variant filters for diverse reference qualities. With this module, our codec can achieve the target reconstruction quality according to different reference qualities, thus reducing reconstruction error propagation. Experimental results show that our codec can achieve better compression performance.

References

  • [1] T. Wiegand, G. J. Sullivan, G. Bjontegaard, and A. Luthra, “Overview of the H.264/AVC video coding standard,” IEEE Transactions on Circuits and Systems for Video Technology, vol. 13, no. 7, pp. 560–576, 2003.
  • [2] G. J. Sullivan, J.-R. Ohm, W.-J. Han, and T. Wiegand, “Overview of the high efficiency video coding (HEVC) standard,” IEEE Transactions on Circuits and Systems for Video Technology, vol. 22, no. 12, pp. 1649–1668, 2012.
  • [3] B. Bross, Y.-K. Wang, Y. Ye, S. Liu, J. Chen, G. J. Sullivan, and J.-R. Ohm, “Overview of the versatile video coding (VVC) standard and its applications,” IEEE Transactions on Circuits and Systems for Video Technology, 2021.
  • [4] W.-J. Chien, L. Zhang, M. Winken, X. Li, R.-L. Liao, H. Gao, C.-W. Hsu, H. Liu, and C.-C. Chen, “Motion vector coding and block merging in the versatile video coding standard,” IEEE Transactions on Circuits and Systems for Video Technology, vol. 31, no. 10, pp. 3848–3861, 2021.
  • [5] L. Li, H. Li, D. Liu, Z. Li, H. Yang, S. Lin, H. Chen, and F. Wu, “An efficient four-parameter affine motion model for video coding,” IEEE Transactions on Circuits and Systems for Video Technology, vol. 28, no. 8, pp. 1934–1948, 2017.
  • [6] K. Zhang, Y.-W. Chen, L. Zhang, W.-J. Chien, and M. Karczewicz, “An improved framework of affine motion compensation in video coding,” IEEE Transactions on Image Processing, vol. 28, no. 3, pp. 1456–1469, 2018.
  • [7] A. Habibian, T. v. Rozendaal, J. M. Tomczak, and T. S. Cohen, “Video compression with rate-distortion autoencoders,” in Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), October 2019.
  • [8] W. Sun, C. Tang, W. Li, Z. Yuan, H. Yang, and Y. Liu, “High-quality single-model deep video compression with frame-conv3d and multi-frame differential modulation,” in European Conference on Computer Vision (ECCV), pp. 239–254, Springer, 2020.
  • [9] J. Liu, S. Wang, W.-C. Ma, M. Shah, R. Hu, P. Dhawan, and R. Urtasun, “Conditional entropy coding for efficient video compression,” in European Conference on Computer Vision (ECCV), pp. 453–468, Springer, 2020.
  • [10] F. Mentzer, G. Toderici, D. Minnen, S. Caelles, S. J. Hwang, M. Lucic, and E. Agustsson, “VCT: A video compression transformer,” in Advances in Neural Information Processing Systems (NeurIPS), 2022.
  • [11] H. Chen, B. He, H. Wang, Y. Ren, S. N. Lim, and A. Shrivastava, “Nerv: Neural representations for videos,” Advances in Neural Information Processing Systems, vol. 34, pp. 21557–21568, 2021.
  • [12] H. Chen, M. Gwilliam, S.-N. Lim, and A. Shrivastava, “Hnerv: A hybrid neural representation for videos,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10270–10279, 2023.
  • [13] Q. Zhao, M. S. Asif, and Z. Ma, “Dnerv: Modeling inherent dynamics via difference neural representation for videos,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2031–2040, 2023.
  • [14] H. M. Kwan, G. Gao, F. Zhang, A. Gower, and D. Bull, “Hinerv: Video compression with hierarchical encoding-based neural representation,” Advances in Neural Information Processing Systems, vol. 36, 2024.
  • [15] H. Liu, H. Shen, L. Huang, M. Lu, T. Chen, and Z. Ma, “Learned video compression via joint spatial-temporal correlation exploration,” in Proceedings of the AAAI Conference on Artificial Intelligence, vol. 34, pp. 11580–11587, 2020.
  • [16] O. Rippel, A. G. Anderson, K. Tatwawadi, S. Nair, C. Lytle, and L. Bourdev, “ELF-VC: Efficient learned flexible-rate video coding,” in Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pp. 14479–14488, October 2021.
  • [17] Z. Hu, Z. Chen, D. Xu, G. Lu, W. Ouyang, and S. Gu, “Improving deep video compression by resolution-adaptive flow coding,” in European Conference on Computer Vision (ECCV), pp. 193–209, Springer, 2020.
  • [18] G. Lu, C. Cai, X. Zhang, L. Chen, W. Ouyang, D. Xu, and Z. Gao, “Content adaptive and error propagation aware deep video compression,” in European Conference on Computer Vision (ECCV), pp. 456–472, Springer, 2020.
  • [19] G. Lu, X. Zhang, W. Ouyang, L. Chen, Z. Gao, and D. Xu, “An end-to-end learning framework for video compression,” IEEE Transactions on Pattern Analysis and Machine Intelligence, 2020.
  • [20] J. Lin, D. Liu, H. Li, and F. Wu, “M-LVC: multiple frames prediction for learned video compression,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 3546–3554, 2020.
  • [21] Z. Hu, G. Lu, and D. Xu, “FVC: A new framework towards deep video compression in feature space,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 1502–1511, 2021.
  • [22] R. Yang, F. Mentzer, L. V. Gool, and R. Timofte, “Learning for video compression with hierarchical quality and recurrent enhancement,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 6628–6637, 2020.
  • [23] E. Agustsson, D. Minnen, N. Johnston, J. Balle, S. J. Hwang, and G. Toderici, “Scale-space flow for end-to-end optimized video compression,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 8503–8512, 2020.
  • [24] Z. Cheng, H. Sun, M. Takeuchi, and J. Katto, “Learning image and video compression through spatial-temporal energy compaction,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 10071–10080, 2019.
  • [25] O. Rippel, S. Nair, C. Lew, S. Branson, A. G. Anderson, and L. Bourdev, “Learned video compression,” in Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pp. 3454–3463, 2019.
  • [26] A. Djelouah, J. Campos, S. Schaub-Meyer, and C. Schroers, “Neural inter-frame compression for video coding,” in Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pp. 6421–6429, 2019.
  • [27] R. Yang, F. Mentzer, L. Van Gool, and R. Timofte, “Learning for video compression with recurrent auto-encoder and recurrent probability model,” IEEE Journal of Selected Topics in Signal Processing, vol. 15, no. 2, pp. 388–401, 2021.
  • [28] C.-Y. Wu, N. Singhal, and P. Krahenbuhl, “Video compression through image interpolation,” in Proceedings of the European Conference on Computer Vision (ECCV), pp. 416–431, 2018.
  • [29] B. Liu, Y. Chen, S. Liu, and H.-S. Kim, “Deep learning in latent space for video prediction and compression,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 701–710, 2021.
  • [30] H. Liu, M. Lu, Z. Ma, F. Wang, Z. Xie, X. Cao, and Y. Wang, “Neural video coding using multiscale motion compensation and spatiotemporal context model,” IEEE Transactions on Circuits and Systems for Video Technology, 2020.
  • [31] M. A. Yılmaz and A. M. Tekalp, “End-to-end rate-distortion optimized learned hierarchical bi-directional video compression,” IEEE Transactions on Image Processing, vol. 31, pp. 974–983, 2021.
  • [32] Z. Chen, T. He, X. Jin, and F. Wu, “Learning for video compression,” IEEE Transactions on Circuits and Systems for Video Technology, vol. 30, no. 2, pp. 566–576, 2019.
  • [33] K. Lin, C. Jia, X. Zhang, S. Wang, S. Ma, and W. Gao, “DMVC: Decomposed motion modeling for learned video compression,” IEEE Transactions on Circuits and Systems for Video Technology, 2022.
  • [34] Z. Guo, R. Feng, Z. Zhang, X. Jin, and Z. Chen, “Learning cross-scale weighted prediction for efficient neural video compression,” IEEE Transactions on Image Processing, 2023.
  • [35] H. Guo, S. Kwong, C. Jia, and S. Wang, “Enhanced motion compensation for deep video compression,” IEEE Signal Processing Letters, 2023.
  • [36] R. Yang, R. Timofte, and L. Van Gool, “Advancing learned video compression with in-loop frame prediction,” IEEE Transactions on Circuits and Systems for Video Technology, 2022.
  • [37] X. Sheng, J. Li, B. Li, L. Li, D. Liu, and Y. Lu, “Temporal context mining for learned video compression,” IEEE Transactions on Multimedia, 2022.
  • [38] F. Wang, H. Ruan, F. Xiong, J. Yang, L. Li, and R. Wang, “Butterfly: Multiple reference frames feature propagation mechanism for neural video compression,” in 2023 Data Compression Conference (DCC), pp. 198–207, IEEE, 2023.
  • [39] X. Sheng, L. Li, D. Liu, and H. Li, “Spatial decomposition and temporal fusion based inter prediction for learned video compression,” IEEE Transactions on Circuits and Systems for Video Technology, 2024.
  • [40] X. Sheng, L. Li, D. Liu, and H. Li, “Vnvc: A versatile neural video coding framework for efficient human-machine vision,” IEEE Transactions on Pattern Analysis and Machine Intelligence, 2024.
  • [41] J. Li, B. Li, and Y. Lu, “Deep contextual video compression,” Advances in Neural Information Processing Systems (NeurIPS), vol. 34, pp. 18114–18125, 2021.
  • [42] J. Li, B. Li, and Y. Lu, “Hybrid spatial-temporal entropy modelling for neural video compression,” in Proceedings of the 30th ACM International Conference on Multimedia, pp. 1503–1511, 2022.
  • [43] Y.-H. Ho, C.-P. Chang, P.-Y. Chen, A. Gnutti, and W.-H. Peng, “Canf-vc: Conditional augmented normalizing flows for video compression,” in Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XVI, pp. 207–223, Springer, 2022.
  • [44] D. Jin, J. Lei, B. Peng, Z. Pan, L. Li, and N. Ling, “Learned video compression with efficient temporal context learning,” IEEE Transactions on Image Processing, 2023.
  • [45] R. Lin, M. Wang, P. Zhang, S. Wang, and S. Kwong, “Multiple hypotheses based motion compensation for learned video compression,” Neurocomputing, p. 126396, 2023.
  • [46] J. Li, B. Li, and Y. Lu, “Neural video compression with diverse contexts,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 22616–22626, 2023.
  • [47] J. Li, B. Li, and Y. Lu, “Neural video compression with feature modulation,” arXiv preprint arXiv:2402.17414, 2024.
  • [48] H. Kim, M. Bauer, L. Theis, J. R. Schwarz, and E. Dupont, “C3: High-performance and low-complexity neural compression from a single image or video,” arXiv preprint arXiv:2312.02753, 2023.
  • [49] Z. Hu, G. Lu, J. Guo, S. Liu, W. Jiang, and D. Xu, “Coarse-to-fine deep video coding with hyperprior-guided mode prediction,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5921–5930, 2022.
  • [50] H. Liu, M. Lu, Z. Chen, X. Cao, Z. Ma, and Y. Wang, “End-to-end neural video coding using a compound spatiotemporal representation,” IEEE Transactions on Circuits and Systems for Video Technology, vol. 32, no. 8, pp. 5650–5662, 2022.
  • [51] A. Ranjan and M. J. Black, “Optical flow estimation using a spatial pyramid network,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 4161–4170, 2017.
  • [52] J. Ballé, D. Minnen, S. Singh, S. J. Hwang, and N. Johnston, “Variational image compression with a scale hyperprior,” in 6th International Conference on Learning Representations, ICLR 2018, Vancouver, BC, Canada, April 30 - May 3, 2018, Conference Track Proceedings, OpenReview.net, 2018.
  • [53] Y. Shi, Y. Ge, J. Wang, and J. Mao, “Alphavc: High-performance and efficient learned video compression,” in European Conference on Computer Vision, pp. 616–631, Springer, 2022.
  • [54] T. Xue, B. Chen, J. Wu, D. Wei, and W. T. Freeman, “Video enhancement with task-oriented flow,” International Journal of Computer Vision, vol. 127, no. 8, pp. 1106–1125, 2019.
  • [55] A. Mercat, M. Viitanen, and J. Vanne, “UVG dataset: 50/120fps 4k sequences for video codec analysis and development,” in Proceedings of the 11th ACM Multimedia Systems Conference, pp. 297–302, 2020.
  • [56] H. Wang, W. Gan, S. Hu, J. Y. Lin, L. Jin, L. Song, P. Wang, I. Katsavounidis, A. Aaron, and C.-C. J. Kuo, “MCL-JCV: a JND-based H.264/AVC video quality assessment dataset,” in 2016 IEEE International Conference on Image Processing (ICIP), pp. 1509–1513, IEEE, 2016.
  • [57] F. Bossen, “Common hm test conditions and software reference configurations (JCTVC-l1100),” Joint Collaborative Team on Video Coding (JCT-VC) of ITU-T SG, 2013.
  • [58] D. Flynn, D. Marpe, M. Naccari, T. Nguyen, C. Rosewarne, K. Sharman, J. Sole, and J. Xu, “Overview of the range extensions for the HEVC standard: Tools, profiles, and performance,” IEEE Transactions on Circuits and Systems for Video Technology, vol. 26, no. 1, pp. 4–19, 2015.
  • [59] J.-R. Ohm, G. J. Sullivan, H. Schwarz, T. K. Tan, and T. Wiegand, “Comparison of the coding efficiency of video coding standards—including high efficiency video coding (hevc),” IEEE Transactions on circuits and systems for video technology, vol. 22, no. 12, pp. 1669–1684, 2012.
  • [60] D. P. Kingma and J. Ba, “Adam: A method for stochastic optimization,” arXiv preprint arXiv:1412.6980, 2014.
  • [61] “HM-16.20.” https://vcgit.hhi.fraunhofer.de/jvet/HM/. Accessed: 2022-07-05.
  • [62] “VTM-13.2.” https://vcgit.hhi.fraunhofer.de/jvet/VVCSoftware_VTM/. Accessed: 2022-03-02.