Soc Hpec18
Soc Hpec18
Soc Hpec18
net/publication/328314317
CITATIONS READS
6 877
3 authors, including:
Some of the authors of this publication are also working on these related projects:
All content following this page was uploaded by Thuong Nguyen Canh on 16 October 2018.
Fig. 2. The proposed residual learning approach for the rate- III. DEEP LEARNING-BASED RDOQ (DL-RDOQ)
distortion optimized quantization.
A. Simplified RDOQ
This section addresses the problem of deep learning-based
II. RATE-DISTORTION OPTIMIZED QUANTIZATION RDOQ, namely, DL-RDOQ, which predicts the optimal
quantized levels of a whole TB without estimating rate and
A. Optimal Rate-Distortion Optimized Quantization distortion. For practical reason, we investigate DL-RDOQ as
a supervised learning problem with possible inputs and
RDOQ finds the optimal quantized level [11] of each outputs given in Table 1. In addition, noting that the residual
transform coefficient by minimizing the rate ( ) and distortion learning has better proven performance [10], we make DL-
( ) cost in (1) of a given TB. RDOQ implemented on HM following the residual learning
arg min where . 1 scheme, that is, predicting the residual of a given TB, ≜
as shown in Fig. 2.
RDOQ has practical implementation issues due to (i) huge
number of candidates to search for achieving optimality; too To find out which RDOQ process is suitable for DL-
excessive amount of computation to evaluate (ii) rate and (iii) RDOQ, we analyze the values in at various stages. As in
distortion for those candidates. Fig. 3, except in SBH, the residual signal assumes only three
values of {0, 1, 2} with a very small probability of residual
B. Rate-Distortion Optimized Quantization in HEVC value equal to 2, Pr re 2 where re is an element in the
Practical compromise in RDOQ [11] can be made by matrix Re. SBH can have additional -1 value. In addition, SBH
limiting the number of candidates [16] and/or simplifying the is related to signs of coefficients in a CG, so a single miss
rate calculation via lookup tables [16]. The CABAC process prediction in SBH will cause a level to be – , which will
in HEVC splits the transform coefficients of a TB (which can significantly increase distortion. We, therefore, use the output
be of size 4×4, 8×8, 16×16, or 32×32), denoted by , to one after the LAST process as the output of prediction and
or more coefficient groups (CG) of size 4×4, and processes set value 2 to 1 for simplification. Now, element in the
them in five internal steps shown in Fig. 1. simplified only assumes values of 0 or 1.
Firstly, scalar quantization (SQ) is executed as B. Proposed Deep Learning-Based RDOQ (DL-RDOQ)
, , where Δ denotes a quantization step size As deep learning has also drawn significant interest in
and ⋅ represents the floor operator. Following, the level video coding communities, this work studies DL-RDOQ.
estimation (LE) process selects the best quantized levels of a However, as the first research on DL- RDOQ, it is challenging
to find a suitable network for RDOQ. It is because, DL is often
given CG from a candidate list consisting of and 1.
applied to signals in the spatial domain like in image
Value 0 is further considered as the third candidate if 2. enhancement, while RDOQ deals with the DCT transformed
The LE process is bypassed if 0. Thirdly, the detection signals with have less correlation and different characteristic.
process of All Zero (AZ) coding group decides whether to set
all levels in the given CG to all zero based on rate-distortion 1) Deep Convolution Neural Network
cost. Fourthly, the last non-zero coefficient (LAST) process Convolutional neural network (CNN) is a class of deep
detects the best location for the last non-zero level. Finally, learning methods which show high performance in many
Sign Bit Hiding (SBH) process is used to hide a sign bit for recognition tasks. CNN is well-known for its low complexity,
the given CG. transition invariance and weight sharing characteristic.
Fig. 3. Distribution (%) of the residual values in (vertical axis) at various RDOQ stages in Fig. 1 for Kimono sequence, Intra, TB size of 8×8 and
32×32. This analysis tells that we can simplify the residual values by removing the case of value of 2 in Re as it rarely occurs.
Fig. 4 Average L1 prediction error from RDOQ output in eq. (2) (vertical axis) over iterations (horizontal axis, ×1000) respectively by DL-RDOQ
implemented using FCN_VGG [14] and scalar quantization (SQ) for intra coded 1st frame of Kimono, BQTerrace, BasketballDrive with QPs (22~37).
The percent number indicates error reduction ratio in (3).
The reason for using CNN to predict RDOQ is that local different TB size. We set learning rate to 0.0001, momentum
correlation exists in optimal quantization. Rate estimation is of 0.99, mini-batch size of 512 and total 80,000 iterations.
based on the context modeling for each level depending on the
previous levels, its frequency location and quantized value. In IV. EXPERIMENTAL RESULTS
addition, RDOQ is processed in CG units of size 4×4.
Therefore, CNN filter could learn to predict the optimal level A. Prediction Performance
of RDOQ utilizing the local/context characteristic. This work
fully follows convolution network FCN_VGG [14] to validate To evaluate the prediction performance of RDOQ output,
the effectiveness of DL which only uses convolution and Relu its average prediction error is computed as:
activation layers. The DL-RDOQ can be parallelized using ∑ , 2
GPU thanks to the nature of CNN framework such as Caffe
[15]. The training input/output pair is selected as and where denotes the predicted residual of a given TB,
– . denotes the RDOQ prediction, and m is the total
number of TBs for evaluation. Since the residual only contain
2) Dataset Collection 0 or 1, 1 error is equivalent to the average number of
To enable DL-RDOQ, training dataset is collected from different levels between DL-RDOQ and RDOQ.
HM 16.15 under coding configuration of Random Access The prediction results along iterations are shown in Fig. 5.
(RA) with QP 22, 27, 32, 37. We collect the values of scalar The fact that the error is reduced much more by DL-RDOQ
quantization and RDOQ after each internal stage (LE, AZ, than by SQ over iterations clearly shows that the DL could
LAST, and SBH) together with information of the current TB predict the RDOQ output successfully. It is noteworthy to see
(i.e., prediction mode, scan mode, CU size, TB size). The the different degree of reduction with respect to different TB
dataset is then grouped according to TB size (4×4, 8×8, sizes, which certainly hints on further necessary investigation
16×16, 32×32). The dataset of each size is further divided into for designing different network for different TB size as future
training, testing, and validation with proportion of 90:5:5. work.
3) Loss Function To further evaluate the difference regarding TB sizes, we
Since the residual output is a 2D matrix having only 0 and normalize the error by TB size, and name it as error reduction
1 values as elements, we model the problem similar to ratio. It represents the ratio of prediction errors between DL-
semantic segmentation with only two labels, and employ the RDOQ and SQ at the last iteration, and computed as,
logistic loss function. In fact, the network structure
FCN_VGG remains identical to the semantic segmentation error reduction ratio= ∗ 100, 3
[14]. It should be noted that the proposed network only mimics We observe that DL-RDOQ reduced L1 error by
the RDOQ and does not utilize any rate information. 50%~63% on average compared to SQ and has better error
4) Training reduction ratio. We observe that the variation in residual
Four networks corresponding to different TB sizes (4×4, characteristic has effects on the prediction results. That is, the
8×8, 16×16, and 32×32) are implemented under Caffe smaller probability of 1, (i.e., large QPs, large TB size),
framework [15] and trained with corresponding dataset of the poorer prediction performance DL-RDOQ has.
Fig. 6. Coding performance of DL-RDOQ implemented on HM 16.15 for Kimono (1920x1080, 50fps) at All Intra (left) and Random-Access (right).
_ V. CONCLUSION
∑ SBH* This paper proposed a DL-based method to predict RDOQ
without rate-distortion estimation via residual CNN network.
DL 4x4
The proposed method demonstrated that despite of using DL
DL 8x8 simply as a black box, we were able to predict RDOQ pretty
well and produced much better performance than RDOQ-Off.
DL 16x16
DL 32x32 REFERENCES
Sign
*SBH considers distortion only (similar in SQDZ) [1] G. Sullivan, and et al., “Overview of the High Efficiency Video Coding
Fig. 5. The proposed residual DL-RDOQ implemented on HM 16.15. (HEVC) Standard,” IEEE Trans. Circ. Syst. Video Tech., vol. 22, no.
12, pp. 1649-1668, 2012.
[2] M. Xu, and et al., “Reducing complexity of HEVC: A deep learning
approach,” IEEE Trans. Image Process., vol. 27, no. 10, 2018.
B. Coding Performance
[3] L. Thorsten and O. Jorn, “Deep learning based intra prediction mode
To test DL-RDOQ in HEVC encoding scenario, we decision for HEVC,” Proc. of IEEE Picture Coding Symposium, 2016.
implement the trained networks on top of the reference [4] J. Li, and et al., “Fully connected network-based intra prediciton for
software HM 16.15 with Caffe C++ [15] interface. Four image coding,” IEEE Trans. Image Process., vol. 27, no. 7, 2018.
trained FCN_VGG networks are used to generate output when [5] R. Lin, and et al., “Deep CNN for Decompressed Video Enhancement,”
scalar quantized data is given as input. The DL-RDOQ Proc. of IEEE Data Compression Conference, 2016.
prediction process does not consider the sign information. The [6] R. Yang, M. Xu, Z. Wang, and T. Li, “Multi-Frame Quality Enha
corresponding TB size is provided to choose a proper residual ncement for Compressed Video,” arXiv:1803.04680, 2018.
DL-RDOQ network corresponding to the given size. The [7] A. Oord, and et al., “Conditional image generation with PixelCNN
decoders,” Inter. Conf. Neur. Info. Process. Sys., pp. 4797-4805, 2016.
predicted residual signal is subtracted from the input to
[8] F. Jiang, and et al., “An end-to-end compression framework based on
obtain the RDOQ prediction after LAST. Sign information is convolutional neural networks,” IEEE Trans.Circ.Syst.Vid.Tech., 2017.
used in SBH to deliver the final RDOQ as shown in Fig. 5. [9] C. Dong, and et al. “Image super-resolution using deep convolutional
We compare the coding performance of DL-RDOQ with networks,” arXiv: 501.00092, Jul. 2015.
RDOQ-On and RDOQ-Off with HM 16.15 [16]. SBH is on in [10] K. Zhang, and et al., “Beyond a Gaussian Denoiser: Residual learning
of deep CNN for image denoising,” IEEE Trans. Image Process., vol.
both testing cases. For SBH, both rate and distortion are 26, no. 7, pp. 3142 – 3155, 2017.
computed in the RDOQ-On case but distortion is computed [11] M. Karczewicz, and et al., “Rate Distortion Optimized Quantization,”
only in the DL-RDOQ and RDOQ-Off cases in testing. It is document ITU-T SG16 Q.6, VCEG-AH21, 2008.
because the rate is estimated in RDOQ-On but not in RDOQ- [12] H. Lee, and et al., “Fast quantization method with simplified rate-
Off nor DL-RDOQ. The rate-distortion curves for all intra distortion optimized quantization for an HEVC encoder,” IEEE Trans.
(AI) and random access (RA) are shown in Fig. 6 using 10 Cir. and Sys. Video Tech., vol. 26, no. 1, pp. 106 – 116, 2016.
frames of the sequence Kimono (1920x1080, 50fps) for AI [13] M. Xu, and et al., “Simplified rate-distortion optimized quantization
and 64 frames for RA under common test condition [17]. for HEVC,” Proc. IEEE Inter. Sym. Broad. Mul. Sys. Broadcast., 2018.
[14] E. Shelhamer et al., “Fully convolutional neural network for semantic
DL-RDOQ performs better than RDOQ-Off (or SQDZ) segmentation,” IEEE Trans. Patt. Anal. Mach. Intell., vol. 39, no. 4,
while fairly approximating the performance of RDOQ-On pp.640 – 651, 2016.
especially at high bit-rate. It is because, at high bit-rate, there [15] Y. Jia, and et al., “Caffe: Convolutional Architecture for Fast Feature
is more residual value of 1 which leads to smaller prediction Embedding,” ACM Inter. Conf. Multi Media, pp. 674-678, 2014.
error of DL-RDOQ. As RDOQ-Off (SQDZ) is better than SQ, [16] High Efficiency Video Coding Test Software 16.15, available at
DL-RDOQ shows better performance than its initial input, https://hevc.hhi.fraunhofer.de/svn/svn_HEVCSoftware/tags/M-16.15.
SQ. On the other hand, this work only utilizes deep learning [17] F. Bossen, “Common HM Test Conditions and Software Reference
Config,” Joint Collaborative Team on Video Coding, JCTVC-L1100.
in DL-RDOQ as a black box solution. So, its performance
boost can be archived more by further fine tuning such as [18] X. Zhang, and et al., “Optimizing the Hierarchical Prediction and
Coding in HEVC for Surveillance and Conference Videos With
adding more layers, customizing network structure so that to Background Modeling,” IEEE Trans. Image Process., vol. 23, no. 10,
be able to better utilize knowledge on RDOQ. pp. 4511 – 4526, 2014.