A Study on the Effect of Color Spaces in Learned Image Compression

Abstract

In this work, we present a comparison between color spaces namely YUV, LAB, RGB and their effect on learned image compression. For this we use the structure and color based learned image codec (SLIC) from our prior work, which consists of two branches - one for the luminance component (Y or L) and another for chrominance components (UV or AB). However, for the RGB variant we input all 3 channels in a single branch, similar to most learned image codecs operating in RGB. The models are trained for multiple bitrate configurations in each color space. We report the findings from our experiments by evaluating them on various datasets and compare the results to state-of-the-art image codecs. The YUV model performs better than the LAB variant in terms of MS-SSIM with a Bjøntegaard delta bitrate (BD-BR) gain of 7.5% using VTM intra-coding mode as the baseline. Whereas the LAB variant has a better performance than YUV model in terms of CIEDE2000 having a BD-BR gain of 8%. Overall, the RGB variant of SLIC achieves the best performance with a BD-BR gain of 13.14% in terms of MS-SSIM and a gain of 17.96% in CIEDE2000 at the cost of a higher model complexity.

Index Terms— Deep learning, learned image compression, image compression, variational autoencoder, color learning, color spaces

1 Introduction

One of the widely researched topics today is the use of deep neural networks in almost every field, with innumerable applications. In image processing and computer vision, they have been in vogue for many years. Recently, there has been a growing interest in the development of learned image codecs. A learned image codec uses non-linear neural networks consisting of several layers. Traditional image codecs such as JPEG [1] use orthogonal linear mapping functions with discrete cosine transform (DCT). Learned image codecs are catching up with their traditional counterparts such as HEVC[2] and state-of-the-art VVC [3] in the intra-coding mode.

Typically, learned image codecs are trained end-to-end in order to learn the encoder and decoder parameters jointly. The encoding process involves the transformation of an image into a latent representation, quantization and entropy coding. The quantized latent is extracted by the entropy decoder and reconstructed as an image by the non-linear decoder. This non-linear transform coding introduced in [4] forms the basis for most learning based image codecs. The rate-distortion optimization (RDO) for such a system can be written as:

\mathrm{min}_{\boldsymbol{\theta},\boldsymbol{\phi}}\{L\},\text{with }L(% \boldsymbol{\theta},\boldsymbol{\phi})={R}({\boldsymbol{\theta}})+\lambda\cdot D% ({\boldsymbol{\theta},\boldsymbol{\phi}}),

(1)

where $L$ is the loss term, $R$ is the rate measured in bits per pixel (bpp), $D$ represents distortion and $\lambda$ is the Lagrangian multiplier. The learnable parameters of the encoding and decoding networks are indicated by $\boldsymbol{\theta}$ and $\boldsymbol{\phi}$ respectively.

A large part of learned image codecs operate in RGB color space. One of the reasons for this is the availability of image datasets in this color space. However, properties of the human visual system are not well exploited. Although the use of YUV color space in image compression is not new, in this work, we build on our prior works [5, 6] to shed some light on the effect of color spaces in learned image compression. The performance is compared with state-of-the-art image codecs by means of rate-distortion curves, Bjøntegaard delta bitrate [7], and distortion values. Although we consider only RGB, YUV, and LAB, according to our knowledge, it is the first work in the domain of learned image compression focused on color spaces.

2 Related work

Numerous contributions addressing specific problems in learned image compression have been made. The base framework for a majority of the learned codecs is variational autoencoder based. In order to have better entropy coding, a hyperprior model for forward adaptation is introduced in [8] and eventually extended with a backward adaptation in [9]. The Cheng2020 [10] was one of the first works to have competitive performance and the ELIC introduced in [11] outperforms VVC intra-coding mode [3].

The models in [8, 9, 10, 11] are all trained and operate in RGB color space. There are some works such as [12, 13] that work in the YUV color space. Learned image coding is also being made practical with standardization activities in JPEG AI [14], which aims to address several image processing and computer vision tasks in addition to compression.

The effect of color space has been studied in general image classification [15] and robustness of deep learning [16] to name a few. In our prior work [5] and [6], we present our findings on designing a learning based codec that splits the task of image compression into two - structure from luminance channel and color from chrominance channels. It shows the advantage of optimizing networks with a color difference metric in addition to the other terms in the loss function.

3 Color spaces in learned image compression

Here we provide an overview of the SLIC model and its variants, the model architecture, workflow, loss function, and the implementation details including the training methodology.

Fig. 2: RD curves of learned image codecs operating in RGB for the Kodak dataset .

Fig. 3: RD curves of learned image codecs operating in YUV (JPEG AI and SLIC–YUV), SLIC–LAB, SLIC–RGB, and VTM for the Kodak dataset.

3.1 Overview

We build on the structure and color based learned image codec (SLIC) from our prior work in [6], which is an image codec operating in YUV. In this paper, we introduce two new variants of the SLIC model called SLIC–LAB and SLIC–RGB. But for clarity, the original SLIC model will henceforth be referred to as SLIC–YUV. SLIC–LAB has exactly the same architecture as the SLIC–YUV model except the operating color space. To have an equivalent RGB model with same set of layers, we introduce the SLIC–RGB, which has a single branch. The model architectures are shown in Fig. 1. The SLIC–RGB model is as shown in Fig. 1(a) and Fig. 1(b) indicates the block diagram for the YUV or LAB variant, wherein the input and output images are in RGB, but the operating color space is either YUV or LAB. The internal details of the models are discussed in greater detail in our prior work [6]. In the RGB model, the single branch consists of 192 channels. In case of YUV and LAB models, the luminance branch consists of 128 channels and the chrominance branch has 64 channels. Due to the increase in the number of channels in a single branch, SLIC–RGB has a higher complexity. The two-branch models have around 20 million parameters and need 1,512 kilo multiply accumulate operations per pixel (kMACs/pixel) including encoder and decoder. For the RGB variant, it is about 36 million parameters with 4,840 kMACs/pixel.

In the SLIC–RGB model in Fig. 1(a), the analysis transform $g_{a}$ translates the image $x$ into a latent space representation $y$ . This is further transformed, to learn the statistical distribution of the latent as a hyperlatent representation $z$ through the hyper analysis transform $h_{a}$ . A factorized prior model helps in encoding the quantized hyperlatents $\hat{z}$ . The quantized latent $\hat{y}$ is efficiently encoded by backward adaptation through an autoregressive context model (CM) using masked convolution as proposed in [9]. The entropy parameter estimation module (EP) combines the output $\gamma$ of hypersynthesis transform $h_{s}$ , and $\tau$ from the context model, to predict mean $\mu$ and variance $\sigma$ . In other words, the probability distribution of the latent is estimated and used for entropy coding. The entropy decoded latent $\hat{y}$ is then transformed back to image space through the synthesis transform $g_{s}$ as image $\hat{x}$ .

For the YUV and LAB models shown in Fig. 1(b), the same workflow holds true. However, an input image is converted from RGB into the respective color space and split into luminance and chrominance components. These are then fed into their corresponding branches. The symbols with subscripts $\cdot_{L}$ and $\cdot_{C}$ indicate luminance and chrominance components respectively. The outputs of these networks are combined and converted back to an RGB image.

3.2 Loss Function

RDO is the backbone in optimizing image codecs. In this work, we use an objective function that is a combination of three distortion metrics and the rate term. The metrics used are mean squared error (MSE), multi-scale structural similarity index measure (MS-SSIM) [17], and color difference metric CIEDE2000 ( $\Delta E_{00}^{12}$ ) [18]. The loss function, as used in our prior works [5, 6] is :

	$\displaystyle\mathrm{min}_{\boldsymbol{\theta},\boldsymbol{\phi}}\{L\},\text{% with }L(\boldsymbol{\theta},\boldsymbol{\phi})={R}+\lambda_{1}\cdot\mathrm{MSE% }(x,\hat{x})$		(2)
	$\displaystyle+\lambda_{2}\cdot(1.0-\mathrm{MS\text{-}SSIM}(x,\hat{x}))+\lambda% _{3}\cdot\Delta E_{00}^{12}(x,\hat{x}),$		(2)

where $\lambda_{1},\lambda_{2}$ , and $\lambda_{3}$ are the Lagrangian multipliers for the metrics MSE, MS-SSIM and CIEDE2000 respectively. It should be noted that MSE and MS-SSIM are estimated in RGB color space, since the data used for training and evaluating the models are RGB images. All the SLIC variants are trained with this same loss function.

The rate term $R$ comprises of the bits required to encode the image $x$ . For the RGB model, it constitues of the latent bits and the hyperlatent bits. However, in YUV or LAB model, there are luma and chroma branches, contributing a total of four components.

3.3 Implementation details

The models are implemented in Python programming language using the PyTorch¹¹1https://pytorch.org framework and CompressAI library [19].

The model variants are trained individually for each operating color space, with the loss function in (2) and four operating points with Lagrangian values from our prior works [5, 6]; $\lambda_{1}=\{0.001,0.005,0.01,0.02\}$ , $\lambda_{2}=\{0.01,0.12,2.4,4.8\}$ , and $\lambda_{3}=\{0.024,0.12,0.24,0.48\}$ for MSE, MS-SSIM and $\Delta E_{00}^{12}$ respectively. The models are trained for 120 epochs with the COCO2017 training dataset [20] comprising around 118,000 images. The validation data has about 5,000 randomly chosen images from the ImageNet dataset [21] spanning various classes. Adam optimizer [22] initialized with a learning rate of 1e-4 is employed in tandem with a learning rate scheduler.

4 Experiments and results

In this section, we report our findings from various experiments. We start with the rate-distortion performance where we compare various codecs using the metrics PSNR, MS-SSIM and CIEDE2000 for model configurations resulting in a bitrate range of 0 to 1 bpp. Followed by that, we make a comparison between the effect of color channels in the SLIC–LAB and SLIC–YUV models. Finally, we illustrate the effect of color spaces on the latent channels through the channel impulse responses and visual comparison.

Table I: BD-Rate and BD-Distortion values of various image codecs with VTM-intra baseline for the Kodak dataset.

Codec Name

PSNR

MS-SSIM

CIEDE2000

BD-BR

(%)

BD-PSNR

(dB)

BD-BR

(%)

BD-MS-

SSIM

BD-BR

(%)

BD-1

/CIEDE2000

Cheng2020 [10]

3.40

-0.1461

-3.32

0.1333

20.82

-0.0175

ELIC MSE [11]

-7.07

0.3260

–

ELIC MS-SSIM [11]

–

-12.87

0.5961

–

JPEG AI[14]

55.75

-1.7562

-20.16

0.9121

69.68

-0.6138

SLIC–RGB (Ours)

12.60

-0.5298

-13.14

0.4772

-17.96

0.0302

SLIC–YUV (Ours)

21.73

-0.8274

-7.50

0.2052

-4.66

0.0080

SLIC–LAB (Ours)

22.65

-0.8305

-6.23

0.2342

-7.99

0.0157

Bold indicates the best values and underline represents the second best.

Fig. 4: RD curves of chroma channel variants of SLIC–LAB and SLIC–YUV models for the Tecknick RGB dataset.

4.1 Rate-distortion performance

We measure rate-distortion (RD) performance of various codecs and compare them with our models using the Kodak dataset, that consists of 24 images of resolution $512\times 768$ in either orientation. This experiment is split into two parts. First, we compare the RGB codecs and then the codecs operating in YUV. In both cases, we compute RGB PSNR and MS-SSIM with the original and reconstructed images for all bitrate configurations for each codec.

The PSNR for each RGB image is computed as the average across each pixel over every channel. Similarly, the MS-SSIM metric is calculated according to [17] and by using equal weights for RGB channels.

The CIEDE2000 metric, indicated by $\Delta E_{00}^{12}$ requires a color conversion from RGB to LAB. Moreover, we represent the metric as $5.0-\Delta E_{00}^{12}$ in order to complement it as a quality metric, with $5.0$ as an offset based on the range of values.

4.1.1 RGB image codecs

A comparison of rate-distortion performance of SLIC–RGB is made with Cheng2020[10], ELIC[11], Hyper Prior[8], and Factorized Prior[4] models. The rate-distortion curves are reported in Fig. 2 for PSNR and MS-SSIM. We use the RD values of ELIC from CompressAI, where PSNR values are provided for MSE optimized ELIC and MS-SSIM values for MS-SSIM optimized models. They are indicated as ELIC–MSE and ELIC–MS-SSIM respectively.

In terms of PSNR, ELIC–MSE has the best performance and Cheng2020 is better than SLIC–RGB. When we observe the MS-SSIM curves, ELIC–MS-SSIM has the best performance. However, SLIC–RGB outperforms Cheng2020, and is comparable to ELIC at bitrates larger than 0.5 bpp.

4.1.2 YUV image codecs

We compare JPEG AI [14] and intra-coding mode of VVC test model (VTM)[3] with the SLIC variants. The RD curves are presented in Fig. 3 with PSNR, MS-SSIM and $\Delta E_{00}^{12}$ metrics at various bitrates. For JPEG AI, the verification model vm-release-v4.5 is used in the default evaluation mode. Clearly, VTM has the best PSNR performance. With MS-SSIM, JPEG AI outperforms all codecs under consideration for bitrates lower than 0.5 bpp. But VTM, SLIC–LAB, and SLIC–YUV variants are close in terms of MS-SSIM and SLIC–RGB has slightly higher values.

JPEG AI [14] has the worst performance in terms of CIEDE2000. VTM has a comparable performance to SLIC–RGB for bitrates lower than 0.2 bpp. It is interesting to see that SLIC variants have a superior performance and thus, having a color difference metric in the loss function can improve color fidelity.

4.1.3 BD-Rate and BD-Distortion

The Bjøntegaard-delta bitrate [7] and distortion values are measured and reported in Table I. VTM is used as the baseline. The values are measured for PSNR, MS-SSIM, and CIEDE2000 metrics. In each column, the best value is indicated in bold and the second best value is underlined. The MSE optimized ELIC model has the most gains in PSNR, both in bitrate and distortion with values $7.07\%$ and $0.326$ dB respectively. In the MS-SSIM column, SLIC–RGB has a gain of $13.14\%$ in BD-BR. However, JPEG AI has the highest gain with $20.16\%$ and $0.9121$ dB in bitrate and distortion respectively.

Lastly, for the color difference metric CIEDE2000, only the SLIC variants have a BD-BR gain of $17.96\%$ for SLIC–RGB, $7.99\%$ for SLIC–LAB, and a gain of $4.66\%$ for SLIC–YUV models. This is also reflected in the RD curves in Fig. 3. With the color difference metric, we observe that Cheng2020 is better than JPEG AI.

4.2 Effect of channels in chroma branch

In this experiment, we reduce the number of channels in the chroma branches of SLIC–YUV and SLIC–LAB models from 64 to 32, 16, and 8. This is done in order to understand the effect of colors through the chroma branch. This can be interpreted as the feature space equivalent of color sub-sampling in the image space. For each set of chroma channels, the models are trained. With four bitrate configurations in each model variant, eight RD curves are obtained and shown in Fig. 4, in which we use JPEG AI as an anchor. We used 100 RGB images of dimensions $400\times 400$ from the Tecknick dataset [23].

Overall, the SLIC–YUV models perform better than SLIC–LAB. The number of channels is directly proportional to the quality. The difference between the variants is higher in PSNR and MS-SSIM. Whereas, in CIEDE2000, both color spaces have a similar performance. JPEG AI outperforms all the variants in terms of MS-SSIM upto a bitrate of 0.5 bpp. After which, the 32 and the 64 channel SLIC–YUV models are better. However, in CIEDE2000, from 16 channel onwards, YUV and LAB variants outperform JPEG AI.

4.3 Channel impulse responses and color spaces

The channel impulse responses of a learned image codec provide an insight into the overall features captured by the analysis transform in the form of latent representations. In this experiment, we visualize the impulse responses of the SLIC variants. The impulse response computation is done in the same way as prior work [5]. The models with the highest bitrate configuration are chosen. Two images of dimensions $400\times 400$ from the Tecknick dataset [23] are considered here. The impulse responses are first arranged in the decreasing order of their importance by means of their bitrate contribution. The 48 most important channels are considered with the RGB model. In case of YUV and LAB, as there are two branches, the 32 highest channels from the luma branch, and 16 highest from the chroma branch are considered.

The grayscale image GRAY_R03_0400x0400_014.png and its color counterpart RGB_R03_0400x0400_014.png are encoded and reconstructed with all the SLIC variants, and shown in Fig. 5 and Fig. 6. The impulse responses for each color space are shown below the reconstructed image. Each patch has dimensions $16\times 16$ in the impulse response and represents individual channels in the latent. The luminance impulse responses are shown in the first two rows and the chroma impulse responses are shown in the third row for the LAB and YUV models. The rate and distortion metrics are provided below each image for reference. Comparing the visual quality of reconstructed images, they are very similar, which is also clear from the rate and PSNR values. Irrespective of the model color space, the quality and bitrates are comparable for the example images.

Observing the impulse responses of the grayscale images across all color space variants, it is clear, as one would expect, no color is captured. The main information contained is related to structure, which is represented in the first two rows of SLIC–YUV and SLIC–LAB images in Fig. 5. Similar behaviour can be observed for SLIC–RGB, where the top 48 channels resemble structural filters. However, when we observe the impulse responses of the color image counterparts in Fig. 6, again we observe a similar behaviour, where the last rows of SLIC–YUV and SLIC–LAB are now populated with colored regions. In case of SLIC–RGB, the impulse responses are a mix of both color and structural components since there is no explicit separation of luminance and chromiance components. For the image considered, since a large part is the blue sky background, this is reflected as the second most important or second highest bitrate contributing channel.

When we consider the top most channel, and use only this channel to perform the synthesis transform while setting the rest to zero, a low resolution version of the original image is obtained. This means that irrespective of color space, the first channel is most often similar to a low pass filter. The successive channels capture other finer details and colors in the image. Using channel impulse responses, it can be inferred that the number of structural features captured by a learned image codec’s non-linear transform is oftentimes higher than that of color features. By separating the luminance and chrominance channels with a color transform, we can have control on the constituents. However, when such a split is not made, deep neural networks learn them implicitly but this results in a lack of optimization and control of the components.

Hence, we can conclude that the features captured by the YUV, LAB and RGB variants of the SLIC model have similarities. With an explicit separation of structure and color, it can be observed that they can be independently optimized and tuned. Whereas, for the RGB model, a granular control is not directly possible with the single branch structure, but training with the loss function in (2) has an improvement in performance for all variants.

5 Conclusion

In this paper, we report our findings on the effect of color space in learned image compression. Building on our prior work, we compare the rate-distortion performance of our SLIC model variants with other codecs. It is shown that YUV and LAB models have similar performance. But the RGB model outperforms them at the cost of a higher complexity. The measurements also show that the RGB model has $1.2$ times more number of parameters, and requires $3.2$ times higher kMACs/pixel to that of the YUV and LAB variants. With the channel impulse responses, it is shown that the features captured by color space variants of SLIC have similarities. However, the split model architecture has the benefit of reducing complexity. The experiments can be extended to other color spaces such as HSV, XYZ etc.

References

[1] G.K. Wallace, “The jpeg still picture compression standard,” IEEE Transactions on Consumer Electronics, vol. 38, no. 1, pp. xviii–xxxiv, 1992.
[2] Gary J. Sullivan, Jens-Rainer Ohm, Woo-Jin Han, and Thomas Wiegand, “Overview of the high efficiency video coding (hevc) standard,” IEEE Transactions on Circuits and Systems for Video Technology, vol. 22, no. 12, pp. 1649–1668, 2012.
[3] Benjamin Bross, Ye-Kui Wang, Yan Ye, Shan Liu, Jianle Chen, Gary J. Sullivan, and Jens-Rainer Ohm, “Overview of the versatile video coding (vvc) standard and its applications,” IEEE Transactions on Circuits and Systems for Video Technology, vol. 31, no. 10, pp. 3736–3764, 2021.
[4] Johannes Ballé, Valero Laparra, and Eero P Simoncelli, “End-to-end optimized image compression,” in Proc. of the International Conference on Learning Representations (ICLR), 2017.
[5] Srivatsa Prativadibhayankaram, Thomas Richter, Heiko Sparenberg, and Siegfried Foessel, “Color learning for image compression,” in Proc. of the IEEE International Conference on Image Processing (ICIP), 2023, pp. 2330–2334.
[6] Srivatsa Prativadibhayankaram, Mahadev Prasad Panda, Thomas Richter, Heiko Sparenberg, Siegfried Fößel, and André Kaup, “Slic: A learned image codec using structure and color,” in 2024 Data Compression Conference (DCC), 2024, pp. 3–12.
[7] Gisle Bjontegaard, “Calculation of average psnr differences between rd-curves,” ITU SG16 Doc. VCEG-M33, 2001.
[8] Johannes Ballé, David Minnen, Saurabh Singh, Sung Jin Hwang, and Nick Johnston, “Variational image compression with a scale hyperprior,” in Proc. of the International Conference on Learning Representations (ICLR), 2018.
[9] David Minnen, Johannes Ballé, and George D Toderici, “Joint autoregressive and hierarchical priors for learned image compression,” in Advances in Neural Information Processing Systems. 2018, vol. 31, Curran Associates, Inc.
[10] Zhengxue Cheng, Heming Sun, Masaru Takeuchi, and Jiro Katto, “Learned image compression with discretized gaussian mixture likelihoods and attention modules,” in Proc. of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2020, pp. 7939–7948.
[11] Dailan He, Ziming Yang, Weikun Peng, Rui Ma, Hongwei Qin, and Yan Wang, “Elic: Efficient learned image compression with unevenly grouped space-channel contextual adaptive coding,” in Proc. of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2022, pp. 5718–5727.
[12] Anna Meyer and André Kaup, “A novel cross-component context model for end-to-end wavelet image coding,” in Proc. of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), June 2023, pp. 1–5.
[13] Panqi Jia, Ahmet Burakhan Koyuncu, Georgii Gaikov, Alexander Karabutov, Elena Alshina, and André Kaup, “Learning-based conditional image coder using color separation,” in Proc. of the Picture Coding Symposium (PCS), Dec. 2022, pp. 49–53.
[14] João Ascenso, Elena Alshina, and Touradj Ebrahimi, “The jpeg ai standard: Providing efficient human and machine visual data consumption,” IEEE MultiMedia, vol. 30, no. 1, pp. 100–111, 2023.
[15] Shreyank N Gowda and Chun Yuan, “Colornet: Investigating the importance of color spaces for image classification,” in Proc. of the Asian Conference on Computer Vision (ACCV). Springer, 2019, pp. 581–596.
[16] Kanjar De and Marius Pedersen, “Impact of colour on robustness of deep neural networks,” in Proc. of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2021, pp. 21–30.
[17] Zhou Wang, Eero P Simoncelli, and Alan C Bovik, “Multiscale structural similarity for image quality assessment,” in Proc. of the Thirty-Seventh Asilomar Conference on Signals, Systems & Computers, Pacific Grove, CA, USA, 2003, pp. 1398–1402, IEEE.
[18] Gaurav Sharma, Wencheng Wu, and Edul N Dalal, “The CIEDE2000 color-difference formula: Implementation notes, supplementary test data, and mathematical observations,” Color Research & Application, vol. 30, no. 1, pp. 21–30, 2005.
[19] Jean Bégaint, Fabien Racapé, Simon Feltman, and Akshay Pushparaja, “Compressai: a pytorch library and evaluation platform for end-to-end compression research,” arXiv preprint arXiv:2011.03029, 2020.
[20] Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Dollár, and C Lawrence Zitnick, “Microsoft coco: Common objects in context,” in Proc. of the European Conference on Computer Vision (ECCV). Springer, 2014, pp. 740–755.
[21] Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei, “Imagenet: A large-scale hierarchical image database,” in Proc. of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2009, pp. 248–255.
[22] Diederik P. Kingma and Jimmy Ba, “Adam: A method for stochastic optimization,” in arXiv:1412.6980, 2017.
[23] Nicola Asuni and Andrea Giachetti, “Testimages: A large data archive for display and algorithm testing,” Journal of Graphics Tools, vol. 17, no. 4, pp. 113–125, 2013.