Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                

Towards Extreme Image Compression with Latent Feature Guidance and Diffusion Prior

Zhiyuan Li, Yanhui Zhou, Hao Wei, Chenyang Ge, Jingwen Jiang This work are supported by the National Natural Science Foundation of China (62376208, 62088102HZ), the Natural Science Foundation of Shaanxi Province (No.2021JZ-04). (Corresponding author: Chenyang Ge.)The authors are with the Institute of Artificial Intelligence and Robotics, Xi’an Jiaotong University, Xi’an 710049, China (e-mail: lizhiyuan2839@163.com; zhouyh@mail.xjtu.edu.cn; haowei@stu.xjtu.edu.cn; cyge@mail.xjtu.edu.cn; jiangjingwen@stu.xjtu.edu.cn)
Abstract

Image compression at extremely low bitrates (below 0.1 bits per pixel (bpp)) is a significant challenge due to substantial information loss. In this work, we propose a novel two-stage extreme image compression framework that exploits the powerful generative capability of pre-trained diffusion models to achieve realistic image reconstruction at extremely low bitrates. In the first stage, we treat the latent representation of images in the diffusion space as guidance, employing a VAE-based compression approach to compress images and initially decode the compressed information into content variables. The second stage leverages pre-trained stable diffusion to reconstruct images under the guidance of content variables. Specifically, we introduce a small control module to inject content information while keeping the stable diffusion model fixed to maintain its generative capability. Furthermore, we design a space alignment loss to force the content variables to align with the diffusion space and provide the necessary constraints for optimization. Extensive experiments demonstrate that our method significantly outperforms state-of-the-art approaches in terms of visual performance at extremely low bitrates.

Index Terms:
Image compression, diffusion models, content variables, extremely low bitrates.

I Introduction

With the explosive growth of image data on the Internet, how to achieve efficient data transmission and storage has become increasingly important. Lossy image compression is a popular solution for saving storage and transmission bandwidth by exploiting the spatial and perceptual redundancy of images. Traditional compression standards, such as JPEG2000 [1], BPG [2], and VVC [3], are widely used in practice. However, these algorithms produce severe blocking artifacts at extremely low bitrates due to their block-based processing, see Fig. 1(b).

Learning-based image compression has attracted significant interest and shows great potential to outperform traditional codecs. Based on their optimization objectives, learning-based methods can be roughly categorized into distortion-oriented [4, 5, 6, 7] and perception-oriented [8, 9, 10, 11] methods. Distortion-oriented methods are optimized for the rate-distortion function, which often leads to unrealistic reconstructions at low bitrates, typically manifested as blurring. Perception-oriented methods, on the other hand, aim to optimize the rate-distortion-perception function, leveraging techniques such as adversarial training [12] to improve perceptual quality. While these methods achieve significant improvements in visual quality, they often introduce unpleasant visual artifacts, especially at extremely low bitrates, as shown in Fig. 1(c).

Recently, diffusion models have exhibited impressive generation ability in image and video generation [13, 14, 15], encouraging researchers to develop various diffusion-based perception-driven compression methods [16, 17, 18, 19]. For extreme image compression, some works leverage pre-trained text-to-image diffusion models as prior knowledge to achieve realistic reconstructions at extremely low bitrates. For instance, Pan et al. [20] encode images as textual embeddings with extremely low bitrates, using pre-trained text-to-image diffusion models for realistic reconstruction. Lei et al. [21] directly transmit short text prompts and compressed image sketches, employing the pre-trained ControlNet [14] to produce reconstructions with high perceptual quality and semantic fidelity. However, these methods treat pre-trained text-to-image diffusion models as independent components, which limits their ability to fully exploit the generative capabilities of pre-trained diffusion models, resulting in reconstruction results that are inconsistent with the original image (see Fig. 1(d)). Therefore, how to develop an effective diffusion-based extreme generative compression method is worth further exploration.

(a) Original (b) VVC, 0.0205 bpp (c) MS-ILLM, 0.0447 bpp (d) Text+Sketch, 0.0281 bpp (e) DiffEIC (Ours), 0.0201 bpp
Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption
Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption
Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption

Figure 1: Visual examples of the reconstructed results on the Kodak [22] dataset. The proposed DiffEIC produces much better results in terms of perception and fidelity. For example, the small attic is well reconstructed.

In this work, we develop an end-to-end Diffusion-based Extreme Image Compression (DiffEIC) model that effectively combines compressive variational autoencoders (VAEs) [23] with a fixed stable diffusion model. First, to effectively convey information, we develop a VAE-based latent feature-guided compression module (LFGCM) that can adaptively select information essential for reconstruction, rather than using explicit information, such as text prompts and sketches in [21], to represent images. Specifically, this module employs a VAE-based compression method to compress images and initially decode the compressed information into content variables. To effectively utilize the knowledge encapsulated in the fixed stable diffusion model, these content variables are expected to align with the diffusion space. However, learning to map images to the diffusion space from scratch is challenging. To address this issue, in the latent feature-guided compression module, we introduce the latent representation of images in the diffusion space as external guidance to correct intermediate features and content variables. Second, we introduce a conditional diffusion decoding module (CDDM) to reconstruct images with the guidance of content variables. This module employs the well-trained stable diffusion as a fixed decoder and injects external condition information via a trainable control module. Leveraging the powerful generative capability of stable diffusion, the proposed DiffEIC can produce realistic reconstructions even at extremely low bitrates. Furthermore, to optimize the model in an end-to-end manner, we design a space alignment loss to force content variables to align with the diffusion space and provide necessary constraints for optimization. With the help of these mentioned components, the proposed DiffEIC achieves favorable results compared to state-of-the-art approaches, as demonstrated in Fig. 1(e).

In summary, the main contributions of this work are as follows:

1) To the best of our knowledge, we propose the first extreme image compression framework that combines compressive VAEs with pre-trained text-to-image diffusion models in an end-to-end manner.

2) We develop a latent feature-guided compression module to adaptively select information essential for reconstruction. By introducing external guidance, we effectively improve reconstruction fidelity at extremely low bitrates.

3) We propose a conditional diffusion decoding module that fully exploits the powerful diffusion prior contained in the well-trained stable diffusion to facilitate extreme image compression and improve realistic reconstruction. In addition, we design a simple yet effective space alignment loss to enable end-to-end model training.

The remainder of this paper is organized as follows. The related works are summarized in Section II. The proposed method is described in Section III. The experiment results and analysis are presented in Section IV and Section V, respectively. Finally, we conclude our work in Section VI.

II Related Work

II-A Lossy Image Compression

Lossy image compression plays a crucial role in image storage and transmission. Traditional compression standards, such as JPEG [24] and JPEG2000 [1], are widely used in practice. However, they tend to introduce block artifacts due to the lack of consideration of spatial correlation between image blocks. In recent years, learned image compression has made significant progress and achieved impressive rate-distortion performance [25, 26]. The main success of these methods is attributed to the development of various transform networks and entropy models. For instance, Liu et al. [27] introduce a non-local attention module to improve transform networks. In [28], He et al. employ invertible neural networks (INNs) to mitigate the information loss problem. Zhu et al. [29] construct nonlinear transforms using swin-transformers, achieving superior compression performance compared to CNNs-based transforms. In [18], Yang et al. innovatively use conditional diffusion models as decoders. Furthermore, several methods [6, 7, 30] enhance performance by improving entropy models. For example, Minnen et al. [30] combine hierarchical priors with autoregressive models to reduce spatial redundancy within latent features. In [31], He et al. assume the redundancy in spatial dimension and channel dimension is orthogonal and propose a multi-dimension entropy model. Qian et al. [32] utilize a transformer to enable entropy models to capture long-range dependencies. Guo et al. [26] explore capturing the dependencies along both the spatial and channel dimensions by using the causal global contextual prediction.

II-B Extreme Image Compression

In some practical scenarios, such as underwater wireless communication, the bandwidth is too narrow to transmit the images or videos. To overcome this dilemma, extreme image compression towards low bitrates (e.g., below 0.1 bpp) is urgently needed. Several algorithms [10, 11, 33, 34, 35] leverage generative adversarial networks (GANs) for realistic reconstructions and bit savings. In [10], Agustsson et al. incorporate a multi-scale discriminator to synthesize details that cannot be stored at extremely low bitrates. Mentzer et al. [34] explore normalization layers, generator and discriminator architectures, training strategies, as well as perceptual losses, achieving visually pleasing reconstructions at low bitrates. However, these approaches suffer from the unstable training of GANs and inevitably introduce unpleasant visual artifacts.

In addition, some approaches use prior knowledge to achieve extreme image compression. Yue et al. [36] describe input images based on the down-sampled version and handcrafted features, and use these descriptions to reconstuct the images from a large-scale image database. Their method can achieve impressive compression performance when the large-scale image database contains images that are highly correlated with the input images. Gao et al. [37] leverage invertible networks, significantly reducing information loss during the compression process. Similarly, Wei et al. [38] employ invertible and generative priors to achieve extreme compression by rescaling images with extreme scaling factors (i.e., 16×\times× and 32×\times×). In [39], Li et al. employ physical priors for extreme underwater image compression. Jiang et al. [40] utilize the text descriptions of images as prior to guide image compression. Inspired by the tremendous success of diffusion models in image generation, some methods [20, 21] use more powerful pre-trained text-to-image diffusion models as prior knowledge. In [20], Pan et al. encode images into short text embeddings and then generate high-quality images with pre-trained text-to-image diffusion models by feeding the text embeddings. Lei et al. [21] directly compress the short text prompts and binary contour sketches on the encoded side, and then use them as input to the pre-trained text-to-image diffusion model for reconstruction on the decoded side. However, these diffusion-based methods treat pre-trained text-to-image diffusion models as independent components, which limits their ability to fully exploit the generative capabilities of pre-trained diffusion models.

In this work, we propose DiffEIC, a framework that efficiently incorporates compressive VAEs with pre-trained text-to-image diffusion models in an end-to-end manner. Leveraging the nonlinear capability of compressive VAEs and the powerful generative capability of pre-trained text-to-image diffusion models, our DiffEIC achieves both high perceptual quality and high-fidelity image reconstruction at extremely low bitrates.

II-C Diffusion Models

Inspired by non-equilibrium statistical physics [41], diffusion models convert real data distributions into simple, known distributions (e.g., Gaussian) through a gradual process of adding random noise, known as the diffusion process. Subsequently, they learn to reverse this diffusion process and construct desired data samples from noise (i.e., the reverse process). Denoising diffusion implicit models (DDPM) [42] improves upon the original diffusion model and has profoundly influenced subsequent research. The latent diffusion model (LDM) [13] significantly reduces computational costs by performing diffusion and reverse steps in the latent space. Stable diffusion is a widely used large-scale implementation of LDM. Owing to their flexibility, tractability, and superior generative capability, diffusion models have achieved remarkable success in various vision tasks.

Due to the complexity of the diffusion process, training diffusion models from scratch is computationally demanding and time-consuming. To address this problem, some algorithms [14, 43, 44] introduce additional trainable networks to inject external conditions into fixed, pre-trained diffusion models. This strategy simplifies the exhaustive training from scratch while maintaining the robust capability of pre-trained diffusion models. In [14], Zhang et al. employ pre-trained text-to-image diffusion models (e.g., stable diffusion) as a strong backbone with fixed parameters and reuse their encoding layers for controllable image generation. Similarly, Mou et al. [43] introduce lightweight T2I-Adapters to provide extra guidance for pre-trained text-to-image diffusion models. In [44], Lin et al. use the latent representation of coarse restored images as conditions to help the pre-trained diffusion models generate clean results. We note that the main success of these algorithms on image generation and restoration is due to the use of pre-trained diffusion models. The robust generative capability of such models motivates us to explore effective approaches for extreme image compression at low bitrates.

III Methodology

Refer to caption
Figure 2: The two-stage pipeline of the proposed DiffEIC. Image Compression: Initially, we leverage the VAE-based latent feature-guided compression module (LFGCM) to adaptively select information essential for reconstruction and obtain zcsubscript𝑧𝑐z_{c}italic_z start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT. Image Reconstruction: We leverage the conditional diffusion decoding module (CDDM) for realistic image reconstruction and obtain x^^𝑥\hat{x}over^ start_ARG italic_x end_ARG. The CDDM contains a trainable control module and a fixed noise estimator. Note that the control module and noise estimator are connected with zero convolutions (zero-initialized convolution layers).

In this section, we propose DiffEIC for extreme image compression. As shown in Fig. 2, the proposed DiffEIC consists of two primary stages: image compression and image reconstruction. Specifically, the former stage aims to compress images and generate content-related variables. The latter stage is designed for decoding the content variables into reconstructed images. Furthermore, a space alignment loss is introduced to force content variables to align with the diffusion space and provide necessary constraints for optimization.

III-A Image Compression with Compressive VAEs

As shown in Fig. 2(a), we propose a latent feature-guided compression module (LFGCM) based on compressive VAEs [23]. This module leverages an additional guidance branch that utilizes the latent representation of images in the diffusion space to correct intermediate features and content variables. The encoding process, decoding process, and network details of LFGCM are introduced below.

III-A1 Encoding Process

Given an input image x𝑥xitalic_x, we first obtain external guidance zgsubscript𝑧𝑔z_{g}italic_z start_POSTSUBSCRIPT italic_g end_POSTSUBSCRIPT with stable diffusion’s encoder \mathcal{E}caligraphic_E as follows:

zg=(x).subscript𝑧𝑔𝑥z_{g}=\mathcal{E}(x).italic_z start_POSTSUBSCRIPT italic_g end_POSTSUBSCRIPT = caligraphic_E ( italic_x ) . (1)

Then zgsubscript𝑧𝑔z_{g}italic_z start_POSTSUBSCRIPT italic_g end_POSTSUBSCRIPT is used to guide the extraction of the latent representation y𝑦yitalic_y and the side information z𝑧zitalic_z, sequentially, which can be expressed as:

y=𝒩e(x,zg),z=𝒩he(y,zg),formulae-sequence𝑦subscript𝒩𝑒𝑥subscript𝑧𝑔𝑧subscript𝒩𝑒𝑦subscript𝑧𝑔y=\mathcal{N}_{e}(x,z_{g}),\ z=\mathcal{N}_{he}(y,z_{g}),italic_y = caligraphic_N start_POSTSUBSCRIPT italic_e end_POSTSUBSCRIPT ( italic_x , italic_z start_POSTSUBSCRIPT italic_g end_POSTSUBSCRIPT ) , italic_z = caligraphic_N start_POSTSUBSCRIPT italic_h italic_e end_POSTSUBSCRIPT ( italic_y , italic_z start_POSTSUBSCRIPT italic_g end_POSTSUBSCRIPT ) , (2)

where 𝒩esubscript𝒩𝑒\mathcal{N}_{e}caligraphic_N start_POSTSUBSCRIPT italic_e end_POSTSUBSCRIPT denotes the encoder network and 𝒩hesubscript𝒩𝑒\mathcal{N}_{he}caligraphic_N start_POSTSUBSCRIPT italic_h italic_e end_POSTSUBSCRIPT denotes the hyper-encoder network. Then we apply a hyper-decoder to draw a parameter ψ𝜓\psiitalic_ψ from the quantized side information z^^𝑧\hat{z}over^ start_ARG italic_z end_ARG:

z^=𝒬(z),ψ=𝒩hd(z^),formulae-sequence^𝑧𝒬𝑧𝜓subscript𝒩𝑑^𝑧\hat{z}=\mathcal{Q}(z),\ \psi=\mathcal{N}_{hd}(\hat{z}),over^ start_ARG italic_z end_ARG = caligraphic_Q ( italic_z ) , italic_ψ = caligraphic_N start_POSTSUBSCRIPT italic_h italic_d end_POSTSUBSCRIPT ( over^ start_ARG italic_z end_ARG ) , (3)

where 𝒩hdsubscript𝒩𝑑\mathcal{N}_{hd}caligraphic_N start_POSTSUBSCRIPT italic_h italic_d end_POSTSUBSCRIPT denotes the hyper-decoder and 𝒬()𝒬\mathcal{Q}(\cdot)caligraphic_Q ( ⋅ ) denotes the quantization operation, i.e., adding uniform noise during training and performing rounding operation during inference. Finally, the context model 𝒞msubscript𝒞𝑚\mathcal{C}_{m}caligraphic_C start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT uses ψ𝜓\psiitalic_ψ and the quantized latent representation y^=𝒬(y)^𝑦𝒬𝑦\hat{y}=\mathcal{Q}(y)over^ start_ARG italic_y end_ARG = caligraphic_Q ( italic_y ) to predict the Gaussian entropy parameters (μ,σ)𝜇𝜎(\mu,\sigma)( italic_μ , italic_σ ) for approximating the distribution of y^^𝑦\hat{y}over^ start_ARG italic_y end_ARG.

III-A2 Decoding process

Given the quantized y^^𝑦\hat{y}over^ start_ARG italic_y end_ARG and z^^𝑧\hat{z}over^ start_ARG italic_z end_ARG, we first use the information extraction network fcsubscript𝑓𝑐f_{c}italic_f start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT to extract a representation w𝑤witalic_w from z^^𝑧\hat{z}over^ start_ARG italic_z end_ARG, which can be expressed as:

w=fc(z^).𝑤subscript𝑓𝑐^𝑧w=f_{c}(\hat{z}).italic_w = italic_f start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT ( over^ start_ARG italic_z end_ARG ) . (4)

The external guidance information, originally contained in zgsubscript𝑧𝑔z_{g}italic_z start_POSTSUBSCRIPT italic_g end_POSTSUBSCRIPT, is captured in w𝑤witalic_w. This effectively compensates for the unavailability of zgsubscript𝑧𝑔z_{g}italic_z start_POSTSUBSCRIPT italic_g end_POSTSUBSCRIPT during the decoding process. Instead of directly reconstructing the original input image, we initially decode y^^𝑦\hat{y}over^ start_ARG italic_y end_ARG into a content variable zcsubscript𝑧𝑐z_{c}italic_z start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT:

zc=𝒩d(y^,w),subscript𝑧𝑐subscript𝒩𝑑^𝑦𝑤z_{c}=\mathcal{N}_{d}(\hat{y},w),italic_z start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT = caligraphic_N start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT ( over^ start_ARG italic_y end_ARG , italic_w ) , (5)

where 𝒩dsubscript𝒩𝑑\mathcal{N}_{d}caligraphic_N start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT denotes the decoder network. The content variable zcsubscript𝑧𝑐z_{c}italic_z start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT is further decoded in the subsequent image reconstruction stage using diffusion prior.

III-A3 Network Details

Fig. 3 illustrates the network architecture of LFGCM. The information extraction network fcsubscript𝑓𝑐f_{c}italic_f start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT has the same network structure as the hyper-decoder 𝒩hdsubscript𝒩𝑑\mathcal{N}_{hd}caligraphic_N start_POSTSUBSCRIPT italic_h italic_d end_POSTSUBSCRIPT, and we adopt context model 𝒞msubscript𝒞𝑚\mathcal{C}_{m}caligraphic_C start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT proposed by He et al. [31]. The guidance components (the elements denoted by red arrows, SFT𝑆𝐹𝑇SFTitalic_S italic_F italic_T and SFTResblk𝑆𝐹𝑇𝑅𝑒𝑠𝑏𝑙𝑘SFT\ Resblkitalic_S italic_F italic_T italic_R italic_e italic_s italic_b italic_l italic_k) first use a series of convolutions to resize the external feature G𝐺Gitalic_G to the appropriate dimensions. Then the SFT layers [45] are employed to inject the network with external guidance information. Specifically, given an external feature G𝐺Gitalic_G and an intermediate feature map F𝐹Fitalic_F, a pair of affine transformation parameters (i.e., α𝛼\alphaitalic_α for scaling and β𝛽\betaitalic_β for shifting) is generated as follows:

α,β=Φθ(G),𝛼𝛽subscriptΦ𝜃𝐺\alpha,\beta=\Phi_{\theta}(G),italic_α , italic_β = roman_Φ start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_G ) , (6)

where ΦθsubscriptΦ𝜃\Phi_{\theta}roman_Φ start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT denotes a stack of convolutions. Then the tuned feature map Fsuperscript𝐹F^{\prime}italic_F start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT can be generated by:

F=SFT(F,G)=αF+β,superscript𝐹𝑆𝐹𝑇𝐹𝐺tensor-product𝛼𝐹𝛽F^{\prime}=SFT(F,G)=\alpha\otimes F+\beta,italic_F start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT = italic_S italic_F italic_T ( italic_F , italic_G ) = italic_α ⊗ italic_F + italic_β , (7)

where tensor-product\otimes denotes element-wise product.

Refer to caption
Figure 3: The architecture of the proposed LFGCM. Convk3s1𝐶𝑜𝑛𝑣𝑘3𝑠1Conv\ k3s1italic_C italic_o italic_n italic_v italic_k 3 italic_s 1 denotes convolution with 3×3333\times 33 × 3 filters and stride 1. Tconvk3s1𝑇𝑐𝑜𝑛𝑣𝑘3𝑠1Tconv\ k3s1italic_T italic_c italic_o italic_n italic_v italic_k 3 italic_s 1 denotes transposed convolution with 3×3333\times 33 × 3 filters and stride 1. RB𝑅𝐵RBitalic_R italic_B denotes residual block [46]. RBneck𝑅𝐵𝑛𝑒𝑐𝑘RBneckitalic_R italic_B italic_n italic_e italic_c italic_k denotes residual bottleneck block [46]. LReLU𝐿𝑅𝑒𝐿𝑈LReLUitalic_L italic_R italic_e italic_L italic_U denotes the LeakyReLU function. AE𝐴𝐸AEitalic_A italic_E and AD𝐴𝐷ADitalic_A italic_D denote arithmetic encoder and decoder, respectively. Cmsubscript𝐶𝑚C_{m}italic_C start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT denotes context model. The black and red arrows denote main and guidance flow, respectively.

III-B Image Reconstruction with Diffusion Prior

As shown in Fig. 2(b), we propose a conditional diffusion decoding module (CDDM) to reconstruct images with the guidance of content variables. To maintain the generative capability of stable diffusion, we keep it fixed and employ a small control module to inject content information. In this section, we introduce stable diffusion and the proposed CDDM sequentially.

III-B1 Stable Diffusion

Stable diffusion first employs an encoder \mathcal{E}caligraphic_E to encode an image x𝑥xitalic_x into a latent representation z0=(x)subscript𝑧0𝑥z_{0}=\mathcal{E}(x)italic_z start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT = caligraphic_E ( italic_x ). Then z0subscript𝑧0z_{0}italic_z start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT is progressively corrupted by adding Gaussian noise through a Markov chain. The intensity of the added noise at each step is controlled by a default noise schedule βtsubscript𝛽𝑡\beta_{t}italic_β start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT. This process can be expressed as follows:

zt=αt¯z0+1αt¯ϵ,t=1,2,,T,formulae-sequencesubscript𝑧𝑡¯subscript𝛼𝑡subscript𝑧01¯subscript𝛼𝑡italic-ϵ𝑡12𝑇z_{t}=\sqrt{\bar{\alpha_{t}}}z_{0}+\sqrt{1-\bar{\alpha_{t}}}\epsilon,\ t=1,2,% \cdots,T,italic_z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = square-root start_ARG over¯ start_ARG italic_α start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG end_ARG italic_z start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT + square-root start_ARG 1 - over¯ start_ARG italic_α start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG end_ARG italic_ϵ , italic_t = 1 , 2 , ⋯ , italic_T , (8)

where ϵ𝒩(0,I)similar-toitalic-ϵ𝒩0I\epsilon\sim\mathcal{N}(0,\textbf{I})italic_ϵ ∼ caligraphic_N ( 0 , I ) is a sample from a standard Gaussian distribution, αt=1βtsubscript𝛼𝑡1subscript𝛽𝑡\alpha_{t}=1-\beta_{t}italic_α start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = 1 - italic_β start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT and αt¯=i=1tαi¯subscript𝛼𝑡superscriptsubscriptproduct𝑖1𝑡subscript𝛼𝑖\bar{\alpha_{t}}=\prod_{i=1}^{t}\alpha_{i}over¯ start_ARG italic_α start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG = ∏ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT italic_α start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT. The corrupted representation ztsubscript𝑧𝑡z_{t}italic_z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT approaches a Gaussian distribution as t𝑡titalic_t increases. To iteratively convert zTsubscript𝑧𝑇z_{T}italic_z start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT back to z0subscript𝑧0z_{0}italic_z start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT, a noise estimator ϵθsubscriptitalic-ϵ𝜃\epsilon_{\theta}italic_ϵ start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT with U-Net [47] architecture is learned to predict the added noise ϵitalic-ϵ\epsilonitalic_ϵ at each time step t𝑡titalic_t:

sd=𝔼z0,c,t,ϵϵϵθ(zt,c,t)2,subscript𝑠𝑑subscript𝔼subscript𝑧0𝑐𝑡italic-ϵsuperscriptnormitalic-ϵsubscriptitalic-ϵ𝜃subscript𝑧𝑡𝑐𝑡2\mathcal{L}_{sd}=\mathbb{E}_{z_{0},c,t,\epsilon}\|\epsilon-\epsilon_{\theta}(z% _{t},c,t)\|^{2},caligraphic_L start_POSTSUBSCRIPT italic_s italic_d end_POSTSUBSCRIPT = blackboard_E start_POSTSUBSCRIPT italic_z start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , italic_c , italic_t , italic_ϵ end_POSTSUBSCRIPT ∥ italic_ϵ - italic_ϵ start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_c , italic_t ) ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT , (9)

where c𝑐citalic_c denotes control conditions such as text prompts and images. After completing the iterative denoising process, a decoder 𝒟𝒟\mathcal{D}caligraphic_D is used to map z0subscript𝑧0z_{0}italic_z start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT back into pixel space.

III-B2 Conditional Diffusion Decoding Module

The CDDM is designed to leverage the powerful generative capability of fixed stable diffusion to reconstruct image x𝑥xitalic_x with realistic details at extremely low bitrates. Inspired by ControlNet [14], we introduce a control module to inject content information contained in zcsubscript𝑧𝑐z_{c}italic_z start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT into the denoising process. This control module has the same encoder and middle block architecture as the noise estimator ϵθsubscriptitalic-ϵ𝜃\epsilon_{\theta}italic_ϵ start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT. Notably, we reduce the channel number of the control module to 20% of the original, which results in a slight performance decrease but significantly enhances inference speed (see Section V-B). In addition, we increase the channel number of the first convolution layer to 6 to accommodate the concatenated input of the content variable zcsubscript𝑧𝑐z_{c}italic_z start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT and the latent noise ztsubscript𝑧𝑡z_{t}italic_z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT. Through the control module, we obtain a series of conditional features that contain content information and align with the internal knowledge of stable diffusion. These conditional features are then added to the encoder and decoder of the noise estimator ϵθsubscriptitalic-ϵ𝜃\epsilon_{\theta}italic_ϵ start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT using 1×\times×1 convolutions. Leveraging the powerful generative capability encapsulated in pre-trained stable diffusion, we can obtain a high perceptual quality reconstruction x^^𝑥\hat{x}over^ start_ARG italic_x end_ARG even at extremely low bitrates.

III-C Model Objectives

III-C1 Noise Estimation Loss

Due to the external condition zcsubscript𝑧𝑐z_{c}italic_z start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT introduced by proposed CDDM, Eq. (9) is modified as:

ne=𝔼z0,c,t,ϵ,zcϵϵθ(zt,c,t,zc)2,subscript𝑛𝑒subscript𝔼subscript𝑧0𝑐𝑡italic-ϵsubscript𝑧𝑐superscriptnormitalic-ϵsubscriptitalic-ϵ𝜃subscript𝑧𝑡𝑐𝑡subscript𝑧𝑐2\mathcal{L}_{ne}=\mathbb{E}_{z_{0},c,t,\epsilon,z_{c}}\|\epsilon-\epsilon_{% \theta}(z_{t},c,t,z_{c})\|^{2},caligraphic_L start_POSTSUBSCRIPT italic_n italic_e end_POSTSUBSCRIPT = blackboard_E start_POSTSUBSCRIPT italic_z start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , italic_c , italic_t , italic_ϵ , italic_z start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT end_POSTSUBSCRIPT ∥ italic_ϵ - italic_ϵ start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_c , italic_t , italic_z start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT ) ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT , (10)

where text prompt c𝑐citalic_c is set to empty.

III-C2 Rate loss

We employ the rate loss ratesubscript𝑟𝑎𝑡𝑒\mathcal{L}_{rate}caligraphic_L start_POSTSUBSCRIPT italic_r italic_a italic_t italic_e end_POSTSUBSCRIPT to optimize the rate performance as:

rate=R(y^)+R(z^),subscript𝑟𝑎𝑡𝑒𝑅^𝑦𝑅^𝑧\mathcal{L}_{rate}=R(\hat{y})+R(\hat{z}),caligraphic_L start_POSTSUBSCRIPT italic_r italic_a italic_t italic_e end_POSTSUBSCRIPT = italic_R ( over^ start_ARG italic_y end_ARG ) + italic_R ( over^ start_ARG italic_z end_ARG ) , (11)

where R()𝑅R(\cdot)italic_R ( ⋅ ) denotes the bitrate.

III-C3 Space Alignment Loss

As the noise estimation loss is unable to provide effective constraints for LFGCM, we design a space alignment loss to force the content variables to align with the diffusion space, providing necessary constraints for optimization:

sa=zc(x)2.subscript𝑠𝑎superscriptnormsubscript𝑧𝑐𝑥2\mathcal{L}_{sa}=\|z_{c}-\mathcal{E}(x)\|^{2}.caligraphic_L start_POSTSUBSCRIPT italic_s italic_a end_POSTSUBSCRIPT = ∥ italic_z start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT - caligraphic_E ( italic_x ) ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT . (12)

In summary, the total loss of DiffEIC is defined as:

total=λrate+λsasa+λnene,subscript𝑡𝑜𝑡𝑎𝑙𝜆subscript𝑟𝑎𝑡𝑒subscript𝜆𝑠𝑎subscript𝑠𝑎subscript𝜆𝑛𝑒subscript𝑛𝑒\mathcal{L}_{total}=\lambda\mathcal{L}_{rate}+\lambda_{sa}\mathcal{L}_{sa}+% \lambda_{ne}\mathcal{L}_{ne},caligraphic_L start_POSTSUBSCRIPT italic_t italic_o italic_t italic_a italic_l end_POSTSUBSCRIPT = italic_λ caligraphic_L start_POSTSUBSCRIPT italic_r italic_a italic_t italic_e end_POSTSUBSCRIPT + italic_λ start_POSTSUBSCRIPT italic_s italic_a end_POSTSUBSCRIPT caligraphic_L start_POSTSUBSCRIPT italic_s italic_a end_POSTSUBSCRIPT + italic_λ start_POSTSUBSCRIPT italic_n italic_e end_POSTSUBSCRIPT caligraphic_L start_POSTSUBSCRIPT italic_n italic_e end_POSTSUBSCRIPT , (13)

where λsasubscript𝜆𝑠𝑎\lambda_{sa}italic_λ start_POSTSUBSCRIPT italic_s italic_a end_POSTSUBSCRIPT and λnesubscript𝜆𝑛𝑒\lambda_{ne}italic_λ start_POSTSUBSCRIPT italic_n italic_e end_POSTSUBSCRIPT denote the weights for space alignment loss and noise estimation loss, respectively. λ𝜆\lambdaitalic_λ is used to achieve a trade-off between rate and distortion.

IV Experiments

Refer to caption
Refer to caption
Refer to caption
Figure 4: Quantitative comparisons with state-of-the-art methods in terms of perceptual quality (LPIPS\downarrow / NIQE\downarrow / DISTS\downarrow / FID\downarrow / KID\downarrow) on the Kodak [22], Tecnick [48], and CLIC2020 [49] datasets.
Refer to caption
Figure 5: Quantitative comparisons with state-of-the-art methods in terms of pixel fidelity (MS-SSIM\uparrow / PSNR\uparrow) on the Kodak [22], Tecnick [48], and CLIC2020 [49] datasets.

(a) Original (b) ELIC (c) HiFiC (d) MS-ILLM (e) Text+Sketch (f) Wei et al. (g) DiffEIC (Ours)
Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption
bpp / DISTS\downarrow 0.1784 / 0.1296 0.1901 / 0.1178 0.1662 / 0.0741 0.0246 / 0.3135 0.0792 / 0.1431 0.0844 / 0.0658
Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption
bpp / DISTS\downarrow 0.0947 / 0.2136 0.0800 / 0.1193 0.0893 / 0.0766 0.0228 / 0.2661 0.0838 / 0.0891 0.0664 / 0.0775
Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption
bpp / DISTS\downarrow 0.0538 / 0.2610 0.0453 / 0.1789 0.0501 / 0.1207 0.0286 / 0.2688 0.0216 / 0.2046 0.0182 / 0.1804

Figure 6: Visual comparisons of the proposed DiffEIC framework with the MSE-optimized ELIC [31], the GANs-based HiFiC [34] and MS-ILLM [35], the diffusion-based Text+Sketch [21], and the method by Wei et al. [38] on the CLIC2020 [49] dataset. For each method, the bpp and DISTS values are shown beneath images. Compared to other methods, our method produces more realistic and faithful reconstructions with lower bpp.

IV-A Experimental Settings

IV-A1 Implementation

We train DiffEIC on the LSDIR [50] dataset, which contains 84,991 high-quality training images. The images are randomly cropped to 512×\times×512 resolution. In our experiments, we use Stable Diffusion 2.1-base111https://huggingface.co/stabilityai/stable-diffusion-2-1-base as the diffusion prior. We train our model in an end-to-end manner using Eq. (13), where λsasubscript𝜆𝑠𝑎\lambda_{sa}italic_λ start_POSTSUBSCRIPT italic_s italic_a end_POSTSUBSCRIPT and λnesubscript𝜆𝑛𝑒\lambda_{ne}italic_λ start_POSTSUBSCRIPT italic_n italic_e end_POSTSUBSCRIPT are set as 2 and 1, respectively. To achieve different coding bitrates, we choose λ𝜆\lambdaitalic_λ from {1,2,4,8,16}124816\{1,2,4,8,16\}{ 1 , 2 , 4 , 8 , 16 }. For optimization, we utilize Adam [51] optimizer with β1=0.9subscript𝛽10.9\beta_{1}=0.9italic_β start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT = 0.9 and β2=0.999subscript𝛽20.999\beta_{2}=0.999italic_β start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT = 0.999 and set the learning rate to 1×104absentsuperscript104\times 10^{-4}× 10 start_POSTSUPERSCRIPT - 4 end_POSTSUPERSCRIPT. The training batch size is set to 4. Inspired by previous work [6], we first train the proposed DiffEIC with λ=1𝜆1\lambda=1italic_λ = 1 for 300K iterations, and then adapt them using target λ𝜆\lambdaitalic_λ for another 200K iterations. We set the learning rate to 2×105absentsuperscript105\times 10^{-5}× 10 start_POSTSUPERSCRIPT - 5 end_POSTSUPERSCRIPT during the fine-tuning process. For inference, we adopt spaced DDPM sampling [52] with 50 steps to reconstruct the images. All experiments are conducted on a single NVIDIA GeForce RTX 4090 GPU. The source code and trained models will be available on the authors’ homepage after acceptance of the manuscript.

IV-A2 Test Data

For evaluation, we use three commonly used benchmarks: Kodak [22], Tecnick [48], and CLIC2020 [49] datasets. The Kodak dataset contains 24 natural images with a resolution of 768×\times×512. The Tecnick dataset contains 140 images with 1200×\times×1200 resolution. The CLIC2020 dataset has 428 high-quality images. For the Tecnick and CLIC2020 datasets, we resize the images so that the shorter dimension is equal to 768px. Then we center-crop the image with 768×\times×768 resolution for evaluation [18].

IV-A3 Metrics

For quantitative evaluation, we employ several established metrics to assess the perceptual quality of results, including the Learned Perceptual Image Patch Similarity (LPIPS) [53], Naturalness Image Quality Evaluator (NIQE) [54], Deep Image Structure and Texture Similarity (DISTS) [55], Fréchet Inception Distance (FID) [56], and Kernel Inception Distance (KID) [57]. Meanwhile, we employ the Peak Signal-to-Noise Ratio (PSNR) and Multi-Scale Structural Similarity Index (MS-SSIM) [58] to measure the fidelity of reconstruction results. Furthermore, the bits per pixel (bpp) is used to evaluate rate performance. Note that FID and KID are calculated on patches of 256×\times×256 resolution according to [34]. Since the Kodak dataset is too small to calculate FID and KID, we do not report FID or KID results on it.

IV-B Comparisons With State-of-the-art Methods

We compare our DiffEIC with state-of-the-art learned image compression methods, including ELIC [31], HiFiC [34], Text+Sketch [21], Wei et al. [38], and MS-ILLM [35]. In addition, we compare with traditional image compression methods BPG [2] and VVC [3]. For BPG software, we optimize image quality and compression efficiency with the following settings: “YUV444” subsampling mode, “x265” HEVC implementation, “8-bit” depth, and “YCbCr” color space. For VVC, we employ the reference software VTM-23.0222https://vcgit.hhi.fraunhofer.de/jvet/VVCSoftware_VTM/-/tree/master with intra configuration.

TABLE I: Encoding and decoding speed on Kodak [22] dataset in terms of seconds.
Type Method Denoising Step Encoding Speed (in sec.) Decoding Speed (in sec.) Platform
Traditional method VVC 13.86213.86213.86213.862±plus-or-minus\pm± 9.8219.8219.8219.821 0.0660.0660.0660.066±plus-or-minus\pm± 0.0060.0060.0060.006 13th Core i9-13900K
VAE-based method ELIC 0.0560.0560.0560.056±plus-or-minus\pm± 0.0060.0060.0060.006 0.0810.0810.0810.081±plus-or-minus\pm± 0.0110.0110.0110.011 RTX4090
GAN-based methods HiFiC 0.0380.0380.0380.038±plus-or-minus\pm± 0.0040.0040.0040.004 0.0590.0590.0590.059±plus-or-minus\pm± 0.0040.0040.0040.004 RTX4090
MS-ILLM 0.0380.0380.0380.038±plus-or-minus\pm± 0.0040.0040.0040.004 0.0590.0590.0590.059±plus-or-minus\pm± 0.0040.0040.0040.004 RTX4090
Wei et al. 0.0500.0500.0500.050±plus-or-minus\pm± 0.0030.0030.0030.003 0.1790.1790.1790.179±plus-or-minus\pm± 0.0050.0050.0050.005 RTX4090
Diffusion-based methods Text+Sketch 25 62.04562.04562.04562.045±plus-or-minus\pm± 0.5160.5160.5160.516 12.02812.02812.02812.028±plus-or-minus\pm± 0.4130.4130.4130.413 RTX4090
PerCo 5 0.0800.0800.0800.080±plus-or-minus\pm± 0.0180.0180.0180.018 0.6650.6650.6650.665±plus-or-minus\pm± 0.0090.0090.0090.009 A100
PerCo 20 0.0800.0800.0800.080±plus-or-minus\pm± 0.0180.0180.0180.018 2.5512.5512.5512.551±plus-or-minus\pm± 0.0180.0180.0180.018 A100
DiffEIC (Ours) 20 0.1280.1280.1280.128±plus-or-minus\pm± 0.0050.0050.0050.005 1.9641.9641.9641.964±plus-or-minus\pm± 0.0090.0090.0090.009 RTX4090
DiffEIC (Ours) 50 0.1280.1280.1280.128±plus-or-minus\pm± 0.0050.0050.0050.005 4.5744.5744.5744.574±plus-or-minus\pm± 0.0060.0060.0060.006 RTX4090

IV-B1 Quantitative Comparisons

Fig. 4 shows the rate-perception curves at low bitrates for different methods over the three datasets. It can be observed that the proposed DiffEIC performs much better than BPG, VVC, and ELIC [31] for all perceptual metrics. Although the Text+Sketch [21] achieves the best NIQE value of all the methods, it fails to ensure the pixel fidelity, where the LPIPS value is the highest. For other generative image compression methods, the proposed DiffEIC yields lower DISTS, FID, and KID values, indicating that DiffEIC excels in preserving the perceptual integrity of the images and producing reconstructions with minimal perceptual differences from the originals.

The rate-distortion performance comparison is shown in Fig. 5. Since Text+Sketch [21] has ignored the pixel-level fidelity of the reconstruction results, we do not report its rate-distortion performance. Compared to Wei et al. [38], the proposed DiffEIC achieves better PSNR and MS-SSIM values. However, we find that DiffEIC is worse than other comparative methods. The reason behind this is that the proposed DiffEIC uses stable diffusion prior for realistic detail reconstruction at extremely low bitrates, which does not ensure pixel-level accuracy. To further demonstrate this, we report the PSNR and MS-SSIM values of the stable diffusion autoencoder (see the black horizontal line in Fig. 5), which can be treated as the upper bound of the performance of DiffEIC. Although it sacrifices some fidelity, the proposed DiffEIC is able to capture realism at extremely low bitrates.

IV-B2 Qualitative Comparisons

Fig. 6 shows visual comparisons among the evaluated methods at extremely low bitrates. Compared to other methods, DiffEIC yields reconstructions with higher perceptual quality, fewer artifacts, and more realistic detail at extremely low bitrates. For example, DiffEIC preserves the texture and details of the background that are lost or distorted in other methods (see the first row). Similarly, the DiffEIC is able to produce more realistic facial detail than other methods (see the second row).

IV-B3 Complexity Comparisons

We further compare the proposed DiffEIC with state-of-the-art image compression methods in terms of complexity. For PerCo [19], we directly show the results reported in their paper, since the source codes are not available. Table I summarizes the average encoding/decoding time in seconds with its standard deviation on the Kodak dataset. On the one hand, it is worth noting that the diffusion-based methods have higher encoding and decoding complexity than the VAE-based and GAN-based methods. On the other hand, the proposed DiffEIC encoder is significantly faster than Text+Sketch [21]. Compared to PerCo [19], the proposed DiffEIC is able to achieve comparable encoding speed and faster decoding speed with the same number of denoising steps.

V Analysis and Discussions

To better analyze the proposed method, we perform ablation studies and discuss its limitations.

TABLE II: Ablation of Latent Feature Guidance (LFG), Denoising Steps (DS), and the Channel Number (CN) of the control module. BD-rate [59] is calculated on CLIC2020 [49] dataset, with DISTS and LPIPS as the metric.
Methods Hyper-Parameter BD-Rate (%) Speed (in sec.)
CN (%) Denoising Steps DISTS LPIPS Encoding Speed Decoding Speed
DiffEIC (CN) 100 50 -5.36 -2.57 0.128 ±plus-or-minus\pm± 0.005 6.012 ±plus-or-minus\pm± 0.012
50 50 -2.99 -2.28 5.068 ±plus-or-minus\pm± 0.020
DiffEIC (W/o LFG) 20 50 23.88 13.19 0.062 ±plus-or-minus\pm± 0.009 4.574 ±plus-or-minus\pm± 0.006
DiffEIC (Ours) 20 50 0 0 0.128 ±plus-or-minus\pm± 0.005 4.574 ±plus-or-minus\pm± 0.006
DiffEIC (DS) 20 20 22.51 6.20 0.128 ±plus-or-minus\pm± 0.005 1.964 ±plus-or-minus\pm± 0.009
20 10 37.83 13.26 1.089 ±plus-or-minus\pm± 0.009
20 5 49.93 21.68 0.646 ±plus-or-minus\pm± 0.005
20 0 59.50 35.77 0.212 ±plus-or-minus\pm± 0.006

V-A Ablation of Latent Feature Guidance

In this part, we analyze the proposed Latent Feature Guidance (LFG), which is used to correct content variables. Specifically, we remove the guidance components and retrain the model from scratch using the same experimental settings.

Refer to caption
Refer to caption

(a) (b)

Figure 7: Ablation studies of latent feature guidance on CLIC2020 [49] dataset. (a) Euclidean distance between content variables and corresponding latent representations; (b) Proportion of bits allocated to the hyper prior.

(a) Original (b) W/o LFG (c) Ours
Refer to caption Refer to caption Refer to caption
bpp / MS-SSIM\uparrow / DISTS\downarrow 0.0076 / 0.55 / 0.2231 0.0085 / 0.60 / 0.2000

Figure 8: Impact of latent feature guidance on reconstruction results.

Fig. 7(a) demonstrates that the distance between the content variables and corresponding latent representations is significantly reduced after introducing the LFG strategy, which implies that more accurate information is provided for the subsequent denoising process. As shown in Table II, the removal of guidance components results in a slightly faster encoding speed but a noticeable degradation in performance, with a 23.88% increase in bitrates at the same DISTS metric and a 13.19% increase in bitrates at the same LPIPS metric. The visual comparison is presented in Fig. 8. As seen from this example, with the help of LFG strategy, our DiffEIC achieves more accurate facial reconstruction at extremely low bitrates. This further demonstrates that the LFG strategy contributes to increased fidelity.

Note that the representation w𝑤witalic_w in the decoder side is extracted from the quantized side information z^^𝑧\hat{z}over^ start_ARG italic_z end_ARG. The bitrates of the additional information contained in z^^𝑧\hat{z}over^ start_ARG italic_z end_ARG need to be analyzed further. To evaluate the impact of the LFG on the bitrates of the hyper prior, we compare the bits allocated to the hyper prior with and without LFG. Specifically, we compute the proportion of bits allocated to hyper prior as:

P=R(z^)R(z^)+R(y^)𝑃𝑅^𝑧𝑅^𝑧𝑅^𝑦P=\frac{R(\hat{z})}{R(\hat{z})+R(\hat{y})}italic_P = divide start_ARG italic_R ( over^ start_ARG italic_z end_ARG ) end_ARG start_ARG italic_R ( over^ start_ARG italic_z end_ARG ) + italic_R ( over^ start_ARG italic_y end_ARG ) end_ARG (14)

where ()\mathcal{R}(\cdot)caligraphic_R ( ⋅ ) is the bitrate. As shown in Fig. 7(b), using LFG does not significantly affect the proportion of bits allocated to the hyper prior. The reason for this phenomenon is that the hyper prior z^^𝑧\hat{z}over^ start_ARG italic_z end_ARG requires extremely fewer bits than the quantized latent representation y^^𝑦\hat{y}over^ start_ARG italic_y end_ARG, which is also observed in [23], so the additional information conveyed by the hyper prior is small and the bit consumption of these information can be ignored.

Refer to caption
Figure 9: Quantitative comparisons of different denoising steps on the CLIC2020 [49] dataset. 0 step denotes directly using the decode 𝒟𝒟\mathcal{D}caligraphic_D to decode content variables.

(a) Original (b) 0 step (c) 5 steps (d) 10 steps (e) 20 steps (f) 50 steps
0.0739 bpp Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption
MS-SSIM\uparrow / DISTS\downarrow 0.9682 / 0.0824 0.9695 / 0.0748 0.9703 / 0.0728 0.9667 / 0.0700 0.9662 / 0.0677
0.0174 bpp Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption
MS-SSIM\uparrow / DISTS\downarrow 0.8842 / 0.2585 0.8733 / 0.2293 0.8755 / 0.1993 0.8649 / 0.1752 0.8641 / 0.1701

Figure 10: Visual comparisons of different denoising steps.

V-B Effect of the Channel Number in Control Module

We further analyze how the number of channels of the control module affects the performance and complexity of the proposed DiffEIC. In our default setting, we reduce the number of channels to 20% of the original. We also increase the number of channels by setting the percentage to 50% and 100%. As shown in Table II, using more channels is able to bring a slight improvement in performance, where the lower DISTS and LPIPS values are achieved. However, it inevitably leads to the decoding complexity up. For example, the decoding speed of the proposed DiffEIC with 100% channels decrease by about 31% compared to the default setting. To achieve a tradeoff between performance and inference speed, we choose the 20% in the proposed DiffEIC.

V-C Effect of Denoising Steps

For the proposed DiffEIC, we relate the decoding complexity to the number of denoising steps. As shown in Table II, the decoding complexity can be reduced by using fewer denoising steps. Fig. 9 shows the reconstruction performance using different numbers of denoising steps. We note that increasing the number of denoising steps is able to improve the perceptual quality of the decoded results, where the perceptual metrics (DISTS, LPIPS, and NIQE) are better. The visual comparisons in Fig. 10 further demonstrate that using more denoising steps facilitates the improvement of the reconstruction performance, where the details of the hair are well reconstructed.

V-D Effectiveness of Space Alignment Loss

The proposed space alignment loss is used to provide constraints for LFGCM. To illustrate the necessity of this loss, we attempt to train DiffEIC without the space alignment loss by removing sasubscript𝑠𝑎\mathcal{L}_{sa}caligraphic_L start_POSTSUBSCRIPT italic_s italic_a end_POSTSUBSCRIPT from Eq. (13).

As shown in Fig. 11(a), without the space alignment loss, the bits per pixel (bpp) curves invariably converge to zero during training, regardless of the selected values for λ𝜆\lambdaitalic_λ and λnesubscript𝜆𝑛𝑒\lambda_{ne}italic_λ start_POSTSUBSCRIPT italic_n italic_e end_POSTSUBSCRIPT. We attribute this phenomenon to the noise estimation loss being independent of the input images, thus failing to provide effective constraints for LFGCM. In contrast, Fig. 11(b) demonstrates the effectiveness of incorporating the space alignment loss. With this loss in place, the bpp curves stabilize and converge to meaningful values during training, indicating that the space alignment loss successfully enforces necessary constraints. Furthermore, the space alignment loss forces content variables to align with the diffusion space, contributing to enhanced reconstruction quality, as mentioned in Section V-A.

Refer to caption
Refer to caption

(a) W/o space alignment loss (b) W/ space alignment loss

Figure 11: Effectiveness of the space alignment loss for end-to-end training.

V-E Robustness to Different Image Resolutions

Since we use the stable diffusion for image reconstruction, some may wonder about whether our method is able to achieve image compression with different resolutions. To answer this question, we use images with different resolutions, such as 256×\times×256, 512×\times×768, and 512×\times×1538, for evaluation. As shown in Fig. 12, the proposed DiffEIC is able to reconstruct visually pleasing results under different image resolutions. In addition, we believe that our method is capable of processing ultra-high definition images (i.e., 4K and 8K) using the block-based processing strategy when the computational resources are limited.

Refer to caption
Refer to caption

avg bpp / avg DISTS 0.0915 / 0.0851

Figure 12: Reconstruction results at different resolutions. On the left are the original images and on the right are the decoded results.

V-F Limitation

Although the proposed DiffEIC framework achieves favorable reconstructions at extremely low bitrates, it still has some limitations. 1) While text is an important component in pre-trained text-to-image diffusion models, its application has not yet been explored within our framework. The work of Text+Sketch [21] demonstrates the powerful ability of text in extracting image semantics, encouraging us to further leverage text to enhance our method in future work. 2) Due to using a diffusion model as the decoder, the DiffEIC framework requires more computational resources and longer inference times compared to other VAE-based compression methods. Using more advanced sampling methods may be a solution to alleviate the computing burden. 3) Due to the limitations of the stable diffusion autoencoder, DiffEIC exhibits lower performance on pixel-wise distortion metrics compared to other methods. Future work will focus on improving the balance between pixel-wise accuracy and perceptual quality.

VI Conclusion

In this paper, we propose a novel extreme image compression framework, named DiffEIC, which combines compressive VAEs with pre-trained text-to-image diffusion models to achieve realistic and high-fidelity reconstructions at extremely low bitrates (below 0.1 bpp). First, we introduce a VAE-based latent feature-guided compression module to adaptively select information essential for reconstruction. This module compresses images and initially decodes them into content variables. The latent feature guidance strategy effectively improves reconstruction fidelity. Second, we propose a conditional diffusion decoding module that leverages the powerful generative capability of pretrained stable diffusion to reconstruct images with realistic details. Finally, we design a simple yet effective space alignment loss to optimize DiffEIC within a unified framework. Extensive experiments demonstrate the superiority of DiffEIC and the effectiveness of the proposed modules.

References

  • [1] David S Taubman, Michael W Marcellin, and Majid Rabbani. Jpeg2000: Image compression fundamentals, standards and practice. Journal of Electronic Imaging, 11(2):286–287, 2002.
  • [2] Fabrice Bellard. Bpg image format. https://bellard.org/bpg/.
  • [3] Benjamin Bross, Ye-Kui Wang, Yan Ye, Shan Liu, Jianle Chen, Gary J Sullivan, and Jens-Rainer Ohm. Overview of the versatile video coding (vvc) standard and its applications. IEEE Transactions on Circuits and Systems for Video Technology, 31(10):3736–3764, 2021.
  • [4] Jinming Liu, Heming Sun, and Jiro Katto. Learned image compression with mixed transformer-cnn architectures. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 14388–14397, 2023.
  • [5] Renjie Zou, Chunfeng Song, and Zhaoxiang Zhang. The devil is in the details: Window-based attention for image compression. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 17492–17501, 2022.
  • [6] David Minnen and Saurabh Singh. Channel-wise autoregressive entropy models for learned image compression. In 2020 IEEE International Conference on Image Processing (ICIP), pages 3339–3343. IEEE, 2020.
  • [7] Dailan He, Yaoyan Zheng, Baocheng Sun, Yan Wang, and Hongwei Qin. Checkerboard context model for efficient learned image compression. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 14771–14780, 2021.
  • [8] Oren Rippel and Lubomir Bourdev. Real-time adaptive image compression. In International Conference on Machine Learning, pages 2922–2930. PMLR, 2017.
  • [9] Yochai Blau and Tomer Michaeli. Rethinking lossy compression: The rate-distortion-perception tradeoff. In International Conference on Machine Learning, pages 675–685. PMLR, 2019.
  • [10] Eirikur Agustsson, Michael Tschannen, Fabian Mentzer, Radu Timofte, and Luc Van Gool. Generative adversarial networks for extreme learned image compression. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 221–231, 2019.
  • [11] Shubham Dash, Giridharan Kumaravelu, Vijayakrishna Naganoor, Suraj Kiran Raman, Aditya Ramesh, and Honglak Lee. Compressnet: Generative compression at extremely low bitrates. In 2020 IEEE Winter Conference on Applications of Computer Vision (WACV), pages 2314–2322. IEEE, 2020.
  • [12] Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. Generative adversarial nets. Advances in neural information processing systems, 27, 2014.
  • [13] Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Björn Ommer. High-resolution image synthesis with latent diffusion models. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 10684–10695, 2022.
  • [14] Lvmin Zhang, Anyi Rao, and Maneesh Agrawala. Adding conditional control to text-to-image diffusion models. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 3836–3847, 2023.
  • [15] Andreas Blattmann, Tim Dockhorn, Sumith Kulal, Daniel Mendelevitch, Maciej Kilian, Dominik Lorenz, Yam Levi, Zion English, Vikram Voleti, Adam Letts, et al. Stable video diffusion: Scaling latent video diffusion models to large datasets. arXiv preprint arXiv:2311.15127, 2023.
  • [16] Lucas Theis, Tim Salimans, Matthew D Hoffman, and Fabian Mentzer. Lossy compression with gaussian diffusion. arXiv preprint arXiv:2206.08889, 2022.
  • [17] Noor Fathima Ghouse, Jens Petersen, Auke Wiggers, Tianlin Xu, and Guillaume Sautière. A residual diffusion model for high perceptual quality codec augmentation. arXiv preprint arXiv:2301.05489, 2023.
  • [18] Ruihan Yang and Stephan Mandt. Lossy image compression with conditional diffusion models. arXiv preprint arXiv:2209.06950, 2022.
  • [19] Marlene Careil, Matthew J. Muckley, Jakob Verbeek, and Stéphane Lathuilière. Towards image compression with perfect realism at ultra-low bitrates. In The Twelfth International Conference on Learning Representations, 2024.
  • [20] Zhihong Pan, Xin Zhou, and Hao Tian. Extreme generative image compression by learning text embedding from diffusion models. arXiv preprint arXiv:2211.07793, 2022.
  • [21] Eric Lei, Yigit Berkay Uslu, Hamed Hassani, and Shirin Saeedi Bidokhti. Text+ sketch: Image compression at ultra low rates. In ICML 2023 Workshop Neural Compression: From Information Theory to Applications, 2023.
  • [22] Eastman Kodak Company. Kodak lossless true color image suite. http://r0k.us/graphics/kodak/.
  • [23] Johannes Ballé, David Minnen, Saurabh Singh, Sung Jin Hwang, and Nick Johnston. Variational image compression with a scale hyperprior. In International Conference on Learning Representations, 2018.
  • [24] Gregory K Wallace. The jpeg still picture compression standard. Communications of the ACM, 34(4):30–44, 1991.
  • [25] Johannes Ballé, Valero Laparra, and Eero P Simoncelli. End-to-end optimized image compression. In 5th International Conference on Learning Representations, ICLR 2017, 2017.
  • [26] Zongyu Guo, Zhizheng Zhang, Runsen Feng, and Zhibo Chen. Causal contextual prediction for learned image compression. IEEE Transactions on Circuits and Systems for Video Technology, 32(4):2329–2341, 2021.
  • [27] Haojie Liu, Tong Chen, Peiyao Guo, Qiu Shen, Xun Cao, Yao Wang, and Zhan Ma. Non-local attention optimized deep image compression. arXiv preprint arXiv:1904.09757, 2019.
  • [28] Yueqi Xie, Ka Leong Cheng, and Qifeng Chen. Enhanced invertible encoding for learned image compression. In Proceedings of the 29th ACM international conference on multimedia, pages 162–170, 2021.
  • [29] Yinhao Zhu, Yang Yang, and Taco Cohen. Transformer-based transform coding. In International Conference on Learning Representations, 2021.
  • [30] David Minnen, Johannes Ballé, and George D Toderici. Joint autoregressive and hierarchical priors for learned image compression. Advances in neural information processing systems, 31, 2018.
  • [31] Dailan He, Ziming Yang, Weikun Peng, Rui Ma, Hongwei Qin, and Yan Wang. Elic: Efficient learned image compression with unevenly grouped space-channel contextual adaptive coding. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 5718–5727, 2022.
  • [32] Yichen Qian, Xiuyu Sun, Ming Lin, Zhiyu Tan, and Rong Jin. Entroformer: A transformer-based entropy model for learned image compression. In International Conference on Learning Representations, 2021.
  • [33] Shoma Iwai, Tomo Miyazaki, Yoshihiro Sugaya, and Shinichiro Omachi. Fidelity-controllable extreme image compression with generative adversarial networks. In 2020 25th International Conference on Pattern Recognition (ICPR), pages 8235–8242. IEEE, 2021.
  • [34] Fabian Mentzer, George D Toderici, Michael Tschannen, and Eirikur Agustsson. High-fidelity generative image compression. Advances in Neural Information Processing Systems, 33:11913–11924, 2020.
  • [35] Matthew J Muckley, Alaaeldin El-Nouby, Karen Ullrich, Hervé Jégou, and Jakob Verbeek. Improving statistical fidelity for neural image compression with implicit local likelihood models. In International Conference on Machine Learning, pages 25426–25443. PMLR, 2023.
  • [36] Huanjing Yue, Xiaoyan Sun, Jingyu Yang, and Feng Wu. Cloud-based image coding for mobile devices—toward thousands to one compression. IEEE transactions on multimedia, 15(4):845–857, 2013.
  • [37] Fangyuan Gao, Xin Deng, Junpeng Jing, Xin Zou, and Mai Xu. Extremely low bit-rate image compression via invertible image generation. IEEE Transactions on Circuits and Systems for Video Technology, 2023.
  • [38] Hao Wei, Chenyang Ge, Zhiyuan Li, Xin Qiao, and Pengchao Deng. Towards extreme image rescaling with generative prior and invertible prior. IEEE Transactions on Circuits and Systems for Video Technology, 2024.
  • [39] Mengyao Li, Liquan Shen, Yufei Lin, Kun Wang, and Jinbo Chen. Extreme underwater image compression using physical priors. IEEE Transactions on Circuits and Systems for Video Technology, 33(4):1937–1951, 2022.
  • [40] Xuhao Jiang, Weimin Tan, Tian Tan, Bo Yan, and Liquan Shen. Multi-modality deep network for extreme learned image compression. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 37, pages 1033–1041, 2023.
  • [41] Jascha Sohl-Dickstein, Eric Weiss, Niru Maheswaranathan, and Surya Ganguli. Deep unsupervised learning using nonequilibrium thermodynamics. In International conference on machine learning, pages 2256–2265. PMLR, 2015.
  • [42] Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising diffusion probabilistic models. Advances in neural information processing systems, 33:6840–6851, 2020.
  • [43] Chong Mou, Xintao Wang, Liangbin Xie, Jian Zhang, Zhongang Qi, Ying Shan, and Xiaohu Qie. T2i-adapter: Learning adapters to dig out more controllable ability for text-to-image diffusion models. arXiv preprint arXiv:2302.08453, 2023.
  • [44] Xinqi Lin, Jingwen He, Ziyan Chen, Zhaoyang Lyu, Ben Fei, Bo Dai, Wanli Ouyang, Yu Qiao, and Chao Dong. Diffbir: Towards blind image restoration with generative diffusion prior. arXiv preprint arXiv:2308.15070, 2023.
  • [45] Xintao Wang, Ke Yu, Chao Dong, and Chen Change Loy. Recovering realistic texture in image super-resolution by deep spatial feature transform. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 606–615, 2018.
  • [46] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 770–778, 2016.
  • [47] Olaf Ronneberger, Philipp Fischer, and Thomas Brox. U-net: Convolutional networks for biomedical image segmentation. In Medical Image Computing and Computer-Assisted Intervention–MICCAI 2015: 18th International Conference, Munich, Germany, October 5-9, 2015, Proceedings, Part III 18, pages 234–241. Springer, 2015.
  • [48] Nicola Asuni and Andrea Giachetti. Testimages: a large-scale archive for testing visual devices and basic image processing algorithms. In STAG, pages 63–70, 2014.
  • [49] George Toderici, Lucas Theis, Nick Johnston, Eirikur Agustsson, Fabian Mentzer, Johannes Ballé, Wenzhe Shi, and Radu Timofte. Clic 2020: Challenge on learned image compression, 2020. http://www.compression.cc.
  • [50] Yawei Li, Kai Zhang, Jingyun Liang, Jiezhang Cao, Ce Liu, Rui Gong, Yulun Zhang, Hao Tang, Yun Liu, Denis Demandolx, et al. Lsdir: A large scale dataset for image restoration. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 1775–1787, 2023.
  • [51] Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.
  • [52] Alexander Quinn Nichol and Prafulla Dhariwal. Improved denoising diffusion probabilistic models. In International conference on machine learning, pages 8162–8171. PMLR, 2021.
  • [53] Richard Zhang, Phillip Isola, Alexei A Efros, Eli Shechtman, and Oliver Wang. The unreasonable effectiveness of deep features as a perceptual metric. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 586–595, 2018.
  • [54] Anish Mittal, Rajiv Soundararajan, and Alan C Bovik. Making a “completely blind” image quality analyzer. IEEE Signal processing letters, 20(3):209–212, 2012.
  • [55] Keyan Ding, Kede Ma, Shiqi Wang, and Eero P Simoncelli. Image quality assessment: Unifying structure and texture similarity. IEEE transactions on pattern analysis and machine intelligence, 44(5):2567–2581, 2020.
  • [56] Martin Heusel, Hubert Ramsauer, Thomas Unterthiner, Bernhard Nessler, and Sepp Hochreiter. Gans trained by a two time-scale update rule converge to a local nash equilibrium. Advances in neural information processing systems, 30, 2017.
  • [57] Mikołaj Bińkowski, Danica J Sutherland, Michael Arbel, and Arthur Gretton. Demystifying mmd gans. arXiv preprint arXiv:1801.01401, 2018.
  • [58] Zhou Wang, Eero P Simoncelli, and Alan C Bovik. Multiscale structural similarity for image quality assessment. In The Thrity-Seventh Asilomar Conference on Signals, Systems & Computers, 2003, volume 2, pages 1398–1402. Ieee, 2003.
  • [59] Gisle Bjontegaard. Calculation of average psnr differences between rd-curves. ITU SG16 Doc. VCEG-M33, 2001.
[Uncaptioned image] Zhiyuan Li is currently pursuing a master’s degree with the Institute of Artificial Intelligence and Robotics at Xi’an Jiaotong University. He received his bachelor’s degree from Xidian University in 2022. His research interests include image compression, image rescaling, and other visual problems.
[Uncaptioned image] Yanhui Zhou received the M. S. and Ph. D. degrees in electrical engineering from the Xi’an Jiaotong University, Xi’an, China, in 2005 and 2011, respectively. She is currently an associate professor with the School of Information and telecommunication Xi’an Jiaotong University. Her current research interests include image/video compression, computer vision and deep learning.
[Uncaptioned image] Hao Wei is currently a Ph.D. candidate with the Institute of Artificial Intelligence and Robotics at Xi’an Jiaotong University. He received his B.Sc. and M.Sc. degrees from Yangzhou University and Nanjing University of Science and Technology in 2018 and 2021, respectively. His research interests include image deblurring, image compression, and other low-level vision problems.
[Uncaptioned image] Chenyang Ge is currently an associate professor at Xi’an Jiaotong University. He received the B.A., M.S., and Ph.D. degrees at Xi’an Jiaotong University in 1999, 2002, and 2009, respectively. His research interests include computer vision, 3D sensing, new display processing, and SoC design.
[Uncaptioned image] Jingwen Jiang is currently pursuing a master’s degree at Xi’an Jiaotong University. He received his bachelor’s degree from Sichuan University in 2023. His research interests include image compression and video compression.