Posterior-Mean Rectified Flow: Towards Minimum MSE Photo-Realistic Image Restoration
Abstract
Photo-realistic image restoration algorithms are typically evaluated by distortion measures (e.g., PSNR, SSIM) and by perceptual quality measures (e.g., FID, NIQE), where the desire is to attain the lowest possible distortion without compromising on perceptual quality. To achieve this goal, current methods typically attempt to sample from the posterior distribution, or to optimize a weighted sum of a distortion loss (e.g., MSE) and a perceptual quality loss (e.g., GAN). Unlike previous works, this paper is concerned specifically with the optimal estimator that minimizes the MSE under a constraint of perfect perceptual index, namely where the distribution of the reconstructed images is equal to that of the ground-truth ones. A recent theoretical result shows that such an estimator can be constructed by optimally transporting the posterior mean prediction (MMSE estimate) to the distribution of the ground-truth images. Inspired by this result, we introduce Posterior-Mean Rectified Flow (PMRF), a simple yet highly effective algorithm that approximates this optimal estimator. In particular, PMRF first predicts the posterior mean, and then transports the result to a high-quality image using a rectified flow model that approximates the desired optimal transport map. We investigate the theoretical utility of PMRF and demonstrate that it consistently outperforms previous methods on a variety of image restoration tasks. Our codes are available at https://github.com/ohayonguy/PMRF.
1 Introduction
Photo-realistic image restoration (PIR) is the task of reconstructing visually appealing images from degraded measurements (e.g., noisy, blurry). This is a long-standing research problem with diverse applications in mobile photography, surveillance, remote sensing, medical imaging, and more. PIR algorithms are commonly evaluated by distortion measures (e.g., PSNR, SSIM (Wang et al., 2004), LPIPS (Zhang et al., 2018)), which quantify some type of discrepancy between the reconstructed images and the ground-truth ones, and by perceptual quality measures (e.g., FID (Heusel et al., 2017), KID (Bińkowski et al., 2018), NIQE (Mittal et al., 2013), NIMA (Talebi & Milanfar, 2018)), which are intended to predict the extent to which the reconstructions would look natural to human observers. Since distortion and perceptual quality are typically at odds with each other (Blau & Michaeli, 2018), the core challenge in PIR is to achieve minimal distortion without sacrificing perceptual quality.
A common way to approach this task is through posterior sampling (Daras et al., 2024; Kawar et al., 2021a; 2022; b; Man et al., 2023; Ohayon et al., 2021; Bendel et al., 2023; Chung et al., 2023; Wang et al., 2023a; Song et al., 2023; Zhu et al., 2023; Saharia et al., 2023; Murata et al., 2023; Saharia et al., 2022). Specifically, letting and denote the random vectors corresponding to the ground-truth image and its degraded measurement, respectively, posterior sampling generates a reconstruction by sampling from (such that ). This solution is appealing as it theoretically guarantees a perfect perceptual index111Formally, the perceptual index of is defined as the statistical divergence between and . (). Interestingly, however, the Mean Squared Error (MSE) that this solution achieves is not the minimal possible under the perfect perceptual index constraint. Indeed, the MSE achieved by posterior sampling is precisely twice the Minimum MSE (MMSE) that can be achieved without a constraint on the perceptual index (Blau & Michaeli, 2018). This is while the minimal MSE achievable under a perfect perceptual index constraint is typically strictly smaller (Blau & Michaeli, 2018; Freirich et al., 2021), as illustrated in Figure 1. Throughout this paper, we denote by the estimator that minimizes the MSE under a perfect perceptual quality constraint. Its formal definition is provided in Section 2.2.
Perceptual Quality | Distortion | |||||||||
Method | FID | KID | NIQE | Precision | PSNR | SSIM | LPIPS | Deg | LMD | |
DOT | 100.2 | 0.0914 | 6.462 | 0.1600 | 21.32 | 0.6636 | 0.4756 | 43.87 | 2.876 | |
RestoreFormer++ | 41.15 | 0.0290 | 4.187 | 0.6877 | 25.31 | 0.6703 | 0.3441 | 29.63 | 2.043 | |
RestoreFormer | 42.30 | 0.0301 | 4.405 | 0.7010 | 24.62 | 0.6460 | 0.3655 | 32.13 | 2.299 | |
CodeFormer | 53.16 | 0.0425 | 4.649 | 0.6940 | 25.15 | 0.6700 | 0.3432 | 37.28 | 2.470 | |
VQFRv1 | 41.79 | 0.0297 | 3.693 | 0.6593 | 24.07 | 0.6446 | 0.3515 | 35.75 | 2.429 | |
VQFRv2 | 46.77 | 0.0346 | 4.169 | 0.6590 | 23.23 | 0.6412 | 0.3624 | 44.38 | 3.053 | |
GFPGAN | 46.72 | 0.0350 | 4.415 | 0.6970 | 24.99 | 0.6774 | 0.3643 | 36.05 | 2.443 | |
DiffBIR | 59.06 | 0.0509 | 6.084 | 0.5643 | 25.39 | 0.6536 | 0.3878 | 32.94 | 2.006 | |
DifFace | 38.43 | 0.0258 | 4.288 | 0.7413 | 24.80 | 0.6726 | 0.3999 | 45.79 | 2.965 | |
BFRffusion | 41.53 | 0.0301 | 4.966 | 0.6623 | 26.21 | 0.6917 | 0.3619 | 30.98 | 1.992 | |
PMRF (Ours) | 37.46 | 0.0257 | 4.118 | 0.7073 | 26.37 | 0.7073 | 0.3470 | 30.67 | 2.030 |
Another common way to solve PIR tasks is to train a model by minimizing a weighted sum of a distortion loss (e.g., MSE) and a GAN loss (Goodfellow et al., 2014; Gu et al., 2022; Wang et al., 2021; 2023b; 2022; Zhou et al., 2022; Yang et al., 2021; Ledig et al., 2017; Wang et al., 2018; Zhang et al., 2021; Wang et al., ). As explained by Blau & Michaeli (2018), this is a principled way to traverse the distortion-perception tradeoff, where the GAN loss coefficient acts as a Lagrange multiplier that controls the desired perceptual index. Thus, in principle, one can approximate by selecting a sufficiently large such coefficient. Despite the elegance of this approach, diffusion methods that aim for posterior sampling tend to perform better in practice, both in terms of distortion and perceptual quality (see Table 1), implying that current GAN-based methods fail to approximate . Such a shortcoming can be partially attributed to the fact that GANs are extremely difficult to optimize, especially when the GAN loss coefficient is significantly larger than that of the distortion loss.
In this paper, we propose Posterior-Mean Rectified Flow (PMRF), a straightforward framework to directly approximate . Interestingly, Freirich et al. (2021) proved that can be constructed by first predicting the posterior mean , and then optimally transporting the result to the ground-truth image distribution (see Section 2.2 for a formal explanation). Motivated by this result, PMRF first approximates the posterior mean by using a model that minimizes the MSE between the reconstructed outputs and the ground-truth images. Then, we train a rectified flow model (Liu et al., 2023) to predict the direction of the straight path between corresponding pairs of posterior mean predictions and ground-truth images. Given a degraded measurement at test time, PMRF solves an ODE using such a flow model, with the posterior mean prediction set as the initial condition. As we explain in Section 3, PMRF approximates the desired estimator , aiming for a solution that minimizes the MSE under a perfect perceptual index constraint.
Our paper is organized as follows. In Section 2 we provide the necessary background and set mathematical notations. In Section 3 we describe our proposed method, and provide intuition via theoretical results and a toy example with closed-form solutions. In Section 4 we discuss related work. In Section 5 we demonstrate the utility of PMRF on a variety of face image restoration tasks, including denoising, super-resolution, inpainting, colorization, and blind restoration. We show that PMRF sets a new state-of-the-art on several benchmarks in the challenging blind face image restoration task, and is either on-par or outperforms previous frameworks in the rest of the tasks. Finally, in Section 6 we conclude our work and discuss its limitations.
2 Background
We adopt the Bayesian perspective for solving inverse problems (Davison, 2003; Kaipio & Somersalo, 2005), where a natural image is regarded as a realization of a random vector with probability density function . The degraded measurement (e.g., a noisy or low-resolution image) is a realization of a random vector , which is related to via the conditional probability density function . Given a degraded measurement , an image restoration algorithm generates a prediction by sampling from , such that adheres to the Markov chain (i.e. and are statistically independent given ).
2.1 Distortion and perceptual index
Image restoration algorithms are typically evaluated by their average distortion , where is some distortion measure that quantifies the discrepancy between and , and the expectation is taken over the joint distribution . Common examples for are the absolute error , the squared error , and LPIPS (Zhang et al., 2018). Moreover, as the goal in PIR is to produce reconstructions that would look natural to humans, PIR algorithms are also evaluated by perceptual quality measures. The ideal way to evaluate perceptual quality is to assess the ability of humans to distinguish between samples of ground-truth images and samples of reconstructed ones. This is typically done by conducting experiments where human observers vote on whether the generated images are real or fake (Isola et al., 2017; Zhang et al., 2016; Salimans et al., 2016; Denton et al., 2015; Dahl et al., 2017; Iizuka et al., 2016; Zhang et al., 2017; Guadarrama et al., 2017). However, such experiments are too costly and impractical for optimizing models. A practical and sensible alternative to quantify the perceptual quality is via some perceptual index , where is a statistical divergence between probability distributions (e.g., Kullback–Leibler, Wasserstein) (Blau & Michaeli, 2018). Quantifying the perceptual index for high-dimensional distributions is both statistically and computationally intractable, so it is common to resort to approximations. Popular examples include the Fréchet Inception Distance (FID) (Heusel et al., 2017) and the Kernel Inception Distance (KID) (Bińkowski et al., 2018).
2.2 Optimal estimators for the squared error distortion
Due to the distortion-perception tradeoff (Blau & Michaeli, 2018), it has become common practice to compare image restoration algorithms on the distortion-perception plane, where the goal is to obtain optimal estimators with the lowest possible distortion given a prescribed level of perceptual index. This goal can be formalized by the distortion-perception function (Blau & Michaeli, 2018),
(1) |
Perhaps the most common points of interest on are and , where the first point corresponds to the estimator achieving minimal average distortion under no constraint, and the second corresponds to the estimator achieving minimal average distortion under a perfect perceptual index constraint. Considering the squared error distortion, these points are defined by
(2) | |||
(3) |
respectively. It is well-known that the unique solution to Problem (2) is the posterior mean , which typically produces overly-smooth reconstructions (Blau & Michaeli, 2018). Therefore, in PIR tasks, it is more appropriate to aim for the solution to Problem (3). Interestingly, Freirich et al. (2021) proved that a solution to Problem (3) can be obtained by solving the optimal transport problem
(4) |
where is the set of all joint probabilities with marginals and . Namely, the optimal solution to Problem (3) can be constructed as follows: Given a degraded measurement , first predict the posterior mean , and then sample from , which is the optimal transport plan from to . Similarly to Freirich et al. (2021), we denote such a solution to Problem (3) by .
As discussed before, one of the most common and appealing solutions for PIR tasks is the estimator that samples from the posterior distribution , such that . While such an estimator always attains a perfect perceptual index (Blau & Michaeli, 2018), its MSE is typically larger than that of (Blau & Michaeli, 2018; Freirich et al., 2021) (see Figure 1). In other words, to design an algorithm with minimal MSE under a perfect perceptual index constraint, one should often not resort to posterior sampling, but rather to solving Problem (3). This is our goal in this paper. Lastly, one may wonder whether sampling from instead of using the optimal transport plan from Equation 4 may also be effective in terms of MSE. However, in Section A.1 we prove that such an approach leads to precisely the same MSE as sampling from the posterior.
2.3 Flow matching and rectified flows
Flow matching.
Flow matching algorithms (Liu et al., 2023; Lipman et al., 2023; Albergo & Vanden-Eijnden, 2023) are generative models defined via the ODE
(5) |
where is often called a vector field, and is some forward process such that is the source distribution, from which we can easily sample (e.g., isotropic Gaussian noise), and is the target distribution from which we aim to sample (e.g., natural images). In principle, one can generate samples from the target distribution by solving Equation 5, where samples from the source distribution are set as the initial conditions for the ODE solver. Nevertheless, given a particular forward process , there are possibly many different vector fields that satisfy Equation 5. The goal in flow matching is to somehow find an appropriate vector field with desirable practical and theoretical properties, e.g., where the solution to Equation 5 is unique.
Rectified flow.
Rectified flow (Liu et al., 2023) is a flow matching algorithm defined via the particular forward process
(6) |
which connects samples from and with straight lines. Here, and can be statistically independent, as is typically the case when learning a flow model from Gaussian noise to image data, but they can also have any joint distribution . This forward process clearly adheres to the ODE , where is the corresponding vector field. However, this is not a practical generative model, since it requires knowing the “destination” realization of at any time step (i.e., the solution is not causal). To solve this issue, Liu et al. (2023) offer instead to use the vector field
(7) |
which is causal, and generates the target distribution if the solution to Equation 5 exists and is unique when adopting such a vector field (Theorem 3.3 in (Liu et al., 2023)). Interestingly, solving the ODE in Equation 5 with often approximates the optimal transport map from the source distribution to the target one, especially when the process is repeated several times (i.e., reflow) or when is close to the optimal transport plan between and (Liu et al., 2023; Tong et al., 2024). To learn , one can simply train a model by minimizing the loss
(8) |
where the expectation is taken over the joint distribution (Liu et al., 2023).
3 Posterior-Mean Rectified Flow
We now describe our proposed algorithm, which we coin Posterior-Mean Rectified Flow (PMRF) (Section 3). Our method consists of two simple training stages. First, we train a model to predict the posterior mean by minimizing the MSE loss,
(9) |
Note that this training stage can often be skipped, whenever there exists an off-the-shelf algorithm that attains sufficiently small MSE (high PSNR) in the desired restoration task. In the second stage, we train a rectified flow model (a vector field) to solve
(10) |
where . Here, , where is statistically independent of and , and is a hyper-parameter that controls the level of the Gaussian noise added to the posterior mean prediction. As shown by Albergo et al. (2023), adding such a noise is critical when the source and target distributions lie on low and high dimensional manifolds, respectively. Specifically, it alleviates the singularities resulting from learning a deterministic mapping between such distributions. Note, however, that adding noise to may harm the MSE of the reconstructions produced by PMRF, and so should be taken to be sufficiently small.
To explain why PMRF approximates the desired estimator , we prove an important proposition and demonstrate it on a simple example with closed-form solutions. Specifically, let
(11) |
be the ODE in PMRF, where and is the random vector generated by PMRF at time step . In Section A.2 we prove the following:
Proposition 1.
Suppose that , and let us assume that the solution of the ODE in Equation 11 exists and is unique. Then,
-
(a)
attains a perfect perceptual index ().
-
(b)
The MSE of cannot be larger than that of the posterior sampler.
-
(c)
If the distribution of is non-degenerate for almost every and , then the MSE of is strictly smaller than that of the posterior sampler.
Note that assumptions (a) and (b) are the same as the ones in (Liu et al., 2023), so they are not more limiting. Whether assumption (c) holds depends on the nature of the restoration task. For example, if can be reconstructed from with zero error (i.e., is a Dirac delta function for almost every ), then almost surely and assumption (c) does not hold. Yet, this is not an interesting setting as the degradation is not invertible in most practical scenarios. To gain intuition into a more common scenario, consider the following example from (Blau & Michaeli, 2018):
Example 1.
Let , where and are statistically independent and . Then, the MSE of is strictly smaller than that of the posterior sampler. Moreover, when , all the assumptions in Proposition 1 hold, and we have almost surely.
See Section A.3 for the proof of Example 1. This example shows that PMRF not only outperforms posterior sampling, but may even coincide with the desired estimator in certain cases.
4 Related work
Before moving on to demonstrate the effectiveness of our approach, it is instructive to note the difference between our PMRF method and existing techniques that may superficially seem similar.
Diffusion and flow-based posterior samplers.
Diffusion or flow-based image restoration algorithms often attempt to sample from the posterior distribution by training a conditional model that takes (or some function of , like ) as an additional input (Zhu et al., 2024; Lin et al., 2024). Some works avoid training a conditional model for each task separately, and rather modify the sampling process of a trained unconditional diffusion model (Kawar et al., 2022; Chung et al., 2023). In Section 5.2 we perform a controlled experiment on various inverse problems, which shows that our PMRF method consistently outperforms posterior samplers with the same architecture.
Flow from degraded image.
Some diffusion/flow models are trained on corresponding pairs of ground-truth images and degraded measurements (Albergo et al., 2023; Delbracio & Milanfar, 2023; Li et al., 2023). In this approach, the idea is to obtain a high-quality image by solving an ODE/SDE with the degraded measurement set as the initial condition. For example, Albergo et al. (2023) trained a rectified flow model for the forward process , where is an up-sampled version of such that it matches the dimensionality of . These algorithms are closely related to PMRF, in the sense that they learn to transport an intermediate signal (instead of pure noise) to the ground-truth image distribution. Yet, they have two critical disadvantages compared to PMRF. First, the flow model’s design is not agnostic to the type of degradation, as the degraded signals can have varying dimensionalities or lie in a different domain than that of the ground-truth images (e.g., in MRI image reconstruction). Thus, the task of the flow model may be harder than necessary, as it needs to translate signals from one domain to another. On the other hand, in PMRF the flow model always operates in the image domain, where the dimensionalities of the source and target signals are the same. Second, the theoretical motivation for flowing from is not clear, at least from a reconstruction performance standpoint (e.g., distortion). In contrast, the theoretical motivation underlying PMRF is clear: it approximates , which achieves the minimal possible MSE under the constraint of perfect perceptual index. As we show in Section 5.2, PMRF always either outperforms or is on-par with the solution that flows from (see Figure 4).
Methods that aim for directly.
To the best of our knowledge, Deep Optimal Transport (DOT) (Adrai et al., 2023) is the only existing method that, like PMRF, attempts to approximate directly. Specifically, DOT approximates the desired optimal transport map (Equation 4) via a linear transformation in the latent space of a variational auto-encoder (VAE) (Kingma & Welling, 2014). This transformation is computed in closed-form using the empirical means and covariances (in latent space) of the source distribution (that of the posterior mean predictions) and the target distribution (that of the ground-truth images), under the assumption that both are Gaussian. This method is computationally efficient, but the use of a VAE imposes a performance ceiling. Moreover, the optimal transport in DOT occurs in latent space and assumes that the source and target distributions are Gaussians, unlike Equation 4 which occurs in pixel space and does not make such an assumption. In contrast, PMRF does not use a VAE, and approximates the optimal transport directly in pixel space. In Section 5 we show that PMRF significantly outperforms DOT (see Figure 4).
5 Experiments
5.1 Blind face image restoration
We train PMRF to solve the challenging blind face image restoration task, and compare its performance with leading methods. As in previous works (e.g., (Wang et al., 2021)), we use the FFHQ data set (Karras et al., 2019) with images of size to train our model. Similarly to previous works, we adopt a complex and random degradation process to synthesize the degraded images,
(12) |
where denotes convolution, is a Gaussian blur kernel of size and variance , is bilinear down-sampling by a factor , is white Gaussian noise of variance , and is JPEG compression-decompression with quality factor . Similarly to (Yue & Loy, 2024), we synthesize the degraded images by sampling and uniformly from and , respectively. See Section B.1 for additional implementation details.
5.1.1 Evaluation settings
For evaluation, we consider the common synthetic CelebA-Test benchmark, as well as the real-world data sets LFW-Test (Wang et al., 2021; Huang et al., 2008), WebPhoto-Test (Wang et al., 2021), CelebAdult-Test (Wang et al., 2021), and WIDER-Test (Zhou et al., 2022). CelebA-Test consists of 3,000 high-quality images taken from the test partition of CelebA-HQ (Karras et al., 2018), and the degraded images were synthesized by Wang et al. (2021). For the real-world data sets, the degradations are unknown and there is no access to the clean ground-truth images. We compare our performance with DOT (Adrai et al., 2023) and leading blind face restoration models, including BFRffussion (Chen et al., 2024), DiffBIR (Lin et al., 2024), DifFace (Yue & Loy, 2024), CodeFormer (Zhou et al., 2022), GFPGAN (Wang et al., 2021), VQFRv1 and VQFRv2 (Gu et al., 2022), RestoreFormer and RestoreFormer++ (Wang et al., 2022; 2023b). Notably, these restoration methods also use the degradation model from Equation 12, though the ranges of , , , and differ across methods. The ranges we choose, those from (Yue & Loy, 2024), are the most severe among all the compared methods. For example, the range of we use is , whereas Wang et al. (2021) use . Thus, PMRF attempts to solve a more difficult restoration task than some of the compared methods. In the following experiments, we use flow steps in PMRF (Section 3). Refer to Section B.2 for an evaluation of additional values of , and to Section B.3 for the implementation details of DOT.
5.1.2 Results on CelebA-Test
For the CelebA-Test benchmark, we measure the perceptual quality by FID (Heusel et al., 2017), KID (Bińkowski et al., 2018), NIQE (Mittal et al., 2013), and Precision (Kynkäänniemi et al., 2019), and measure the distortion by the PSNR, SSIM (Wang et al., 2004), and LPIPS (Zhang et al., 2018). Similarly to previous works (Wang et al., 2021; Gu et al., 2022), we also compute the identity metric Deg (using the embedding angle of ArcFace (Deng et al., 2019)) and the landmark distance LMD. Both of these can be considered as distortion measures, as they quantify some type of discrepancy between each reconstructed image and its ground-truth counterpart.
The results are reported in Table 1. Notably, PMRF outperforms all other methods in FID, KID, PSNR, and SSIM, achieves the second best scores in NIQE, Precision and Deg, and the third best scores in LPIPS, and LMD. Interestingly, no other method attains such a consensus in performance like PMRF, namely, where none of the measures are significantly compromised compared to the state-of-the-art. For example, while DifFace achieves the highest Precision, it attains worse LMD, Deg, LPIPS, SSIM, and PSNR compared to the third best method in each of these metrics. This demonstrates that PMRF produces robust reconstructions, in the sense that it does not “over-fit” particular perceptual quality or distortion measures, but rather achieves high performance in all of them simultaneously. Visual results are provided in Figure 2 and in Figure 5 in the appendix.
5.1.3 Results on real-world degraded images
Evaluating the distortion for real-world degraded images is impossible, as there is no access to the ground-truth images. Consequently, previous works conduct only a perceptual quality evaluation (e.g., FID) on real-world data sets such as WIDER-Test and LFW-Test. Yet, high perceptual quality alone is clearly not indicative of reconstruction performance (to attain high perceptual quality, one may simply ignore the inputs and generate samples from ). Thus, we consider a measure which indicates the Root MSE (RMSE) and allows ranking algorithms according to their (approximate) RMSE, without access to the ground-truth images. Specifically, for any estimator it holds that
(13) |
where is an approximation of the true posterior mean predictor , and is a constant that does not depend on (see Appendix D for an explanation). Thus, the square root of , which we denote by IndRMSE, indicates the true RMSE. We utilize the posterior mean predictor trained by (Yue & Loy, 2024)222Importantly, the exact same posterior mean predictor model (and weights) is also used by other methods such as DifFace and DiffBIR, so this is a fair evaluation. as , and compute the IndRMSE of all the evaluated algorithms on the LFW-Test, WebPhoto-Test, CelebAdult-Test, and WIDER-Test data sets. As before, we evaluate perceptual quality by FID, KID, NIQE, and Precision. In Figure 3 we provide visual results on inputs from the WIDER-Test data set, and compare the algorithms on a “distortion”-perception plane (IndRMSE vs. FID). DOT is not plotted as it achieves far worse FID compared to other methods. Our algorithm attains the best (smallest) IndRMSE on all data sets, while achieving on-par perceptual quality compared to the state-of-the-art. This indicates that PMRF achieves superior distortion on such real-world data sets, while not compromising perceptual quality. In the appendix, we report the rest of the perceptual quality measures in Tables 7, 10, 9 and 8, provide visual results in Figures 7, 6 and 8, and also report the performance of DOT.
5.2 Comparing PMRF with previous frameworks in controlled experiments
One may wonder whether the performance of PMRF is attributed to the framework itself (Section 3), or, maybe it is attributed to the model architecture, the rectified flow training approach, the chosen hyper-parameters, etc. Could we have done better by training a flow to sample from the posterior, or by adopting the approach of (Albergo et al., 2023) and flow directly from ? Here, we conduct a controlled study where we demonstrate that the high performance of PMRF is indeed attributed to the proposed framework itself (Section 3). Specifically, we consider the image denoising, super-resolution, inpainting, and colorization tasks, where we train PMRF and several baseline methods on the “same grounds”. In each task we train two conditional rectified flow models, where one is conditioned on the degraded measurement (we call this method flow conditioned on ), and the other is conditioned on the posterior mean predictor (we call this method flow conditioned on ). The first model represents posterior sampling methods, and the second model allows for a fair comparison of model capacity with PMRF (since PMRF is comprised of and a flow model). In fact, theoretically speaking, the second approach achieves precisely the same MSE as the posterior sampler (see Section A.1), and is often used in practice (e.g., in (Lin et al., 2024; Zhu et al., 2024)). In addition, we train an unconditional rectified flow model, where the forward process is defined as , , , and is the up-scaled version of the degraded measurement such that it matches the dimensionality of (we call this method flow from ). This method represents the frameworks in (Albergo et al., 2023; Li et al., 2023; Delbracio & Milanfar, 2023), which we discuss in Section 4. All of the models are trained with the same hyper-parameters as PMRF, using the same architecture, learning rate, weight decay, number of training epochs, etc. Moreover, for PMRF and flow conditioned on method, we use the exact same architecture and weights for . To clarify the differences between the mathematical formulations of the baseline methods, in Table 11 in the appendix we summarize the definitions of the training loss and the forward process of all methods. Moreover, in Appendices D, D and D we disclose a pseudo-code for the training and inference procedures of the baseline methods. While DOT is not a flow method, we still evaluate its performance as it is related to PMRF.
In Figure 4 we compare the algorithms on the distortion-perception plane (RMSE vs. FID), using flow steps for each flow algorithm. We clearly see that for the inpainting and colorization tasks, PMRF dominates all other methods, achieving notably smaller RMSE without compromising FID. This demonstrates that PMRF achieves our desired goal, which is to attain low distortion without compromising on perceptual quality. For the denoising task, we observe that PMRF and flow from attain similar performance, and both dominate the posterior sampling approaches. We hypothesize that, in some tasks (e.g., denoising), flowing from may be as effective as PMRF in terms of approximating . To demonstrate this, we prove in Section C.4 that flowing from is optimal in the toy problem in Example 1 (just like PMRF). Yet, our experiments demonstrate that PMRF generally leads to better performance compared to previous frameworks. To assess the effectiveness of each method given different inference time constraints, in Figure 9 in the appendix we vary the number of flow inference steps for each method. Interestingly, we observe that PMRF is still either on-par or dominates the other methods for any given number of inference steps. These results further demonstrate that the superior performance of PMRF is attributed to our framework itself, rather than to the chosen hyper-parameters. See Appendix C for more details, and refer to Figures 10, 11, 12 and 13 in the appendix for visual comparisons.
6 Conclusion and limitations
The goal in this paper was to design an algorithm that directly approximates , which is the estimator that minimizes the MSE under a perfect perceptual index constraint (Equation 3). To achieve this goal, we introduced Posterior-Mean Rectified Flow (PMRF), a simple yet highly effective image restoration algorithm that outperforms previous frameworks (e.g., posterior sampling, flow from , and GAN-based methods) in a variety of image restoration tasks. As we explained in Section 3, PMRF alleviates the issues resulting from solving the ODE by adding Gaussian noise to the posterior mean predictions. We note that the noise level should be carefully tuned, as taking it to be too large or too small may cause the MSE or the perceptual quality of PMRF to degrade, respectively. While the flow from method (Appendix D) suffers from the same limitation (though it does not provide a theoretical guarantee on the MSE, like PMRF), this may be considered a disadvantage of PMRF compared to posterior sampling methods (e.g., Appendix D), which do not require such a hyper-parameter. Moreover, we proved in Proposition 1 that, under some conditions, PMRF is guaranteed to achieve a smaller MSE than the posterior sampler. However, as in (Liu et al., 2023), one could argue that the assumptions in Proposition 1 may be too limiting in some cases. Finally, we did not provide experiments on general-content images, as this requires training significantly larger models (Crowson et al., 2024). However, we believe that our experiments demonstrate the strength and potential of the PMRF approach, as we showcased its superiority on five different tasks, including the highly challenging blind face image restoration problem.
Reproducibility statement
Our codes are available at https://github.com/ohayonguy/PMRF. We provide all the explanations and checkpoints necessary to reproduce our results, including training, inference, and the computation of the distortion and perceptual quality measures in Section 5. Besides our code, our paper discloses all the implementation details required to reproduce the results, including architecture details, training hyper-parameters, etc. Refer to Sections 5.1, B, 5.2 and C for implementation details, and to Table 12 in the appendix for a summary of our training hyper-parameters.
References
- Adrai et al. (2023) Theo Adrai, Guy Ohayon, Michael Elad, and Tomer Michaeli. Deep optimal transport: A practical algorithm for photo-realistic image restoration. In A. Oh, T. Naumann, A. Globerson, K. Saenko, M. Hardt, and S. Levine (eds.), Advances in Neural Information Processing Systems, volume 36, pp. 61777–61791. Curran Associates, Inc., 2023. URL https://proceedings.neurips.cc/paper_files/paper/2023/file/c281c5a17ad2e55e1ac1ca825071f991-Paper-Conference.pdf.
- Albergo et al. (2023) Michael S. Albergo, Mark Goldstein, Nicholas M. Boffi, Rajesh Ranganath, and Eric Vanden-Eijnden. Stochastic interpolants with data-dependent couplings. arXiv, 2023. URL https://arxiv.org/abs/2310.03725.
- Albergo & Vanden-Eijnden (2023) Michael Samuel Albergo and Eric Vanden-Eijnden. Building normalizing flows with stochastic interpolants. In The Eleventh International Conference on Learning Representations, 2023. URL https://openreview.net/forum?id=li7qeBbCR1t.
- Bendel et al. (2023) Matthew C Bendel, Rizwan Ahmad, and Philip Schniter. A regularized conditional GAN for posterior sampling in image recovery problems. In Thirty-seventh Conference on Neural Information Processing Systems, 2023. URL https://openreview.net/forum?id=z4vKRmq7UO.
- Bińkowski et al. (2018) Mikołaj Bińkowski, Dougal J. Sutherland, Michael Arbel, and Arthur Gretton. Demystifying MMD GANs. In International Conference on Learning Representations, 2018. URL https://openreview.net/forum?id=r1lUOzWCW.
- Blau & Michaeli (2018) Yochai Blau and Tomer Michaeli. The perception-distortion tradeoff. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2018.
- Chen et al. (2024) Xiaoxu Chen, Jingfan Tan, Tao Wang, Kaihao Zhang, Wenhan Luo, and Xiaochun Cao. Towards real-world blind face restoration with generative diffusion prior. IEEE Transactions on Circuits and Systems for Video Technology, pp. 1–1, 2024. doi: 10.1109/TCSVT.2024.3383659.
- Chung et al. (2023) Hyungjin Chung, Jeongsol Kim, Michael Thompson Mccann, Marc Louis Klasky, and Jong Chul Ye. Diffusion posterior sampling for general noisy inverse problems. In The Eleventh International Conference on Learning Representations, 2023. URL https://openreview.net/forum?id=OnD9zGAGT0k.
- Crowson et al. (2024) Katherine Crowson, Stefan Andreas Baumann, Alex Birch, Tanishq Mathew Abraham, Daniel Z Kaplan, and Enrico Shippole. Scalable high-resolution pixel-space image synthesis with hourglass diffusion transformers. In Forty-first International Conference on Machine Learning, 2024. URL https://openreview.net/forum?id=WRIn2HmtBS.
- Dahl et al. (2017) Ryan Dahl, Mohammad Norouzi, and Jonathon Shlens. Pixel recursive super resolution. In Proceedings of the IEEE International Conference on Computer Vision (ICCV), Oct 2017.
- Daras et al. (2024) Giannis Daras, Hyungjin Chung, Chieh-Hsin Lai, Yuki Mitsufuji, Peyman Milanfar, Alexandros G. Dimakis, Chul Ye, and Mauricio Delbracio. A survey on diffusion models for inverse problems. 2024. URL https://giannisdaras.github.io/publications/diffusion_survey.pdf.
- Davison (2003) A. C. Davison. Statistical Models. Cambridge Series in Statistical and Probabilistic Mathematics. Cambridge University Press, 2003. doi: 10.1017/CBO9780511815850.
- Delbracio & Milanfar (2023) Mauricio Delbracio and Peyman Milanfar. Inversion by direct iteration: An alternative to denoising diffusion for image restoration. Transactions on Machine Learning Research, 2023. ISSN 2835-8856. URL https://openreview.net/forum?id=VmyFF5lL3F. Featured Certification.
- Deng et al. (2019) Jiankang Deng, Jia Guo, Niannan Xue, and Stefanos Zafeiriou. Arcface: Additive angular margin loss for deep face recognition. In 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 4685–4694, 2019. doi: 10.1109/CVPR.2019.00482.
- Denton et al. (2015) Emily L Denton, Soumith Chintala, arthur szlam, and Rob Fergus. Deep generative image models using a laplacian pyramid of adversarial networks. In C. Cortes, N. Lawrence, D. Lee, M. Sugiyama, and R. Garnett (eds.), Advances in Neural Information Processing Systems, volume 28. Curran Associates, Inc., 2015. URL https://proceedings.neurips.cc/paper_files/paper/2015/file/aa169b49b583a2b5af89203c2b78c67c-Paper.pdf.
- Freirich et al. (2021) Dror Freirich, Tomer Michaeli, and Ron Meir. A theory of the distortion-perception tradeoff in wasserstein space. In M. Ranzato, A. Beygelzimer, Y. Dauphin, P.S. Liang, and J. Wortman Vaughan (eds.), Advances in Neural Information Processing Systems, volume 34, pp. 25661–25672. Curran Associates, Inc., 2021. URL https://proceedings.neurips.cc/paper_files/paper/2021/file/d77e68596c15c53c2a33ad143739902d-Paper.pdf.
- Goodfellow et al. (2014) Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. Generative adversarial nets. In Z. Ghahramani, M. Welling, C. Cortes, N. Lawrence, and K.Q. Weinberger (eds.), Advances in Neural Information Processing Systems, volume 27. Curran Associates, Inc., 2014. URL https://proceedings.neurips.cc/paper_files/paper/2014/file/5ca3e9b122f61f8f06494c97b1afccf3-Paper.pdf.
- Gu et al. (2022) Yuchao Gu, Xintao Wang, Liangbin Xie, Chao Dong, Gen Li, Ying Shan, and Ming-Ming Cheng. Vqfr: Blind face restoration with vector-quantized dictionary and parallel decoder. In Computer Vision – ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XVIII, pp. 126–143, Berlin, Heidelberg, 2022. Springer-Verlag. ISBN 978-3-031-19796-3. doi: 10.1007/978-3-031-19797-0˙8. URL https://doi.org/10.1007/978-3-031-19797-0_8.
- Guadarrama et al. (2017) Sergio Guadarrama, Ryan Dahl, David Bieber, Jonathon Shlens, Mohammad Norouzi, and Kevin Murphy. Pixcolor: Pixel recursive colorization. In British Machine Vision Conference 2017, BMVC 2017, London, UK, September 4-7, 2017. BMVA Press, 2017. URL https://www.dropbox.com/s/wmnk861irndf8xe/0447.pdf.
- Heusel et al. (2017) Martin Heusel, Hubert Ramsauer, Thomas Unterthiner, Bernhard Nessler, and Sepp Hochreiter. Gans trained by a two time-scale update rule converge to a local nash equilibrium. In I. Guyon, U. Von Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett (eds.), Advances in Neural Information Processing Systems, volume 30. Curran Associates, Inc., 2017. URL https://proceedings.neurips.cc/paper_files/paper/2017/file/8a1d694707eb0fefe65871369074926d-Paper.pdf.
- Huang et al. (2008) Gary B. Huang, Marwan Mattar, Tamara Berg, and Eric Learned-Miller. Labeled Faces in the Wild: A Database forStudying Face Recognition in Unconstrained Environments. In Workshop on Faces in ’Real-Life’ Images: Detection, Alignment, and Recognition, Marseille, France, October 2008. Erik Learned-Miller and Andras Ferencz and Frédéric Jurie. URL https://inria.hal.science/inria-00321923.
- Iizuka et al. (2016) Satoshi Iizuka, Edgar Simo-Serra, and Hiroshi Ishikawa. Let there be Color!: Joint End-to-end Learning of Global and Local Image Priors for Automatic Image Colorization with Simultaneous Classification. ACM Transactions on Graphics (Proc. of SIGGRAPH 2016), 35(4), 2016.
- Isola et al. (2017) Phillip Isola, Jun-Yan Zhu, Tinghui Zhou, and Alexei A Efros. Image-to-image translation with conditional adversarial networks. CVPR, 2017.
- Kaipio & Somersalo (2005) Jari Kaipio and Erkki Somersalo. Statistical and Computational Inverse Problems. Springer, Dordrecht, 2005. doi: 10.1007/b138659. URL https://cds.cern.ch/record/1338003.
- Karras et al. (2018) Tero Karras, Timo Aila, Samuli Laine, and Jaakko Lehtinen. Progressive growing of GANs for improved quality, stability, and variation. In International Conference on Learning Representations, 2018. URL https://openreview.net/forum?id=Hk99zCeAb.
- Karras et al. (2019) Tero Karras, Samuli Laine, and Timo Aila. A style-based generator architecture for generative adversarial networks. In 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 4396–4405, 2019. doi: 10.1109/CVPR.2019.00453.
- Kawar et al. (2021a) Bahjat Kawar, Gregory Vaksman, and Michael Elad. Stochastic image denoising by sampling from the posterior distribution. In 2021 IEEE/CVF International Conference on Computer Vision Workshops (ICCVW), pp. 1866–1875, 2021a. doi: 10.1109/ICCVW54120.2021.00213.
- Kawar et al. (2021b) Bahjat Kawar, Gregory Vaksman, and Michael Elad. Snips: Solving noisy inverse problems stochastically. In M. Ranzato, A. Beygelzimer, Y. Dauphin, P.S. Liang, and J. Wortman Vaughan (eds.), Advances in Neural Information Processing Systems, volume 34, pp. 21757–21769. Curran Associates, Inc., 2021b. URL https://proceedings.neurips.cc/paper_files/paper/2021/file/b5c01503041b70d41d80e3dbe31bbd8c-Paper.pdf.
- Kawar et al. (2022) Bahjat Kawar, Michael Elad, Stefano Ermon, and Jiaming Song. Denoising diffusion restoration models. In Advances in Neural Information Processing Systems, 2022.
- Kingma & Welling (2014) Diederik P. Kingma and Max Welling. Auto-encoding variational bayes. In Yoshua Bengio and Yann LeCun (eds.), 2nd International Conference on Learning Representations, ICLR 2014, Banff, AB, Canada, April 14-16, 2014, Conference Track Proceedings, 2014. URL http://arxiv.org/abs/1312.6114.
- Kynkäänniemi et al. (2019) Tuomas Kynkäänniemi, Tero Karras, Samuli Laine, Jaakko Lehtinen, and Timo Aila. Improved precision and recall metric for assessing generative models. In H. Wallach, H. Larochelle, A. Beygelzimer, F. d'Alché-Buc, E. Fox, and R. Garnett (eds.), Advances in Neural Information Processing Systems, volume 32. Curran Associates, Inc., 2019. URL https://proceedings.neurips.cc/paper_files/paper/2019/file/0234c510bc6d908b28c70ff313743079-Paper.pdf.
- Ledig et al. (2017) Christian Ledig, Lucas Theis, Ferenc Huszár, Jose Caballero, Andrew Cunningham, Alejandro Acosta, Andrew Aitken, Alykhan Tejani, Johannes Totz, Zehan Wang, and Wenzhe Shi. Photo-realistic single image super-resolution using a generative adversarial network. In 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 105–114, 2017. doi: 10.1109/CVPR.2017.19.
- Li et al. (2023) Bo Li, Kaitao Xue, Bin Liu, and Yu-Kun Lai. Bbdm: Image-to-image translation with brownian bridge diffusion models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 1952–1961, 2023.
- Liang et al. (2021) Jingyun Liang, Jiezhang Cao, Guolei Sun, Kai Zhang, Luc Van Gool, and Radu Timofte. Swinir: Image restoration using swin transformer. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV) Workshops, pp. 1833–1844, October 2021.
- Lin et al. (2024) Xinqi Lin, Jingwen He, Ziyan Chen, Zhaoyang Lyu, Bo Dai, Fanghua Yu, Wanli Ouyang, Yu Qiao, and Chao Dong. Diffbir: Towards blind image restoration with generative diffusion prior. arXiv, 2024. URL https://arxiv.org/abs/2308.15070.
- Lipman et al. (2023) Yaron Lipman, Ricky T. Q. Chen, Heli Ben-Hamu, Maximilian Nickel, and Matthew Le. Flow matching for generative modeling. In The Eleventh International Conference on Learning Representations, 2023. URL https://openreview.net/forum?id=PqvMRDCJT9t.
- Liu et al. (2023) Xingchao Liu, Chengyue Gong, and qiang liu. Flow straight and fast: Learning to generate and transfer data with rectified flow. In The Eleventh International Conference on Learning Representations, 2023. URL https://openreview.net/forum?id=XVjTT1nw5z.
- Loshchilov & Hutter (2019) Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization. In International Conference on Learning Representations, 2019. URL https://openreview.net/forum?id=Bkg6RiCqY7.
- Man et al. (2023) Sean Man, Guy Ohayon, Theo Adrai, and Michael Elad. High-perceptual quality jpeg decoding via posterior sampling. In 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), pp. 1272–1282, 2023. doi: 10.1109/CVPRW59228.2023.00134.
- Mittal et al. (2013) Anish Mittal, Rajiv Soundararajan, and Alan C. Bovik. Making a “completely blind” image quality analyzer. IEEE Signal Processing Letters, 20(3):209–212, 2013. doi: 10.1109/LSP.2012.2227726.
- Murata et al. (2023) Naoki Murata, Koichi Saito, Chieh-Hsin Lai, Yuhta Takida, Toshimitsu Uesaka, Yuki Mitsufuji, and Stefano Ermon. GibbsDDRM: A partially collapsed Gibbs sampler for solving blind inverse problems with denoising diffusion restoration. In Andreas Krause, Emma Brunskill, Kyunghyun Cho, Barbara Engelhardt, Sivan Sabato, and Jonathan Scarlett (eds.), Proceedings of the 40th International Conference on Machine Learning, volume 202 of Proceedings of Machine Learning Research, pp. 25501–25522. PMLR, 23–29 Jul 2023. URL https://proceedings.mlr.press/v202/murata23a.html.
- Ohayon et al. (2021) Guy Ohayon, Theo Adrai, Gregory Vaksman, Michael Elad, and Peyman Milanfar. High perceptual quality image denoising with a posterior sampling cgan. In 2021 IEEE/CVF International Conference on Computer Vision Workshops (ICCVW), pp. 1805–1813, 2021. doi: 10.1109/ICCVW54120.2021.00207.
- Rombach et al. (2022) Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Björn Ommer. High-resolution image synthesis with latent diffusion models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 10684–10695, June 2022.
- Saharia et al. (2022) Chitwan Saharia, William Chan, Huiwen Chang, Chris Lee, Jonathan Ho, Tim Salimans, David Fleet, and Mohammad Norouzi. Palette: Image-to-image diffusion models. In ACM SIGGRAPH 2022 Conference Proceedings, SIGGRAPH ’22, New York, NY, USA, 2022. Association for Computing Machinery. ISBN 9781450393379. doi: 10.1145/3528233.3530757. URL https://doi.org/10.1145/3528233.3530757.
- Saharia et al. (2023) Chitwan Saharia, Jonathan Ho, William Chan, Tim Salimans, David J. Fleet, and Mohammad Norouzi. Image super-resolution via iterative refinement. IEEE Transactions on Pattern Analysis and Machine Intelligence, 45(4):4713–4726, 2023. doi: 10.1109/TPAMI.2022.3204461.
- Salimans et al. (2016) Tim Salimans, Ian Goodfellow, Wojciech Zaremba, Vicki Cheung, Alec Radford, Xi Chen, and Xi Chen. Improved techniques for training gans. In D. Lee, M. Sugiyama, U. Luxburg, I. Guyon, and R. Garnett (eds.), Advances in Neural Information Processing Systems, volume 29. Curran Associates, Inc., 2016. URL https://proceedings.neurips.cc/paper_files/paper/2016/file/8a3363abe792db2d8761d6403605aeb7-Paper.pdf.
- Song et al. (2023) Jiaming Song, Arash Vahdat, Morteza Mardani, and Jan Kautz. Pseudoinverse-guided diffusion models for inverse problems. In International Conference on Learning Representations, 2023. URL https://openreview.net/forum?id=9_gsMA8MRKQ.
- Talebi & Milanfar (2018) Hossein Talebi and Peyman Milanfar. Nima: Neural image assessment. IEEE Transactions on Image Processing, 27(8):3998–4011, 2018. doi: 10.1109/TIP.2018.2831899.
- Tong et al. (2024) Alexander Tong, Kilian FATRAS, Nikolay Malkin, Guillaume Huguet, Yanlei Zhang, Jarrid Rector-Brooks, Guy Wolf, and Yoshua Bengio. Improving and generalizing flow-based generative models with minibatch optimal transport. Transactions on Machine Learning Research, 2024. ISSN 2835-8856. URL https://openreview.net/forum?id=CD9Snc73AW. Expert Certification.
- (50) Xintao Wang, Liangbin Xie, Chao Dong, and Ying Shan. Real-esrgan: Training real-world blind super-resolution with pure synthetic data. In International Conference on Computer Vision Workshops (ICCVW).
- Wang et al. (2018) Xintao Wang, Ke Yu, Shixiang Wu, Jinjin Gu, Yihao Liu, Chao Dong, Yu Qiao, and Chen Change Loy. Esrgan: Enhanced super-resolution generative adversarial networks. In The European Conference on Computer Vision Workshops (ECCVW), September 2018.
- Wang et al. (2021) Xintao Wang, Yu Li, Honglun Zhang, and Ying Shan. Towards real-world blind face restoration with generative facial prior. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2021.
- Wang et al. (2023a) Yinhuai Wang, Jiwen Yu, and Jian Zhang. Zero-shot image restoration using denoising diffusion null-space model. The Eleventh International Conference on Learning Representations, 2023a.
- Wang et al. (2004) Zhou Wang, A.C. Bovik, H.R. Sheikh, and E.P. Simoncelli. Image quality assessment: from error visibility to structural similarity. IEEE Transactions on Image Processing, 13(4):600–612, 2004. doi: 10.1109/TIP.2003.819861.
- Wang et al. (2022) Zhouxia Wang, Jiawei Zhang, Runjian Chen, Wenping Wang, and Ping Luo. Restoreformer: High-quality blind face restoration from undegraded key-value pairs. 2022.
- Wang et al. (2023b) Zhouxia Wang, Jiawei Zhang, Tianshui Chen, Wenping Wang, and Ping Luo. Restoreformer++: Towards real-world blind face restoration from undegraded key-value paris. 2023b.
- Yang et al. (2021) Tao Yang, Peiran Ren, Xuansong Xie, , and Lei Zhang. Gan prior embedded network for blind face restoration in the wild. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2021.
- Yue & Loy (2024) Zongsheng Yue and Chen Change Loy. Difface: Blind face restoration with diffused error contraction. IEEE Transactions on Pattern Analysis and Machine Intelligence, pp. 1–15, 2024. doi: 10.1109/TPAMI.2024.3432651.
- Zhang et al. (2021) Kai Zhang, Jingyun Liang, Luc Van Gool, and Radu Timofte. Designing a practical degradation model for deep blind image super-resolution. In IEEE International Conference on Computer Vision, pp. 4791–4800, 2021.
- Zhang et al. (2016) Richard Zhang, Phillip Isola, and Alexei A Efros. Colorful image colorization. In ECCV, 2016.
- Zhang et al. (2017) Richard Zhang, Jun-Yan Zhu, Phillip Isola, Xinyang Geng, Angela S Lin, Tianhe Yu, and Alexei A Efros. Real-time user-guided image colorization with learned deep priors. ACM Transactions on Graphics (TOG), 9(4), 2017.
- Zhang et al. (2018) Richard Zhang, Phillip Isola, Alexei A Efros, Eli Shechtman, and Oliver Wang. The unreasonable effectiveness of deep features as a perceptual metric. In CVPR, 2018.
- Zhou et al. (2022) Shangchen Zhou, Kelvin C.K. Chan, Chongyi Li, and Chen Change Loy. Towards robust blind face restoration with codebook lookup transformer. In NeurIPS, 2022.
- Zhu et al. (2024) Yixuan Zhu, Wenliang Zhao, Ao Li, Yansong Tang, Jie Zhou, and Jiwen Lu. Flowie: Efficient image enhancement via rectified flow. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 13–22, June 2024.
- Zhu et al. (2023) Yuanzhi Zhu, Kai Zhang, Jingyun Liang, Jiezhang Cao, Bihan Wen, Radu Timofte, and Luc Van Gool. Denoising diffusion models for plug-and-play image restoration. In IEEE Conference on Computer Vision and Pattern Recognition Workshops (NTIRE), 2023.
Appendix A Supplementary explanations for PMRF
A.1 Proof that conditioning on achieves the same MSE as posterior sampling
Proposition 2.
Let be the estimator which, given any degraded measurement , first predicts the posterior mean and then samples from 333Note that is a “posterior sampler” which is conditioned on . Thus, Appendix D represents such an algorithm, which is one of the baseline methods we evaluate in Section 5.2.. Then, the MSE of equals twice the MMSE, which is the MSE attained by the posterior sampler.
Proof.
The MSE of is given by
(14) |
where this equality follows from Lemma 2 in (Freirich et al., 2021) (Appendix B.1). By the definition of we have , so
(15) |
Substituting this result into Equation 14, we get
(16) |
Namely, attains precisely the same MSE as the posterior sampler, which is equal to twice the MMSE (Blau & Michaeli, 2018). Thus, in theory, one should not expect to improve the MSE of a conditional diffusion/flow model by supplying as a condition instead of . ∎
A.2 Proof of Proposition 1
For completeness, we first restate Proposition 1 and then provide its proof. See 1
Proof.
We first prove (a) and (b) assuming that the solution for the ODE in Equation 11 exists and is unique for . Then, we will prove (c) by also assuming that the distribution of is non-degenerate for almost every and .
From Theorem 3.3 in (Liu et al., 2023) we have for every . This implies that , i.e., PMRF attains a perfect perceptual index when . This proves (a).
Next, without additional assumptions, we will prove (b) by showing that
(17) |
which will imply that the MSE of can only be smaller than that of the posterior sampler. Since , we have . Following similar arguments to those in the proof of Theorem 3.5 in (Liu et al., 2023), it holds that
(18) | ||||
(19) | ||||
(20) | ||||
(21) | ||||
(22) | ||||
(23) | ||||
(24) | ||||
(25) |
where Equation 18 follows from the definition of and , Equation 19 follows from the fact that , Equation 20 follows from Jensen’s inequality, Equation 21 follows from the definition of , Equation 22 follows from Jensen’s inequality, Equation 23 follows from the linearity of the integral operator, and Equation 24 follows from the law of total expectation. Thus, we have . Combining this result with Lemma 2 from (Freirich et al., 2021) (Appendix B.1), we conclude that
(26) |
where the left hand side is the MSE of PMRF, and the right hand side is the MSE of the posterior sampler, which always equals twice the MMSE (Blau & Michaeli, 2018).
Finally, to prove (c), let us further assume that is a non-degenerate random vector for every and . Thus, the inequality in Equation 22 becomes strict (from Jensen’s inequality for strictly convex functions), and hence we have . Combining this result with Lemma 2 from (Freirich et al., 2021) (Appendix B.1), we conclude that
(27) |
Namely, the MSE of (left hand side) is strictly smaller than that of the posterior sampler (right hand side). ∎
A.3 Proof of the results in Example 1
From (Freirich et al., 2021; Blau & Michaeli, 2018), we know that in Example 1 attains a MSE that is strictly smaller than that of the posterior sampler (assuming that ). Specifically, the closed-form solution of in Example 1 is given by (Freirich et al., 2021):
(28) |
Moreover, in this example, it is well known that the posterior mean is given by
(29) |
Next, we will prove that:
-
(a)
All the assumptions in Proposition 1 hold.
-
(b)
almost surely.
Proof of (a).
Since , we have and . Below, we show that
(30) | |||
(31) |
Since and are jointly Gaussian444 and can be written as a linear transformation of , which are jointly Gaussian random variables. Thus, and are jointly Gaussian., we have
(32) | ||||
(33) |
where Equation 32 follows from the fact that and . One can verify that the solution of for any initial condition is unique and is given by
(34) |
To show that the distribution of is non-degenerate for almost every and , note that
(35) |
Thus, for any , and assuming , the correlation between and is given by
(36) |
Namely, the correlation between and is strictly smaller than for every . Moreover, for the correlation between and clearly equals zero, so such a correlation is smaller than 1 for every . This implies that the distribution of is non-degenerate for almost every and , and so all the assumptions in Proposition 1 hold.
Proof of (b).
The proof follows directly from Equation 34. Specifically, for the initial condition , we have
(39) | ||||
(40) |
Thus, in Example 1, PMRF with coincides with the desired optimal estimator .
A.4 Reflow (optional)
To potentially improve the MSE of PMRF further, one may conduct a reflow procedure (Liu et al., 2023), where a sequence of flow models are trained, and the flow model at index learns to flow from the source distribution to the distribution generated by the flow model at index . Specifically, let be the random vector generated by PMRF (Section 3), where replaces the role of in Section 3 and ( remains unchanged). Thus, from Theorem 3.5 in (Liu et al., 2023), we have , which implies the reflowing may only improve the MSE of PMRF, and hence improve the approximation of the desired optimal transport map (Equation 4). We leave this possibility for future work.
Appendix B Supplementary details and experiments in blind face image restoration
Unfortunately we do not compare with FlowIE (Zhu et al., 2024), as the checkpoints in the official repository of this method seem to not work at the moment. Note that FlowIE is a conditional method that utilizes a ControlNet (similarly to DiffBIR), so it is not similar to our PMRF algorithm.
B.1 Implementation details of PMRF
During training, we only use random horizontal flips for data augmentation. We use the SwinIR (Liang et al., 2021) model trained by Yue & Loy (2024) as the posterior mean predictor in Section 3, and use . This model was trained using the same synthetic degradation as in Equation 12, with the same ranges for , , , and we mentioned in Section 5.1. The SwinIR model’s weights are kept frozen during the vector field’s training stage, and the same weights are utilized during inference as well. The vector field is a HDiT model (Crowson et al., 2024), which we train from scratch. As in (Crowson et al., 2024), we sample uniformly from using a stratified sampling strategy. The vector field is trained for 3850 epochs using the AdamW optimizer (Loshchilov & Hutter, 2019), with a learning rate of , , and a weight decay of (as in (Crowson et al., 2024)). In the last 350 epochs, we reduce the learning rate gradually, multiplying it by at the end of every epoch. The training batch size is set to 256 and is kept fixed. We compute the exponential moving average (EMA) of the model’s weights, using a decay of . The EMA weights of the model are then used in all evaluations. Our model is trained using bfloat16 mixed precision. A summary of the vector field training hyper-parameters is provided in Table 12.
B.2 Varying the number of flow steps in PMRF
B.3 Details of DOT
We use the official codes of DOT (Adrai et al., 2023) as provided by the authors. This method performs optimal transport between the source and target distributions in latent space, using the closed-form solution for the optimal transport map between two Gaussians. As in (Adrai et al., 2023), we use the VAE (Kingma & Welling, 2014) of stable-diffusion (Rombach et al., 2022). For computing the latent empirical mean and covariance of the target distribution, we provide to the code the first 1000 images from FFHQ, with images of size (the default is 100 images, so using 1000 images instead ensures that the performance of DOT is not compromised, as explain by Adrai et al. (2023)). For computing the latent empirical mean and covariance of the source distribution, we randomly synthesize degraded images according to Equation 12 from the first 1000 images in FFHQ, and reconstruct each image using the SwinIR model with the pre-trained weights from (Yue & Loy, 2024) (the same weights we use in PMRF). Given a degraded image at test time, the code of Adrai et al. (2023) first predicts the posterior mean using the SwinIR model, encodes it to latent space, optimally transports the result using the pre-computed empirical means and covariances, and finally uses the decoder to obtain the reconstructed image.
B.4 Computation of FID, KID, and Precision
For each data set and algorithm, the FID, KID, and Precision are computed between the entire FFHQ training set, and the reconstructed images produced for the degraded images in the test data set (as in previous works). For example, for the evaluations on the CelebA-Test data, this means that the FID is computed between the 70,000 FFHQ images, and the 3,000 CelebA-Test reconstructed images.
Appendix C Supplementary details on Section 5.2
C.1 Degradations
The degraded images in each task in the controlled experiments are synthesized according to the following degradations:
-
1.
Denoising: We apply additive white Gaussian noise with standard deviation .
-
2.
Super-resolution: We use the bicubic down-sampling operator, and add Gaussian noise with standard deviation .
-
3.
Inpainting: We randomly mask of the pixels in the ground-truth image, and add Gaussian noise with standard deviation .
-
4.
Colorization: We average the color channels in the ground-truth image (with a weight of for each color channel), and add Gaussian noise with standard deviation .
C.2 Implementation details of the flow methods
Training.
For all restoration tasks in Section 5.2, the models are trained on the FFHQ data set with images of size (we down-sample the original images to ). Unlike in the blind face image restoration experiments, where the model is trained on images of size , here we choose to use a smaller image resolution to save computational resources and achieve shorter training times. During training, we only use random horizontal flips for data augmentation.
Choice of .
As expected, we observe that using in both PMRF (Section 3) and the flow from method (Appendix D) leads to blurry results with small MSE and large FID. Thus, for a fair comparison, we use the same value of in both methods. For the denoising task we use , and for the rest of the tasks (inpainting, colorization, and super-resolution), we use 555Note that the “optimal” value of depends on the severity of the restoration task. For example, in a mild image denoising task, the posterior mean may already be close to the ground-truth image, so should be smaller compared to a case where the noise is severe..
Posterior mean predictor.
The posterior mean predictor is a 4.4M parameters SwinIR model666We use the official code for the SwinIR architecture from https://github.com/JingyunLiang/SwinIR. Implementation details and hyper-parameters are provided in our code. which we train from scratch for each task. In all tasks, this model is trained for 1000 epochs, with a fixed batch size of 256, using the AdamW optimizer with a learning rate of , , without weight decay, and without learning rate scheduling. When utilizing this model in the flow process (e.g., in PMRF), we use the EMA weights computed with a decay of .
Vector field.
Similarly to Section B.1, the vector field is a HDiT model. The time in Sections 3, D, D and D is sampled from using a stratified sampling strategy. For all baseline methods and PMRF, we train the vector field for 1000 epochs, use a fixed batch size of 256, adopt the AdamW optimizer with a learning rate of , , and a weight decay of . As in (Crowson et al., 2024), we do not apply learning rate scheduling. Finally, we use the EMA weights for evaluation, using a decay of . A summary of the hyper-parameters is provided in Table 12.
Evaluation.
We test all models on the CelebA-Test data set, with images of size . The FID of each method is computed between the entire FFHQ training set, and the images produced by the algorithm for the synthesized CelebA-Test degraded images. The MSE is computed between the reconstructed images and the corresponding ground-truth images.
C.3 Details of DOT.
We use DOT (Adrai et al., 2023) similarly to Section B.3, using images of size instead of , and adopting the official codes of the authors. For the source distribution, we randomly synthesize degraded images according to the degradation of each task (Section C.1) from the first 1000 images in FFHQ, reconstruct each image using the SwinIR model we trained for each task (the same weights we use in PMRF), and finally compute the empirical mean and covariance of the reconstructions in latent space.
C.4 Proving that flow from is also optimal in Example 1
In Section 5.2 we show that, for the denoising task, PMRF and flow from are on-par in terms of both perceptual quality and MSE. To provide intuition for this result, we show that flow from leads to the desired estimator in Example 1 (just like PMRF does).
Specifically, as in Example 1, suppose that , , , and . In flow from with we have , and thus . Below, we show that
(41) | |||
(42) |
Hence,
(43) | ||||
(44) |
where Equation 43 holds since and . One can verify that the solution of for any initial condition is given by
(45) |
Namely, we have
(46) |
where the last equality follows from Equation 28. It follows that flow from is also optimal in Example 1, just like PMRF.
Appendix D Indicator RMSE (IndRMSE) derivation
The MSE of any estimator can always be written as
(49) | ||||
(50) |
where is the MMSE estimator, Equation 49 follows from Lemma 2 in (Freirich et al., 2021) (Appendix B.1), and is some constant that does not depend on . Thus, if , we have
(51) |
so may be used as an indicator for . Future works should investigate the effectiveness of this measure.
Perceptual Quality | Distortion | |||||||||
---|---|---|---|---|---|---|---|---|---|---|
FID | KID | NIQE | Precision | PSNR | SSIM | LPIPS | Deg | LMD | ||
3 | 81.81 | 0.0811 | 8.9012 | 0.2820 | 27.668 | 0.7669 | 0.3582 | 31.41 | 2.0340 | |
5 | 63.77 | 0.0581 | 7.4568 | 0.4563 | 27.498 | 0.7601 | 0.3401 | 30.80 | 2.0294 | |
10 | 44.39 | 0.0342 | 5.2648 | 0.6427 | 27.017 | 0.7388 | 0.3314 | 30.49 | 2.0215 | |
25 | 37.46 | 0.0257 | 4.1179 | 0.7073 | 26.373 | 0.7073 | 0.3470 | 30.67 | 2.0303 | |
50 | 36.63 | 0.0244 | 3.8492 | 0.7050 | 26.028 | 0.6896 | 0.3591 | 30.89 | 2.0409 | |
100 | 36.57 | 0.0240 | 3.7311 | 0.7010 | 25.810 | 0.6787 | 0.3662 | 31.06 | 2.0409 |
FID | KID | NIQE | Precision | IndRMSE | |
---|---|---|---|---|---|
3 | 78.2331 | 0.0692 | 8.2315 | 0.3477 | 3.3934 |
5 | 64.3121 | 0.0524 | 6.8733 | 0.5143 | 3.8008 |
10 | 51.9845 | 0.0387 | 4.9896 | 0.6546 | 4.8648 |
25 | 49.3151 | 0.0366 | 4.0028 | 0.6692 | 6.1382 |
50 | 49.5581 | 0.0375 | 3.7126 | 0.6826 | 6.7960 |
100 | 49.6561 | 0.0377 | 3.6242 | 0.6710 | 7.2004 |
FID | KID | NIQE | Precision | IndRMSE | |
---|---|---|---|---|---|
3 | 85.0361 | 0.0704 | 9.9988 | 0.2742 | 5.3486 |
5 | 65.2563 | 0.0451 | 8.4650 | 0.5381 | 5.7665 |
10 | 42.5002 | 0.0179 | 5.5677 | 0.7144 | 7.1134 |
25 | 41.2685 | 0.0160 | 4.0726 | 0.7144 | 9.2164 |
50 | 41.4446 | 0.0174 | 3.6953 | 0.6845 | 10.3403 |
100 | 42.9437 | 0.0183 | 3.5704 | 0.6907 | 11.0674 |
FID | KID | NIQE | Precision | IndRMSE | |
---|---|---|---|---|---|
3 | 128.7858 | 0.0996 | 9.1626 | 0.3907 | 3.2961 |
5 | 113.4734 | 0.0782 | 7.5893 | 0.5553 | 3.7371 |
10 | 91.3677 | 0.0484 | 5.4199 | 0.6413 | 4.8369 |
25 | 81.0642 | 0.0347 | 4.2402 | 0.6462 | 6.3098 |
50 | 78.7174 | 0.0324 | 3.9512 | 0.6265 | 7.0159 |
100 | 79.1239 | 0.0313 | 3.7990 | 0.5602 | 7.6887 |
FID | KID | NIQE | Precision | IndRMSE | |
---|---|---|---|---|---|
3 | 122.8780 | 0.0551 | 6.6818 | 0.3944 | 3.7339 |
5 | 113.7837 | 0.0426 | 5.5810 | 0.4444 | 4.3313 |
10 | 105.7426 | 0.0319 | 4.4119 | 0.6111 | 5.4908 |
25 | 102.8914 | 0.0293 | 3.7367 | 0.5500 | 6.7145 |
50 | 102.1454 | 0.0276 | 3.5609 | 0.6278 | 7.3004 |
100 | 102.0568 | 0.0279 | 3.4878 | 0.5944 | 7.7286 |
Method | FID | KID | NIQE | Precision | IndRMSE |
SwinIR ( Posterior mean) | 87.34 | 0.0808 | 8.595 | 0.2513 | 0 |
DOT | 97.09 | 0.0891 | 5.705 | 0.1806 | 26.24 |
RestoreFormer++ | 50.80 | 0.0386 | 3.911 | 0.6330 | 9.429 |
RestoreFormer | 49.04 | 0.0355 | 4.168 | 0.6674 | 12.21 |
CodeFormer | 52.82 | 0.0387 | 4.484 | 0.6756 | 9.534 |
VQFRv1 | 51.31 | 0.0399 | 3.590 | 0.6014 | 11.26 |
VQFRv2 | 51.16 | 0.0378 | 3.761 | 0.6154 | 16.15 |
GFPGAN | 47.59 | 0.0308 | 4.554 | 0.6400 | 9.842 |
DiffBIR | 40.97 | 0.0234 | 5.738 | 0.5804 | 9.105 |
DifFace | 46.48 | 0.0329 | 4.024 | 0.7411 | 11.33 |
BFRffusion | 50.93 | 0.0377 | 4.963 | 0.6850 | 7.210 |
PMRF (Ours) | 49.32 | 0.0366 | 4.003 | 0.6692 | 6.138 |
Method | FID | KID | NIQE | Precision | IndRMSE |
SwinIR ( Posterior mean) | 91.96 | 0.0780 | 10.16 | 0.1649 | 0 |
DOT | 82.15 | 0.0618 | 7.633 | 0.4082 | 14.900 |
RestoreFormer++ | 45.41 | 0.0209 | 3.759 | 0.6505 | 14.466 |
RestoreFormer | 50.23 | 0.0251 | 3.894 | 0.6505 | 14.200 |
CodeFormer | 39.27 | 0.0138 | 4.164 | 0.7227 | 12.185 |
VQFRv1 | 44.21 | 0.0192 | 3.055 | 0.5959 | 17.042 |
VQFRv2 | 38.70 | 0.0157 | 3.995 | 0.6381 | 16.368 |
GFPGAN | 41.28 | 0.0182 | 4.450 | 0.7876 | 11.840 |
DiffBIR | 35.87 | 0.0114 | 5.659 | 0.6361 | 11.106 |
DifFace | 37.38 | 0.0131 | 4.383 | 0.7856 | 10.418 |
BFRffusion | 56.82 | 0.0307 | 4.647 | 0.5825 | 11.759 |
PMRF (Ours) | 41.27 | 0.0160 | 4.073 | 0.7144 | 9.2164 |
Method | FID | KID | NIQE | Precision | IndRMSE |
SwinIR ( Posterior mean) | 132.1 | 0.1022 | 9.638 | 0.2383 | 0 |
DOT | 125.6 | 0.0865 | 7.397 | 0.3071 | 20.69 |
RestoreFormer++ | 75.60 | 0.0291 | 4.080 | 0.6143 | 18.43 |
RestoreFormer | 77.80 | 0.0334 | 4.460 | 0.6265 | 11.55 |
CodeFormer | 84.17 | 0.0406 | 4.709 | 0.6830 | 8.952 |
VQFRv1 | 75.57 | 0.0312 | 3.608 | 0.5774 | 12.53 |
VQFRv2 | 83.52 | 0.0411 | 4.620 | 0.5848 | 14.48 |
GFPGAN | 88.43 | 0.0494 | 4.941 | 0.6781 | 9.240 |
DiffBIR | 92.82 | 0.0541 | 6.069 | 0.5307 | 9.152 |
DifFace | 80.05 | 0.0341 | 4.405 | 0.7273 | 10.31 |
BFRffusion | 84.83 | 0.0388 | 5.612 | 0.5872 | 7.222 |
PMRF (Ours) | 81.06 | 0.0347 | 4.240 | 0.6462 | 6.310 |
Method | FID | KID | NIQE | Precision | IndRMSE |
SwinIR ( Posterior mean) | 143.80 | 0.0811 | 7.477 | 0.4222 | 0 |
DOT | 208.54 | 0.1634 | 6.018 | 0.0444 | 44.24 |
RestoreFormer++ | 103.81 | 0.0313 | 4.006 | 0.5167 | 11.43 |
RestoreFormer | 103.96 | 0.0315 | 4.320 | 0.5556 | 14.97 |
CodeFormer | 111.62 | 0.0427 | 4.544 | 0.5722 | 10.49 |
VQFRv1 | 105.59 | 0.0336 | 3.756 | 0.5944 | 11.14 |
VQFRv2 | 104.72 | 0.0337 | 3.999 | 0.6056 | 18.51 |
GFPGAN | 109.19 | 0.0395 | 4.423 | 0.5111 | 11.90 |
DiffBIR | 109.74 | 0.0411 | 5.650 | 0.5000 | 9.853 |
DifFace | 98.780 | 0.0243 | 3.901 | 0.6833 | 12.66 |
BFRffusion | 103.06 | 0.0290 | 4.702 | 0.6056 | 8.037 |
PMRF (Ours) | 102.89 | 0.0293 | 3.737 | 0.5500 | 6.715 |
Forward process | Flow training loss | |
---|---|---|
PMRF (Ours) | ||
Flow cond. on | ||
Flow cond. on | ||
Flow from | ||
Hyper-parameter |
|
|
||||
---|---|---|---|---|---|---|
Parameters | 160M | 121M | ||||
Training Epochs | 3850 | 1000 | ||||
Batch Size | 256 | 256 | ||||
Image Size | 512512 | 256256 | ||||
Precision | bfloat16 mixed | bfloat16 mixed | ||||
Training Hardware | 16 A100 40GiB | 4 L40 48GiB | ||||
Training Time | 12 days | 2.5 days | ||||
Patch Size | 4 | 4 | ||||
Levels (Local + Global Attention) | 2 + 1 | 1 + 1 | ||||
Depth | (2,2,8) | (2,11) | ||||
Widths | (256,512,1024) | (384,768) | ||||
Attention Heads (Width / Head Dim) | (4, 8, 16) | (6,12) | ||||
Attention Head Dim | 64 | 64 | ||||
Neighborhood Kernel Size | 7 | 7 | ||||
Mapping Depth | 1 | 1 | ||||
Mapping Width | 768 | 768 | ||||
Optimizer | AdamW | AdamW | ||||
Learning Rate | ||||||
Learning Rate Scheduler | Multi-step, last 350 epochs | Not applied | ||||
AdamW betas | (0.9, 0.95) | (0.9, 0.95) | ||||
AdamW eps | ||||||
Weight Decay | ||||||
EMA Decay | 0.9999 | 0.9999 |