Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                

Posterior-Mean Rectified Flow: Towards Minimum MSE Photo-Realistic Image Restoration

Guy Ohayon, Tomer Michaeli, Michael Elad
Technion–Israel Institute of Technology
{ohayonguy@cs,tomer.m@ee,elad@cs}.technion.ac.il
Abstract

Photo-realistic image restoration algorithms are typically evaluated by distortion measures (e.g., PSNR, SSIM) and by perceptual quality measures (e.g., FID, NIQE), where the desire is to attain the lowest possible distortion without compromising on perceptual quality. To achieve this goal, current methods typically attempt to sample from the posterior distribution, or to optimize a weighted sum of a distortion loss (e.g., MSE) and a perceptual quality loss (e.g., GAN). Unlike previous works, this paper is concerned specifically with the optimal estimator that minimizes the MSE under a constraint of perfect perceptual index, namely where the distribution of the reconstructed images is equal to that of the ground-truth ones. A recent theoretical result shows that such an estimator can be constructed by optimally transporting the posterior mean prediction (MMSE estimate) to the distribution of the ground-truth images. Inspired by this result, we introduce Posterior-Mean Rectified Flow (PMRF), a simple yet highly effective algorithm that approximates this optimal estimator. In particular, PMRF first predicts the posterior mean, and then transports the result to a high-quality image using a rectified flow model that approximates the desired optimal transport map. We investigate the theoretical utility of PMRF and demonstrate that it consistently outperforms previous methods on a variety of image restoration tasks. Our codes are available at https://github.com/ohayonguy/PMRF.

1 Introduction

Refer to caption
Figure 1: Illustration of the distortion-perception tradeoff, where distortion is measured by MSE. Many photo-realistic image restoration methods aim for posterior sampling. Theoretically, this approach achieves a perfect perceptual index (pX^=pXsubscript𝑝^𝑋subscript𝑝𝑋p_{\hat{X}}=p_{X}italic_p start_POSTSUBSCRIPT over^ start_ARG italic_X end_ARG end_POSTSUBSCRIPT = italic_p start_POSTSUBSCRIPT italic_X end_POSTSUBSCRIPT) but its MSE is twice the MMSE. In contrast, we aim for the estimator X^0subscript^𝑋0\smash{\hat{X}_{0}}over^ start_ARG italic_X end_ARG start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT that minimizes the MSE under a perfect perceptual index constraint (Eq. (3)), which typically achieves a smaller MSE than posterior sampling.

Photo-realistic image restoration (PIR) is the task of reconstructing visually appealing images from degraded measurements (e.g., noisy, blurry). This is a long-standing research problem with diverse applications in mobile photography, surveillance, remote sensing, medical imaging, and more. PIR algorithms are commonly evaluated by distortion measures (e.g., PSNR, SSIM (Wang et al., 2004), LPIPS (Zhang et al., 2018)), which quantify some type of discrepancy between the reconstructed images and the ground-truth ones, and by perceptual quality measures (e.g., FID (Heusel et al., 2017), KID (Bińkowski et al., 2018), NIQE (Mittal et al., 2013), NIMA (Talebi & Milanfar, 2018)), which are intended to predict the extent to which the reconstructions would look natural to human observers. Since distortion and perceptual quality are typically at odds with each other (Blau & Michaeli, 2018), the core challenge in PIR is to achieve minimal distortion without sacrificing perceptual quality.

Refer to caption
Figure 2: Visual results of PMRF (our method) on the CelebA-Test blind face image restoration data set. Our algorithm produces sharp and visually appealing details while maintaining incredibly low distortion according to a variety of measures simultaneously. See Table 1.

A common way to approach this task is through posterior sampling (Daras et al., 2024; Kawar et al., 2021a; 2022; b; Man et al., 2023; Ohayon et al., 2021; Bendel et al., 2023; Chung et al., 2023; Wang et al., 2023a; Song et al., 2023; Zhu et al., 2023; Saharia et al., 2023; Murata et al., 2023; Saharia et al., 2022). Specifically, letting X𝑋Xitalic_X and Y𝑌Yitalic_Y denote the random vectors corresponding to the ground-truth image and its degraded measurement, respectively, posterior sampling generates a reconstruction X^^𝑋\smash{\hat{X}}over^ start_ARG italic_X end_ARG by sampling from pX|Ysubscript𝑝conditional𝑋𝑌\smash{p_{X|Y}}italic_p start_POSTSUBSCRIPT italic_X | italic_Y end_POSTSUBSCRIPT (such that pX^|Y=pX|Ysubscript𝑝conditional^𝑋𝑌subscript𝑝conditional𝑋𝑌\smash{p_{\hat{X}|Y}=p_{X|Y}}italic_p start_POSTSUBSCRIPT over^ start_ARG italic_X end_ARG | italic_Y end_POSTSUBSCRIPT = italic_p start_POSTSUBSCRIPT italic_X | italic_Y end_POSTSUBSCRIPT). This solution is appealing as it theoretically guarantees a perfect perceptual index111Formally, the perceptual index of X^^𝑋\smash{\hat{X}}over^ start_ARG italic_X end_ARG is defined as the statistical divergence between pX^subscript𝑝^𝑋p_{\hat{X}}italic_p start_POSTSUBSCRIPT over^ start_ARG italic_X end_ARG end_POSTSUBSCRIPT and pXsubscript𝑝𝑋p_{X}italic_p start_POSTSUBSCRIPT italic_X end_POSTSUBSCRIPT. (pX^=pXsubscript𝑝^𝑋subscript𝑝𝑋\smash{p_{\hat{X}}=p_{X}}italic_p start_POSTSUBSCRIPT over^ start_ARG italic_X end_ARG end_POSTSUBSCRIPT = italic_p start_POSTSUBSCRIPT italic_X end_POSTSUBSCRIPT). Interestingly, however, the Mean Squared Error (MSE) that this solution achieves is not the minimal possible under the perfect perceptual index constraint. Indeed, the MSE achieved by posterior sampling is precisely twice the Minimum MSE (MMSE) that can be achieved without a constraint on the perceptual index (Blau & Michaeli, 2018). This is while the minimal MSE achievable under a perfect perceptual index constraint is typically strictly smaller (Blau & Michaeli, 2018; Freirich et al., 2021), as illustrated in Figure 1. Throughout this paper, we denote by X^0subscript^𝑋0\smash{\hat{X}_{0}}over^ start_ARG italic_X end_ARG start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT the estimator that minimizes the MSE under a perfect perceptual quality constraint. Its formal definition is provided in Section 2.2.

Table 1: Quantitative evaluation of state-of-the-art blind face image restoration algorithms on the CelebA-Test benchmark. Red, blue and green indicate the best, the second best, and the third best scores, respectively. Our method achieves the best FID, KID, PSNR and SSIM, and the second or third best scores in the rest of the perceptual quality and distortion measures. A visual comparison is provided in Figure 2 and Figure 5 in the appendix.
Perceptual Quality Distortion
Method FID\downarrow KID\downarrow NIQE\downarrow Precision\uparrow PSNR\uparrow SSIM\uparrow LPIPS\downarrow Deg\downarrow LMD\downarrow
DOT 100.2 0.0914 6.462 0.1600 21.32 0.6636 0.4756 43.87 2.876
RestoreFormer++ 41.15 0.0290 4.187 0.6877 25.31 0.6703 0.3441 29.63 2.043
RestoreFormer 42.30 0.0301 4.405 0.7010 24.62 0.6460 0.3655 32.13 2.299
CodeFormer 53.16 0.0425 4.649 0.6940 25.15 0.6700 0.3432 37.28 2.470
VQFRv1 41.79 0.0297 3.693 0.6593 24.07 0.6446 0.3515 35.75 2.429
VQFRv2 46.77 0.0346 4.169 0.6590 23.23 0.6412 0.3624 44.38 3.053
GFPGAN 46.72 0.0350 4.415 0.6970 24.99 0.6774 0.3643 36.05 2.443
DiffBIR 59.06 0.0509 6.084 0.5643 25.39 0.6536 0.3878 32.94 2.006
DifFace 38.43 0.0258 4.288 0.7413 24.80 0.6726 0.3999 45.79 2.965
BFRffusion 41.53 0.0301 4.966 0.6623 26.21 0.6917 0.3619 30.98 1.992
PMRF (Ours) 37.46 0.0257 4.118 0.7073 26.37 0.7073 0.3470 30.67 2.030

Another common way to solve PIR tasks is to train a model by minimizing a weighted sum of a distortion loss (e.g., MSE) and a GAN loss (Goodfellow et al., 2014; Gu et al., 2022; Wang et al., 2021; 2023b; 2022; Zhou et al., 2022; Yang et al., 2021; Ledig et al., 2017; Wang et al., 2018; Zhang et al., 2021; Wang et al., ). As explained by Blau & Michaeli (2018), this is a principled way to traverse the distortion-perception tradeoff, where the GAN loss coefficient acts as a Lagrange multiplier that controls the desired perceptual index. Thus, in principle, one can approximate X^0subscript^𝑋0\smash{\hat{X}_{0}}over^ start_ARG italic_X end_ARG start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT by selecting a sufficiently large such coefficient. Despite the elegance of this approach, diffusion methods that aim for posterior sampling tend to perform better in practice, both in terms of distortion and perceptual quality (see Table 1), implying that current GAN-based methods fail to approximate X^0subscript^𝑋0\smash{\hat{X}_{0}}over^ start_ARG italic_X end_ARG start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT. Such a shortcoming can be partially attributed to the fact that GANs are extremely difficult to optimize, especially when the GAN loss coefficient is significantly larger than that of the distortion loss.

In this paper, we propose Posterior-Mean Rectified Flow (PMRF), a straightforward framework to directly approximate X^0subscript^𝑋0\smash{\hat{X}_{0}}over^ start_ARG italic_X end_ARG start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT. Interestingly, Freirich et al. (2021) proved that X^0subscript^𝑋0\smash{\hat{X}_{0}}over^ start_ARG italic_X end_ARG start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT can be constructed by first predicting the posterior mean X^𝔼[X|Y]superscript^𝑋𝔼delimited-[]conditional𝑋𝑌\smash{\hat{X}^{*}\coloneqq\mathbb{E}[X|Y]}over^ start_ARG italic_X end_ARG start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ≔ blackboard_E [ italic_X | italic_Y ], and then optimally transporting the result to the ground-truth image distribution (see Section 2.2 for a formal explanation). Motivated by this result, PMRF first approximates the posterior mean by using a model that minimizes the MSE between the reconstructed outputs and the ground-truth images. Then, we train a rectified flow model (Liu et al., 2023) to predict the direction of the straight path between corresponding pairs of posterior mean predictions and ground-truth images. Given a degraded measurement at test time, PMRF solves an ODE using such a flow model, with the posterior mean prediction set as the initial condition. As we explain in Section 3, PMRF approximates the desired estimator X^0subscript^𝑋0\smash{\hat{X}_{0}}over^ start_ARG italic_X end_ARG start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT, aiming for a solution that minimizes the MSE under a perfect perceptual index constraint.

Our paper is organized as follows. In Section 2 we provide the necessary background and set mathematical notations. In Section 3 we describe our proposed method, and provide intuition via theoretical results and a toy example with closed-form solutions. In Section 4 we discuss related work. In Section 5 we demonstrate the utility of PMRF on a variety of face image restoration tasks, including denoising, super-resolution, inpainting, colorization, and blind restoration. We show that PMRF sets a new state-of-the-art on several benchmarks in the challenging blind face image restoration task, and is either on-par or outperforms previous frameworks in the rest of the tasks. Finally, in Section 6 we conclude our work and discuss its limitations.

2 Background

We adopt the Bayesian perspective for solving inverse problems (Davison, 2003; Kaipio & Somersalo, 2005), where a natural image x𝑥\smash{x}italic_x is regarded as a realization of a random vector X𝑋Xitalic_X with probability density function pXsubscript𝑝𝑋p_{X}italic_p start_POSTSUBSCRIPT italic_X end_POSTSUBSCRIPT. The degraded measurement y𝑦\smash{y}italic_y (e.g., a noisy or low-resolution image) is a realization of a random vector Y𝑌Yitalic_Y, which is related to X𝑋Xitalic_X via the conditional probability density function pY|Xsubscript𝑝conditional𝑌𝑋\smash{p_{Y|X}}italic_p start_POSTSUBSCRIPT italic_Y | italic_X end_POSTSUBSCRIPT. Given a degraded measurement y𝑦yitalic_y, an image restoration algorithm generates a prediction x^^𝑥\smash{\hat{x}}over^ start_ARG italic_x end_ARG by sampling from pX^|Y(|y)\smash{p_{\hat{X}|Y}}(\cdot|y)italic_p start_POSTSUBSCRIPT over^ start_ARG italic_X end_ARG | italic_Y end_POSTSUBSCRIPT ( ⋅ | italic_y ), such that X^^𝑋\smash{\hat{X}}over^ start_ARG italic_X end_ARG adheres to the Markov chain XYX^𝑋𝑌^𝑋\smash{X\rightarrow Y\rightarrow\hat{X}}italic_X → italic_Y → over^ start_ARG italic_X end_ARG (i.e. X𝑋Xitalic_X and X^^𝑋\smash{\hat{X}}over^ start_ARG italic_X end_ARG are statistically independent given Y𝑌Yitalic_Y).

2.1 Distortion and perceptual index

Image restoration algorithms are typically evaluated by their average distortion 𝔼[Δ(X,X^)]𝔼delimited-[]Δ𝑋^𝑋\smash{\mathbb{E}[\Delta(X,\hat{X})]}blackboard_E [ roman_Δ ( italic_X , over^ start_ARG italic_X end_ARG ) ], where Δ(x,x^)Δ𝑥^𝑥\Delta(x,\hat{x})roman_Δ ( italic_x , over^ start_ARG italic_x end_ARG ) is some distortion measure that quantifies the discrepancy between x𝑥xitalic_x and x^^𝑥\hat{x}over^ start_ARG italic_x end_ARG, and the expectation is taken over the joint distribution pX,X^subscript𝑝𝑋^𝑋\smash{p_{X,\hat{X}}}italic_p start_POSTSUBSCRIPT italic_X , over^ start_ARG italic_X end_ARG end_POSTSUBSCRIPT. Common examples for Δ(x,x^)Δ𝑥^𝑥\Delta(x,\hat{x})roman_Δ ( italic_x , over^ start_ARG italic_x end_ARG ) are the absolute error xx^1subscriptdelimited-∥∥𝑥^𝑥1\smash{\lVert x-\hat{x}\rVert_{1}}∥ italic_x - over^ start_ARG italic_x end_ARG ∥ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT, the squared error xx^2superscriptdelimited-∥∥𝑥^𝑥2\smash{\lVert x-\hat{x}\rVert^{2}}∥ italic_x - over^ start_ARG italic_x end_ARG ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT, and LPIPS (Zhang et al., 2018). Moreover, as the goal in PIR is to produce reconstructions that would look natural to humans, PIR algorithms are also evaluated by perceptual quality measures. The ideal way to evaluate perceptual quality is to assess the ability of humans to distinguish between samples of ground-truth images and samples of reconstructed ones. This is typically done by conducting experiments where human observers vote on whether the generated images are real or fake (Isola et al., 2017; Zhang et al., 2016; Salimans et al., 2016; Denton et al., 2015; Dahl et al., 2017; Iizuka et al., 2016; Zhang et al., 2017; Guadarrama et al., 2017). However, such experiments are too costly and impractical for optimizing models. A practical and sensible alternative to quantify the perceptual quality is via some perceptual index d(pX,pX^)𝑑subscript𝑝𝑋subscript𝑝^𝑋\smash{d(p_{X},p_{\hat{X}})}italic_d ( italic_p start_POSTSUBSCRIPT italic_X end_POSTSUBSCRIPT , italic_p start_POSTSUBSCRIPT over^ start_ARG italic_X end_ARG end_POSTSUBSCRIPT ), where d(,)𝑑\smash{d(\cdot,\cdot)}italic_d ( ⋅ , ⋅ ) is a statistical divergence between probability distributions (e.g., Kullback–Leibler, Wasserstein) (Blau & Michaeli, 2018). Quantifying the perceptual index for high-dimensional distributions is both statistically and computationally intractable, so it is common to resort to approximations. Popular examples include the Fréchet Inception Distance (FID) (Heusel et al., 2017) and the Kernel Inception Distance (KID) (Bińkowski et al., 2018).

2.2 Optimal estimators for the squared error distortion

Due to the distortion-perception tradeoff (Blau & Michaeli, 2018), it has become common practice to compare image restoration algorithms on the distortion-perception plane, where the goal is to obtain optimal estimators with the lowest possible distortion given a prescribed level of perceptual index. This goal can be formalized by the distortion-perception function (Blau & Michaeli, 2018),

D(P)=minpX^|Y𝔼[Δ(X,X^)]s.t.d(pX,pX^)P.formulae-sequence𝐷𝑃subscriptsubscript𝑝conditional^𝑋𝑌𝔼delimited-[]Δ𝑋^𝑋s.t.𝑑subscript𝑝𝑋subscript𝑝^𝑋𝑃D(P)=\min_{p_{\hat{X}|Y}}\mathbb{E}[\Delta(X,\hat{X})]\quad\text{s.t.}\quad d(% p_{X},p_{\hat{X}})\leq P.italic_D ( italic_P ) = roman_min start_POSTSUBSCRIPT italic_p start_POSTSUBSCRIPT over^ start_ARG italic_X end_ARG | italic_Y end_POSTSUBSCRIPT end_POSTSUBSCRIPT blackboard_E [ roman_Δ ( italic_X , over^ start_ARG italic_X end_ARG ) ] s.t. italic_d ( italic_p start_POSTSUBSCRIPT italic_X end_POSTSUBSCRIPT , italic_p start_POSTSUBSCRIPT over^ start_ARG italic_X end_ARG end_POSTSUBSCRIPT ) ≤ italic_P . (1)

Perhaps the most common points of interest on D(P)𝐷𝑃D(P)italic_D ( italic_P ) are D()𝐷D(\infty)italic_D ( ∞ ) and D(0)𝐷0D(0)italic_D ( 0 ), where the first point corresponds to the estimator achieving minimal average distortion under no constraint, and the second corresponds to the estimator achieving minimal average distortion under a perfect perceptual index constraint. Considering the squared error distortion, these points are defined by

minpX^|Y𝔼[XX^2]andsubscriptsubscript𝑝conditional^𝑋𝑌𝔼delimited-[]superscriptdelimited-∥∥𝑋^𝑋2and\displaystyle\min_{p_{\hat{X}|Y}}\mathbb{E}[\lVert X-\hat{X}\rVert^{2}]~{}~{}% \mbox{and}roman_min start_POSTSUBSCRIPT italic_p start_POSTSUBSCRIPT over^ start_ARG italic_X end_ARG | italic_Y end_POSTSUBSCRIPT end_POSTSUBSCRIPT blackboard_E [ ∥ italic_X - over^ start_ARG italic_X end_ARG ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ] and (2)
minpX^|Y𝔼[XX^2]s.t.pX^=pX,subscriptsubscript𝑝conditional^𝑋𝑌𝔼delimited-[]superscriptdelimited-∥∥𝑋^𝑋2s.t.subscript𝑝^𝑋subscript𝑝𝑋\displaystyle\min_{p_{\hat{X}|Y}}\mathbb{E}[\lVert X-\hat{X}\rVert^{2}]\quad% \text{s.t.}\quad p_{\hat{X}}=p_{X},roman_min start_POSTSUBSCRIPT italic_p start_POSTSUBSCRIPT over^ start_ARG italic_X end_ARG | italic_Y end_POSTSUBSCRIPT end_POSTSUBSCRIPT blackboard_E [ ∥ italic_X - over^ start_ARG italic_X end_ARG ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ] s.t. italic_p start_POSTSUBSCRIPT over^ start_ARG italic_X end_ARG end_POSTSUBSCRIPT = italic_p start_POSTSUBSCRIPT italic_X end_POSTSUBSCRIPT , (3)

respectively. It is well-known that the unique solution to Problem (2) is the posterior mean X^𝔼[X|Y]superscript^𝑋𝔼delimited-[]conditional𝑋𝑌\smash{\hat{X}^{*}\coloneqq\mathbb{E}[X|Y]}over^ start_ARG italic_X end_ARG start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ≔ blackboard_E [ italic_X | italic_Y ], which typically produces overly-smooth reconstructions (Blau & Michaeli, 2018). Therefore, in PIR tasks, it is more appropriate to aim for the solution to Problem (3). Interestingly, Freirich et al. (2021) proved that a solution to Problem (3) can be obtained by solving the optimal transport problem

pU,VargminpU,VΠ(pX,pX^)𝔼[UV2],subscript𝑝𝑈𝑉subscriptargminsubscript𝑝superscript𝑈superscript𝑉Πsubscript𝑝𝑋subscript𝑝superscript^𝑋𝔼delimited-[]superscriptdelimited-∥∥superscript𝑈superscript𝑉2\displaystyle p_{U,V}\in\operatorname*{arg\,min}_{p_{U^{\prime},V^{\prime}}\in% \Pi(p_{X},p_{\hat{X}^{*}})}\mathbb{E}[\lVert U^{\prime}-V^{\prime}\rVert^{2}],italic_p start_POSTSUBSCRIPT italic_U , italic_V end_POSTSUBSCRIPT ∈ start_OPERATOR roman_arg roman_min end_OPERATOR start_POSTSUBSCRIPT italic_p start_POSTSUBSCRIPT italic_U start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_V start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ∈ roman_Π ( italic_p start_POSTSUBSCRIPT italic_X end_POSTSUBSCRIPT , italic_p start_POSTSUBSCRIPT over^ start_ARG italic_X end_ARG start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ) end_POSTSUBSCRIPT blackboard_E [ ∥ italic_U start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT - italic_V start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ] , (4)

where Π(pX,pX^){pU,V:pU=pX,pV=pX^}Πsubscript𝑝𝑋subscript𝑝superscript^𝑋conditional-setsubscript𝑝superscript𝑈superscript𝑉formulae-sequencesubscript𝑝superscript𝑈subscript𝑝𝑋subscript𝑝superscript𝑉subscript𝑝superscript^𝑋\smash{\Pi(p_{X},p_{\hat{X}^{*}})\coloneqq\{p_{U^{\prime},V^{\prime}}\>:\>p_{U% ^{\prime}}=p_{X},p_{V^{\prime}}=p_{\hat{X}^{*}}\}}roman_Π ( italic_p start_POSTSUBSCRIPT italic_X end_POSTSUBSCRIPT , italic_p start_POSTSUBSCRIPT over^ start_ARG italic_X end_ARG start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ) ≔ { italic_p start_POSTSUBSCRIPT italic_U start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_V start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT : italic_p start_POSTSUBSCRIPT italic_U start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT = italic_p start_POSTSUBSCRIPT italic_X end_POSTSUBSCRIPT , italic_p start_POSTSUBSCRIPT italic_V start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT = italic_p start_POSTSUBSCRIPT over^ start_ARG italic_X end_ARG start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT } is the set of all joint probabilities pU,Vsubscript𝑝superscript𝑈superscript𝑉p_{U^{\prime},V^{\prime}}italic_p start_POSTSUBSCRIPT italic_U start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_V start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT with marginals pU=pXsubscript𝑝superscript𝑈subscript𝑝𝑋\smash{p_{U^{\prime}}=p_{X}}italic_p start_POSTSUBSCRIPT italic_U start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT = italic_p start_POSTSUBSCRIPT italic_X end_POSTSUBSCRIPT and pV=pX^subscript𝑝superscript𝑉subscript𝑝superscript^𝑋\smash{p_{V^{\prime}}=p_{\hat{X}^{*}}}italic_p start_POSTSUBSCRIPT italic_V start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT = italic_p start_POSTSUBSCRIPT over^ start_ARG italic_X end_ARG start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT. Namely, the optimal solution to Problem (3) can be constructed as follows: Given a degraded measurement y𝑦yitalic_y, first predict the posterior mean x^=𝔼[X|Y=y]superscript^𝑥𝔼delimited-[]conditional𝑋𝑌𝑦\smash{\hat{x}^{*}=\mathbb{E}[X|Y=y]}over^ start_ARG italic_x end_ARG start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT = blackboard_E [ italic_X | italic_Y = italic_y ], and then sample from pU|V(|x^)p_{U|V}(\cdot|\hat{x}^{*})italic_p start_POSTSUBSCRIPT italic_U | italic_V end_POSTSUBSCRIPT ( ⋅ | over^ start_ARG italic_x end_ARG start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ), which is the optimal transport plan from pX^subscript𝑝superscript^𝑋\smash{p_{\hat{X}^{*}}}italic_p start_POSTSUBSCRIPT over^ start_ARG italic_X end_ARG start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT to pXsubscript𝑝𝑋p_{X}italic_p start_POSTSUBSCRIPT italic_X end_POSTSUBSCRIPT. Similarly to Freirich et al. (2021), we denote such a solution to Problem (3) by X^0subscript^𝑋0\smash{\hat{X}_{0}}over^ start_ARG italic_X end_ARG start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT.

As discussed before, one of the most common and appealing solutions for PIR tasks is the estimator X^^𝑋\smash{\hat{X}}over^ start_ARG italic_X end_ARG that samples from the posterior distribution pX|Ysubscript𝑝conditional𝑋𝑌p_{X|Y}italic_p start_POSTSUBSCRIPT italic_X | italic_Y end_POSTSUBSCRIPT, such that pX^|Y=pX|Ysubscript𝑝conditional^𝑋𝑌subscript𝑝conditional𝑋𝑌\smash{p_{\hat{X}|Y}=p_{X|Y}}italic_p start_POSTSUBSCRIPT over^ start_ARG italic_X end_ARG | italic_Y end_POSTSUBSCRIPT = italic_p start_POSTSUBSCRIPT italic_X | italic_Y end_POSTSUBSCRIPT. While such an estimator always attains a perfect perceptual index (Blau & Michaeli, 2018), its MSE is typically larger than that of X^0subscript^𝑋0\smash{\hat{X}_{0}}over^ start_ARG italic_X end_ARG start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT (Blau & Michaeli, 2018; Freirich et al., 2021) (see Figure 1). In other words, to design an algorithm with minimal MSE under a perfect perceptual index constraint, one should often not resort to posterior sampling, but rather to solving Problem (3). This is our goal in this paper. Lastly, one may wonder whether sampling from pX|X^subscript𝑝conditional𝑋superscript^𝑋\smash{p_{X|\hat{X}^{*}}}italic_p start_POSTSUBSCRIPT italic_X | over^ start_ARG italic_X end_ARG start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT instead of using the optimal transport plan from Equation 4 may also be effective in terms of MSE. However, in Section A.1 we prove that such an approach leads to precisely the same MSE as sampling from the posterior.

2.3 Flow matching and rectified flows

Flow matching.

Flow matching algorithms (Liu et al., 2023; Lipman et al., 2023; Albergo & Vanden-Eijnden, 2023) are generative models defined via the ODE

dZt=v(Zt,t)dt,𝑑subscript𝑍𝑡𝑣subscript𝑍𝑡𝑡𝑑𝑡\displaystyle dZ_{t}=v(Z_{t},t)dt,italic_d italic_Z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = italic_v ( italic_Z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_t ) italic_d italic_t , (5)

where v𝑣vitalic_v is often called a vector field, and Ztsubscript𝑍𝑡Z_{t}italic_Z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT is some forward process such that pZ0subscript𝑝subscript𝑍0p_{Z_{0}}italic_p start_POSTSUBSCRIPT italic_Z start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT is the source distribution, from which we can easily sample (e.g., isotropic Gaussian noise), and pZ1subscript𝑝subscript𝑍1p_{Z_{1}}italic_p start_POSTSUBSCRIPT italic_Z start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT is the target distribution from which we aim to sample (e.g., natural images). In principle, one can generate samples from the target distribution pZ1subscript𝑝subscript𝑍1p_{Z_{1}}italic_p start_POSTSUBSCRIPT italic_Z start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT by solving Equation 5, where samples from the source distribution pZ0subscript𝑝subscript𝑍0p_{Z_{0}}italic_p start_POSTSUBSCRIPT italic_Z start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT are set as the initial conditions for the ODE solver. Nevertheless, given a particular forward process Ztsubscript𝑍𝑡Z_{t}italic_Z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT, there are possibly many different vector fields that satisfy Equation 5. The goal in flow matching is to somehow find an appropriate vector field with desirable practical and theoretical properties, e.g., where the solution to Equation 5 is unique.

Rectified flow.

Rectified flow (Liu et al., 2023) is a flow matching algorithm defined via the particular forward process

Zt=tZ1+(1t)Z0,subscript𝑍𝑡𝑡subscript𝑍11𝑡subscript𝑍0\displaystyle Z_{t}=tZ_{1}+(1-t)Z_{0},italic_Z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = italic_t italic_Z start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT + ( 1 - italic_t ) italic_Z start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , (6)

which connects samples from pZ1subscript𝑝subscript𝑍1p_{Z_{1}}italic_p start_POSTSUBSCRIPT italic_Z start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT and pZ0subscript𝑝subscript𝑍0p_{Z_{0}}italic_p start_POSTSUBSCRIPT italic_Z start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT with straight lines. Here, Z0subscript𝑍0Z_{0}italic_Z start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT and Z1subscript𝑍1Z_{1}italic_Z start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT can be statistically independent, as is typically the case when learning a flow model from Gaussian noise to image data, but they can also have any joint distribution pZ0,Z1subscript𝑝subscript𝑍0subscript𝑍1p_{Z_{0},Z_{1}}italic_p start_POSTSUBSCRIPT italic_Z start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , italic_Z start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT. This forward process clearly adheres to the ODE dZt=(Z1Z0)dt𝑑subscript𝑍𝑡subscript𝑍1subscript𝑍0𝑑𝑡dZ_{t}=(Z_{1}-Z_{0})dtitalic_d italic_Z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = ( italic_Z start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT - italic_Z start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) italic_d italic_t, where Z1Z0subscript𝑍1subscript𝑍0\smash{Z_{1}-Z_{0}}italic_Z start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT - italic_Z start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT is the corresponding vector field. However, this is not a practical generative model, since it requires knowing the “destination” realization of Z1subscript𝑍1Z_{1}italic_Z start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT at any time step t<1𝑡1t<1italic_t < 1 (i.e., the solution is not causal). To solve this issue, Liu et al. (2023) offer instead to use the vector field

vRF(Zt,t)=𝔼[Z1Z0|Zt],subscript𝑣RFsubscript𝑍𝑡𝑡𝔼delimited-[]subscript𝑍1conditionalsubscript𝑍0subscript𝑍𝑡\displaystyle v_{\text{RF}}(Z_{t},t)=\mathbb{E}[Z_{1}-Z_{0}|Z_{t}],italic_v start_POSTSUBSCRIPT RF end_POSTSUBSCRIPT ( italic_Z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_t ) = blackboard_E [ italic_Z start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT - italic_Z start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT | italic_Z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ] , (7)

which is causal, and generates the target distribution if the solution to Equation 5 exists and is unique when adopting such a vector field (Theorem 3.3 in (Liu et al., 2023)). Interestingly, solving the ODE in Equation 5 with vRFsubscript𝑣RF\smash{v_{\text{RF}}}italic_v start_POSTSUBSCRIPT RF end_POSTSUBSCRIPT often approximates the optimal transport map from the source distribution to the target one, especially when the process is repeated several times (i.e., reflow) or when pZ1,Z0subscript𝑝subscript𝑍1subscript𝑍0p_{Z_{1},Z_{0}}italic_p start_POSTSUBSCRIPT italic_Z start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_Z start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT is close to the optimal transport plan between pZ0subscript𝑝subscript𝑍0p_{Z_{0}}italic_p start_POSTSUBSCRIPT italic_Z start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT and pZ1subscript𝑝subscript𝑍1p_{Z_{1}}italic_p start_POSTSUBSCRIPT italic_Z start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT (Liu et al., 2023; Tong et al., 2024). To learn vRFsubscript𝑣RF\smash{v_{\text{RF}}}italic_v start_POSTSUBSCRIPT RF end_POSTSUBSCRIPT, one can simply train a model vθsubscript𝑣𝜃v_{\theta}italic_v start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT by minimizing the loss

01𝔼[(Z1Z0)vθ(Zt,t)2]𝑑t,superscriptsubscript01𝔼delimited-[]superscriptdelimited-∥∥subscript𝑍1subscript𝑍0subscript𝑣𝜃subscript𝑍𝑡𝑡2differential-d𝑡\displaystyle\int_{0}^{1}\mathbb{E}\left[\lVert(Z_{1}-Z_{0})-v_{\theta}(Z_{t},% t)\rVert^{2}\right]dt,∫ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT blackboard_E [ ∥ ( italic_Z start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT - italic_Z start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) - italic_v start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_Z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_t ) ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ] italic_d italic_t , (8)

where the expectation is taken over the joint distribution pZ1,Z0subscript𝑝subscript𝑍1subscript𝑍0p_{Z_{1},Z_{0}}italic_p start_POSTSUBSCRIPT italic_Z start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_Z start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT (Liu et al., 2023).

3 Posterior-Mean Rectified Flow

[t]    Training 

       Stage 1: Solve ωargminω𝔼[Xfω(Y)2]superscript𝜔subscript𝜔𝔼delimited-[]superscriptdelimited-∥∥𝑋subscript𝑓𝜔𝑌2\omega^{*}\leftarrow\arg\min_{\omega}\mathbb{E}\left[\lVert X-f_{\omega}(Y)% \rVert^{2}\right]italic_ω start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ← roman_arg roman_min start_POSTSUBSCRIPT italic_ω end_POSTSUBSCRIPT blackboard_E [ ∥ italic_X - italic_f start_POSTSUBSCRIPT italic_ω end_POSTSUBSCRIPT ( italic_Y ) ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ] Stage 2: Solve θargminθ𝔼[(XZ0)vθ(Zt,t)2]superscript𝜃subscriptargmin𝜃𝔼delimited-[]superscriptdelimited-∥∥𝑋subscript𝑍0subscript𝑣𝜃subscript𝑍𝑡𝑡2\theta^{*}\leftarrow\operatorname*{arg\,min}_{\theta}\mathbb{E}\left[\lVert% \left(X-Z_{0}\right)-v_{\theta}(Z_{t},t)\rVert^{2}\right]italic_θ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ← start_OPERATOR roman_arg roman_min end_OPERATOR start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT blackboard_E [ ∥ ( italic_X - italic_Z start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) - italic_v start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_Z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_t ) ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ]
        // ZttX+(1t)(fω(Y)+σsϵ)subscript𝑍𝑡𝑡𝑋1𝑡subscript𝑓superscript𝜔𝑌subscript𝜎𝑠italic-ϵZ_{t}\coloneqq tX+(1-t)(f_{\omega^{*}}(Y)+\sigma_{s}\epsilon)italic_Z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ≔ italic_t italic_X + ( 1 - italic_t ) ( italic_f start_POSTSUBSCRIPT italic_ω start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( italic_Y ) + italic_σ start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT italic_ϵ ), where t𝑡titalic_t is sampled from U[0,1]𝑈01U[0,1]italic_U [ 0 , 1 ].
Inference (using Euler’s method with K𝐾Kitalic_K steps to solve the ODE) 
       Sample ϵ𝒩(0,I)similar-toitalic-ϵ𝒩0𝐼\epsilon\sim\mathcal{N}(0,I)italic_ϵ ∼ caligraphic_N ( 0 , italic_I ) x^fω(y)+σsϵ^𝑥subscript𝑓superscript𝜔𝑦subscript𝜎𝑠italic-ϵ\hat{x}\leftarrow f_{\omega^{*}}(y)+\sigma_{s}\epsilonover^ start_ARG italic_x end_ARG ← italic_f start_POSTSUBSCRIPT italic_ω start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( italic_y ) + italic_σ start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT italic_ϵ
        // y𝑦yitalic_y is the given degraded measurement for i0,,K1𝑖0𝐾1i\leftarrow 0,\ldots,K-1italic_i ← 0 , … , italic_K - 1 do
             x^x^+1Kvθ(x^,iK)^𝑥^𝑥1𝐾subscript𝑣superscript𝜃^𝑥𝑖𝐾\hat{x}\leftarrow\hat{x}+\frac{1}{K}v_{\theta^{*}}(\hat{x},\frac{i}{K})over^ start_ARG italic_x end_ARG ← over^ start_ARG italic_x end_ARG + divide start_ARG 1 end_ARG start_ARG italic_K end_ARG italic_v start_POSTSUBSCRIPT italic_θ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( over^ start_ARG italic_x end_ARG , divide start_ARG italic_i end_ARG start_ARG italic_K end_ARG )
      Return x^^𝑥\hat{x}over^ start_ARG italic_x end_ARG
Posterior-Mean Rectified Flow (PMRF)  

We now describe our proposed algorithm, which we coin Posterior-Mean Rectified Flow (PMRF) (Section 3). Our method consists of two simple training stages. First, we train a model fωsubscript𝑓𝜔f_{\omega}italic_f start_POSTSUBSCRIPT italic_ω end_POSTSUBSCRIPT to predict the posterior mean by minimizing the MSE loss,

ω=argminω𝔼[Xfω(Y)2].superscript𝜔subscriptargmin𝜔𝔼delimited-[]superscriptdelimited-∥∥𝑋subscript𝑓𝜔𝑌2\displaystyle\omega^{*}=\operatorname*{arg\,min}_{\omega}\mathbb{E}\left[% \lVert X-f_{\omega}(Y)\rVert^{2}\right].italic_ω start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT = start_OPERATOR roman_arg roman_min end_OPERATOR start_POSTSUBSCRIPT italic_ω end_POSTSUBSCRIPT blackboard_E [ ∥ italic_X - italic_f start_POSTSUBSCRIPT italic_ω end_POSTSUBSCRIPT ( italic_Y ) ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ] . (9)

Note that this training stage can often be skipped, whenever there exists an off-the-shelf algorithm that attains sufficiently small MSE (high PSNR) in the desired restoration task. In the second stage, we train a rectified flow model vθsubscript𝑣𝜃v_{\theta}italic_v start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT (a vector field) to solve

θ=argminθ01𝔼[(XZ0)vθ(Zt,t)2]𝑑t,superscript𝜃subscriptargmin𝜃superscriptsubscript01𝔼delimited-[]superscriptdelimited-∥∥𝑋subscript𝑍0subscript𝑣𝜃subscript𝑍𝑡𝑡2differential-d𝑡\displaystyle\theta^{*}=\operatorname*{arg\,min}_{\theta}\int_{0}^{1}\mathbb{E% }\left[\lVert\left(X-Z_{0}\right)-v_{\theta}(Z_{t},t)\rVert^{2}\right]dt,italic_θ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT = start_OPERATOR roman_arg roman_min end_OPERATOR start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ∫ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT blackboard_E [ ∥ ( italic_X - italic_Z start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) - italic_v start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_Z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_t ) ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ] italic_d italic_t , (10)

where ZttX+(1t)Z0subscript𝑍𝑡𝑡𝑋1𝑡subscript𝑍0\smash{Z_{t}\coloneqq tX+(1-t)Z_{0}}italic_Z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ≔ italic_t italic_X + ( 1 - italic_t ) italic_Z start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT. Here, Z0fω(Y)+σsϵsubscript𝑍0subscript𝑓superscript𝜔𝑌subscript𝜎𝑠italic-ϵ\smash{Z_{0}\coloneqq f_{\omega^{*}}(Y)+\sigma_{s}\epsilon}italic_Z start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ≔ italic_f start_POSTSUBSCRIPT italic_ω start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( italic_Y ) + italic_σ start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT italic_ϵ, where ϵ𝒩(0,I)similar-toitalic-ϵ𝒩0𝐼\epsilon\sim\mathcal{N}(0,I)italic_ϵ ∼ caligraphic_N ( 0 , italic_I ) is statistically independent of Y𝑌Yitalic_Y and X𝑋Xitalic_X, and σssubscript𝜎𝑠\sigma_{s}italic_σ start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT is a hyper-parameter that controls the level of the Gaussian noise added to the posterior mean prediction. As shown by Albergo et al. (2023), adding such a noise is critical when the source and target distributions lie on low and high dimensional manifolds, respectively. Specifically, it alleviates the singularities resulting from learning a deterministic mapping between such distributions. Note, however, that adding noise to fω(Y)subscript𝑓superscript𝜔𝑌f_{\omega^{*}}(Y)italic_f start_POSTSUBSCRIPT italic_ω start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( italic_Y ) may harm the MSE of the reconstructions produced by PMRF, and so σssubscript𝜎𝑠\sigma_{s}italic_σ start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT should be taken to be sufficiently small.

To explain why PMRF approximates the desired estimator X^0subscript^𝑋0\hat{X}_{0}over^ start_ARG italic_X end_ARG start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT, we prove an important proposition and demonstrate it on a simple example with closed-form solutions. Specifically, let

dZ^t=vRF(Z^t,t)dtwithZ^0=Z0formulae-sequence𝑑subscript^𝑍𝑡subscript𝑣RFsubscript^𝑍𝑡𝑡𝑑𝑡withsubscript^𝑍0subscript𝑍0\displaystyle d\hat{Z}_{t}=v_{\text{RF}}(\hat{Z}_{t},t)dt\quad\mbox{with}\quad% \hat{Z}_{0}=Z_{0}italic_d over^ start_ARG italic_Z end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = italic_v start_POSTSUBSCRIPT RF end_POSTSUBSCRIPT ( over^ start_ARG italic_Z end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_t ) italic_d italic_t with over^ start_ARG italic_Z end_ARG start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT = italic_Z start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT (11)

be the ODE in PMRF, where vRF(z,t)=𝔼[XZ0|Zt=z]subscript𝑣RF𝑧𝑡𝔼delimited-[]𝑋conditionalsubscript𝑍0subscript𝑍𝑡𝑧\smash{v_{\text{RF}}(z,t)=\mathbb{E}[X-Z_{0}|Z_{t}=z]}italic_v start_POSTSUBSCRIPT RF end_POSTSUBSCRIPT ( italic_z , italic_t ) = blackboard_E [ italic_X - italic_Z start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT | italic_Z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = italic_z ] and Z^tsubscript^𝑍𝑡\hat{Z}_{t}over^ start_ARG italic_Z end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT is the random vector generated by PMRF at time step t[0,1]𝑡01t\in[0,1]italic_t ∈ [ 0 , 1 ]. In Section A.2 we prove the following:

Proposition 1.

Suppose that σs=0subscript𝜎𝑠0\sigma_{s}=0italic_σ start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT = 0, and let us assume that the solution of the ODE in Equation 11 exists and is unique. Then,

  1. (a)

    Z^1subscript^𝑍1\smash{\hat{Z}_{1}}over^ start_ARG italic_Z end_ARG start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT attains a perfect perceptual index (pZ^1=pXsubscript𝑝subscript^𝑍1subscript𝑝𝑋\smash{p_{\hat{Z}_{1}}=p_{X}}italic_p start_POSTSUBSCRIPT over^ start_ARG italic_Z end_ARG start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT = italic_p start_POSTSUBSCRIPT italic_X end_POSTSUBSCRIPT).

  2. (b)

    The MSE of Z^1subscript^𝑍1\hat{Z}_{1}over^ start_ARG italic_Z end_ARG start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT cannot be larger than that of the posterior sampler.

  3. (c)

    If the distribution of (XX^)|Zt=ztconditional𝑋superscript^𝑋subscript𝑍𝑡subscript𝑧𝑡(X-\hat{X}^{*})|Z_{t}=z_{t}( italic_X - over^ start_ARG italic_X end_ARG start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) | italic_Z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = italic_z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT is non-degenerate for almost every ztsupppZtsubscript𝑧𝑡suppsubscript𝑝subscript𝑍𝑡z_{t}\in\operatorname{supp}{p_{Z_{t}}}italic_z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∈ roman_supp italic_p start_POSTSUBSCRIPT italic_Z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT and t[0,1]𝑡01t\in[0,1]italic_t ∈ [ 0 , 1 ], then the MSE of Z^1subscript^𝑍1\hat{Z}_{1}over^ start_ARG italic_Z end_ARG start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT is strictly smaller than that of the posterior sampler.

Note that assumptions (a) and (b) are the same as the ones in (Liu et al., 2023), so they are not more limiting. Whether assumption (c) holds depends on the nature of the restoration task. For example, if X𝑋Xitalic_X can be reconstructed from Y𝑌Yitalic_Y with zero error (i.e., pX|Y(|y)p_{X|Y}(\cdot|y)italic_p start_POSTSUBSCRIPT italic_X | italic_Y end_POSTSUBSCRIPT ( ⋅ | italic_y ) is a Dirac delta function for almost every y𝑦yitalic_y), then XX^=0𝑋superscript^𝑋0\smash{X-\hat{X}^{*}=0}italic_X - over^ start_ARG italic_X end_ARG start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT = 0 almost surely and assumption (c) does not hold. Yet, this is not an interesting setting as the degradation is not invertible in most practical scenarios. To gain intuition into a more common scenario, consider the following example from (Blau & Michaeli, 2018):

Example 1.

Let Y=X+N𝑌𝑋𝑁Y=X+Nitalic_Y = italic_X + italic_N, where X𝒩(0,1)similar-to𝑋𝒩01X\sim\mathcal{N}(0,1)italic_X ∼ caligraphic_N ( 0 , 1 ) and N𝒩(0,σN2)similar-to𝑁𝒩0superscriptsubscript𝜎𝑁2N\sim\mathcal{N}(0,\sigma_{N}^{2})italic_N ∼ caligraphic_N ( 0 , italic_σ start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) are statistically independent and σN>0subscript𝜎𝑁0\sigma_{N}>0italic_σ start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT > 0. Then, the MSE of X^0subscript^𝑋0\hat{X}_{0}over^ start_ARG italic_X end_ARG start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT is strictly smaller than that of the posterior sampler. Moreover, when σs=0subscript𝜎𝑠0\sigma_{s}=0italic_σ start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT = 0, all the assumptions in Proposition 1 hold, and we have Z^1=X^0subscript^𝑍1subscript^𝑋0\hat{Z}_{1}=\hat{X}_{0}over^ start_ARG italic_Z end_ARG start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT = over^ start_ARG italic_X end_ARG start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT almost surely.

See Section A.3 for the proof of Example 1. This example shows that PMRF not only outperforms posterior sampling, but may even coincide with the desired estimator X^0subscript^𝑋0\smash{\hat{X}_{0}}over^ start_ARG italic_X end_ARG start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT in certain cases.

4 Related work

Before moving on to demonstrate the effectiveness of our approach, it is instructive to note the difference between our PMRF method and existing techniques that may superficially seem similar.

Diffusion and flow-based posterior samplers.

Diffusion or flow-based image restoration algorithms often attempt to sample from the posterior distribution by training a conditional model that takes Y𝑌Yitalic_Y (or some function of Y𝑌Yitalic_Y, like X^superscript^𝑋\smash{\hat{X}^{*}}over^ start_ARG italic_X end_ARG start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT) as an additional input (Zhu et al., 2024; Lin et al., 2024). Some works avoid training a conditional model for each task separately, and rather modify the sampling process of a trained unconditional diffusion model (Kawar et al., 2022; Chung et al., 2023). In Section 5.2 we perform a controlled experiment on various inverse problems, which shows that our PMRF method consistently outperforms posterior samplers with the same architecture.

Flow from degraded image.

Some diffusion/flow models are trained on corresponding pairs of ground-truth images and degraded measurements (Albergo et al., 2023; Delbracio & Milanfar, 2023; Li et al., 2023). In this approach, the idea is to obtain a high-quality image by solving an ODE/SDE with the degraded measurement set as the initial condition. For example, Albergo et al. (2023) trained a rectified flow model for the forward process Zt=tX+(1t)Ysubscript𝑍𝑡𝑡𝑋1𝑡superscript𝑌\smash{Z_{t}=tX+(1-t)Y^{\dagger}}italic_Z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = italic_t italic_X + ( 1 - italic_t ) italic_Y start_POSTSUPERSCRIPT † end_POSTSUPERSCRIPT, where Ysuperscript𝑌Y^{\dagger}italic_Y start_POSTSUPERSCRIPT † end_POSTSUPERSCRIPT is an up-sampled version of Y𝑌Yitalic_Y such that it matches the dimensionality of X𝑋Xitalic_X. These algorithms are closely related to PMRF, in the sense that they learn to transport an intermediate signal (instead of pure noise) to the ground-truth image distribution. Yet, they have two critical disadvantages compared to PMRF. First, the flow model’s design is not agnostic to the type of degradation, as the degraded signals can have varying dimensionalities or lie in a different domain than that of the ground-truth images (e.g., in MRI image reconstruction). Thus, the task of the flow model may be harder than necessary, as it needs to translate signals from one domain to another. On the other hand, in PMRF the flow model always operates in the image domain, where the dimensionalities of the source and target signals are the same. Second, the theoretical motivation for flowing from Y𝑌Yitalic_Y is not clear, at least from a reconstruction performance standpoint (e.g., distortion). In contrast, the theoretical motivation underlying PMRF is clear: it approximates X^0subscript^𝑋0\smash{\hat{X}_{0}}over^ start_ARG italic_X end_ARG start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT, which achieves the minimal possible MSE under the constraint of perfect perceptual index. As we show in Section 5.2, PMRF always either outperforms or is on-par with the solution that flows from Y𝑌Yitalic_Y (see Figure 4).

Methods that aim for X^0subscript^𝑋0\smash{\hat{X}_{0}}over^ start_ARG italic_X end_ARG start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT directly.

To the best of our knowledge, Deep Optimal Transport (DOT) (Adrai et al., 2023) is the only existing method that, like PMRF, attempts to approximate X^0subscript^𝑋0\smash{\hat{X}_{0}}over^ start_ARG italic_X end_ARG start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT directly. Specifically, DOT approximates the desired optimal transport map (Equation 4) via a linear transformation in the latent space of a variational auto-encoder (VAE) (Kingma & Welling, 2014). This transformation is computed in closed-form using the empirical means and covariances (in latent space) of the source distribution (that of the posterior mean predictions) and the target distribution (that of the ground-truth images), under the assumption that both are Gaussian. This method is computationally efficient, but the use of a VAE imposes a performance ceiling. Moreover, the optimal transport in DOT occurs in latent space and assumes that the source and target distributions are Gaussians, unlike Equation 4 which occurs in pixel space and does not make such an assumption. In contrast, PMRF does not use a VAE, and approximates the optimal transport directly in pixel space. In Section 5 we show that PMRF significantly outperforms DOT (see Figure 4).

5 Experiments

5.1 Blind face image restoration

We train PMRF to solve the challenging blind face image restoration task, and compare its performance with leading methods. As in previous works (e.g.(Wang et al., 2021)), we use the FFHQ data set (Karras et al., 2019) with images of size 512×512512512512\times 512512 × 512 to train our model. Similarly to previous works, we adopt a complex and random degradation process to synthesize the degraded images,

Y=[(Xkσ)R+Nδ]JPEGQ,𝑌subscriptdelimited-[]subscript𝑅𝑋subscript𝑘𝜎subscript𝑁𝛿subscriptJPEG𝑄\displaystyle Y=\left[(X\circledast k_{\sigma})\downarrow_{R}+N_{\delta}\right% ]_{\text{JPEG}_{Q}},italic_Y = [ ( italic_X ⊛ italic_k start_POSTSUBSCRIPT italic_σ end_POSTSUBSCRIPT ) ↓ start_POSTSUBSCRIPT italic_R end_POSTSUBSCRIPT + italic_N start_POSTSUBSCRIPT italic_δ end_POSTSUBSCRIPT ] start_POSTSUBSCRIPT JPEG start_POSTSUBSCRIPT italic_Q end_POSTSUBSCRIPT end_POSTSUBSCRIPT , (12)

where \circledast denotes convolution, kσsubscript𝑘𝜎k_{\sigma}italic_k start_POSTSUBSCRIPT italic_σ end_POSTSUBSCRIPT is a Gaussian blur kernel of size 41×41414141\times 4141 × 41 and variance σ2superscript𝜎2\sigma^{2}italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT, Rsubscript𝑅\downarrow_{R}↓ start_POSTSUBSCRIPT italic_R end_POSTSUBSCRIPT is bilinear down-sampling by a factor R𝑅Ritalic_R, Nδsubscript𝑁𝛿N_{\delta}italic_N start_POSTSUBSCRIPT italic_δ end_POSTSUBSCRIPT is white Gaussian noise of variance δ2superscript𝛿2\delta^{2}italic_δ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT, and []JPEGQsubscriptdelimited-[]subscriptJPEG𝑄\smash{[\cdot]_{\text{JPEG}_{Q}}}[ ⋅ ] start_POSTSUBSCRIPT JPEG start_POSTSUBSCRIPT italic_Q end_POSTSUBSCRIPT end_POSTSUBSCRIPT is JPEG compression-decompression with quality factor Q𝑄Qitalic_Q. Similarly to (Yue & Loy, 2024), we synthesize the degraded images by sampling σ,R,δ𝜎𝑅𝛿\sigma,R,\deltaitalic_σ , italic_R , italic_δ and Q𝑄Qitalic_Q uniformly from [0.1,15],[0.8,32],[0,20],0.1150.832020[0.1,15],[0.8,32],[0,20],[ 0.1 , 15 ] , [ 0.8 , 32 ] , [ 0 , 20 ] , and [30,100]30100[30,100][ 30 , 100 ], respectively. See Section B.1 for additional implementation details.

5.1.1 Evaluation settings

For evaluation, we consider the common synthetic CelebA-Test benchmark, as well as the real-world data sets LFW-Test (Wang et al., 2021; Huang et al., 2008), WebPhoto-Test (Wang et al., 2021), CelebAdult-Test (Wang et al., 2021), and WIDER-Test (Zhou et al., 2022). CelebA-Test consists of 3,000 high-quality images taken from the test partition of CelebA-HQ (Karras et al., 2018), and the degraded images were synthesized by Wang et al. (2021). For the real-world data sets, the degradations are unknown and there is no access to the clean ground-truth images. We compare our performance with DOT (Adrai et al., 2023) and leading blind face restoration models, including BFRffussion (Chen et al., 2024), DiffBIR (Lin et al., 2024), DifFace (Yue & Loy, 2024), CodeFormer (Zhou et al., 2022), GFPGAN (Wang et al., 2021), VQFRv1 and VQFRv2 (Gu et al., 2022), RestoreFormer and RestoreFormer++ (Wang et al., 2022; 2023b). Notably, these restoration methods also use the degradation model from Equation 12, though the ranges of σ𝜎\sigmaitalic_σ, R𝑅Ritalic_R, δ𝛿\deltaitalic_δ, and Q𝑄Qitalic_Q differ across methods. The ranges we choose, those from (Yue & Loy, 2024), are the most severe among all the compared methods. For example, the range of R𝑅Ritalic_R we use is [0.8,32]0.832[0.8,32][ 0.8 , 32 ], whereas Wang et al. (2021) use [1,8]18[1,8][ 1 , 8 ]. Thus, PMRF attempts to solve a more difficult restoration task than some of the compared methods. In the following experiments, we use K=25𝐾25K=25italic_K = 25 flow steps in PMRF (Section 3). Refer to Section B.2 for an evaluation of additional values of K𝐾Kitalic_K, and to Section B.3 for the implementation details of DOT.

5.1.2 Results on CelebA-Test

For the CelebA-Test benchmark, we measure the perceptual quality by FID (Heusel et al., 2017), KID (Bińkowski et al., 2018), NIQE (Mittal et al., 2013), and Precision (Kynkäänniemi et al., 2019), and measure the distortion by the PSNR, SSIM (Wang et al., 2004), and LPIPS (Zhang et al., 2018). Similarly to previous works (Wang et al., 2021; Gu et al., 2022), we also compute the identity metric Deg (using the embedding angle of ArcFace (Deng et al., 2019)) and the landmark distance LMD. Both of these can be considered as distortion measures, as they quantify some type of discrepancy between each reconstructed image and its ground-truth counterpart.

The results are reported in Table 1. Notably, PMRF outperforms all other methods in FID, KID, PSNR, and SSIM, achieves the second best scores in NIQE, Precision and Deg, and the third best scores in LPIPS, and LMD. Interestingly, no other method attains such a consensus in performance like PMRF, namely, where none of the measures are significantly compromised compared to the state-of-the-art. For example, while DifFace achieves the highest Precision, it attains worse LMD, Deg, LPIPS, SSIM, and PSNR compared to the third best method in each of these metrics. This demonstrates that PMRF produces robust reconstructions, in the sense that it does not “over-fit” particular perceptual quality or distortion measures, but rather achieves high performance in all of them simultaneously. Visual results are provided in Figure 2 and in Figure 5 in the appendix.

5.1.3 Results on real-world degraded images

Evaluating the distortion for real-world degraded images is impossible, as there is no access to the ground-truth images. Consequently, previous works conduct only a perceptual quality evaluation (e.g., FID) on real-world data sets such as WIDER-Test and LFW-Test. Yet, high perceptual quality alone is clearly not indicative of reconstruction performance (to attain high perceptual quality, one may simply ignore the inputs and generate samples from pXsubscript𝑝𝑋p_{X}italic_p start_POSTSUBSCRIPT italic_X end_POSTSUBSCRIPT). Thus, we consider a measure which indicates the Root MSE (RMSE) and allows ranking algorithms according to their (approximate) RMSE, without access to the ground-truth images. Specifically, for any estimator X^^𝑋\smash{\hat{X}}over^ start_ARG italic_X end_ARG it holds that

𝔼[XX^2]𝔼[X^f(Y)2]+m,𝔼delimited-[]superscriptdelimited-∥∥𝑋^𝑋2𝔼delimited-[]superscriptdelimited-∥∥^𝑋𝑓𝑌2𝑚\displaystyle\mathbb{E}[\lVert X-\hat{X}\rVert^{2}]\approx\mathbb{E}[\lVert% \hat{X}-f(Y)\rVert^{2}]+m,blackboard_E [ ∥ italic_X - over^ start_ARG italic_X end_ARG ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ] ≈ blackboard_E [ ∥ over^ start_ARG italic_X end_ARG - italic_f ( italic_Y ) ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ] + italic_m , (13)

where f(Y)X^𝑓𝑌superscript^𝑋\smash{f(Y)\approx\hat{X}^{*}}italic_f ( italic_Y ) ≈ over^ start_ARG italic_X end_ARG start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT is an approximation of the true posterior mean predictor X^superscript^𝑋\smash{\hat{X}^{*}}over^ start_ARG italic_X end_ARG start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT, and m𝑚mitalic_m is a constant that does not depend on X^^𝑋\smash{\hat{X}}over^ start_ARG italic_X end_ARG (see Appendix D for an explanation). Thus, the square root of 𝔼[X^f(Y)2]𝔼delimited-[]superscriptdelimited-∥∥^𝑋𝑓𝑌2\smash{\mathbb{E}[\lVert\hat{X}-f(Y)\rVert^{2}]}blackboard_E [ ∥ over^ start_ARG italic_X end_ARG - italic_f ( italic_Y ) ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ], which we denote by IndRMSE, indicates the true RMSE. We utilize the posterior mean predictor trained by (Yue & Loy, 2024)222Importantly, the exact same posterior mean predictor model (and weights) is also used by other methods such as DifFace and DiffBIR, so this is a fair evaluation. as f𝑓fitalic_f, and compute the IndRMSE of all the evaluated algorithms on the LFW-Test, WebPhoto-Test, CelebAdult-Test, and WIDER-Test data sets. As before, we evaluate perceptual quality by FID, KID, NIQE, and Precision. In Figure 3 we provide visual results on inputs from the WIDER-Test data set, and compare the algorithms on a “distortion”-perception plane (IndRMSE vs. FID). DOT is not plotted as it achieves far worse FID compared to other methods. Our algorithm attains the best (smallest) IndRMSE on all data sets, while achieving on-par perceptual quality compared to the state-of-the-art. This indicates that PMRF achieves superior distortion on such real-world data sets, while not compromising perceptual quality. In the appendix, we report the rest of the perceptual quality measures in Tables 7, 10, 9 and 8, provide visual results in Figures 7, 6 and 8, and also report the performance of DOT.

Refer to caption
Refer to caption
Figure 3: Real-world face image restoration. Top: Qualitative results on inputs from the WIDER-Test data set. Bottom: Comparison on the “distortion”-perception plane (IndRMSE vs. FID), where IndRMSE indicates the RMSE of each method (the true distortion cannot be computed as there is no access to the ground-truth images). Our algorithm outperforms all other methods in IndRMSE, while achieving on-par perceptual quality compared to the state-of-the-art.

5.2 Comparing PMRF with previous frameworks in controlled experiments

One may wonder whether the performance of PMRF is attributed to the framework itself (Section 3), or, maybe it is attributed to the model architecture, the rectified flow training approach, the chosen hyper-parameters, etc. Could we have done better by training a flow to sample from the posterior, or by adopting the approach of (Albergo et al., 2023) and flow directly from Y𝑌Yitalic_Y? Here, we conduct a controlled study where we demonstrate that the high performance of PMRF is indeed attributed to the proposed framework itself (Section 3). Specifically, we consider the image denoising, super-resolution, inpainting, and colorization tasks, where we train PMRF and several baseline methods on the “same grounds”. In each task we train two conditional rectified flow models, where one is conditioned on the degraded measurement Y𝑌Yitalic_Y (we call this method flow conditioned on Y𝑌Yitalic_Y), and the other is conditioned on the posterior mean predictor fω(Y)subscript𝑓superscript𝜔𝑌f_{\omega^{*}}(Y)italic_f start_POSTSUBSCRIPT italic_ω start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( italic_Y ) (we call this method flow conditioned on X^superscript^𝑋\smash{\hat{X}^{*}}over^ start_ARG italic_X end_ARG start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT). The first model represents posterior sampling methods, and the second model allows for a fair comparison of model capacity with PMRF (since PMRF is comprised of fω(Y)subscript𝑓superscript𝜔𝑌f_{\omega^{*}}(Y)italic_f start_POSTSUBSCRIPT italic_ω start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( italic_Y ) and a flow model). In fact, theoretically speaking, the second approach achieves precisely the same MSE as the posterior sampler (see Section A.1), and is often used in practice (e.g., in (Lin et al., 2024; Zhu et al., 2024)). In addition, we train an unconditional rectified flow model, where the forward process is defined as Zt=tX+(1t)Z0subscript𝑍𝑡𝑡𝑋1𝑡subscript𝑍0\smash{Z_{t}=tX+(1-t)Z_{0}}italic_Z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = italic_t italic_X + ( 1 - italic_t ) italic_Z start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT, Z0=Y+σsϵsubscript𝑍0superscript𝑌subscript𝜎𝑠italic-ϵ\smash{Z_{0}=Y^{\dagger}+\sigma_{s}\epsilon}italic_Z start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT = italic_Y start_POSTSUPERSCRIPT † end_POSTSUPERSCRIPT + italic_σ start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT italic_ϵ, ϵ𝒩(0,I)similar-toitalic-ϵ𝒩0𝐼\smash{\epsilon\sim\mathcal{N}(0,I)}italic_ϵ ∼ caligraphic_N ( 0 , italic_I ), and Ysuperscript𝑌\smash{Y^{\dagger}}italic_Y start_POSTSUPERSCRIPT † end_POSTSUPERSCRIPT is the up-scaled version of the degraded measurement Y𝑌Yitalic_Y such that it matches the dimensionality of X𝑋Xitalic_X (we call this method flow from Y𝑌\smash{Y}italic_Y). This method represents the frameworks in (Albergo et al., 2023; Li et al., 2023; Delbracio & Milanfar, 2023), which we discuss in Section 4. All of the models are trained with the same hyper-parameters as PMRF, using the same architecture, learning rate, weight decay, number of training epochs, etc. Moreover, for PMRF and flow conditioned on X^superscript^𝑋\smash{\hat{X}^{*}}over^ start_ARG italic_X end_ARG start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT method, we use the exact same architecture and weights for fω(Y)subscript𝑓superscript𝜔𝑌f_{\omega^{*}}(Y)italic_f start_POSTSUBSCRIPT italic_ω start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( italic_Y ). To clarify the differences between the mathematical formulations of the baseline methods, in Table 11 in the appendix we summarize the definitions of the training loss and the forward process of all methods. Moreover, in Appendices D, D and D we disclose a pseudo-code for the training and inference procedures of the baseline methods. While DOT is not a flow method, we still evaluate its performance as it is related to PMRF.

Refer to caption
Refer to caption
Figure 4: A controlled experiment comparing PMRF (our method) with several baseline methods, where the models are trained with the same architecture, hyper-parameters, etc. (see Section 5.2). Top: Qualitative comparison of PMRF and the baseline methods on several tasks. Bottom: Quantitative comparison on the distortion-perception plane (RMSE vs. FID). DOT is not a flow model, but rather another approach that attempts to approximate X^0subscript^𝑋0\smash{\hat{X}_{0}}over^ start_ARG italic_X end_ARG start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT (like PMRF). These experiments demonstrate that PMRF is either superior or is on-par with previous frameworks (i.e., posterior sampling or flowing from Y𝑌Yitalic_Y) on a variety of image restoration tasks. See Section 5.2 for more details.

In Figure 4 we compare the algorithms on the distortion-perception plane (RMSE vs. FID), using K=100𝐾100K=100italic_K = 100 flow steps for each flow algorithm. We clearly see that for the inpainting and colorization tasks, PMRF dominates all other methods, achieving notably smaller RMSE without compromising FID. This demonstrates that PMRF achieves our desired goal, which is to attain low distortion without compromising on perceptual quality. For the denoising task, we observe that PMRF and flow from Y𝑌Yitalic_Y attain similar performance, and both dominate the posterior sampling approaches. We hypothesize that, in some tasks (e.g., denoising), flowing from Y𝑌Yitalic_Y may be as effective as PMRF in terms of approximating X^0subscript^𝑋0\smash{\hat{X}_{0}}over^ start_ARG italic_X end_ARG start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT. To demonstrate this, we prove in Section C.4 that flowing from Y𝑌Yitalic_Y is optimal in the toy problem in Example 1 (just like PMRF). Yet, our experiments demonstrate that PMRF generally leads to better performance compared to previous frameworks. To assess the effectiveness of each method given different inference time constraints, in Figure 9 in the appendix we vary the number of flow inference steps K𝐾Kitalic_K for each method. Interestingly, we observe that PMRF is still either on-par or dominates the other methods for any given number of inference steps. These results further demonstrate that the superior performance of PMRF is attributed to our framework itself, rather than to the chosen hyper-parameters. See Appendix C for more details, and refer to Figures 10, 11, 12 and 13 in the appendix for visual comparisons.

6 Conclusion and limitations

The goal in this paper was to design an algorithm that directly approximates X^0subscript^𝑋0\smash{\hat{X}_{0}}over^ start_ARG italic_X end_ARG start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT, which is the estimator that minimizes the MSE under a perfect perceptual index constraint (Equation 3). To achieve this goal, we introduced Posterior-Mean Rectified Flow (PMRF), a simple yet highly effective image restoration algorithm that outperforms previous frameworks (e.g., posterior sampling, flow from Y𝑌Yitalic_Y, and GAN-based methods) in a variety of image restoration tasks. As we explained in Section 3, PMRF alleviates the issues resulting from solving the ODE by adding Gaussian noise to the posterior mean predictions. We note that the noise level σssubscript𝜎𝑠\sigma_{s}italic_σ start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT should be carefully tuned, as taking it to be too large or too small may cause the MSE or the perceptual quality of PMRF to degrade, respectively. While the flow from Y𝑌Yitalic_Y method (Appendix D) suffers from the same limitation (though it does not provide a theoretical guarantee on the MSE, like PMRF), this may be considered a disadvantage of PMRF compared to posterior sampling methods (e.g.Appendix D), which do not require such a hyper-parameter. Moreover, we proved in Proposition 1 that, under some conditions, PMRF is guaranteed to achieve a smaller MSE than the posterior sampler. However, as in (Liu et al., 2023), one could argue that the assumptions in Proposition 1 may be too limiting in some cases. Finally, we did not provide experiments on general-content images, as this requires training significantly larger models (Crowson et al., 2024). However, we believe that our experiments demonstrate the strength and potential of the PMRF approach, as we showcased its superiority on five different tasks, including the highly challenging blind face image restoration problem.

Reproducibility statement

Our codes are available at https://github.com/ohayonguy/PMRF. We provide all the explanations and checkpoints necessary to reproduce our results, including training, inference, and the computation of the distortion and perceptual quality measures in Section 5. Besides our code, our paper discloses all the implementation details required to reproduce the results, including architecture details, training hyper-parameters, etc. Refer to Sections 5.1, B, 5.2 and C for implementation details, and to Table 12 in the appendix for a summary of our training hyper-parameters.

References

  • Adrai et al. (2023) Theo Adrai, Guy Ohayon, Michael Elad, and Tomer Michaeli. Deep optimal transport: A practical algorithm for photo-realistic image restoration. In A. Oh, T. Naumann, A. Globerson, K. Saenko, M. Hardt, and S. Levine (eds.), Advances in Neural Information Processing Systems, volume 36, pp.  61777–61791. Curran Associates, Inc., 2023. URL https://proceedings.neurips.cc/paper_files/paper/2023/file/c281c5a17ad2e55e1ac1ca825071f991-Paper-Conference.pdf.
  • Albergo et al. (2023) Michael S. Albergo, Mark Goldstein, Nicholas M. Boffi, Rajesh Ranganath, and Eric Vanden-Eijnden. Stochastic interpolants with data-dependent couplings. arXiv, 2023. URL https://arxiv.org/abs/2310.03725.
  • Albergo & Vanden-Eijnden (2023) Michael Samuel Albergo and Eric Vanden-Eijnden. Building normalizing flows with stochastic interpolants. In The Eleventh International Conference on Learning Representations, 2023. URL https://openreview.net/forum?id=li7qeBbCR1t.
  • Bendel et al. (2023) Matthew C Bendel, Rizwan Ahmad, and Philip Schniter. A regularized conditional GAN for posterior sampling in image recovery problems. In Thirty-seventh Conference on Neural Information Processing Systems, 2023. URL https://openreview.net/forum?id=z4vKRmq7UO.
  • Bińkowski et al. (2018) Mikołaj Bińkowski, Dougal J. Sutherland, Michael Arbel, and Arthur Gretton. Demystifying MMD GANs. In International Conference on Learning Representations, 2018. URL https://openreview.net/forum?id=r1lUOzWCW.
  • Blau & Michaeli (2018) Yochai Blau and Tomer Michaeli. The perception-distortion tradeoff. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2018.
  • Chen et al. (2024) Xiaoxu Chen, Jingfan Tan, Tao Wang, Kaihao Zhang, Wenhan Luo, and Xiaochun Cao. Towards real-world blind face restoration with generative diffusion prior. IEEE Transactions on Circuits and Systems for Video Technology, pp.  1–1, 2024. doi: 10.1109/TCSVT.2024.3383659.
  • Chung et al. (2023) Hyungjin Chung, Jeongsol Kim, Michael Thompson Mccann, Marc Louis Klasky, and Jong Chul Ye. Diffusion posterior sampling for general noisy inverse problems. In The Eleventh International Conference on Learning Representations, 2023. URL https://openreview.net/forum?id=OnD9zGAGT0k.
  • Crowson et al. (2024) Katherine Crowson, Stefan Andreas Baumann, Alex Birch, Tanishq Mathew Abraham, Daniel Z Kaplan, and Enrico Shippole. Scalable high-resolution pixel-space image synthesis with hourglass diffusion transformers. In Forty-first International Conference on Machine Learning, 2024. URL https://openreview.net/forum?id=WRIn2HmtBS.
  • Dahl et al. (2017) Ryan Dahl, Mohammad Norouzi, and Jonathon Shlens. Pixel recursive super resolution. In Proceedings of the IEEE International Conference on Computer Vision (ICCV), Oct 2017.
  • Daras et al. (2024) Giannis Daras, Hyungjin Chung, Chieh-Hsin Lai, Yuki Mitsufuji, Peyman Milanfar, Alexandros G. Dimakis, Chul Ye, and Mauricio Delbracio. A survey on diffusion models for inverse problems. 2024. URL https://giannisdaras.github.io/publications/diffusion_survey.pdf.
  • Davison (2003) A. C. Davison. Statistical Models. Cambridge Series in Statistical and Probabilistic Mathematics. Cambridge University Press, 2003. doi: 10.1017/CBO9780511815850.
  • Delbracio & Milanfar (2023) Mauricio Delbracio and Peyman Milanfar. Inversion by direct iteration: An alternative to denoising diffusion for image restoration. Transactions on Machine Learning Research, 2023. ISSN 2835-8856. URL https://openreview.net/forum?id=VmyFF5lL3F. Featured Certification.
  • Deng et al. (2019) Jiankang Deng, Jia Guo, Niannan Xue, and Stefanos Zafeiriou. Arcface: Additive angular margin loss for deep face recognition. In 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp.  4685–4694, 2019. doi: 10.1109/CVPR.2019.00482.
  • Denton et al. (2015) Emily L Denton, Soumith Chintala, arthur szlam, and Rob Fergus. Deep generative image models using a laplacian pyramid of adversarial networks. In C. Cortes, N. Lawrence, D. Lee, M. Sugiyama, and R. Garnett (eds.), Advances in Neural Information Processing Systems, volume 28. Curran Associates, Inc., 2015. URL https://proceedings.neurips.cc/paper_files/paper/2015/file/aa169b49b583a2b5af89203c2b78c67c-Paper.pdf.
  • Freirich et al. (2021) Dror Freirich, Tomer Michaeli, and Ron Meir. A theory of the distortion-perception tradeoff in wasserstein space. In M. Ranzato, A. Beygelzimer, Y. Dauphin, P.S. Liang, and J. Wortman Vaughan (eds.), Advances in Neural Information Processing Systems, volume 34, pp.  25661–25672. Curran Associates, Inc., 2021. URL https://proceedings.neurips.cc/paper_files/paper/2021/file/d77e68596c15c53c2a33ad143739902d-Paper.pdf.
  • Goodfellow et al. (2014) Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. Generative adversarial nets. In Z. Ghahramani, M. Welling, C. Cortes, N. Lawrence, and K.Q. Weinberger (eds.), Advances in Neural Information Processing Systems, volume 27. Curran Associates, Inc., 2014. URL https://proceedings.neurips.cc/paper_files/paper/2014/file/5ca3e9b122f61f8f06494c97b1afccf3-Paper.pdf.
  • Gu et al. (2022) Yuchao Gu, Xintao Wang, Liangbin Xie, Chao Dong, Gen Li, Ying Shan, and Ming-Ming Cheng. Vqfr: Blind face restoration with vector-quantized dictionary and parallel decoder. In Computer Vision – ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XVIII, pp.  126–143, Berlin, Heidelberg, 2022. Springer-Verlag. ISBN 978-3-031-19796-3. doi: 10.1007/978-3-031-19797-0˙8. URL https://doi.org/10.1007/978-3-031-19797-0_8.
  • Guadarrama et al. (2017) Sergio Guadarrama, Ryan Dahl, David Bieber, Jonathon Shlens, Mohammad Norouzi, and Kevin Murphy. Pixcolor: Pixel recursive colorization. In British Machine Vision Conference 2017, BMVC 2017, London, UK, September 4-7, 2017. BMVA Press, 2017. URL https://www.dropbox.com/s/wmnk861irndf8xe/0447.pdf.
  • Heusel et al. (2017) Martin Heusel, Hubert Ramsauer, Thomas Unterthiner, Bernhard Nessler, and Sepp Hochreiter. Gans trained by a two time-scale update rule converge to a local nash equilibrium. In I. Guyon, U. Von Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett (eds.), Advances in Neural Information Processing Systems, volume 30. Curran Associates, Inc., 2017. URL https://proceedings.neurips.cc/paper_files/paper/2017/file/8a1d694707eb0fefe65871369074926d-Paper.pdf.
  • Huang et al. (2008) Gary B. Huang, Marwan Mattar, Tamara Berg, and Eric Learned-Miller. Labeled Faces in the Wild: A Database forStudying Face Recognition in Unconstrained Environments. In Workshop on Faces in ’Real-Life’ Images: Detection, Alignment, and Recognition, Marseille, France, October 2008. Erik Learned-Miller and Andras Ferencz and Frédéric Jurie. URL https://inria.hal.science/inria-00321923.
  • Iizuka et al. (2016) Satoshi Iizuka, Edgar Simo-Serra, and Hiroshi Ishikawa. Let there be Color!: Joint End-to-end Learning of Global and Local Image Priors for Automatic Image Colorization with Simultaneous Classification. ACM Transactions on Graphics (Proc. of SIGGRAPH 2016), 35(4), 2016.
  • Isola et al. (2017) Phillip Isola, Jun-Yan Zhu, Tinghui Zhou, and Alexei A Efros. Image-to-image translation with conditional adversarial networks. CVPR, 2017.
  • Kaipio & Somersalo (2005) Jari Kaipio and Erkki Somersalo. Statistical and Computational Inverse Problems. Springer, Dordrecht, 2005. doi: 10.1007/b138659. URL https://cds.cern.ch/record/1338003.
  • Karras et al. (2018) Tero Karras, Timo Aila, Samuli Laine, and Jaakko Lehtinen. Progressive growing of GANs for improved quality, stability, and variation. In International Conference on Learning Representations, 2018. URL https://openreview.net/forum?id=Hk99zCeAb.
  • Karras et al. (2019) Tero Karras, Samuli Laine, and Timo Aila. A style-based generator architecture for generative adversarial networks. In 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp.  4396–4405, 2019. doi: 10.1109/CVPR.2019.00453.
  • Kawar et al. (2021a) Bahjat Kawar, Gregory Vaksman, and Michael Elad. Stochastic image denoising by sampling from the posterior distribution. In 2021 IEEE/CVF International Conference on Computer Vision Workshops (ICCVW), pp.  1866–1875, 2021a. doi: 10.1109/ICCVW54120.2021.00213.
  • Kawar et al. (2021b) Bahjat Kawar, Gregory Vaksman, and Michael Elad. Snips: Solving noisy inverse problems stochastically. In M. Ranzato, A. Beygelzimer, Y. Dauphin, P.S. Liang, and J. Wortman Vaughan (eds.), Advances in Neural Information Processing Systems, volume 34, pp.  21757–21769. Curran Associates, Inc., 2021b. URL https://proceedings.neurips.cc/paper_files/paper/2021/file/b5c01503041b70d41d80e3dbe31bbd8c-Paper.pdf.
  • Kawar et al. (2022) Bahjat Kawar, Michael Elad, Stefano Ermon, and Jiaming Song. Denoising diffusion restoration models. In Advances in Neural Information Processing Systems, 2022.
  • Kingma & Welling (2014) Diederik P. Kingma and Max Welling. Auto-encoding variational bayes. In Yoshua Bengio and Yann LeCun (eds.), 2nd International Conference on Learning Representations, ICLR 2014, Banff, AB, Canada, April 14-16, 2014, Conference Track Proceedings, 2014. URL http://arxiv.org/abs/1312.6114.
  • Kynkäänniemi et al. (2019) Tuomas Kynkäänniemi, Tero Karras, Samuli Laine, Jaakko Lehtinen, and Timo Aila. Improved precision and recall metric for assessing generative models. In H. Wallach, H. Larochelle, A. Beygelzimer, F. d'Alché-Buc, E. Fox, and R. Garnett (eds.), Advances in Neural Information Processing Systems, volume 32. Curran Associates, Inc., 2019. URL https://proceedings.neurips.cc/paper_files/paper/2019/file/0234c510bc6d908b28c70ff313743079-Paper.pdf.
  • Ledig et al. (2017) Christian Ledig, Lucas Theis, Ferenc Huszár, Jose Caballero, Andrew Cunningham, Alejandro Acosta, Andrew Aitken, Alykhan Tejani, Johannes Totz, Zehan Wang, and Wenzhe Shi. Photo-realistic single image super-resolution using a generative adversarial network. In 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp.  105–114, 2017. doi: 10.1109/CVPR.2017.19.
  • Li et al. (2023) Bo Li, Kaitao Xue, Bin Liu, and Yu-Kun Lai. Bbdm: Image-to-image translation with brownian bridge diffusion models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp.  1952–1961, 2023.
  • Liang et al. (2021) Jingyun Liang, Jiezhang Cao, Guolei Sun, Kai Zhang, Luc Van Gool, and Radu Timofte. Swinir: Image restoration using swin transformer. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV) Workshops, pp.  1833–1844, October 2021.
  • Lin et al. (2024) Xinqi Lin, Jingwen He, Ziyan Chen, Zhaoyang Lyu, Bo Dai, Fanghua Yu, Wanli Ouyang, Yu Qiao, and Chao Dong. Diffbir: Towards blind image restoration with generative diffusion prior. arXiv, 2024. URL https://arxiv.org/abs/2308.15070.
  • Lipman et al. (2023) Yaron Lipman, Ricky T. Q. Chen, Heli Ben-Hamu, Maximilian Nickel, and Matthew Le. Flow matching for generative modeling. In The Eleventh International Conference on Learning Representations, 2023. URL https://openreview.net/forum?id=PqvMRDCJT9t.
  • Liu et al. (2023) Xingchao Liu, Chengyue Gong, and qiang liu. Flow straight and fast: Learning to generate and transfer data with rectified flow. In The Eleventh International Conference on Learning Representations, 2023. URL https://openreview.net/forum?id=XVjTT1nw5z.
  • Loshchilov & Hutter (2019) Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization. In International Conference on Learning Representations, 2019. URL https://openreview.net/forum?id=Bkg6RiCqY7.
  • Man et al. (2023) Sean Man, Guy Ohayon, Theo Adrai, and Michael Elad. High-perceptual quality jpeg decoding via posterior sampling. In 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), pp.  1272–1282, 2023. doi: 10.1109/CVPRW59228.2023.00134.
  • Mittal et al. (2013) Anish Mittal, Rajiv Soundararajan, and Alan C. Bovik. Making a “completely blind” image quality analyzer. IEEE Signal Processing Letters, 20(3):209–212, 2013. doi: 10.1109/LSP.2012.2227726.
  • Murata et al. (2023) Naoki Murata, Koichi Saito, Chieh-Hsin Lai, Yuhta Takida, Toshimitsu Uesaka, Yuki Mitsufuji, and Stefano Ermon. GibbsDDRM: A partially collapsed Gibbs sampler for solving blind inverse problems with denoising diffusion restoration. In Andreas Krause, Emma Brunskill, Kyunghyun Cho, Barbara Engelhardt, Sivan Sabato, and Jonathan Scarlett (eds.), Proceedings of the 40th International Conference on Machine Learning, volume 202 of Proceedings of Machine Learning Research, pp.  25501–25522. PMLR, 23–29 Jul 2023. URL https://proceedings.mlr.press/v202/murata23a.html.
  • Ohayon et al. (2021) Guy Ohayon, Theo Adrai, Gregory Vaksman, Michael Elad, and Peyman Milanfar. High perceptual quality image denoising with a posterior sampling cgan. In 2021 IEEE/CVF International Conference on Computer Vision Workshops (ICCVW), pp.  1805–1813, 2021. doi: 10.1109/ICCVW54120.2021.00207.
  • Rombach et al. (2022) Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Björn Ommer. High-resolution image synthesis with latent diffusion models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp.  10684–10695, June 2022.
  • Saharia et al. (2022) Chitwan Saharia, William Chan, Huiwen Chang, Chris Lee, Jonathan Ho, Tim Salimans, David Fleet, and Mohammad Norouzi. Palette: Image-to-image diffusion models. In ACM SIGGRAPH 2022 Conference Proceedings, SIGGRAPH ’22, New York, NY, USA, 2022. Association for Computing Machinery. ISBN 9781450393379. doi: 10.1145/3528233.3530757. URL https://doi.org/10.1145/3528233.3530757.
  • Saharia et al. (2023) Chitwan Saharia, Jonathan Ho, William Chan, Tim Salimans, David J. Fleet, and Mohammad Norouzi. Image super-resolution via iterative refinement. IEEE Transactions on Pattern Analysis and Machine Intelligence, 45(4):4713–4726, 2023. doi: 10.1109/TPAMI.2022.3204461.
  • Salimans et al. (2016) Tim Salimans, Ian Goodfellow, Wojciech Zaremba, Vicki Cheung, Alec Radford, Xi Chen, and Xi Chen. Improved techniques for training gans. In D. Lee, M. Sugiyama, U. Luxburg, I. Guyon, and R. Garnett (eds.), Advances in Neural Information Processing Systems, volume 29. Curran Associates, Inc., 2016. URL https://proceedings.neurips.cc/paper_files/paper/2016/file/8a3363abe792db2d8761d6403605aeb7-Paper.pdf.
  • Song et al. (2023) Jiaming Song, Arash Vahdat, Morteza Mardani, and Jan Kautz. Pseudoinverse-guided diffusion models for inverse problems. In International Conference on Learning Representations, 2023. URL https://openreview.net/forum?id=9_gsMA8MRKQ.
  • Talebi & Milanfar (2018) Hossein Talebi and Peyman Milanfar. Nima: Neural image assessment. IEEE Transactions on Image Processing, 27(8):3998–4011, 2018. doi: 10.1109/TIP.2018.2831899.
  • Tong et al. (2024) Alexander Tong, Kilian FATRAS, Nikolay Malkin, Guillaume Huguet, Yanlei Zhang, Jarrid Rector-Brooks, Guy Wolf, and Yoshua Bengio. Improving and generalizing flow-based generative models with minibatch optimal transport. Transactions on Machine Learning Research, 2024. ISSN 2835-8856. URL https://openreview.net/forum?id=CD9Snc73AW. Expert Certification.
  • (50) Xintao Wang, Liangbin Xie, Chao Dong, and Ying Shan. Real-esrgan: Training real-world blind super-resolution with pure synthetic data. In International Conference on Computer Vision Workshops (ICCVW).
  • Wang et al. (2018) Xintao Wang, Ke Yu, Shixiang Wu, Jinjin Gu, Yihao Liu, Chao Dong, Yu Qiao, and Chen Change Loy. Esrgan: Enhanced super-resolution generative adversarial networks. In The European Conference on Computer Vision Workshops (ECCVW), September 2018.
  • Wang et al. (2021) Xintao Wang, Yu Li, Honglun Zhang, and Ying Shan. Towards real-world blind face restoration with generative facial prior. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2021.
  • Wang et al. (2023a) Yinhuai Wang, Jiwen Yu, and Jian Zhang. Zero-shot image restoration using denoising diffusion null-space model. The Eleventh International Conference on Learning Representations, 2023a.
  • Wang et al. (2004) Zhou Wang, A.C. Bovik, H.R. Sheikh, and E.P. Simoncelli. Image quality assessment: from error visibility to structural similarity. IEEE Transactions on Image Processing, 13(4):600–612, 2004. doi: 10.1109/TIP.2003.819861.
  • Wang et al. (2022) Zhouxia Wang, Jiawei Zhang, Runjian Chen, Wenping Wang, and Ping Luo. Restoreformer: High-quality blind face restoration from undegraded key-value pairs. 2022.
  • Wang et al. (2023b) Zhouxia Wang, Jiawei Zhang, Tianshui Chen, Wenping Wang, and Ping Luo. Restoreformer++: Towards real-world blind face restoration from undegraded key-value paris. 2023b.
  • Yang et al. (2021) Tao Yang, Peiran Ren, Xuansong Xie, , and Lei Zhang. Gan prior embedded network for blind face restoration in the wild. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2021.
  • Yue & Loy (2024) Zongsheng Yue and Chen Change Loy. Difface: Blind face restoration with diffused error contraction. IEEE Transactions on Pattern Analysis and Machine Intelligence, pp.  1–15, 2024. doi: 10.1109/TPAMI.2024.3432651.
  • Zhang et al. (2021) Kai Zhang, Jingyun Liang, Luc Van Gool, and Radu Timofte. Designing a practical degradation model for deep blind image super-resolution. In IEEE International Conference on Computer Vision, pp.  4791–4800, 2021.
  • Zhang et al. (2016) Richard Zhang, Phillip Isola, and Alexei A Efros. Colorful image colorization. In ECCV, 2016.
  • Zhang et al. (2017) Richard Zhang, Jun-Yan Zhu, Phillip Isola, Xinyang Geng, Angela S Lin, Tianhe Yu, and Alexei A Efros. Real-time user-guided image colorization with learned deep priors. ACM Transactions on Graphics (TOG), 9(4), 2017.
  • Zhang et al. (2018) Richard Zhang, Phillip Isola, Alexei A Efros, Eli Shechtman, and Oliver Wang. The unreasonable effectiveness of deep features as a perceptual metric. In CVPR, 2018.
  • Zhou et al. (2022) Shangchen Zhou, Kelvin C.K. Chan, Chongyi Li, and Chen Change Loy. Towards robust blind face restoration with codebook lookup transformer. In NeurIPS, 2022.
  • Zhu et al. (2024) Yixuan Zhu, Wenliang Zhao, Ao Li, Yansong Tang, Jie Zhou, and Jiwen Lu. Flowie: Efficient image enhancement via rectified flow. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp.  13–22, June 2024.
  • Zhu et al. (2023) Yuanzhi Zhu, Kai Zhang, Jingyun Liang, Jiezhang Cao, Bihan Wen, Radu Timofte, and Luc Van Gool. Denoising diffusion models for plug-and-play image restoration. In IEEE Conference on Computer Vision and Pattern Recognition Workshops (NTIRE), 2023.

Appendix A Supplementary explanations for PMRF

A.1 Proof that conditioning on X^superscript^𝑋\hat{X}^{*}over^ start_ARG italic_X end_ARG start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT achieves the same MSE as posterior sampling

Proposition 2.

Let X^superscript^𝑋\smash{\hat{X}^{\prime}}over^ start_ARG italic_X end_ARG start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT be the estimator which, given any degraded measurement y𝑦yitalic_y, first predicts the posterior mean x^=𝔼[X|Y=y]superscript^𝑥𝔼delimited-[]conditional𝑋𝑌𝑦\smash{\hat{x}^{*}=\mathbb{E}[X|Y=y]}over^ start_ARG italic_x end_ARG start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT = blackboard_E [ italic_X | italic_Y = italic_y ] and then samples from pX|X^(|x^)\smash{p_{X|\hat{X}^{*}}(\cdot|\hat{x}^{*})}italic_p start_POSTSUBSCRIPT italic_X | over^ start_ARG italic_X end_ARG start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( ⋅ | over^ start_ARG italic_x end_ARG start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT )333Note that X^superscript^𝑋\smash{\hat{X}^{\prime}}over^ start_ARG italic_X end_ARG start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT is a “posterior sampler” which is conditioned on X^superscript^𝑋\smash{\hat{X}^{*}}over^ start_ARG italic_X end_ARG start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT. Thus, Appendix D represents such an algorithm, which is one of the baseline methods we evaluate in Section 5.2.. Then, the MSE of X^superscript^𝑋\hat{X}^{\prime}over^ start_ARG italic_X end_ARG start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT equals twice the MMSE, which is the MSE attained by the posterior sampler.

Proof.

The MSE of X^superscript^𝑋\smash{\hat{X}^{\prime}}over^ start_ARG italic_X end_ARG start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT is given by

𝔼[XX^2]=𝔼[XX^2]+𝔼[XX^2],𝔼delimited-[]superscriptdelimited-∥∥𝑋superscript^𝑋2𝔼delimited-[]superscriptdelimited-∥∥𝑋superscript^𝑋2𝔼delimited-[]superscriptdelimited-∥∥superscript𝑋superscript^𝑋2\displaystyle\mathbb{E}[\lVert X-\hat{X}^{\prime}\rVert^{2}]=\mathbb{E}[\lVert X% -\hat{X}^{*}\rVert^{2}]+\mathbb{E}[\lVert X^{\prime}-\hat{X}^{*}\rVert^{2}],blackboard_E [ ∥ italic_X - over^ start_ARG italic_X end_ARG start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ] = blackboard_E [ ∥ italic_X - over^ start_ARG italic_X end_ARG start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ] + blackboard_E [ ∥ italic_X start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT - over^ start_ARG italic_X end_ARG start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ] , (14)

where this equality follows from Lemma 2 in (Freirich et al., 2021) (Appendix B.1). By the definition of X^superscript^𝑋\smash{\hat{X}^{\prime}}over^ start_ARG italic_X end_ARG start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT we have pX^,X^=pX,X^subscript𝑝superscript^𝑋superscript^𝑋subscript𝑝𝑋superscript^𝑋\smash{p_{\hat{X}^{\prime},\hat{X}^{*}}=p_{X,\hat{X}^{*}}}italic_p start_POSTSUBSCRIPT over^ start_ARG italic_X end_ARG start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , over^ start_ARG italic_X end_ARG start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT = italic_p start_POSTSUBSCRIPT italic_X , over^ start_ARG italic_X end_ARG start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT, so

𝔼[XX^2]=𝔼[XX^2].𝔼delimited-[]superscriptdelimited-∥∥superscript𝑋superscript^𝑋2𝔼delimited-[]superscriptdelimited-∥∥𝑋superscript^𝑋2\displaystyle\mathbb{E}[\lVert X^{\prime}-\hat{X}^{*}\rVert^{2}]=\mathbb{E}[% \lVert X-\hat{X}^{*}\rVert^{2}].blackboard_E [ ∥ italic_X start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT - over^ start_ARG italic_X end_ARG start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ] = blackboard_E [ ∥ italic_X - over^ start_ARG italic_X end_ARG start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ] . (15)

Substituting this result into Equation 14, we get

𝔼[XX^2]=2𝔼[XX^2].𝔼delimited-[]superscriptdelimited-∥∥𝑋superscript^𝑋22𝔼delimited-[]superscriptdelimited-∥∥𝑋superscript^𝑋2\displaystyle\mathbb{E}[\lVert X-\hat{X}^{\prime}\rVert^{2}]=2\mathbb{E}[% \lVert X-\hat{X}^{*}\rVert^{2}].blackboard_E [ ∥ italic_X - over^ start_ARG italic_X end_ARG start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ] = 2 blackboard_E [ ∥ italic_X - over^ start_ARG italic_X end_ARG start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ] . (16)

Namely, X^superscript^𝑋\smash{\hat{X}^{\prime}}over^ start_ARG italic_X end_ARG start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT attains precisely the same MSE as the posterior sampler, which is equal to twice the MMSE (Blau & Michaeli, 2018). Thus, in theory, one should not expect to improve the MSE of a conditional diffusion/flow model by supplying X^superscript^𝑋\smash{\hat{X}^{*}}over^ start_ARG italic_X end_ARG start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT as a condition instead of Y𝑌Yitalic_Y. ∎

A.2 Proof of Proposition 1

For completeness, we first restate Proposition 1 and then provide its proof. See 1

Proof.

We first prove (a) and (b) assuming that the solution for the ODE in Equation 11 exists and is unique for σs=0subscript𝜎𝑠0\sigma_{s}=0italic_σ start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT = 0. Then, we will prove (c) by also assuming that the distribution of (XX^)|Zt=ztconditional𝑋superscript^𝑋subscript𝑍𝑡subscript𝑧𝑡\smash{(X-\hat{X}^{*})|Z_{t}=z_{t}}( italic_X - over^ start_ARG italic_X end_ARG start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) | italic_Z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = italic_z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT is non-degenerate for almost every ztsubscript𝑧𝑡z_{t}italic_z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT and t[0,1]𝑡01t\in[0,1]italic_t ∈ [ 0 , 1 ].

From Theorem 3.3 in (Liu et al., 2023) we have pZ^t=pZtsubscript𝑝subscript^𝑍𝑡subscript𝑝subscript𝑍𝑡\smash{p_{\hat{Z}_{t}}=p_{Z_{t}}}italic_p start_POSTSUBSCRIPT over^ start_ARG italic_Z end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT = italic_p start_POSTSUBSCRIPT italic_Z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT for every t[0,1]𝑡01t\in[0,1]italic_t ∈ [ 0 , 1 ]. This implies that pZ^1=pZ1=pXsubscript𝑝subscript^𝑍1subscript𝑝subscript𝑍1subscript𝑝𝑋\smash{p_{\hat{Z}_{1}}=p_{Z_{1}}=p_{X}}italic_p start_POSTSUBSCRIPT over^ start_ARG italic_Z end_ARG start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT = italic_p start_POSTSUBSCRIPT italic_Z start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT = italic_p start_POSTSUBSCRIPT italic_X end_POSTSUBSCRIPT, i.e., PMRF attains a perfect perceptual index when σs=0subscript𝜎𝑠0\sigma_{s}=0italic_σ start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT = 0. This proves (a).

Next, without additional assumptions, we will prove (b) by showing that

𝔼[Z^1X^2]𝔼[XX^2],𝔼delimited-[]superscriptdelimited-∥∥subscript^𝑍1superscript^𝑋2𝔼delimited-[]superscriptdelimited-∥∥𝑋superscript^𝑋2\displaystyle\mathbb{E}[\lVert\hat{Z}_{1}-\hat{X}^{*}\rVert^{2}]\leq\mathbb{E}% [\lVert X-\hat{X}^{*}\rVert^{2}],blackboard_E [ ∥ over^ start_ARG italic_Z end_ARG start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT - over^ start_ARG italic_X end_ARG start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ] ≤ blackboard_E [ ∥ italic_X - over^ start_ARG italic_X end_ARG start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ] , (17)

which will imply that the MSE of Z^1subscript^𝑍1\smash{\hat{Z}_{1}}over^ start_ARG italic_Z end_ARG start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT can only be smaller than that of the posterior sampler. Since σs=0subscript𝜎𝑠0\sigma_{s}=0italic_σ start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT = 0, we have Z0=X^+σsϵ=X^subscript𝑍0superscript^𝑋subscript𝜎𝑠italic-ϵsuperscript^𝑋\smash{Z_{0}=\hat{X}^{*}+\sigma_{s}\epsilon=\hat{X}^{*}}italic_Z start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT = over^ start_ARG italic_X end_ARG start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT + italic_σ start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT italic_ϵ = over^ start_ARG italic_X end_ARG start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT. Following similar arguments to those in the proof of Theorem 3.5 in (Liu et al., 2023), it holds that

𝔼[Z^1X^2]𝔼delimited-[]superscriptdelimited-∥∥subscript^𝑍1superscript^𝑋2\displaystyle\mathbb{E}[\lVert\hat{Z}_{1}-\hat{X}^{*}\rVert^{2}]blackboard_E [ ∥ over^ start_ARG italic_Z end_ARG start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT - over^ start_ARG italic_X end_ARG start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ] =𝔼[01vRF(Z^t,t)𝑑t2]absent𝔼delimited-[]superscriptdelimited-∥∥superscriptsubscript01subscript𝑣RFsubscript^𝑍𝑡𝑡differential-d𝑡2\displaystyle=\mathbb{E}\left[\left\lVert\int_{0}^{1}v_{\text{RF}}(\hat{Z}_{t}% ,t)dt\right\rVert^{2}\right]= blackboard_E [ ∥ ∫ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT italic_v start_POSTSUBSCRIPT RF end_POSTSUBSCRIPT ( over^ start_ARG italic_Z end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_t ) italic_d italic_t ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ] (18)
=𝔼[01vRF(Zt,t)𝑑t2]absent𝔼delimited-[]superscriptdelimited-∥∥superscriptsubscript01subscript𝑣RFsubscript𝑍𝑡𝑡differential-d𝑡2\displaystyle=\mathbb{E}\left[\left\lVert\int_{0}^{1}v_{\text{RF}}(Z_{t},t)dt% \right\rVert^{2}\right]= blackboard_E [ ∥ ∫ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT italic_v start_POSTSUBSCRIPT RF end_POSTSUBSCRIPT ( italic_Z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_t ) italic_d italic_t ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ] (19)
𝔼[01vRF(Zt,t)2𝑑t]absent𝔼delimited-[]superscriptsubscript01superscriptdelimited-∥∥subscript𝑣RFsubscript𝑍𝑡𝑡2differential-d𝑡\displaystyle\leq\mathbb{E}\left[\int_{0}^{1}\left\lVert v_{\text{RF}}(Z_{t},t% )\right\rVert^{2}dt\right]≤ blackboard_E [ ∫ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT ∥ italic_v start_POSTSUBSCRIPT RF end_POSTSUBSCRIPT ( italic_Z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_t ) ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_d italic_t ] (20)
=𝔼[01𝔼[XX^|Zt]2𝑑t]absent𝔼delimited-[]superscriptsubscript01superscriptdelimited-∥∥𝔼delimited-[]𝑋conditionalsuperscript^𝑋subscript𝑍𝑡2differential-d𝑡\displaystyle=\mathbb{E}\left[\int_{0}^{1}\left\lVert\mathbb{E}[X-\hat{X}^{*}|% Z_{t}]\right\rVert^{2}dt\right]= blackboard_E [ ∫ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT ∥ blackboard_E [ italic_X - over^ start_ARG italic_X end_ARG start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT | italic_Z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ] ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_d italic_t ] (21)
𝔼[01𝔼[XX^2|Zt]𝑑t]absent𝔼delimited-[]superscriptsubscript01𝔼delimited-[]conditionalsuperscriptdelimited-∥∥𝑋superscript^𝑋2subscript𝑍𝑡differential-d𝑡\displaystyle\leq\mathbb{E}\left[\int_{0}^{1}\mathbb{E}[\lVert X-\hat{X}^{*}% \rVert^{2}|Z_{t}]dt\right]≤ blackboard_E [ ∫ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT blackboard_E [ ∥ italic_X - over^ start_ARG italic_X end_ARG start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT | italic_Z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ] italic_d italic_t ] (22)
=01𝔼[𝔼[XX^2|Zt]]𝑑tabsentsuperscriptsubscript01𝔼delimited-[]𝔼delimited-[]conditionalsuperscriptdelimited-∥∥𝑋superscript^𝑋2subscript𝑍𝑡differential-d𝑡\displaystyle=\int_{0}^{1}\mathbb{E}\left[\mathbb{E}[\lVert X-\hat{X}^{*}% \rVert^{2}|Z_{t}]\right]dt= ∫ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT blackboard_E [ blackboard_E [ ∥ italic_X - over^ start_ARG italic_X end_ARG start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT | italic_Z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ] ] italic_d italic_t (23)
=01𝔼[XX^2]𝑑tabsentsuperscriptsubscript01𝔼delimited-[]superscriptdelimited-∥∥𝑋superscript^𝑋2differential-d𝑡\displaystyle=\int_{0}^{1}\mathbb{E}[\lVert X-\hat{X}^{*}\rVert^{2}]dt= ∫ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT blackboard_E [ ∥ italic_X - over^ start_ARG italic_X end_ARG start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ] italic_d italic_t (24)
=𝔼[XX^2],absent𝔼delimited-[]superscriptdelimited-∥∥𝑋superscript^𝑋2\displaystyle=\mathbb{E}[\lVert X-\hat{X}^{*}\rVert^{2}],= blackboard_E [ ∥ italic_X - over^ start_ARG italic_X end_ARG start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ] , (25)

where Equation 18 follows from the definition of Z^1subscript^𝑍1\smash{\hat{Z}_{1}}over^ start_ARG italic_Z end_ARG start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT and X^superscript^𝑋\hat{X}^{*}over^ start_ARG italic_X end_ARG start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPTEquation 19 follows from the fact that pZ^t=pZtsubscript𝑝subscript^𝑍𝑡subscript𝑝subscript𝑍𝑡p_{\hat{Z}_{t}}=p_{Z_{t}}italic_p start_POSTSUBSCRIPT over^ start_ARG italic_Z end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT = italic_p start_POSTSUBSCRIPT italic_Z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPTEquation 20 follows from Jensen’s inequality, Equation 21 follows from the definition of vRF(Zt,t)subscript𝑣RFsubscript𝑍𝑡𝑡v_{\text{RF}}(Z_{t},t)italic_v start_POSTSUBSCRIPT RF end_POSTSUBSCRIPT ( italic_Z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_t )Equation 22 follows from Jensen’s inequality, Equation 23 follows from the linearity of the integral operator, and Equation 24 follows from the law of total expectation. Thus, we have 𝔼[Z^1X^2]𝔼[XX^2]𝔼delimited-[]superscriptdelimited-∥∥subscript^𝑍1superscript^𝑋2𝔼delimited-[]superscriptdelimited-∥∥𝑋superscript^𝑋2\smash{\mathbb{E}[\lVert\hat{Z}_{1}-\hat{X}^{*}\rVert^{2}]\leq\mathbb{E}[% \lVert X-\hat{X}^{*}\rVert^{2}]}blackboard_E [ ∥ over^ start_ARG italic_Z end_ARG start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT - over^ start_ARG italic_X end_ARG start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ] ≤ blackboard_E [ ∥ italic_X - over^ start_ARG italic_X end_ARG start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ]. Combining this result with Lemma 2 from (Freirich et al., 2021) (Appendix B.1), we conclude that

𝔼[XZ^12]𝔼delimited-[]superscriptdelimited-∥∥𝑋subscript^𝑍12\displaystyle\mathbb{E}[\lVert X-\hat{Z}_{1}\rVert^{2}]blackboard_E [ ∥ italic_X - over^ start_ARG italic_Z end_ARG start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ] =𝔼[XX^2]+𝔼[Z^1X^2]absent𝔼delimited-[]superscriptdelimited-∥∥𝑋superscript^𝑋2𝔼delimited-[]superscriptdelimited-∥∥subscript^𝑍1superscript^𝑋2\displaystyle=\mathbb{E}[\lVert X-\hat{X}^{*}\rVert^{2}]+\mathbb{E}[\lVert\hat% {Z}_{1}-\hat{X}^{*}\rVert^{2}]= blackboard_E [ ∥ italic_X - over^ start_ARG italic_X end_ARG start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ] + blackboard_E [ ∥ over^ start_ARG italic_Z end_ARG start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT - over^ start_ARG italic_X end_ARG start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ]
2𝔼[XX^2],absent2𝔼delimited-[]superscriptdelimited-∥∥𝑋superscript^𝑋2\displaystyle\leq 2\mathbb{E}[\lVert X-\hat{X}^{*}\rVert^{2}],≤ 2 blackboard_E [ ∥ italic_X - over^ start_ARG italic_X end_ARG start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ] , (26)

where the left hand side is the MSE of PMRF, and the right hand side is the MSE of the posterior sampler, which always equals twice the MMSE (Blau & Michaeli, 2018).

Finally, to prove (c), let us further assume that (XX^)|Zt=ztconditional𝑋superscript^𝑋subscript𝑍𝑡subscript𝑧𝑡\smash{(X-\hat{X}^{*}})|Z_{t}=z_{t}( italic_X - over^ start_ARG italic_X end_ARG start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) | italic_Z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = italic_z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT is a non-degenerate random vector for every ztsupppZtsubscript𝑧𝑡suppsubscript𝑝subscript𝑍𝑡\smash{z_{t}\in\operatorname{supp}{p_{Z_{t}}}}italic_z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∈ roman_supp italic_p start_POSTSUBSCRIPT italic_Z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT and t[0,1]𝑡01t\in[0,1]italic_t ∈ [ 0 , 1 ]. Thus, the inequality in Equation 22 becomes strict (from Jensen’s inequality for strictly convex functions), and hence we have 𝔼[Z^1X^2]<𝔼[XX^2]𝔼delimited-[]superscriptdelimited-∥∥subscript^𝑍1superscript^𝑋2𝔼delimited-[]superscriptdelimited-∥∥𝑋superscript^𝑋2\smash{\mathbb{E}[\lVert\hat{Z}_{1}-\hat{X}^{*}\rVert^{2}]<\mathbb{E}[\lVert X% -\hat{X}^{*}\rVert^{2}]}blackboard_E [ ∥ over^ start_ARG italic_Z end_ARG start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT - over^ start_ARG italic_X end_ARG start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ] < blackboard_E [ ∥ italic_X - over^ start_ARG italic_X end_ARG start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ]. Combining this result with Lemma 2 from (Freirich et al., 2021) (Appendix B.1), we conclude that

𝔼[XZ^12]<2𝔼[XX^2].𝔼delimited-[]superscriptdelimited-∥∥𝑋subscript^𝑍122𝔼delimited-[]superscriptdelimited-∥∥𝑋superscript^𝑋2\displaystyle\mathbb{E}[\lVert X-\hat{Z}_{1}\rVert^{2}]<2\mathbb{E}[\lVert X-% \hat{X}^{*}\rVert^{2}].blackboard_E [ ∥ italic_X - over^ start_ARG italic_Z end_ARG start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ] < 2 blackboard_E [ ∥ italic_X - over^ start_ARG italic_X end_ARG start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ] . (27)

Namely, the MSE of Z^1subscript^𝑍1\smash{\hat{Z}_{1}}over^ start_ARG italic_Z end_ARG start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT (left hand side) is strictly smaller than that of the posterior sampler (right hand side). ∎

A.3 Proof of the results in Example 1

From (Freirich et al., 2021; Blau & Michaeli, 2018), we know that X^0subscript^𝑋0\hat{X}_{0}over^ start_ARG italic_X end_ARG start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT in Example 1 attains a MSE that is strictly smaller than that of the posterior sampler (assuming that σN>0subscript𝜎𝑁0\sigma_{N}>0italic_σ start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT > 0). Specifically, the closed-form solution of X^0subscript^𝑋0\hat{X}_{0}over^ start_ARG italic_X end_ARG start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT in Example 1 is given by (Freirich et al., 2021):

X^0=11+σN2Y.subscript^𝑋011superscriptsubscript𝜎𝑁2𝑌\displaystyle\hat{X}_{0}=\frac{1}{\sqrt{1+\sigma_{N}^{2}}}Y.over^ start_ARG italic_X end_ARG start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT = divide start_ARG 1 end_ARG start_ARG square-root start_ARG 1 + italic_σ start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG end_ARG italic_Y . (28)

Moreover, in this example, it is well known that the posterior mean 𝔼[X|Y]𝔼delimited-[]conditional𝑋𝑌\mathbb{E}[X|Y]blackboard_E [ italic_X | italic_Y ] is given by

X^=11+σN2Y.superscript^𝑋11superscriptsubscript𝜎𝑁2𝑌\displaystyle\hat{X}^{*}=\frac{1}{1+\sigma_{N}^{2}}Y.over^ start_ARG italic_X end_ARG start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT = divide start_ARG 1 end_ARG start_ARG 1 + italic_σ start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG italic_Y . (29)

Next, we will prove that:

  1. (a)

    All the assumptions in Proposition 1 hold.

  2. (b)

    Z^1=X^0subscript^𝑍1subscript^𝑋0\hat{Z}_{1}=\hat{X}_{0}over^ start_ARG italic_Z end_ARG start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT = over^ start_ARG italic_X end_ARG start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT almost surely.

Proof of (a).

Since σs=0subscript𝜎𝑠0\sigma_{s}=0italic_σ start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT = 0, we have vRF(Zt,t)=𝔼[XX^|Zt]subscript𝑣RFsubscript𝑍𝑡𝑡𝔼delimited-[]𝑋conditionalsuperscript^𝑋subscript𝑍𝑡v_{\text{RF}}(Z_{t},t)=\mathbb{E}[X-\hat{X}^{*}|Z_{t}]italic_v start_POSTSUBSCRIPT RF end_POSTSUBSCRIPT ( italic_Z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_t ) = blackboard_E [ italic_X - over^ start_ARG italic_X end_ARG start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT | italic_Z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ] and Zt=tX+(1t)X^subscript𝑍𝑡𝑡𝑋1𝑡superscript^𝑋\smash{Z_{t}=tX+(1-t)\hat{X}^{*}}italic_Z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = italic_t italic_X + ( 1 - italic_t ) over^ start_ARG italic_X end_ARG start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT. Below, we show that

Cov(XX^,Zt)=tσN21+σN2,andCov𝑋superscript^𝑋subscript𝑍𝑡𝑡superscriptsubscript𝜎𝑁21superscriptsubscript𝜎𝑁2and\displaystyle\text{Cov}(X-\hat{X}^{*},Z_{t})=t\frac{\sigma_{N}^{2}}{1+\sigma_{% N}^{2}},~{}~{}\mbox{and}Cov ( italic_X - over^ start_ARG italic_X end_ARG start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT , italic_Z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) = italic_t divide start_ARG italic_σ start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG 1 + italic_σ start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG , and (30)
Var(Zt)=t2σN21+σN2+11+σN2.Varsubscript𝑍𝑡superscript𝑡2superscriptsubscript𝜎𝑁21superscriptsubscript𝜎𝑁211superscriptsubscript𝜎𝑁2\displaystyle\text{Var}(Z_{t})=t^{2}\frac{\sigma_{N}^{2}}{1+\sigma_{N}^{2}}+% \frac{1}{1+\sigma_{N}^{2}}.Var ( italic_Z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) = italic_t start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT divide start_ARG italic_σ start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG 1 + italic_σ start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG + divide start_ARG 1 end_ARG start_ARG 1 + italic_σ start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG . (31)

Since XX^𝑋superscript^𝑋X-\hat{X}^{*}italic_X - over^ start_ARG italic_X end_ARG start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT and Ztsubscript𝑍𝑡Z_{t}italic_Z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT are jointly Gaussian444XX^𝑋superscript^𝑋X-\hat{X}^{*}italic_X - over^ start_ARG italic_X end_ARG start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT and Ztsubscript𝑍𝑡Z_{t}italic_Z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT can be written as a linear transformation of (X,Y)𝑋𝑌(X,Y)( italic_X , italic_Y ), which are jointly Gaussian random variables. Thus, XX^𝑋superscript^𝑋X-\hat{X}^{*}italic_X - over^ start_ARG italic_X end_ARG start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT and Ztsubscript𝑍𝑡Z_{t}italic_Z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT are jointly Gaussian., we have

vRF(Zt,t)subscript𝑣RFsubscript𝑍𝑡𝑡\displaystyle v_{\text{RF}}(Z_{t},t)italic_v start_POSTSUBSCRIPT RF end_POSTSUBSCRIPT ( italic_Z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_t ) =𝔼[XX^|Zt]absent𝔼delimited-[]𝑋conditionalsuperscript^𝑋subscript𝑍𝑡\displaystyle=\mathbb{E}[X-\hat{X}^{*}|Z_{t}]= blackboard_E [ italic_X - over^ start_ARG italic_X end_ARG start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT | italic_Z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ]
=𝔼[XX^]+Cov(XX^,Zt)Var(Zt)(Zt𝔼[Zt])absent𝔼delimited-[]𝑋superscript^𝑋Cov𝑋superscript^𝑋subscript𝑍𝑡Varsubscript𝑍𝑡subscript𝑍𝑡𝔼delimited-[]subscript𝑍𝑡\displaystyle=\mathbb{E}[X-\hat{X}^{*}]+\frac{\text{Cov}(X-\hat{X}^{*},Z_{t})}% {\text{Var}(Z_{t})}(Z_{t}-\mathbb{E}[Z_{t}])= blackboard_E [ italic_X - over^ start_ARG italic_X end_ARG start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ] + divide start_ARG Cov ( italic_X - over^ start_ARG italic_X end_ARG start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT , italic_Z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) end_ARG start_ARG Var ( italic_Z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) end_ARG ( italic_Z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT - blackboard_E [ italic_Z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ] )
=Cov(XX^,Zt)Var(Zt)Zt,absentCov𝑋superscript^𝑋subscript𝑍𝑡Varsubscript𝑍𝑡subscript𝑍𝑡\displaystyle=\frac{\text{Cov}(X-\hat{X}^{*},Z_{t})}{\text{Var}(Z_{t})}Z_{t},= divide start_ARG Cov ( italic_X - over^ start_ARG italic_X end_ARG start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT , italic_Z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) end_ARG start_ARG Var ( italic_Z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) end_ARG italic_Z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , (32)
=tσN21+σN2t2σN21+σN2+11+σN2Ztabsent𝑡superscriptsubscript𝜎𝑁21superscriptsubscript𝜎𝑁2superscript𝑡2superscriptsubscript𝜎𝑁21superscriptsubscript𝜎𝑁211superscriptsubscript𝜎𝑁2subscript𝑍𝑡\displaystyle=\frac{t\frac{\sigma_{N}^{2}}{1+\sigma_{N}^{2}}}{t^{2}\frac{% \sigma_{N}^{2}}{1+\sigma_{N}^{2}}+\frac{1}{1+\sigma_{N}^{2}}}Z_{t}= divide start_ARG italic_t divide start_ARG italic_σ start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG 1 + italic_σ start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG end_ARG start_ARG italic_t start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT divide start_ARG italic_σ start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG 1 + italic_σ start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG + divide start_ARG 1 end_ARG start_ARG 1 + italic_σ start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG end_ARG italic_Z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT
=tσN21+t2σN2Zt,absent𝑡superscriptsubscript𝜎𝑁21superscript𝑡2superscriptsubscript𝜎𝑁2subscript𝑍𝑡\displaystyle=\frac{t\sigma_{N}^{2}}{1+t^{2}\sigma_{N}^{2}}Z_{t},= divide start_ARG italic_t italic_σ start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG 1 + italic_t start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_σ start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG italic_Z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , (33)

where Equation 32 follows from the fact that 𝔼[XX^]=0𝔼delimited-[]𝑋superscript^𝑋0\mathbb{E}[X-\hat{X}^{*}]=0blackboard_E [ italic_X - over^ start_ARG italic_X end_ARG start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ] = 0 and 𝔼[Zt]=0𝔼delimited-[]subscript𝑍𝑡0\mathbb{E}[Z_{t}]=0blackboard_E [ italic_Z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ] = 0. One can verify that the solution of dZ^t=vRF(Z^t,t)dt𝑑subscript^𝑍𝑡subscript𝑣RFsubscript^𝑍𝑡𝑡𝑑𝑡d\hat{Z}_{t}=v_{\text{RF}}(\hat{Z}_{t},t)dtitalic_d over^ start_ARG italic_Z end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = italic_v start_POSTSUBSCRIPT RF end_POSTSUBSCRIPT ( over^ start_ARG italic_Z end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_t ) italic_d italic_t for any initial condition Z^0=csubscript^𝑍0𝑐\hat{Z}_{0}=cover^ start_ARG italic_Z end_ARG start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT = italic_c is unique and is given by

Z^t=c1+t2σN2.subscript^𝑍𝑡𝑐1superscript𝑡2superscriptsubscript𝜎𝑁2\displaystyle\hat{Z}_{t}=c\sqrt{1+t^{2}\sigma_{N}^{2}}.over^ start_ARG italic_Z end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = italic_c square-root start_ARG 1 + italic_t start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_σ start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG . (34)

To show that the distribution of (XX^)|Zt=ztconditional𝑋superscript^𝑋subscript𝑍𝑡subscript𝑧𝑡(X-\hat{X}^{*})|Z_{t}=z_{t}( italic_X - over^ start_ARG italic_X end_ARG start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) | italic_Z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = italic_z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT is non-degenerate for almost every ztsubscript𝑧𝑡z_{t}italic_z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT and t[0,1]𝑡01t\in[0,1]italic_t ∈ [ 0 , 1 ], note that

Var(XX^)Var𝑋superscript^𝑋\displaystyle\text{Var}(X-\hat{X}^{*})Var ( italic_X - over^ start_ARG italic_X end_ARG start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) =Cov(XX^,XX^)absentCov𝑋superscript^𝑋𝑋superscript^𝑋\displaystyle=\text{Cov}(X-\hat{X}^{*},X-\hat{X}^{*})= Cov ( italic_X - over^ start_ARG italic_X end_ARG start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT , italic_X - over^ start_ARG italic_X end_ARG start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT )
=Cov(X,X)2Cov(X,X^)+Cov(X^,X^)absentCov𝑋𝑋2Cov𝑋superscript^𝑋Covsuperscript^𝑋superscript^𝑋\displaystyle=\text{Cov}(X,X)-2\text{Cov}(X,\hat{X}^{*})+\text{Cov}(\hat{X}^{*% },\hat{X}^{*})= Cov ( italic_X , italic_X ) - 2 Cov ( italic_X , over^ start_ARG italic_X end_ARG start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) + Cov ( over^ start_ARG italic_X end_ARG start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT , over^ start_ARG italic_X end_ARG start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT )
=12Cov(X,11+σN2Y)+Cov(11+σN2Y,11+σN2Y)absent12Cov𝑋11superscriptsubscript𝜎𝑁2𝑌Cov11superscriptsubscript𝜎𝑁2𝑌11superscriptsubscript𝜎𝑁2𝑌\displaystyle=1-2\text{Cov}\left(X,\frac{1}{1+\sigma_{N}^{2}}Y\right)+\text{% Cov}\left(\frac{1}{1+\sigma_{N}^{2}}Y,\frac{1}{1+\sigma_{N}^{2}}Y\right)= 1 - 2 Cov ( italic_X , divide start_ARG 1 end_ARG start_ARG 1 + italic_σ start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG italic_Y ) + Cov ( divide start_ARG 1 end_ARG start_ARG 1 + italic_σ start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG italic_Y , divide start_ARG 1 end_ARG start_ARG 1 + italic_σ start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG italic_Y )
=121+σN2Cov(X,Y)+1(1+σN2)2Cov(Y,Y)absent121superscriptsubscript𝜎𝑁2Cov𝑋𝑌1superscript1superscriptsubscript𝜎𝑁22Cov𝑌𝑌\displaystyle=1-\frac{2}{1+\sigma_{N}^{2}}\text{Cov}(X,Y)+\frac{1}{(1+\sigma_{% N}^{2})^{2}}\text{Cov}(Y,Y)= 1 - divide start_ARG 2 end_ARG start_ARG 1 + italic_σ start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG Cov ( italic_X , italic_Y ) + divide start_ARG 1 end_ARG start_ARG ( 1 + italic_σ start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG Cov ( italic_Y , italic_Y )
=121+σN2+11+σN2absent121superscriptsubscript𝜎𝑁211superscriptsubscript𝜎𝑁2\displaystyle=1-\frac{2}{1+\sigma_{N}^{2}}+\frac{1}{1+\sigma_{N}^{2}}= 1 - divide start_ARG 2 end_ARG start_ARG 1 + italic_σ start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG + divide start_ARG 1 end_ARG start_ARG 1 + italic_σ start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG
=111+σN2absent111superscriptsubscript𝜎𝑁2\displaystyle=1-\frac{1}{1+\sigma_{N}^{2}}= 1 - divide start_ARG 1 end_ARG start_ARG 1 + italic_σ start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG
=σN21+σN2.absentsuperscriptsubscript𝜎𝑁21superscriptsubscript𝜎𝑁2\displaystyle=\frac{\sigma_{N}^{2}}{1+\sigma_{N}^{2}}.= divide start_ARG italic_σ start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG 1 + italic_σ start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG . (35)

Thus, for any t>0𝑡0t>0italic_t > 0, and assuming σN>0subscript𝜎𝑁0\sigma_{N}>0italic_σ start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT > 0, the correlation between XX^𝑋superscript^𝑋X-\hat{X}^{*}italic_X - over^ start_ARG italic_X end_ARG start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT and Ztsubscript𝑍𝑡Z_{t}italic_Z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT is given by

Cov(XX^,Zt)Var(Zt)Var(XX^)Cov𝑋superscript^𝑋subscript𝑍𝑡Varsubscript𝑍𝑡Var𝑋superscript^𝑋\displaystyle\frac{\text{Cov}(X-\hat{X}^{*},Z_{t})}{\sqrt{\text{Var}(Z_{t})% \text{Var}(X-\hat{X}^{*})}}divide start_ARG Cov ( italic_X - over^ start_ARG italic_X end_ARG start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT , italic_Z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) end_ARG start_ARG square-root start_ARG Var ( italic_Z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) Var ( italic_X - over^ start_ARG italic_X end_ARG start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) end_ARG end_ARG =tσN21+σN2(t2σN21+σN2+11+σN2)(σN21+σN2)absent𝑡superscriptsubscript𝜎𝑁21superscriptsubscript𝜎𝑁2superscript𝑡2superscriptsubscript𝜎𝑁21superscriptsubscript𝜎𝑁211superscriptsubscript𝜎𝑁2superscriptsubscript𝜎𝑁21superscriptsubscript𝜎𝑁2\displaystyle=\frac{t\frac{\sigma_{N}^{2}}{1+\sigma_{N}^{2}}}{\sqrt{\left(t^{2% }\frac{\sigma_{N}^{2}}{1+\sigma_{N}^{2}}+\frac{1}{1+\sigma_{N}^{2}}\right)% \left(\frac{\sigma_{N}^{2}}{1+\sigma_{N}^{2}}\right)}}= divide start_ARG italic_t divide start_ARG italic_σ start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG 1 + italic_σ start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG end_ARG start_ARG square-root start_ARG ( italic_t start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT divide start_ARG italic_σ start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG 1 + italic_σ start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG + divide start_ARG 1 end_ARG start_ARG 1 + italic_σ start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG ) ( divide start_ARG italic_σ start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG 1 + italic_σ start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG ) end_ARG end_ARG
=tσNt2σ2+1absent𝑡subscript𝜎𝑁superscript𝑡2superscript𝜎21\displaystyle=\frac{t\sigma_{N}}{\sqrt{t^{2}\sigma^{2}+1}}= divide start_ARG italic_t italic_σ start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT end_ARG start_ARG square-root start_ARG italic_t start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + 1 end_ARG end_ARG
=11+1t2σN2absent111superscript𝑡2superscriptsubscript𝜎𝑁2\displaystyle=\frac{1}{1+\frac{1}{t^{2}\sigma_{N}^{2}}}= divide start_ARG 1 end_ARG start_ARG 1 + divide start_ARG 1 end_ARG start_ARG italic_t start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_σ start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG end_ARG
<1.absent1\displaystyle<1.< 1 . (36)

Namely, the correlation between XX^𝑋superscript^𝑋X-\hat{X}^{*}italic_X - over^ start_ARG italic_X end_ARG start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT and Ztsubscript𝑍𝑡Z_{t}italic_Z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT is strictly smaller than 1111 for every t(0,1]𝑡01t\in(0,1]italic_t ∈ ( 0 , 1 ]. Moreover, for t=0𝑡0t=0italic_t = 0 the correlation between XX^𝑋superscript^𝑋X-\hat{X}^{*}italic_X - over^ start_ARG italic_X end_ARG start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT and Ztsubscript𝑍𝑡Z_{t}italic_Z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT clearly equals zero, so such a correlation is smaller than 1 for every t[0,1]𝑡01t\in[0,1]italic_t ∈ [ 0 , 1 ]. This implies that the distribution of (XX^)|Zt=ztconditional𝑋superscript^𝑋subscript𝑍𝑡subscript𝑧𝑡(X-\hat{X}^{*})|Z_{t}=z_{t}( italic_X - over^ start_ARG italic_X end_ARG start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) | italic_Z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = italic_z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT is non-degenerate for almost every ztsubscript𝑧𝑡z_{t}italic_z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT and t[0,1]𝑡01t\in[0,1]italic_t ∈ [ 0 , 1 ], and so all the assumptions in Proposition 1 hold.

To prove Equations 30 and 31, first note that Cov(X,X^)=Cov(X^,X^)=11+σN2Cov𝑋superscript^𝑋Covsuperscript^𝑋superscript^𝑋11superscriptsubscript𝜎𝑁2\text{Cov}(X,\hat{X}^{*})=\text{Cov}(\hat{X}^{*},\hat{X}^{*})=\frac{1}{1+% \sigma_{N}^{2}}Cov ( italic_X , over^ start_ARG italic_X end_ARG start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) = Cov ( over^ start_ARG italic_X end_ARG start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT , over^ start_ARG italic_X end_ARG start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) = divide start_ARG 1 end_ARG start_ARG 1 + italic_σ start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG, and so Cov(X,X^)Cov(X^,X^)=0Cov𝑋superscript^𝑋Covsuperscript^𝑋superscript^𝑋0\text{Cov}(X,\hat{X}^{*})-\text{Cov}(\hat{X}^{*},\hat{X}^{*})=0Cov ( italic_X , over^ start_ARG italic_X end_ARG start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) - Cov ( over^ start_ARG italic_X end_ARG start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT , over^ start_ARG italic_X end_ARG start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) = 0. Thus,

Cov(XX^,Zt)Cov𝑋superscript^𝑋subscript𝑍𝑡\displaystyle\text{Cov}(X-\hat{X}^{*},Z_{t})Cov ( italic_X - over^ start_ARG italic_X end_ARG start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT , italic_Z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) =Cov(XX^,tX+(1t)X^)absentCov𝑋superscript^𝑋𝑡𝑋1𝑡superscript^𝑋\displaystyle=\text{Cov}(X-\hat{X}^{*},tX+(1-t)\hat{X}^{*})= Cov ( italic_X - over^ start_ARG italic_X end_ARG start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT , italic_t italic_X + ( 1 - italic_t ) over^ start_ARG italic_X end_ARG start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT )
=t(Cov(X,X)Cov(X,X^))+(1t)(Cov(X,X^)Cov(X^,X^))absent𝑡Cov𝑋𝑋Cov𝑋superscript^𝑋1𝑡Cov𝑋superscript^𝑋Covsuperscript^𝑋superscript^𝑋\displaystyle=t(\text{Cov}(X,X)-\text{Cov}(X,\hat{X}^{*}))+(1-t)(\text{Cov}(X,% \hat{X}^{*})-\text{Cov}(\hat{X}^{*},\hat{X}^{*}))= italic_t ( Cov ( italic_X , italic_X ) - Cov ( italic_X , over^ start_ARG italic_X end_ARG start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) ) + ( 1 - italic_t ) ( Cov ( italic_X , over^ start_ARG italic_X end_ARG start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) - Cov ( over^ start_ARG italic_X end_ARG start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT , over^ start_ARG italic_X end_ARG start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) )
=t(111+σN2)absent𝑡111superscriptsubscript𝜎𝑁2\displaystyle=t\left(1-\frac{1}{1+\sigma_{N}^{2}}\right)= italic_t ( 1 - divide start_ARG 1 end_ARG start_ARG 1 + italic_σ start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG )
=tσN21+σN2,absent𝑡superscriptsubscript𝜎𝑁21superscriptsubscript𝜎𝑁2\displaystyle=t\frac{\sigma_{N}^{2}}{1+\sigma_{N}^{2}},= italic_t divide start_ARG italic_σ start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG 1 + italic_σ start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG , (37)

and,

Var(Zt)Varsubscript𝑍𝑡\displaystyle\text{Var}(Z_{t})Var ( italic_Z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) =Cov(Zt,Zt)absentCovsubscript𝑍𝑡subscript𝑍𝑡\displaystyle=\text{Cov}(Z_{t},Z_{t})= Cov ( italic_Z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_Z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT )
=Cov(tX+(1t)X^,tX+(1t)X^)absentCov𝑡𝑋1𝑡superscript^𝑋𝑡𝑋1𝑡superscript^𝑋\displaystyle=\text{Cov}(tX+(1-t)\hat{X}^{*},tX+(1-t)\hat{X}^{*})= Cov ( italic_t italic_X + ( 1 - italic_t ) over^ start_ARG italic_X end_ARG start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT , italic_t italic_X + ( 1 - italic_t ) over^ start_ARG italic_X end_ARG start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT )
=t2Cov(X,X)+2t(1t)Cov(X,X^)+(1t)2Cov(X^,X^)absentsuperscript𝑡2Cov𝑋𝑋2𝑡1𝑡Cov𝑋superscript^𝑋superscript1𝑡2Covsuperscript^𝑋superscript^𝑋\displaystyle=t^{2}\text{Cov}(X,X)+2t(1-t)\text{Cov}(X,\hat{X}^{*})+(1-t)^{2}% \text{Cov}(\hat{X}^{*},\hat{X}^{*})= italic_t start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT Cov ( italic_X , italic_X ) + 2 italic_t ( 1 - italic_t ) Cov ( italic_X , over^ start_ARG italic_X end_ARG start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) + ( 1 - italic_t ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT Cov ( over^ start_ARG italic_X end_ARG start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT , over^ start_ARG italic_X end_ARG start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT )
=t2+(2t(1t)+(1t)2)11+σN2absentsuperscript𝑡22𝑡1𝑡superscript1𝑡211superscriptsubscript𝜎𝑁2\displaystyle=t^{2}+(2t(1-t)+(1-t)^{2})\frac{1}{1+\sigma_{N}^{2}}= italic_t start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + ( 2 italic_t ( 1 - italic_t ) + ( 1 - italic_t ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) divide start_ARG 1 end_ARG start_ARG 1 + italic_σ start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG
=t2+(2t2t2+12t+t2)11+σN2absentsuperscript𝑡22𝑡2superscript𝑡212𝑡superscript𝑡211superscriptsubscript𝜎𝑁2\displaystyle=t^{2}+(2t-2t^{2}+1-2t+t^{2})\frac{1}{1+\sigma_{N}^{2}}= italic_t start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + ( 2 italic_t - 2 italic_t start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + 1 - 2 italic_t + italic_t start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) divide start_ARG 1 end_ARG start_ARG 1 + italic_σ start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG
=t2+(1t2)11+σN2absentsuperscript𝑡21superscript𝑡211superscriptsubscript𝜎𝑁2\displaystyle=t^{2}+(1-t^{2})\frac{1}{1+\sigma_{N}^{2}}= italic_t start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + ( 1 - italic_t start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) divide start_ARG 1 end_ARG start_ARG 1 + italic_σ start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG
=t2σN21+σN2+11+σN2.absentsuperscript𝑡2superscriptsubscript𝜎𝑁21superscriptsubscript𝜎𝑁211superscriptsubscript𝜎𝑁2\displaystyle=t^{2}\frac{\sigma_{N}^{2}}{1+\sigma_{N}^{2}}+\frac{1}{1+\sigma_{% N}^{2}}.= italic_t start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT divide start_ARG italic_σ start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG 1 + italic_σ start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG + divide start_ARG 1 end_ARG start_ARG 1 + italic_σ start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG . (38)
Proof of (b).

The proof follows directly from Equation 34. Specifically, for the initial condition Z^0=X^subscript^𝑍0superscript^𝑋\smash{\hat{Z}_{0}=\hat{X}^{*}}over^ start_ARG italic_Z end_ARG start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT = over^ start_ARG italic_X end_ARG start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT, we have

Z^1subscript^𝑍1\displaystyle\hat{Z}_{1}over^ start_ARG italic_Z end_ARG start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT =1+σN2X^absent1superscriptsubscript𝜎𝑁2superscript^𝑋\displaystyle=\sqrt{1+\sigma_{N}^{2}}\hat{X}^{*}= square-root start_ARG 1 + italic_σ start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG over^ start_ARG italic_X end_ARG start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT
=1+σN211+σN2Yabsent1superscriptsubscript𝜎𝑁211superscriptsubscript𝜎𝑁2𝑌\displaystyle=\sqrt{1+\sigma_{N}^{2}}\frac{1}{1+\sigma_{N}^{2}}Y= square-root start_ARG 1 + italic_σ start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG divide start_ARG 1 end_ARG start_ARG 1 + italic_σ start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG italic_Y
=11+σN2Yabsent11superscriptsubscript𝜎𝑁2𝑌\displaystyle=\frac{1}{\sqrt{1+\sigma_{N}^{2}}}Y= divide start_ARG 1 end_ARG start_ARG square-root start_ARG 1 + italic_σ start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG end_ARG italic_Y (39)
=X^0.absentsubscript^𝑋0\displaystyle=\hat{X}_{0}.= over^ start_ARG italic_X end_ARG start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT . (40)

Thus, in Example 1, PMRF with σs=0subscript𝜎𝑠0\sigma_{s}=0italic_σ start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT = 0 coincides with the desired optimal estimator X^0subscript^𝑋0\hat{X}_{0}over^ start_ARG italic_X end_ARG start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT.

A.4 Reflow (optional)

To potentially improve the MSE of PMRF further, one may conduct a reflow procedure (Liu et al., 2023), where a sequence of flow models are trained, and the flow model at index k+1𝑘1k+1italic_k + 1 learns to flow from the source distribution to the distribution generated by the flow model at index k𝑘kitalic_k. Specifically, let Z^1k+1superscriptsubscript^𝑍1𝑘1\smash{\hat{Z}_{1}^{k+1}}over^ start_ARG italic_Z end_ARG start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k + 1 end_POSTSUPERSCRIPT be the random vector generated by PMRF (Section 3), where Z^1ksuperscriptsubscript^𝑍1𝑘\smash{\hat{Z}_{1}^{k}}over^ start_ARG italic_Z end_ARG start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT replaces the role of X𝑋Xitalic_X in Section 3 and Z^10=Xsuperscriptsubscript^𝑍10𝑋\smash{\hat{Z}_{1}^{0}=X}over^ start_ARG italic_Z end_ARG start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT = italic_X (Z0subscript𝑍0Z_{0}italic_Z start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT remains unchanged). Thus, from Theorem 3.5 in (Liu et al., 2023), we have 𝔼[c(Z^1k+1Z0)]𝔼[c(Z^1kZ0)]𝔼delimited-[]𝑐superscriptsubscript^𝑍1𝑘1subscript𝑍0𝔼delimited-[]𝑐superscriptsubscript^𝑍1𝑘subscript𝑍0\smash{\mathbb{E}[c(\smash{\hat{Z}_{1}^{k+1}}-Z_{0})]\leq\mathbb{E}[c(\smash{% \hat{Z}_{1}^{k}}-Z_{0})]}blackboard_E [ italic_c ( over^ start_ARG italic_Z end_ARG start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k + 1 end_POSTSUPERSCRIPT - italic_Z start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) ] ≤ blackboard_E [ italic_c ( over^ start_ARG italic_Z end_ARG start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT - italic_Z start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) ], which implies the reflowing may only improve the MSE of PMRF, and hence improve the approximation of the desired optimal transport map (Equation 4). We leave this possibility for future work.

Appendix B Supplementary details and experiments in blind face image restoration

Unfortunately we do not compare with FlowIE (Zhu et al., 2024), as the checkpoints in the official repository of this method seem to not work at the moment. Note that FlowIE is a conditional method that utilizes a ControlNet (similarly to DiffBIR), so it is not similar to our PMRF algorithm.

B.1 Implementation details of PMRF

During training, we only use random horizontal flips for data augmentation. We use the SwinIR (Liang et al., 2021) model trained by Yue & Loy (2024) as the posterior mean predictor fωsubscript𝑓superscript𝜔f_{\omega^{*}}italic_f start_POSTSUBSCRIPT italic_ω start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT in Section 3, and use σs=0.1subscript𝜎𝑠0.1\sigma_{s}=0.1italic_σ start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT = 0.1. This model was trained using the same synthetic degradation as in Equation 12, with the same ranges for σ𝜎\sigmaitalic_σ, R𝑅Ritalic_R, δ𝛿\deltaitalic_δ, and Q𝑄Qitalic_Q we mentioned in Section 5.1. The SwinIR model’s weights are kept frozen during the vector field’s training stage, and the same weights are utilized during inference as well. The vector field vθsubscript𝑣𝜃v_{\theta}italic_v start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT is a HDiT model (Crowson et al., 2024), which we train from scratch. As in (Crowson et al., 2024), we sample t𝑡titalic_t uniformly from U[0,1]𝑈01U[0,1]italic_U [ 0 , 1 ] using a stratified sampling strategy. The vector field is trained for 3850 epochs using the AdamW optimizer (Loshchilov & Hutter, 2019), with a learning rate of 51045superscript1045\cdot 10^{-4}5 ⋅ 10 start_POSTSUPERSCRIPT - 4 end_POSTSUPERSCRIPT, (β1,β2)=(0.9,0.95)subscript𝛽1subscript𝛽20.90.95(\beta_{1},\beta_{2})=(0.9,0.95)( italic_β start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_β start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) = ( 0.9 , 0.95 ), and a weight decay of 102superscript10210^{-2}10 start_POSTSUPERSCRIPT - 2 end_POSTSUPERSCRIPT (as in (Crowson et al., 2024)). In the last 350 epochs, we reduce the learning rate gradually, multiplying it by 0.980.980.980.98 at the end of every epoch. The training batch size is set to 256 and is kept fixed. We compute the exponential moving average (EMA) of the model’s weights, using a decay of 0.99990.99990.99990.9999. The EMA weights of the model are then used in all evaluations. Our model is trained using bfloat16 mixed precision. A summary of the vector field training hyper-parameters is provided in Table 12.

B.2 Varying the number of flow steps K𝐾Kitalic_K in PMRF

In Tables 2, 3, 4, 5 and 6 we evaluate the performance of PMRF for various choices of K𝐾Kitalic_K (the number of inference steps in Section 3). As expected, increasing K𝐾Kitalic_K generally improves the perceptual quality while harming the distortion.

B.3 Details of DOT

We use the official codes of DOT (Adrai et al., 2023) as provided by the authors. This method performs optimal transport between the source and target distributions in latent space, using the closed-form solution for the optimal transport map between two Gaussians. As in (Adrai et al., 2023), we use the VAE (Kingma & Welling, 2014) of stable-diffusion (Rombach et al., 2022). For computing the latent empirical mean and covariance of the target distribution, we provide to the code the first 1000 images from FFHQ, with images of size 512×512512512512\times 512512 × 512 (the default is 100 images, so using 1000 images instead ensures that the performance of DOT is not compromised, as explain by Adrai et al. (2023)). For computing the latent empirical mean and covariance of the source distribution, we randomly synthesize degraded images according to Equation 12 from the first 1000 images in FFHQ, and reconstruct each image using the SwinIR model with the pre-trained weights from (Yue & Loy, 2024) (the same weights we use in PMRF). Given a degraded image y𝑦yitalic_y at test time, the code of Adrai et al. (2023) first predicts the posterior mean using the SwinIR model, encodes it to latent space, optimally transports the result using the pre-computed empirical means and covariances, and finally uses the decoder to obtain the reconstructed image.

B.4 Computation of FID, KID, and Precision

For each data set and algorithm, the FID, KID, and Precision are computed between the entire FFHQ 512×512512512512\times 512512 × 512 training set, and the reconstructed images produced for the degraded images in the test data set (as in previous works). For example, for the evaluations on the CelebA-Test data, this means that the FID is computed between the 70,000 FFHQ images, and the 3,000 CelebA-Test reconstructed images.

Appendix C Supplementary details on Section 5.2

C.1 Degradations

The degraded images in each task in the controlled experiments are synthesized according to the following degradations:

  1. 1.

    Denoising: We apply additive white Gaussian noise with standard deviation 0.350.350.350.35.

  2. 2.

    Super-resolution: We use the 8×8\times8 × bicubic down-sampling operator, and add Gaussian noise with standard deviation 0.050.050.050.05.

  3. 3.

    Inpainting: We randomly mask 90%percent9090\%90 % of the pixels in the ground-truth image, and add Gaussian noise with standard deviation 0.10.10.10.1.

  4. 4.

    Colorization: We average the color channels in the ground-truth image (with a weight of 1313\smash{\frac{1}{3}}divide start_ARG 1 end_ARG start_ARG 3 end_ARG for each color channel), and add Gaussian noise with standard deviation 0.250.250.250.25.

C.2 Implementation details of the flow methods

Training.

For all restoration tasks in Section 5.2, the models are trained on the FFHQ data set with images of size 256×256256256256\times 256256 × 256 (we down-sample the original 1024×1024102410241024\times 10241024 × 1024 images to 256×256256256256\times 256256 × 256). Unlike in the blind face image restoration experiments, where the model is trained on images of size 512×512512512512\times 512512 × 512, here we choose to use a smaller image resolution to save computational resources and achieve shorter training times. During training, we only use random horizontal flips for data augmentation.

Choice of σssubscript𝜎𝑠\sigma_{s}italic_σ start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT.

As expected, we observe that using σs=0subscript𝜎𝑠0\sigma_{s}=0italic_σ start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT = 0 in both PMRF (Section 3) and the flow from Y𝑌Yitalic_Y method (Appendix D) leads to blurry results with small MSE and large FID. Thus, for a fair comparison, we use the same value of σs>0subscript𝜎𝑠0\sigma_{s}>0italic_σ start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT > 0 in both methods. For the denoising task we use σs=0.025subscript𝜎𝑠0.025\sigma_{s}=0.025italic_σ start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT = 0.025, and for the rest of the tasks (inpainting, colorization, and super-resolution), we use σs=0.1subscript𝜎𝑠0.1\sigma_{s}=0.1italic_σ start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT = 0.1555Note that the “optimal” value of σssubscript𝜎𝑠\sigma_{s}italic_σ start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT depends on the severity of the restoration task. For example, in a mild image denoising task, the posterior mean X^superscript^𝑋\smash{\hat{X}^{*}}over^ start_ARG italic_X end_ARG start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT may already be close to the ground-truth image, so σssubscript𝜎𝑠\sigma_{s}italic_σ start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT should be smaller compared to a case where the noise is severe..

Posterior mean predictor.

The posterior mean predictor fωsubscript𝑓𝜔f_{\omega}italic_f start_POSTSUBSCRIPT italic_ω end_POSTSUBSCRIPT is a 4.4M parameters SwinIR model666We use the official code for the SwinIR architecture from https://github.com/JingyunLiang/SwinIR. Implementation details and hyper-parameters are provided in our code. which we train from scratch for each task. In all tasks, this model is trained for 1000 epochs, with a fixed batch size of 256, using the AdamW optimizer with a learning rate of 5×1045superscript1045\times 10^{-4}5 × 10 start_POSTSUPERSCRIPT - 4 end_POSTSUPERSCRIPT, (β1,β2)=(0.9,0.95)subscript𝛽1subscript𝛽20.90.95(\beta_{1},\beta_{2})=(0.9,0.95)( italic_β start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_β start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) = ( 0.9 , 0.95 ), without weight decay, and without learning rate scheduling. When utilizing this model in the flow process (e.g., in PMRF), we use the EMA weights computed with a decay of 0.99990.99990.99990.9999.

Vector field.

Similarly to Section B.1, the vector field is a HDiT model. The time t𝑡titalic_t in Sections 3, D, D and D is sampled from U[0,1]𝑈01U[0,1]italic_U [ 0 , 1 ] using a stratified sampling strategy. For all baseline methods and PMRF, we train the vector field for 1000 epochs, use a fixed batch size of 256, adopt the AdamW optimizer with a learning rate of 5×1045superscript1045\times 10^{-4}5 × 10 start_POSTSUPERSCRIPT - 4 end_POSTSUPERSCRIPT, (β1,β2)=(0.9,0.95)subscript𝛽1subscript𝛽20.90.95(\beta_{1},\beta_{2})=(0.9,0.95)( italic_β start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_β start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) = ( 0.9 , 0.95 ), and a weight decay of 102superscript10210^{-2}10 start_POSTSUPERSCRIPT - 2 end_POSTSUPERSCRIPT. As in (Crowson et al., 2024), we do not apply learning rate scheduling. Finally, we use the EMA weights for evaluation, using a decay of 0.99990.99990.99990.9999. A summary of the hyper-parameters is provided in Table 12.

Evaluation.

We test all models on the CelebA-Test data set, with images of size 256×256256256256\times 256256 × 256. The FID of each method is computed between the entire FFHQ 256×256256256256\times 256256 × 256 training set, and the images produced by the algorithm for the synthesized CelebA-Test degraded images. The MSE is computed between the reconstructed images and the corresponding ground-truth images.

C.3 Details of DOT.

We use DOT (Adrai et al., 2023) similarly to Section B.3, using images of size 256×256256256256\times 256256 × 256 instead of 512×512512512512\times 512512 × 512, and adopting the official codes of the authors. For the source distribution, we randomly synthesize degraded images according to the degradation of each task (Section C.1) from the first 1000 images in FFHQ, reconstruct each image using the SwinIR model we trained for each task (the same weights we use in PMRF), and finally compute the empirical mean and covariance of the reconstructions in latent space.

C.4 Proving that flow from Y𝑌Yitalic_Y is also optimal in Example 1

In Section 5.2 we show that, for the denoising task, PMRF and flow from Y𝑌Yitalic_Y are on-par in terms of both perceptual quality and MSE. To provide intuition for this result, we show that flow from Y𝑌Yitalic_Y leads to the desired estimator X^0subscript^𝑋0\smash{\hat{X}_{0}}over^ start_ARG italic_X end_ARG start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT in Example 1 (just like PMRF does).

Specifically, as in Example 1, suppose that X𝒩(0,1)similar-to𝑋𝒩01X\sim\mathcal{N}(0,1)italic_X ∼ caligraphic_N ( 0 , 1 ), N𝒩(0,σN2)similar-to𝑁𝒩0superscriptsubscript𝜎𝑁2N\sim\mathcal{N}(0,\sigma_{N}^{2})italic_N ∼ caligraphic_N ( 0 , italic_σ start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ), σN>0subscript𝜎𝑁0\sigma_{N}>0italic_σ start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT > 0, and Y=X+N𝑌𝑋𝑁Y=X+Nitalic_Y = italic_X + italic_N. In flow from Y𝑌Yitalic_Y with σs=0subscript𝜎𝑠0\sigma_{s}=0italic_σ start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT = 0 we have Zt=tX+(1t)Ysubscript𝑍𝑡𝑡𝑋1𝑡𝑌Z_{t}=tX+(1-t)Yitalic_Z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = italic_t italic_X + ( 1 - italic_t ) italic_Y, and thus vRF(Zt,t)=𝔼[XY|Zt]subscript𝑣RFsubscript𝑍𝑡𝑡𝔼delimited-[]𝑋conditional𝑌subscript𝑍𝑡v_{\text{RF}}(Z_{t},t)=\mathbb{E}[X-Y|Z_{t}]italic_v start_POSTSUBSCRIPT RF end_POSTSUBSCRIPT ( italic_Z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_t ) = blackboard_E [ italic_X - italic_Y | italic_Z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ]. Below, we show that

Cov(XY,Zt)=(t1)σN2,andCov𝑋𝑌subscript𝑍𝑡𝑡1superscriptsubscript𝜎𝑁2and\displaystyle\text{Cov}(X-Y,Z_{t})=(t-1)\sigma_{N}^{2},~{}~{}\mbox{and}Cov ( italic_X - italic_Y , italic_Z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) = ( italic_t - 1 ) italic_σ start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT , and (41)
Var(Zt)=σN2(t22t+1)+1.Varsubscript𝑍𝑡superscriptsubscript𝜎𝑁2superscript𝑡22𝑡11\displaystyle\text{Var}(Z_{t})=\sigma_{N}^{2}(t^{2}-2t+1)+1.Var ( italic_Z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) = italic_σ start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ( italic_t start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT - 2 italic_t + 1 ) + 1 . (42)

Hence,

vRF(Zt,t)subscript𝑣RFsubscript𝑍𝑡𝑡\displaystyle v_{\text{RF}}(Z_{t},t)italic_v start_POSTSUBSCRIPT RF end_POSTSUBSCRIPT ( italic_Z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_t ) =𝔼[XY|Zt]absent𝔼delimited-[]𝑋conditional𝑌subscript𝑍𝑡\displaystyle=\mathbb{E}[X-Y|Z_{t}]= blackboard_E [ italic_X - italic_Y | italic_Z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ]
=𝔼[XY]+Cov(XY,Zt)Var(Zt)(Zt𝔼[Zt])absent𝔼delimited-[]𝑋𝑌Cov𝑋𝑌subscript𝑍𝑡Varsubscript𝑍𝑡subscript𝑍𝑡𝔼delimited-[]subscript𝑍𝑡\displaystyle=\mathbb{E}[X-Y]+\frac{\text{Cov}(X-Y,Z_{t})}{\text{Var}(Z_{t})}(% Z_{t}-\mathbb{E}[Z_{t}])= blackboard_E [ italic_X - italic_Y ] + divide start_ARG Cov ( italic_X - italic_Y , italic_Z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) end_ARG start_ARG Var ( italic_Z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) end_ARG ( italic_Z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT - blackboard_E [ italic_Z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ] )
=Cov(XY,Zt)Var(Zt)ZtabsentCov𝑋𝑌subscript𝑍𝑡Varsubscript𝑍𝑡subscript𝑍𝑡\displaystyle=\frac{\text{Cov}(X-Y,Z_{t})}{\text{Var}(Z_{t})}Z_{t}= divide start_ARG Cov ( italic_X - italic_Y , italic_Z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) end_ARG start_ARG Var ( italic_Z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) end_ARG italic_Z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT (43)
=(t1)σN2σN2(t22t+1)+1Zt,absent𝑡1superscriptsubscript𝜎𝑁2superscriptsubscript𝜎𝑁2superscript𝑡22𝑡11subscript𝑍𝑡\displaystyle=\frac{(t-1)\sigma_{N}^{2}}{\sigma_{N}^{2}(t^{2}-2t+1)+1}Z_{t},= divide start_ARG ( italic_t - 1 ) italic_σ start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG italic_σ start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ( italic_t start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT - 2 italic_t + 1 ) + 1 end_ARG italic_Z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , (44)

where Equation 43 holds since 𝔼[XY]=0𝔼delimited-[]𝑋𝑌0\mathbb{E}[X-Y]=0blackboard_E [ italic_X - italic_Y ] = 0 and 𝔼[Zt]=0𝔼delimited-[]subscript𝑍𝑡0\mathbb{E}[Z_{t}]=0blackboard_E [ italic_Z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ] = 0. One can verify that the solution of dZ^t=vRF(Z^t,t)dt𝑑subscript^𝑍𝑡subscript𝑣RFsubscript^𝑍𝑡𝑡𝑑𝑡d\hat{Z}_{t}=v_{\text{RF}}(\hat{Z}_{t},t)dtitalic_d over^ start_ARG italic_Z end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = italic_v start_POSTSUBSCRIPT RF end_POSTSUBSCRIPT ( over^ start_ARG italic_Z end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_t ) italic_d italic_t for any initial condition Z^0=csubscript^𝑍0𝑐\hat{Z}_{0}=cover^ start_ARG italic_Z end_ARG start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT = italic_c is given by

Z^t=cσN2(t22t+1)+11+σN2.subscript^𝑍𝑡𝑐superscriptsubscript𝜎𝑁2superscript𝑡22𝑡111superscriptsubscript𝜎𝑁2\displaystyle\hat{Z}_{t}=c\frac{\sqrt{\sigma_{N}^{2}(t^{2}-2t+1)+1}}{\sqrt{1+% \sigma_{N}^{2}}}.over^ start_ARG italic_Z end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = italic_c divide start_ARG square-root start_ARG italic_σ start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ( italic_t start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT - 2 italic_t + 1 ) + 1 end_ARG end_ARG start_ARG square-root start_ARG 1 + italic_σ start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG end_ARG . (45)

Namely, we have

Z^1subscript^𝑍1\displaystyle\hat{Z}_{1}over^ start_ARG italic_Z end_ARG start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT =11+σN2Yabsent11superscriptsubscript𝜎𝑁2𝑌\displaystyle=\frac{1}{\sqrt{1+\sigma_{N}^{2}}}Y= divide start_ARG 1 end_ARG start_ARG square-root start_ARG 1 + italic_σ start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG end_ARG italic_Y
=X^0,absentsubscript^𝑋0\displaystyle=\hat{X}_{0},= over^ start_ARG italic_X end_ARG start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , (46)

where the last equality follows from Equation 28. It follows that flow from Y𝑌Yitalic_Y is also optimal in Example 1, just like PMRF.

Demonstrating Equations 41 and 42 is straightforward. We have

Cov(XY,Zt)Cov𝑋𝑌subscript𝑍𝑡\displaystyle\text{Cov}(X-Y,Z_{t})Cov ( italic_X - italic_Y , italic_Z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) =Cov(XY,tX+(1t)Y)absentCov𝑋𝑌𝑡𝑋1𝑡𝑌\displaystyle=\text{Cov}(X-Y,tX+(1-t)Y)= Cov ( italic_X - italic_Y , italic_t italic_X + ( 1 - italic_t ) italic_Y )
=tCov(X,X)+(1t)Cov(X,Y)tCov(X,Y)(1t)Cov(Y,Y)absent𝑡𝐶𝑜𝑣𝑋𝑋1𝑡Cov𝑋𝑌𝑡𝐶𝑜𝑣𝑋𝑌1𝑡Cov𝑌𝑌\displaystyle=tCov(X,X)+(1-t)\text{Cov}(X,Y)-tCov(X,Y)-(1-t)\text{Cov}(Y,Y)= italic_t italic_C italic_o italic_v ( italic_X , italic_X ) + ( 1 - italic_t ) Cov ( italic_X , italic_Y ) - italic_t italic_C italic_o italic_v ( italic_X , italic_Y ) - ( 1 - italic_t ) Cov ( italic_Y , italic_Y )
=t+(1t)t(1t)(1+σN2)absent𝑡1𝑡𝑡1𝑡1superscriptsubscript𝜎𝑁2\displaystyle=t+(1-t)-t-(1-t)(1+\sigma_{N}^{2})= italic_t + ( 1 - italic_t ) - italic_t - ( 1 - italic_t ) ( 1 + italic_σ start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT )
=(t1)σN2,absent𝑡1superscriptsubscript𝜎𝑁2\displaystyle=(t-1)\sigma_{N}^{2},= ( italic_t - 1 ) italic_σ start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT , (47)

and

Var(Zt)Varsubscript𝑍𝑡\displaystyle\text{Var}(Z_{t})Var ( italic_Z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) =Cov(tX+(1t)Y,tX+(1t)Y)absentCov𝑡𝑋1𝑡𝑌𝑡𝑋1𝑡𝑌\displaystyle=\text{Cov}(tX+(1-t)Y,tX+(1-t)Y)= Cov ( italic_t italic_X + ( 1 - italic_t ) italic_Y , italic_t italic_X + ( 1 - italic_t ) italic_Y )
=t2Cov(X,X)+2t(1t)Cov(X,Y)+(1t)2Cov(Y,Y)absentsuperscript𝑡2Cov𝑋𝑋2𝑡1𝑡Cov𝑋𝑌superscript1𝑡2Cov𝑌𝑌\displaystyle=t^{2}\text{Cov}(X,X)+2t(1-t)\text{Cov}(X,Y)+(1-t)^{2}\text{Cov}(% Y,Y)= italic_t start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT Cov ( italic_X , italic_X ) + 2 italic_t ( 1 - italic_t ) Cov ( italic_X , italic_Y ) + ( 1 - italic_t ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT Cov ( italic_Y , italic_Y )
=t2+2t(1t)+(1t)2(1+σN2)absentsuperscript𝑡22𝑡1𝑡superscript1𝑡21superscriptsubscript𝜎𝑁2\displaystyle=t^{2}+2t(1-t)+(1-t)^{2}(1+\sigma_{N}^{2})= italic_t start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + 2 italic_t ( 1 - italic_t ) + ( 1 - italic_t ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ( 1 + italic_σ start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT )
=t2+2t2t2+(12t+t2)(1+σN2)absentsuperscript𝑡22𝑡2superscript𝑡212𝑡superscript𝑡21superscriptsubscript𝜎𝑁2\displaystyle=t^{2}+2t-2t^{2}+(1-2t+t^{2})(1+\sigma_{N}^{2})= italic_t start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + 2 italic_t - 2 italic_t start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + ( 1 - 2 italic_t + italic_t start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) ( 1 + italic_σ start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT )
=t2(12+1+σN2)+2t(11σN2)+1+σN2absentsuperscript𝑡2121superscriptsubscript𝜎𝑁22𝑡11superscriptsubscript𝜎𝑁21superscriptsubscript𝜎𝑁2\displaystyle=t^{2}(1-2+1+\sigma_{N}^{2})+2t(1-1-\sigma_{N}^{2})+1+\sigma_{N}^% {2}= italic_t start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ( 1 - 2 + 1 + italic_σ start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) + 2 italic_t ( 1 - 1 - italic_σ start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) + 1 + italic_σ start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT
=t2σN22tσN2+σN2+1absentsuperscript𝑡2superscriptsubscript𝜎𝑁22𝑡superscriptsubscript𝜎𝑁2superscriptsubscript𝜎𝑁21\displaystyle=t^{2}\sigma_{N}^{2}-2t\sigma_{N}^{2}+\sigma_{N}^{2}+1= italic_t start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_σ start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT - 2 italic_t italic_σ start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + italic_σ start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + 1
=σN2(t22t+1)+1.absentsuperscriptsubscript𝜎𝑁2superscript𝑡22𝑡11\displaystyle=\sigma_{N}^{2}(t^{2}-2t+1)+1.= italic_σ start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ( italic_t start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT - 2 italic_t + 1 ) + 1 . (48)

Appendix D Indicator RMSE (IndRMSE) derivation

The MSE of any estimator X^^𝑋\smash{\hat{X}}over^ start_ARG italic_X end_ARG can always be written as

𝔼[XX^2]𝔼delimited-[]superscriptdelimited-∥∥𝑋^𝑋2\displaystyle\mathbb{E}[\lVert X-\hat{X}\rVert^{2}]blackboard_E [ ∥ italic_X - over^ start_ARG italic_X end_ARG ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ] =𝔼[X^X^2]+𝔼[XX^2]absent𝔼delimited-[]superscriptdelimited-∥∥^𝑋superscript^𝑋2𝔼delimited-[]superscriptdelimited-∥∥𝑋superscript^𝑋2\displaystyle=\mathbb{E}[\lVert\hat{X}-\hat{X}^{*}\rVert^{2}]+\mathbb{E}[% \lVert X-\hat{X}^{*}\rVert^{2}]= blackboard_E [ ∥ over^ start_ARG italic_X end_ARG - over^ start_ARG italic_X end_ARG start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ] + blackboard_E [ ∥ italic_X - over^ start_ARG italic_X end_ARG start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ] (49)
=𝔼[X^X^2]+m,absent𝔼delimited-[]superscriptdelimited-∥∥^𝑋superscript^𝑋2𝑚\displaystyle=\mathbb{E}[\lVert\hat{X}-\hat{X}^{*}\rVert^{2}]+m,= blackboard_E [ ∥ over^ start_ARG italic_X end_ARG - over^ start_ARG italic_X end_ARG start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ] + italic_m , (50)

where X^=𝔼[X|Y]superscript^𝑋𝔼delimited-[]conditional𝑋𝑌\smash{\hat{X}^{*}}=\mathbb{E}[X|Y]over^ start_ARG italic_X end_ARG start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT = blackboard_E [ italic_X | italic_Y ] is the MMSE estimator, Equation 49 follows from Lemma 2 in (Freirich et al., 2021) (Appendix B.1), and m𝑚mitalic_m is some constant that does not depend on X^^𝑋\hat{X}over^ start_ARG italic_X end_ARG. Thus, if f(Y)X^𝑓𝑌superscript^𝑋\smash{f(Y)\approx\hat{X}^{*}}italic_f ( italic_Y ) ≈ over^ start_ARG italic_X end_ARG start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT, we have

𝔼[XX^2]𝔼[X^f(Y)2]+m,𝔼delimited-[]superscriptdelimited-∥∥𝑋^𝑋2𝔼delimited-[]superscriptdelimited-∥∥^𝑋𝑓𝑌2𝑚\displaystyle\mathbb{E}[\lVert X-\hat{X}\rVert^{2}]\approx\mathbb{E}[\lVert% \hat{X}-f(Y)\rVert^{2}]+m,blackboard_E [ ∥ italic_X - over^ start_ARG italic_X end_ARG ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ] ≈ blackboard_E [ ∥ over^ start_ARG italic_X end_ARG - italic_f ( italic_Y ) ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ] + italic_m , (51)

so 𝔼[X^f(Y)2]𝔼delimited-[]superscriptdelimited-∥∥^𝑋𝑓𝑌2\sqrt{\mathbb{E}[\lVert\hat{X}-f(Y)\rVert^{2}]}square-root start_ARG blackboard_E [ ∥ over^ start_ARG italic_X end_ARG - italic_f ( italic_Y ) ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ] end_ARG may be used as an indicator for 𝔼[XX^2]𝔼delimited-[]superscriptdelimited-∥∥𝑋^𝑋2\sqrt{\mathbb{E}[\lVert X-\hat{X}\rVert^{2}]}square-root start_ARG blackboard_E [ ∥ italic_X - over^ start_ARG italic_X end_ARG ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ] end_ARG. Future works should investigate the effectiveness of this measure.

Table 2: Varying the number of flow steps K𝐾Kitalic_K in PMRF (Section 3) on the CelebA-Test blind face image restoration benchmark. Red, blue and green indicate the best, the second best, and the third best scores, respectively. Increasing the number of steps improves the perceptual quality while hindering the distortion. These results are expected due to the distortion-perception tradeoff.
Perceptual Quality Distortion
K𝐾Kitalic_K FID\downarrow KID\downarrow NIQE\downarrow Precision\uparrow PSNR\uparrow SSIM\uparrow LPIPS\downarrow Deg\downarrow LMD\downarrow
3 81.81 0.0811 8.9012 0.2820 27.668 0.7669 0.3582 31.41 2.0340
5 63.77 0.0581 7.4568 0.4563 27.498 0.7601 0.3401 30.80 2.0294
10 44.39 0.0342 5.2648 0.6427 27.017 0.7388 0.3314 30.49 2.0215
25 37.46 0.0257 4.1179 0.7073 26.373 0.7073 0.3470 30.67 2.0303
50 36.63 0.0244 3.8492 0.7050 26.028 0.6896 0.3591 30.89 2.0409
100 36.57 0.0240 3.7311 0.7010 25.810 0.6787 0.3662 31.06 2.0409
Table 3: Varying the number of flow steps K𝐾Kitalic_K in PMRF (Section 3) on the LFW-Test blind face image restoration benchmark. Red, blue and green indicate the best, the second best, and the third best scores, respectively. Increasing the number of steps generally improves the perceptual quality while hindering the IndRMSE. These results are expected due to the distortion-perception tradeoff.
K𝐾Kitalic_K FID\downarrow KID\downarrow NIQE\downarrow Precision\uparrow IndRMSE\downarrow
3 78.2331 0.0692 8.2315 0.3477 3.3934
5 64.3121 0.0524 6.8733 0.5143 3.8008
10 51.9845 0.0387 4.9896 0.6546 4.8648
25 49.3151 0.0366 4.0028 0.6692 6.1382
50 49.5581 0.0375 3.7126 0.6826 6.7960
100 49.6561 0.0377 3.6242 0.6710 7.2004
Table 4: Varying the number of flow steps K𝐾Kitalic_K in PMRF (Section 3) on the WIDER-Test blind face image restoration benchmark. Red, blue and green indicate the best, the second best, and the third best scores, respectively. Increasing the number of steps generally improves the perceptual quality while hindering the IndRMSE. These results are expected due to the distortion-perception tradeoff.
K𝐾Kitalic_K FID\downarrow KID\downarrow NIQE\downarrow Precision\uparrow IndRMSE\downarrow
3 85.0361 0.0704 9.9988 0.2742 5.3486
5 65.2563 0.0451 8.4650 0.5381 5.7665
10 42.5002 0.0179 5.5677 0.7144 7.1134
25 41.2685 0.0160 4.0726 0.7144 9.2164
50 41.4446 0.0174 3.6953 0.6845 10.3403
100 42.9437 0.0183 3.5704 0.6907 11.0674
Table 5: Varying the number of flow steps K𝐾Kitalic_K in PMRF (Section 3) on the WebPhoto-Test blind face image restoration benchmark. Red, blue and green indicate the best, the second best, and the third best scores, respectively. Increasing the number of steps generally improves the perceptual quality while hindering the IndRMSE. These results are expected due to the distortion-perception tradeoff.
K𝐾Kitalic_K FID\downarrow KID\downarrow NIQE\downarrow Precision\uparrow IndRMSE\downarrow
3 128.7858 0.0996 9.1626 0.3907 3.2961
5 113.4734 0.0782 7.5893 0.5553 3.7371
10 91.3677 0.0484 5.4199 0.6413 4.8369
25 81.0642 0.0347 4.2402 0.6462 6.3098
50 78.7174 0.0324 3.9512 0.6265 7.0159
100 79.1239 0.0313 3.7990 0.5602 7.6887
Table 6: Varying the number of flow steps K𝐾Kitalic_K in PMRF (Section 3) on the CelebAdult-Test blind face image restoration benchmark. Red, blue and green indicate the best, the second best, and the third best scores, respectively. Increasing the number of steps generally improves the perceptual quality while hindering the IndRMSE. These results are expected due to the distortion-perception tradeoff.
K𝐾Kitalic_K FID\downarrow KID\downarrow NIQE\downarrow Precision\uparrow IndRMSE\downarrow
3 122.8780 0.0551 6.6818 0.3944 3.7339
5 113.7837 0.0426 5.5810 0.4444 4.3313
10 105.7426 0.0319 4.4119 0.6111 5.4908
25 102.8914 0.0293 3.7367 0.5500 6.7145
50 102.1454 0.0276 3.5609 0.6278 7.3004
100 102.0568 0.0279 3.4878 0.5944 7.7286
Table 7: Quantitative evaluation of blind face restoration algorithms on the LFW-Test data set.
Method FID\downarrow KID\downarrow NIQE\downarrow Precision\uparrow IndRMSE\downarrow
SwinIR (\approx Posterior mean) 87.34 0.0808 8.595 0.2513 0
DOT 97.09 0.0891 5.705 0.1806 26.24
RestoreFormer++ 50.80 0.0386 3.911 0.6330 9.429
RestoreFormer 49.04 0.0355 4.168 0.6674 12.21
CodeFormer 52.82 0.0387 4.484 0.6756 9.534
VQFRv1 51.31 0.0399 3.590 0.6014 11.26
VQFRv2 51.16 0.0378 3.761 0.6154 16.15
GFPGAN 47.59 0.0308 4.554 0.6400 9.842
DiffBIR 40.97 0.0234 5.738 0.5804 9.105
DifFace 46.48 0.0329 4.024 0.7411 11.33
BFRffusion 50.93 0.0377 4.963 0.6850 7.210
PMRF (Ours) 49.32 0.0366 4.003 0.6692 6.138
Table 8: Quantitative evaluation of blind face restoration algorithms on the WIDER-Test data set.
Method FID\downarrow KID\downarrow NIQE\downarrow Precision\uparrow IndRMSE\downarrow
SwinIR (\approx Posterior mean) 91.96 0.0780 10.16 0.1649 0
DOT 82.15 0.0618 7.633 0.4082 14.900
RestoreFormer++ 45.41 0.0209 3.759 0.6505 14.466
RestoreFormer 50.23 0.0251 3.894 0.6505 14.200
CodeFormer 39.27 0.0138 4.164 0.7227 12.185
VQFRv1 44.21 0.0192 3.055 0.5959 17.042
VQFRv2 38.70 0.0157 3.995 0.6381 16.368
GFPGAN 41.28 0.0182 4.450 0.7876 11.840
DiffBIR 35.87 0.0114 5.659 0.6361 11.106
DifFace 37.38 0.0131 4.383 0.7856 10.418
BFRffusion 56.82 0.0307 4.647 0.5825 11.759
PMRF (Ours) 41.27 0.0160 4.073 0.7144 9.2164
Table 9: Quantitative evaluation of blind face restoration algorithms on the WebPhoto-Test data set.
Method FID\downarrow KID\downarrow NIQE\downarrow Precision\uparrow IndRMSE\downarrow
SwinIR (\approx Posterior mean) 132.1 0.1022 9.638 0.2383 0
DOT 125.6 0.0865 7.397 0.3071 20.69
RestoreFormer++ 75.60 0.0291 4.080 0.6143 18.43
RestoreFormer 77.80 0.0334 4.460 0.6265 11.55
CodeFormer 84.17 0.0406 4.709 0.6830 8.952
VQFRv1 75.57 0.0312 3.608 0.5774 12.53
VQFRv2 83.52 0.0411 4.620 0.5848 14.48
GFPGAN 88.43 0.0494 4.941 0.6781 9.240
DiffBIR 92.82 0.0541 6.069 0.5307 9.152
DifFace 80.05 0.0341 4.405 0.7273 10.31
BFRffusion 84.83 0.0388 5.612 0.5872 7.222
PMRF (Ours) 81.06 0.0347 4.240 0.6462 6.310
Table 10: Quantitative evaluation of blind face restoration algorithms on the CelebAdult-Test data set.
Method FID\downarrow KID\downarrow NIQE\downarrow Precision\uparrow IndRMSE\downarrow
SwinIR (\approx Posterior mean) 143.80 0.0811 7.477 0.4222 0
DOT 208.54 0.1634 6.018 0.0444 44.24
RestoreFormer++ 103.81 0.0313 4.006 0.5167 11.43
RestoreFormer 103.96 0.0315 4.320 0.5556 14.97
CodeFormer 111.62 0.0427 4.544 0.5722 10.49
VQFRv1 105.59 0.0336 3.756 0.5944 11.14
VQFRv2 104.72 0.0337 3.999 0.6056 18.51
GFPGAN 109.19 0.0395 4.423 0.5111 11.90
DiffBIR 109.74 0.0411 5.650 0.5000 9.853
DifFace 98.780 0.0243 3.901 0.6833 12.66
BFRffusion 103.06 0.0290 4.702 0.6056 8.037
PMRF (Ours) 102.89 0.0293 3.737 0.5500 6.715
Refer to caption
Figure 5: Comparison with state-of-the-art blind face restoration methods on inputs from the CelebA-Test data set. Our method produces high perceptual quality while achieving lower distortion overall. Zoom in for best view.
Refer to caption
Figure 6: Qualitative results on the real-world LFW-Test data set. Our algorithm produces reconstructions with either better or on-par perceptual quality compared to the state-of-the-art, while maintaining very high consistency with the input measurements. Zoom in for best view.
Refer to caption
Figure 7: Qualitative results on the real-world WebPhoto-Test data set. Our algorithm produces reconstructions with either better or on-par perceptual quality compared to the state-of-the-art, while maintaining very high consistency with the input measurements. Zoom in for best view.
Refer to caption
Figure 8: Qualitative results on the real-world CelebAdult-Test data set. Our algorithm produces reconstructions with either better or on-par perceptual quality compared to the state-of-the-art, while maintaining very high consistency with the input measurements. Zoom in for best view.
Refer to caption
Figure 9: A controlled experiment comparing PMRF with previous methodologies, where we vary the number of steps K𝐾Kitalic_K in each algorithm (Appendices D, 3, D and D). Specifically, we use K{5,10,20,50,100}𝐾5102050100\smash{K\in\{5,10,20,50,100\}}italic_K ∈ { 5 , 10 , 20 , 50 , 100 }, where a larger marker size corresponds to a larger value of K𝐾Kitalic_K. See Section 5.2 for more details.
Table 11: A comparison of the forward process and training loss of PMRF and the baseline methods from Section 5.2. For the flow from Y𝑌Yitalic_Y algorithm, we have Y=Ysuperscript𝑌𝑌Y^{\dagger}=Yitalic_Y start_POSTSUPERSCRIPT † end_POSTSUPERSCRIPT = italic_Y for all tasks besides super-resolution. For the super-resolution task, we up-scale Y𝑌Yitalic_Y using nearest-neighbor interpolation.
Forward process Flow training loss
PMRF (Ours) Zt=tX+(1t)Z0subscript𝑍𝑡𝑡𝑋1𝑡subscript𝑍0Z_{t}=tX+(1-t)Z_{0}italic_Z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = italic_t italic_X + ( 1 - italic_t ) italic_Z start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT minθ01𝔼[(XZ0)vθ(Zt,t)2]𝑑tsubscript𝜃superscriptsubscript01𝔼delimited-[]superscriptdelimited-∥∥𝑋subscript𝑍0subscript𝑣𝜃subscript𝑍𝑡𝑡2differential-d𝑡\min_{\theta}\int_{0}^{1}\mathbb{E}\left[\lVert(X-Z_{0})-v_{\theta}(Z_{t},t)% \rVert^{2}\right]dtroman_min start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ∫ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT blackboard_E [ ∥ ( italic_X - italic_Z start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) - italic_v start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_Z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_t ) ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ] italic_d italic_t
Z0=fω(Y)+σsϵsubscript𝑍0subscript𝑓superscript𝜔𝑌subscript𝜎𝑠italic-ϵZ_{0}=f_{\omega^{*}}(Y)+\sigma_{s}\epsilonitalic_Z start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT = italic_f start_POSTSUBSCRIPT italic_ω start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( italic_Y ) + italic_σ start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT italic_ϵ
ϵ𝒩(0,I)similar-toitalic-ϵ𝒩0𝐼\epsilon\sim\mathcal{N}(0,I)italic_ϵ ∼ caligraphic_N ( 0 , italic_I )
Flow cond. on Y𝑌Yitalic_Y Zt=tX+(1t)Z0subscript𝑍𝑡𝑡𝑋1𝑡subscript𝑍0Z_{t}=tX+(1-t)Z_{0}italic_Z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = italic_t italic_X + ( 1 - italic_t ) italic_Z start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT minθ01𝔼[(XZ0)vθ(Zt,t,Y)2]𝑑tsubscript𝜃superscriptsubscript01𝔼delimited-[]superscriptdelimited-∥∥𝑋subscript𝑍0subscript𝑣𝜃subscript𝑍𝑡𝑡𝑌2differential-d𝑡\min_{\theta}\int_{0}^{1}\mathbb{E}\left[\lVert(X-Z_{0})-v_{\theta}(Z_{t},t,Y)% \rVert^{2}\right]dtroman_min start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ∫ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT blackboard_E [ ∥ ( italic_X - italic_Z start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) - italic_v start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_Z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_t , italic_Y ) ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ] italic_d italic_t
Z0𝒩(0,I)similar-tosubscript𝑍0𝒩0𝐼Z_{0}\sim\mathcal{N}(0,I)italic_Z start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ∼ caligraphic_N ( 0 , italic_I )
Flow cond. on X^superscript^𝑋\hat{X}^{*}over^ start_ARG italic_X end_ARG start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT Zt=tX+(1t)Z0subscript𝑍𝑡𝑡𝑋1𝑡subscript𝑍0Z_{t}=tX+(1-t)Z_{0}italic_Z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = italic_t italic_X + ( 1 - italic_t ) italic_Z start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT minθ01𝔼[(XZ0)vθ(Zt,t,fω(Y))2]𝑑tsubscript𝜃superscriptsubscript01𝔼delimited-[]superscriptdelimited-∥∥𝑋subscript𝑍0subscript𝑣𝜃subscript𝑍𝑡𝑡subscript𝑓superscript𝜔𝑌2differential-d𝑡\min_{\theta}\int_{0}^{1}\mathbb{E}\left[\lVert(X-Z_{0})-v_{\theta}(Z_{t},t,f_% {\omega^{*}}(Y))\rVert^{2}\right]dtroman_min start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ∫ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT blackboard_E [ ∥ ( italic_X - italic_Z start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) - italic_v start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_Z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_t , italic_f start_POSTSUBSCRIPT italic_ω start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( italic_Y ) ) ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ] italic_d italic_t
Z0𝒩(0,I)similar-tosubscript𝑍0𝒩0𝐼Z_{0}\sim\mathcal{N}(0,I)italic_Z start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ∼ caligraphic_N ( 0 , italic_I )
Flow from Y𝑌Yitalic_Y Zt=tX+(1t)Z0subscript𝑍𝑡𝑡𝑋1𝑡subscript𝑍0Z_{t}=tX+(1-t)Z_{0}italic_Z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = italic_t italic_X + ( 1 - italic_t ) italic_Z start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT minθ01𝔼[(XZ0)vθ(Zt,t)2]𝑑tsubscript𝜃superscriptsubscript01𝔼delimited-[]superscriptdelimited-∥∥𝑋subscript𝑍0subscript𝑣𝜃subscript𝑍𝑡𝑡2differential-d𝑡\min_{\theta}\int_{0}^{1}\mathbb{E}\left[\lVert(X-Z_{0})-v_{\theta}(Z_{t},t)% \rVert^{2}\right]dtroman_min start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ∫ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT blackboard_E [ ∥ ( italic_X - italic_Z start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) - italic_v start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_Z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_t ) ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ] italic_d italic_t
Z0=Y+σsϵsubscript𝑍0superscript𝑌subscript𝜎𝑠italic-ϵZ_{0}=Y^{\dagger}+\sigma_{s}\epsilonitalic_Z start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT = italic_Y start_POSTSUPERSCRIPT † end_POSTSUPERSCRIPT + italic_σ start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT italic_ϵ
ϵ𝒩(0,I)similar-toitalic-ϵ𝒩0𝐼\epsilon\sim\mathcal{N}(0,I)italic_ϵ ∼ caligraphic_N ( 0 , italic_I )

[htbp]    Training 

      Solve θargminθ𝔼[(XZ0)vθ(Zt,t,Y)2]superscript𝜃subscriptargmin𝜃𝔼delimited-[]superscriptdelimited-∥∥𝑋subscript𝑍0subscript𝑣𝜃subscript𝑍𝑡𝑡𝑌2\theta^{*}\leftarrow\operatorname*{arg\,min}_{\theta}\mathbb{E}\left[\lVert% \left(X-Z_{0}\right)-v_{\theta}(Z_{t},t,Y)\rVert^{2}\right]italic_θ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ← start_OPERATOR roman_arg roman_min end_OPERATOR start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT blackboard_E [ ∥ ( italic_X - italic_Z start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) - italic_v start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_Z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_t , italic_Y ) ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ]
        // ZttX+(1t)Z0,Z0𝒩(0,I)formulae-sequencesubscript𝑍𝑡𝑡𝑋1𝑡subscript𝑍0similar-tosubscript𝑍0𝒩0𝐼Z_{t}\coloneqq tX+(1-t)Z_{0},\,Z_{0}\sim\mathcal{N}(0,I)italic_Z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ≔ italic_t italic_X + ( 1 - italic_t ) italic_Z start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , italic_Z start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ∼ caligraphic_N ( 0 , italic_I ). t𝑡titalic_t is sampled uniformly from U[0,1]𝑈01U[0,1]italic_U [ 0 , 1 ].
Inference (using Euler’s method with K𝐾Kitalic_K steps to solve the ODE) 
       Initialize x^𝒩(0,I)similar-to^𝑥𝒩0𝐼\hat{x}\sim\mathcal{N}(0,I)over^ start_ARG italic_x end_ARG ∼ caligraphic_N ( 0 , italic_I ) for i0,,K1𝑖0𝐾1i\leftarrow 0,\ldots,K-1italic_i ← 0 , … , italic_K - 1 do
             x^x^+1Kvθ(x^,iK,y)^𝑥^𝑥1𝐾subscript𝑣superscript𝜃^𝑥𝑖𝐾𝑦\hat{x}\leftarrow\hat{x}+\frac{1}{K}v_{\theta^{*}}(\hat{x},\frac{i}{K},y)over^ start_ARG italic_x end_ARG ← over^ start_ARG italic_x end_ARG + divide start_ARG 1 end_ARG start_ARG italic_K end_ARG italic_v start_POSTSUBSCRIPT italic_θ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( over^ start_ARG italic_x end_ARG , divide start_ARG italic_i end_ARG start_ARG italic_K end_ARG , italic_y )
              // y𝑦yitalic_y is the given degraded measurement
      Return x^^𝑥\hat{x}over^ start_ARG italic_x end_ARG
Flow conditioned on Y𝑌Yitalic_Y  

[htbp]    Training 

       Stage 1: Solve ωargminω𝔼[Xfω(Y)2]superscript𝜔subscript𝜔𝔼delimited-[]superscriptdelimited-∥∥𝑋subscript𝑓𝜔𝑌2\omega^{*}\leftarrow\arg\min_{\omega}\mathbb{E}\left[\lVert X-f_{\omega}(Y)% \rVert^{2}\right]italic_ω start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ← roman_arg roman_min start_POSTSUBSCRIPT italic_ω end_POSTSUBSCRIPT blackboard_E [ ∥ italic_X - italic_f start_POSTSUBSCRIPT italic_ω end_POSTSUBSCRIPT ( italic_Y ) ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ] Stage 2: Solve θargminθ𝔼[(XZ0)vθ(Zt,t,fω(Y))2]superscript𝜃subscriptargmin𝜃𝔼delimited-[]superscriptdelimited-∥∥𝑋subscript𝑍0subscript𝑣𝜃subscript𝑍𝑡𝑡subscript𝑓superscript𝜔𝑌2\theta^{*}\leftarrow\operatorname*{arg\,min}_{\theta}\mathbb{E}\left[\lVert% \left(X-Z_{0}\right)-v_{\theta}(Z_{t},t,f_{\omega^{*}}(Y))\rVert^{2}\right]italic_θ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ← start_OPERATOR roman_arg roman_min end_OPERATOR start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT blackboard_E [ ∥ ( italic_X - italic_Z start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) - italic_v start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_Z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_t , italic_f start_POSTSUBSCRIPT italic_ω start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( italic_Y ) ) ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ]
        // ZttX+(1t)Z0,Z0𝒩(0,I)formulae-sequencesubscript𝑍𝑡𝑡𝑋1𝑡subscript𝑍0similar-tosubscript𝑍0𝒩0𝐼Z_{t}\coloneqq tX+(1-t)Z_{0},\,Z_{0}\sim\mathcal{N}(0,I)italic_Z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ≔ italic_t italic_X + ( 1 - italic_t ) italic_Z start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , italic_Z start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ∼ caligraphic_N ( 0 , italic_I ). t𝑡titalic_t is sampled uniformly from U[0,1]𝑈01U[0,1]italic_U [ 0 , 1 ].
Inference (using Euler’s method with K𝐾Kitalic_K steps to solve the ODE) 
       Initialize x^𝒩(0,I)similar-to^𝑥𝒩0𝐼\hat{x}\sim\mathcal{N}(0,I)over^ start_ARG italic_x end_ARG ∼ caligraphic_N ( 0 , italic_I ) for i0,,K1𝑖0𝐾1i\leftarrow 0,\ldots,K-1italic_i ← 0 , … , italic_K - 1 do
             x^x^+1Kvθ(x^,iK,fω(y))^𝑥^𝑥1𝐾subscript𝑣superscript𝜃^𝑥𝑖𝐾subscript𝑓superscript𝜔𝑦\hat{x}\leftarrow\hat{x}+\frac{1}{K}v_{\theta^{*}}(\hat{x},\frac{i}{K},f_{% \omega^{*}}(y))over^ start_ARG italic_x end_ARG ← over^ start_ARG italic_x end_ARG + divide start_ARG 1 end_ARG start_ARG italic_K end_ARG italic_v start_POSTSUBSCRIPT italic_θ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( over^ start_ARG italic_x end_ARG , divide start_ARG italic_i end_ARG start_ARG italic_K end_ARG , italic_f start_POSTSUBSCRIPT italic_ω start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( italic_y ) )
              // y𝑦yitalic_y is the given degraded measurement
      Return x^^𝑥\hat{x}over^ start_ARG italic_x end_ARG
Flow conditioned on X^superscript^𝑋\hat{X}^{*}over^ start_ARG italic_X end_ARG start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT  

[htbp]    Training 

       Solve θargminθ𝔼[(XZ0)vθ(Zt,t)2]superscript𝜃subscriptargmin𝜃𝔼delimited-[]superscriptdelimited-∥∥𝑋subscript𝑍0subscript𝑣𝜃subscript𝑍𝑡𝑡2\theta^{*}\leftarrow\operatorname*{arg\,min}_{\theta}\mathbb{E}\left[\lVert% \left(X-Z_{0}\right)-v_{\theta}(Z_{t},t)\rVert^{2}\right]italic_θ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ← start_OPERATOR roman_arg roman_min end_OPERATOR start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT blackboard_E [ ∥ ( italic_X - italic_Z start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) - italic_v start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_Z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_t ) ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ]
        // ZttX+(1t)Z0subscript𝑍𝑡𝑡𝑋1𝑡subscript𝑍0Z_{t}\coloneqq tX+(1-t)Z_{0}italic_Z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ≔ italic_t italic_X + ( 1 - italic_t ) italic_Z start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT, Z0=Y+σsϵsubscript𝑍0superscript𝑌subscript𝜎𝑠italic-ϵZ_{0}=Y^{\dagger}+\sigma_{s}\epsilonitalic_Z start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT = italic_Y start_POSTSUPERSCRIPT † end_POSTSUPERSCRIPT + italic_σ start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT italic_ϵ, ϵ𝒩(0,I)similar-toitalic-ϵ𝒩0𝐼\epsilon\sim\mathcal{N}(0,I)italic_ϵ ∼ caligraphic_N ( 0 , italic_I ), and Ysuperscript𝑌Y^{\dagger}italic_Y start_POSTSUPERSCRIPT † end_POSTSUPERSCRIPT is the up-scaled version of Y𝑌Yitalic_Y that matches the dimensionality of X𝑋Xitalic_X. t𝑡titalic_t is sampled uniformly from U[0,1]𝑈01U[0,1]italic_U [ 0 , 1 ].
Inference (using Euler’s method with K𝐾Kitalic_K steps to solve the ODE) 
       Initialize x^𝒩(y,Iσs2)similar-to^𝑥𝒩superscript𝑦𝐼superscriptsubscript𝜎𝑠2\hat{x}\sim\mathcal{N}(y^{\dagger},I\sigma_{s}^{2})over^ start_ARG italic_x end_ARG ∼ caligraphic_N ( italic_y start_POSTSUPERSCRIPT † end_POSTSUPERSCRIPT , italic_I italic_σ start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT )
        // ysuperscript𝑦y^{\dagger}italic_y start_POSTSUPERSCRIPT † end_POSTSUPERSCRIPT is the up-scaled version of the degraded measurement y𝑦yitalic_y for i0,,K1𝑖0𝐾1i\leftarrow 0,\ldots,K-1italic_i ← 0 , … , italic_K - 1 do
             x^x^+1Kvθ(x^,iK)^𝑥^𝑥1𝐾subscript𝑣superscript𝜃^𝑥𝑖𝐾\hat{x}\leftarrow\hat{x}+\frac{1}{K}v_{\theta^{*}}(\hat{x},\frac{i}{K})over^ start_ARG italic_x end_ARG ← over^ start_ARG italic_x end_ARG + divide start_ARG 1 end_ARG start_ARG italic_K end_ARG italic_v start_POSTSUBSCRIPT italic_θ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( over^ start_ARG italic_x end_ARG , divide start_ARG italic_i end_ARG start_ARG italic_K end_ARG )
      Return x^^𝑥\hat{x}over^ start_ARG italic_x end_ARG
Flow from Y𝑌Yitalic_Y  
Refer to caption
Figure 10: Visual results on the image colorization task from Section 5.2. Our method outperforms all baselines for any number of inference steps K𝐾Kitalic_K. Zoom in for best view.
Refer to caption
Figure 11: Visual results on the image denoising task from Section 5.2. Our method is on-par with flow from Y𝑌Yitalic_Y, and outperforms the posterior sampling methods for any number of inference steps K𝐾Kitalic_K. Zoom in for best view.
Refer to caption
Figure 12: Visual results on the image inpainting task from Section 5.2. Our method outperforms all baselines for any number of inference steps K𝐾Kitalic_K. Zoom in for best view.
Refer to caption
Figure 13: Visual results from Section 5.2 on the image super-resolution task. Our method is on-par with flow from Y𝑌Yitalic_Y, and outperforms the posterior sampling methods for any number of inference steps K𝐾Kitalic_K. Zoom in for best view.
Table 12: Training hyper-parameters of the HDiT architecture (Crowson et al., 2024). We use this architecture as the vector field vθsubscript𝑣𝜃v_{\theta}italic_v start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT in PMRF (Section 3), and also in the baseline methods described in Section 5.2 (Appendices D, D and D).
Hyper-parameter
Blind face restoration
(Section 5.1)
Controlled experiments
(Section 5.2)
Parameters 160M 121M
Training Epochs 3850 1000
Batch Size 256 256
Image Size 512×\times×512 256×\times×256
Precision bfloat16 mixed bfloat16 mixed
Training Hardware 16 A100 40GiB 4 L40 48GiB
Training Time 12 days 2.5 days
Patch Size 4 4
Levels (Local + Global Attention) 2 + 1 1 + 1
Depth (2,2,8) (2,11)
Widths (256,512,1024) (384,768)
Attention Heads (Width / Head Dim) (4, 8, 16) (6,12)
Attention Head Dim 64 64
Neighborhood Kernel Size 7 7
Mapping Depth 1 1
Mapping Width 768 768
Optimizer AdamW AdamW
Learning Rate 51045superscript1045\cdot 10^{-4}5 ⋅ 10 start_POSTSUPERSCRIPT - 4 end_POSTSUPERSCRIPT 51045superscript1045\cdot 10^{-4}5 ⋅ 10 start_POSTSUPERSCRIPT - 4 end_POSTSUPERSCRIPT
Learning Rate Scheduler Multi-step, last 350 epochs Not applied
AdamW betas (0.9, 0.95) (0.9, 0.95)
AdamW eps 108superscript10810^{-8}10 start_POSTSUPERSCRIPT - 8 end_POSTSUPERSCRIPT 108superscript10810^{-8}10 start_POSTSUPERSCRIPT - 8 end_POSTSUPERSCRIPT
Weight Decay 102superscript10210^{-2}10 start_POSTSUPERSCRIPT - 2 end_POSTSUPERSCRIPT 102superscript10210^{-2}10 start_POSTSUPERSCRIPT - 2 end_POSTSUPERSCRIPT
EMA Decay 0.9999 0.9999