Abstract
We characterize proximity operators, that is to say functions that map a vector to a solution of a penalized least-squares optimization problem. Proximity operators of convex penalties have been widely studied and fully characterized by Moreau. They are also widely used in practice with nonconvex penalties such as the \(\ell ^0\) pseudo-norm, yet the extension of Moreau’s characterization to this setting seemed to be a missing element of the literature. We characterize proximity operators of (convex or nonconvex) penalties as functions that are the subdifferential of some convex potential. This is proved as a consequence of a more general characterization of the so-called Bregman proximity operators of possibly nonconvex penalties in terms of certain convex potentials. As a side effect of our analysis, we obtain a test to verify whether a given function is the proximity operator of some penalty, or not. Many well-known shrinkage operators are indeed confirmed to be proximity operators. However, we prove that windowed Group-LASSO and persistent empirical Wiener shrinkage—two forms of a so-called social sparsity shrinkage—are generally not the proximity operator of any penalty; the exception is when they are simply weighted versions of group-sparse shrinkage with non-overlapping groups.
Similar content being viewed by others
Notes
See Sect. 2.1 for detailed notations and reminders on convex analysis and differentiability in Hilbert spaces.
See Sect. 2.1 for brief reminders on the notion of continuity/differentiability in Hilbert spaces.
A continuous linear operator \(L: {{\mathcal {H}}}\rightarrow {{\mathcal {H}}}\) is symmetric if \({\langle }x, Ly{\rangle } = {\langle }Lx, y{\rangle }\) for each \(x,y \in {{\mathcal {H}}}\). A symmetric continuous linear operator is positive semi-definite if \({\langle }x,Lx{\rangle } \geqslant 0\) for each \(x \in {{\mathcal {H}}}\). This is denoted \(L \succeq 0\). It is positive definite if \({\langle }x,Lx{\rangle } >0\) for each nonzero \(x \in {{\mathcal {H}}}\). This is denoted \(L \succ 0\).
See “Appendix 1” for some reminders on Fréchet derivatives in Hilbert spaces.
For the sake of simplicity, we use the same notation \({\langle }\cdot ,\cdot {\rangle }\) for the inner products \({\langle }x,A(y){\rangle }\) (between elements of \({{\mathcal {H}}}\)) and \({\langle }B(x),y{\rangle }\) (between elements of \({{\mathcal {H}}}'\)). The reader can inspect the proof of Theorem 3 to check that the result still holds if we consider Banach spaces\({{\mathcal {H}}}\) and \({{\mathcal {H}}}'\), \({{\mathcal {H}}}^\star \) and \(({{\mathcal {H}}}')^\star \) their duals, and \(A: {\mathcal {Y}}\rightarrow {{\mathcal {H}}}^\star \), \(B: {{\mathcal {H}}}\rightarrow ({{\mathcal {H}}}')^\star \).
That are explicitly constructed as the proximity operator of a convex l.s.c. penalty, e.g., soft-thresholding.
For a proof, see, e.g., (in French) https://fr.wikipedia.org/wiki/Lemme_de_Cousin section 4.9, version from 13/01/2019.
In general, we may have \(g\ne g_1\) as there is no connectedness assumption on \({\text {dom}}(\theta )\).
The inclusion (29) is true even if f is not a proximity operator.
References
Advani, M., Ganguli, S.: An equivalence between high dimensional Bayes optimal inference and M-estimation. In: Lee, D.D., Sugiyama, M., Luxburg, U.V., Guyon, I., Garnett, R. (eds.) Advances in Neural Information Processing Systems, pp. 3378–3386. Curran Associates Inc., New York (2016)
Antoniadis, A.: Wavelet methods in statistics: some recent developments and their applications. Stat. Surv. 1, 16–55 (2007)
Bach, F.: Optimization with sparsity-inducing penalties. FNT Mach. Learn. 4(1), 1–106 (2011)
Bakin, S.: Adaptive regression and model selection in data mining problems. Ph.D. thesis, School of Mathematical Sciences, Australian National University (1999)
Bauschke, H.H., Borwein, J.M., Combettes, P.L.: Bregman monotone optimization algorithms. SIAM J. Control Optim. 42(2), 596–636 (2003)
Bauschke, H.H., Combettes, P.L.: Convex Analysis and Monotone Operator Theory in Hilbert Spaces, 2nd edn. Springer, Cham (2017)
Blumensath, T., Davies, M.E.: Iterative hard thresholding for compressed sensing. Appl. Comput. Harmon. Anal. 27(3), 265–274 (2009)
Bredies, K., Lorenz, D.A., Reiterer, S.: Minimization of non-smooth, non-convex functionals by iterative thresholding. J. Optim. Theory Appl. 165(1), 78–112 (2014)
Bregman, L.M.: The relaxation method of finding the common point of convex sets and its application to the solution of problems in convex programming. USSR Comput. Math. Math. Phys. 7(3), 200–217 (1967)
Cai, T.T., Silverman, B.W.: Incorporating information on neighbouring coefficients into wavelet estimation. Sankhyà Indian J. Stat. Ser. B 63, 127–148 (2001). (Special Issue on Wavelets)
Cartan, H.: Cours de calcul différentiel. Collection Méthodes. Editions Hermann (1977)
Censor, Y., Zenios, S.A.: Proximal minimization algorithm with \(d\)-functions. J. Optim. Theory Appl. 73(3), 451–464 (1992)
Combettes, P.L., Pesquet, J.C.: Proximal thresholding algorithm for minimization over orthonormal bases. SIAM J. Optim. 18, 1351–1376 (2007)
Ekeland, I., Turnbull, T.: Infinite-Dimensional Optimization and Convexity. Chicago Lectures in Mathematics. The University of Chicago Press, Chicago (1983)
Févotte, C., Kowalski, M.: Hybrid sparse and low-rank time-frequency signal decomposition. EUSIPCO, pp. 464–468 (2015)
Galbis, A., Maestre, M.: Vector Analysis Versus Vector Calculus. Universitext. Springer, Boston (2012)
Gribonval, R.: Should penalized least squares regression be interpreted as maximum a posteriori estimation? IEEE Trans. Signal Process. 59(5), 2405–2410 (2011)
Gribonval, R., Machart, P.: Reconciling “priors” and “priors” without prejudice? In: Burges, C.J.C., Bottou, L., Welling, M., Ghahramani, Z., Weinberger, K.Q. (eds.) Advances in Neural Information Processing Systems, vol. 26 (NIPS), pp. 2193–2201 (2013). https://papers.nips.cc/paper/4868-reconciling-priors-priors-without-prejudice
Gribonval, R., Nikolova, M.: On bayesian estimation and proximity operators. Appl. Comput. Harmon. Anal. (2019). https://doi.org/10.1016/j.acha.2019.07.002
Hall, P., Penev, S.I., Kerkyacharian, G., Picard, D.: Numerical performance of block thresholded wavelet estimators. Stat. Comput. 7, 115–124 (1997)
Hiriart-Urruty, J.B., Lemaréchal, C.: Convex Analysis and Minimization Algorithms, vol. I. Springer, Berlin (1996)
Kowalski, M., Siedenburg, K., Dörfler, M.: Social sparsity! Neighborhood systems enrich structured shrinkage operators. IEEE Trans. Signal Process. 61, 2498–2511 (2013)
Kowalski, M., Torrésani, B.: Sparsity and persistence: mixed norms provide simple signal models with dependent coefficients. Signal Image Video Process. 3(3), 251–264 (2009)
Kowalski, M., Torrésani, B.: Structured sparsity: from mixed norms to structured shrinkage. In: Gribonval, R. (ed.) SPARS’09: Signal Processing with Adaptive Sparse Structured Representations. Inria Rennes - Bretagne Atlantique, Saint Malo (2009)
Louchet, C., Moisan, L.: Posterior expectation of the total variation model: properties and experiments. SIAM J. Imaging Sci. 6(4), 2640–2684 (2013)
Moreau, J.J.: Proximité et dualité dans un espace Hilbertien. Bull. Soc. Math. France 93, 273–299 (1965)
Nikolova, M.: Estimation of binary images by minimizing convex criteria. In: Proceedings 1998 International Conference on Image Processing. ICIP98 (Cat. No.98CB36269), pp. 108–112. IEEE Comput. Soc. (1998)
Parekh, A., Selesnick, I.W.: Convex denoising using non-convex tight frame regularization. IEEE Signal Process. Lett. 22, 1786–1790 (2015)
Rockafellar, R.: On the maximal monotonicity of subdifferential mappings. Pac. J. Math. 33(1), 209–216 (1970)
Rockafellar, R.T., Wets, R.J.B.: Variational Analysis. Grundlehren der mathematischen Wissenschaften, vol. 317. Springer, Berlin (1998)
Selesnick, I.W.: Sparse regularization via convex analysis. IEEE Trans. Signal Process. 65(17), 4481–4494 (2017)
Siedenburg, K., Dörfler, M.: Structured sparsity for audio signals. In: Proceedings of the 14th International Conference on Digital Audio Effects (DAFx-11), Paris (2011)
Siedenburg, K., Kowalski, M., Dörfler, M.: Audio declipping with social sparsity. In: ICASSP 2014: 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 1577–1581. IEEE (2014)
Thomson, B.S.: Rethinking the elementary real analysis course. Am. Math. Mon. 114, 469–490 (2007)
Varoquaux, G., Kowalski, M., Thirion, B.: Social-sparsity brain decoders: faster spatial sparsity. In: International Workshop on Pattern Recognition in Neuroimaging, Trento (2016)
Villani, C.: Optimal Transport: Old and New. Grundlehren der mathematischen Wissenschaften: A Series of Comprehensive Studies in Mathematics, vol. 338. Springer, Berlin (2009)
Yuan, M., Lin, Y.: Model selection and estimation in regression with grouped variables. J. R. Stat. Soc. Ser. B (Stat. Methodol.) 68(1), 49–67 (2006)
Acknowledgements
The first author wishes to thank Laurent Condat, Jean-Christophe Pesquet and Patrick-Louis Combettes for their feedback that helped improve an early version of this paper, as well as the anonymous reviewers for many insightful comments that improved it much further.
Author information
Authors and Affiliations
Corresponding author
Additional information
This work and the companion paper [19] are dedicated to the memory of Mila Nikolova, who passed away prematurely in June 2018. Mila dedicated much of her energy to bring the technical content to completion during the spring of 2018. The first author did his best to finalize the papers as Mila would have wished. He should be held responsible for any possible imperfection in the final manuscript.
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Appendix: Proofs
Appendix: Proofs
The proofs of technical results of Sect. 2 are provided in “Appendix 4” (Theorem 3), “Appendix5” (Lemma 1), “Appendix 6” (Corollary 3) and “Appendix 7” (Lemma 2). As a preliminary, we give brief reminders on some useful, but classical notions in Sects. 1–1.
1.1 Appendix 1 Brief Reminders on (Fréchet) Differentials and Gradients in Hilbert Spaces
Consider \({{\mathcal {H}}},{{\mathcal {H}}}'\) two Hilbert spaces. A function \(\theta : {\mathcal {X}}\rightarrow {{\mathcal {H}}}'\) where \({\mathcal {X}}\subset {{\mathcal {H}}}\) is an open domain is (Fréchet) differentiable at x if there exists a continuous linear operator \(L:{{\mathcal {H}}}\rightarrow {{\mathcal {H}}}'\) such that \(\lim _{h \rightarrow 0}\Vert \theta (x+h)-\theta (x)-L(h)\Vert _{{{\mathcal {H}}}'}/\Vert h\Vert _{{{\mathcal {H}}}} = 0\). The linear operator L is called the differential of \(\theta \) at x and denoted \(D\theta (x)\). When \({{\mathcal {H}}}' = {\mathbb {R}}\), L belongs to the dual of \({{\mathcal {H}}}\), hence there is \(u \in {{\mathcal {H}}}\)—called the gradient of \(\theta \) at x and denoted \(\nabla \theta (x)\)—such that \(L(h) = {\langle }u,h{\rangle },\ \forall \;h \in {{\mathcal {H}}}\).
1.2 Appendix 2 Subgradients and Subdifferentials for Possibly Nonconvex Functions
We adopt a gentle definition which is familiar when \(\theta \) is a convex function. Although this is possibly less well known by nonexperts, this definition is also valid when \(\theta \) is possibly nonconvex, see, e.g., [6, Definition 16.1].
Definition 5
Let \(\theta :{{\mathcal {H}}}\rightarrow {\mathbb {R}}\cup \{+\infty \}\) be a proper function. The subdifferential\(\partial \theta (x)\) of \(\theta \) at x is the set of all \(u\in {{\mathcal {H}}}\), called subgradients of \(\theta \) at x, such that
If \(x\not \in {\text {dom}}(\theta )\), then \(\partial \theta (x)=\varnothing \). The function \(\theta \) is subdifferentiable at \(x \in {{\mathcal {H}}}\) if \(\partial \theta (x) \ne \varnothing \). The domain of \(\partial \theta \) is \({\text {dom}}(\partial \theta ) := \{x \in {{\mathcal {H}}}, \partial \theta (x) \ne \varnothing \}\). It satisfies \({\text {dom}}(\partial \theta ) \subset {\text {dom}}(\theta )\).
Fact 1
When \(\partial \theta (x)\ne \varnothing \), the inequality in (16) is trivial for each \(x'\not \in {\text {dom}}(\theta )\) since it amounts to \(+\infty = \theta (x')-\theta (x) \geqslant {\langle } u,x'-x{\rangle }\).
Definition 5 leads to the well-known Fermat’s rule [6, Theorem 16.3]
Theorem 5
Let \(\theta :{{\mathcal {H}}}\rightarrow {\mathbb {R}}\cup \{+\infty \}\) be a proper function. A point \(x\in {\text {dom}}(\theta )\) is a global minimizer of \(\theta \) if and only if
If \(\theta \) has a global minimizer at x, then by Theorem 5 the set \(\partial \theta (x)\) is non-empty. However, \(\partial \theta (x)\) can be empty, e.g., at local minimizers that are not the global minimizer:
Example 7
Let \(\theta (x)=\frac{1}{2}x^2-\cos (\pi x)\). The global minimum of \(\theta \) is reached at \(x=0\) where \(\partial \theta (x)= f'(x)=0\). At \(x=\pm 1.7 {\bar{9}} \)\(\theta \) has local minimizers where \(\partial \theta (x)=\varnothing \) (even though \(\theta \) is \({{\mathcal {C}}}^\infty \)). For \(|x|<0.53\), one has \(\partial \theta (x)=\nabla \theta (x)\) with \(\theta ''(x)\geqslant 0\) and for \(0.54< |x| < 1.91\)\(\partial \theta (x)=\varnothing \).
The proof of the following lemma is a standard exercise in convex analysis [6, Exercise 16.8].
Lemma 3
Let \(\theta :{{\mathcal {H}}}\rightarrow {\mathbb {R}}\cup \{+\infty \}\) be a proper function such that (a) \({\text {dom}}(\theta )\) is convex and (b) \(\partial \theta (x)\ne \varnothing \) for each \(x \in {\text {dom}}(\theta )\). Then, \(\theta \) is a convex function.
Definition 6
(Lower convex envelope of a function) Let \(\theta : {{\mathcal {H}}}\rightarrow {\mathbb {R}}\cup \{+\infty \}\) be proper with \({\text {dom}}(\partial \theta ) \ne \varnothing \). Its lower convex envelope,Footnote 7 denoted \(\breve{\theta }\), is the pointwise supremum of all the convex lower-semicontinuous functions minorizing \(\theta \)
The function \(\breve{\theta }\) is proper, convex and lower-semicontinuous. It satisfies
Proposition 3
Let \(\theta : {{\mathcal {H}}}\rightarrow {\mathbb {R}}\cup \{+\infty \}\) be proper with \({\text {dom}}(\partial \theta ) \ne \varnothing \). For any \(x_{0} \in {\text {dom}}(\partial \theta )\), we have \(\breve{\theta }(x_0) = \theta (x_0)\), \(\partial \theta (x_0) = \partial \breve{\theta }(x_0)\).
Proof
As \(\partial \theta (x_0) \ne \varnothing \), by [6, Proposition 13.45], \(\breve{\theta }\) is the so-called biconjugate \(\theta ^{**}\) of \(\theta \) [6, Definition 13.1]. Moreover, [6, Proposition 16.5] yields \(\theta ^{**}(x_{0}) = \theta (x_{0})\) and \(\partial \theta ^{**}(x_{0}) = \partial \theta (x_{0})\).
We need to adapt [6, Proposition 17.31] to the case where \(\theta \) is proper but possibly nonconvex, with a stronger assumption of Fréchet (instead of Gâteaux) differentiability.
Proposition 4
If \(\partial \theta (x) \ne \varnothing \) and \(\theta \) is (Fréchet) differentiable at x, then \(\partial \theta (x) = \{\nabla \theta (x)\}\).
Proof
Consider \(u \in \partial \theta (x)\). As \(\theta \) is differentiable at x, there is an open ball \({\mathcal {B}}\) centered at 0 such that \(x+h \in {\text {dom}}(\theta )\) for each \(h \in {\mathcal {B}}\). For each \(h \in {\mathcal {B}}\), Definition 5 yields
hence \(-(\theta (x-h)-\theta (x)) \leqslant {\langle }u,h{\rangle } \leqslant \theta (x+h)-\theta (x)\). Since \(\theta \) is Fréchet differentiable at x, letting \(\Vert h\Vert \) tend to zero yields
hence \({\langle }u-\nabla \theta (x),h{\rangle } = o(\Vert h\Vert )\), \(\forall \;h\in {\mathcal {B}}\). This shows that \(u = \nabla \theta (x)\).
1.3 Appendix 3 Characterizing Functions with a Given Subdifferential
Corollary 9 below generalizes a result of Moreau [26, Proposition 8.b] characterizing functions by their subdifferential. It shows that one only needs the subdifferentials to intersect. We begin in dimension one.
Lemma 4
Consider \(a_0,a_1: {\mathbb {R}}\rightarrow {\mathbb {R}}\cup \{+\infty \}\) convex functions such that \({\text {dom}}(a_i) = {\text {dom}}(\partial a_i) = [0,1]\) and \(\partial a_0(t) \cap \partial a_1(t) \ne \varnothing \) on [0, 1]. Then, there exists a constant \(K \in {\mathbb {R}}\) such that \(a_1(t)-a_0(t)=K\) on [0, 1].
Proof
As \(a_i\) is convex, it is continuous on (0, 1) [21, Theorem 3.1.1, p16]. Moreover, by [21, Proposition 3.1.2] we have \(a_{i}(0) \geqslant \lim _{t \rightarrow 0, t>0} a_{i}(t) =:a_{i}(0_+)\), and since \(\partial a_{i}(0) \ne \varnothing \), there is \(u_{i} \in \partial a_{i}(0)\) such that \(a_{i}(t) \geqslant a_{i}(0) + u_{i}(t-0)\) for each \(t \in [0,1]\) hence \(a_{i}(0_+) \geqslant a_{i}(0)\). This shows that \(a_{i}(0_+) = a_{i}(0)\), and similarly \( \lim _{t \rightarrow 1, t<1} a_{i}(t) = a_{i}(1)\), hence \(a_{i}\) is continuous on [0, 1] relative to [0, 1]. In addition, \(a_{i}\) is differentiable on [0, 1] except on a countable set \(B_i \subset [0,1]\) [21, Theorem 4.2.1 (ii)].
For \(t \in [0,1] \backslash (B_{0} \cup B_{1})\) and \(i \in \{0,1\}\), Proposition 4 yields \(\partial a_i(t) = \{a'_i(t)\}\) hence the function \(\delta := a_1-a_0\) is continuous on [0, 1] and differentiable on \([0,1] \backslash (B_0 \cup B_1)\). For \(t \in I \backslash (B_0 \cup B_1)\), \(\{a'_0(t)\} \cap \{a'_1(t)\} = \partial a_0(t) \cap \partial a_1(t) \ne \varnothing \), hence \(a'_0(t)=a'_1(t)\) and \(\delta '(t) = 0\). A classical exerciseFootnote 8 in real analysis [34, Example 4] is to show that if a function f is continuous on an interval, and differentiable with zero derivative except on a countable set, then f is constant. As \(B_0 \cup B_1\) is countable, it follows \(\delta \) is constant on (0, 1). As it is continuous on [0, 1], it is constant on [0, 1].
Corollary 9
Let \(\theta _0,\theta _1: {{\mathcal {H}}}\rightarrow {\mathbb {R}}\cup \{+\infty \}\) be proper and \({{\mathcal {C}}}\subset {{\mathcal {H}}}\) a non-empty polygonally connected set. Assume that for each \(z \in {{\mathcal {C}}}\), \(\partial \theta _0(z) \cap \partial \theta _1(z) \ne \varnothing \); then, there is a constant \(K \in {\mathbb {R}}\) such that \(\theta _1(x) -\theta _0(x) = K\), \(\forall \;x \in {{\mathcal {C}}}\).
Remark 4
Note that the functions \(\theta _{i}\) and the set \({{\mathcal {C}}}\) are not assumed to be convex.
Proof
The proof is in two parts.
-
(i)
Assume that \({{\mathcal {C}}}\) is convex and fix some \(x^* \in {{\mathcal {C}}}\). Consider \(x \in {{\mathcal {C}}}\), and define \(a_i(t):= \theta _i(x^*+t(x-x^*))\), for \(i=0,1\) and each \(t \in [0,1]\), and \(a_i(t)=+\infty \) if \(t\not \in [0,1]\). As \({{\mathcal {C}}}\) is convex, \(z_t := x^*+t(x-x^*) \in {{\mathcal {C}}}\) hence for each \(t \in [0,1]\) there exists \(u_t \in \partial \theta _0(z_t) \cap \partial \theta _1(z_t)\). By Definition 5 for each \(t,t' \in [0,1]\),
$$\begin{aligned}&a_i(t')-a_i(t) = \theta _i(x^*+t'(x-x^*))-\theta _i(x^*+t(x-x^*)) \\&\quad \geqslant {\langle }u_t,(t'-t)(x-x^*){\rangle } ={\langle }u_t,x-x^*{\rangle }(t'-t). \end{aligned}$$For \(t \in [0,1]\) and \(t' \in {\mathbb {R}}\backslash [0,1]\) since \(a_{i}(t') = +\infty \) the inequality \(a_i(t')-a_i(t) \geqslant {\langle }u_t,x-x^*{\rangle }(t'-t)\) also obviously holds, hence \({\langle }u_t,x-x^*{\rangle } \in \partial a_i(t)\), \(i=0,1\). Thus, \(\partial a_i(t)\ne \varnothing \) for each \(t \in [0,1]\), so by Lemma 3\(a_i\) is convex on [0, 1] for \(i=0,1\), and \({\langle }u_t,x-x^*{\rangle } \in \partial a_0(t) \cap \partial a_1(t)\) for each \(t \in [0,1]\). By Lemma 4, there exists \(K \in {\mathbb {R}}\) such that \(a_1(t)-a_0(t)= K\) for each \(t \in [0,1]\). Therefore,
$$\begin{aligned} \theta _1(x)-\theta _0(x)= & {} a_1(1)-a_0(1) = a_1(0) -a_0(0)\\= & {} \theta _1(x^*) -\theta _0(x^*) = K. \end{aligned}$$As this holds for each \(x \in {{\mathcal {C}}}\), we have established the result as soon as \({{\mathcal {C}}}\) is convex.
-
(ii)
Now, we prove the result when \({{\mathcal {C}}}\) is polygonally connected. Fix some \(x^* \in {{\mathcal {C}}}\) and define \(K:= \theta _1(x^*)-\theta _0(x^*)\). Consider \(x \in {{\mathcal {C}}}\): By the definition of polygonal-connectedness, there exists an integer \(n \geqslant 1\) and \(x_j \in {{\mathcal {C}}}\), \(0 \leqslant j \leqslant n\) with \(x_0 = x^*\) and \(x_n = x\) such that the (convex) segments \({{\mathcal {C}}}_j = [x_j,x_{j+1}] = \{t x_j + (1-t) x_{j+1}, t \in [0,1]\}\) satisfy \({{\mathcal {C}}}_j \subset {{\mathcal {C}}}\). Since each \({{\mathcal {C}}}_j\) is convex, the result established in (i) implies that \(\theta _1(x_{j+1})-\theta _0(x_{j+1}) = \theta _1(x_j)-\theta _0(x_j)\) for \(0 \leqslant j < n\). This shows that \(\theta _1(x)-\theta _0(x) = \theta _1(x_{n})-\theta _0(x_{n}) = \cdots = \theta _1(x_{0})-\theta _0(x_{0}) = \theta _1(x^*)-\theta _0(x^*) =K\).
1.4 Appendix 4 Proof of Theorem 3
The indicator function of a set \({\mathcal {S}}\) is denoted
(ai) \(\Rightarrow \) (aii) We introduce the function \(\theta :{{\mathcal {H}}}\rightarrow {\mathbb {R}}\cup \{+\infty \}\) by
Consider \(x \in {\text {Im}}(f)\). By definition, \(x= f(y)\) where \(y \in {\mathcal {Y}}\), hence by (ai) x is a global minimizer of \(x'\mapsto \left\{ D(x',y)+\varphi (x')\right\} \). Therefore, we have
which is equivalent to
meaning that \(A(y)\in \partial \theta \left( f(y)\right) \). As this holds for each \(y \in {\mathcal {Y}}\) such that \(f(y)=x\), we get \(A(f^{-1}(x))\subset \partial \theta (x)\). Consider \(g_1 := \breve{\theta } \) according to Definition 6. Since \(g_1\) is convex l.s.c. and
by Proposition 3, \(\partial \theta (x) = \partial g_1(x)\) and \(\theta (x) = g_{1}(x)\) for each \(x \in {\text {Im}}(f)\). This establishes (aii) with \(g := g_1 = \breve{\theta }\).
(aii) \(\Rightarrow \) (ai) Set \(\theta _1: = g+\chi _{{\text {Im}}(f)}\). By (aii), \(\partial g(x) \ne \varnothing \) for each \(x \in {\text {Im}}(f)\). Since \({\text {dom}}(\partial g) \subset {\text {dom}}(g)\), it follows that \({\text {Im}}(f) \subset {\text {dom}}(g)\) and consequently
Consider \(y \in {\mathcal {Y}}\) and \(x:=f(y)\) so that \(x \in {\text {Im}}(f)\), hence \(\theta _1(x)=g(x)\) and \(A(y)\in A(f^{-1}(x)) \subset \partial g(x)\) where the inclusion comes from (aii). It follows that for each \((x,x')\in {\text {Im}}(f) \times {{\mathcal {H}}}\), one has
showing that \(A(y)\in \partial \theta _1(x)\). This is equivalent to (21) with \(\theta := \theta _1\), and since \({\text {dom}}(\theta _1) = {\text {Im}}(f)\), the inequality in (20) holds with \(\varphi (x) := \theta _1(x)-b(x)\), i.e., x is a global minimizer of \(D(x',y)+\varphi (x')\). Since this holds for each \(y \in {\mathcal {Y}}\), this establishes (ai) with \(\varphi := \theta _1-b = g-b+\chi _{{\text {Im}}(f)}\).
(b) Consider \(\varphi \) and g satisfying (ai) and (aii), respectively. LetFootnote 9\(g_1 := \breve{\theta }\) with \(\theta \) defined in (19). Following the arguments of (ai) \(\Rightarrow \) (aii), we obtain that \(g_1\) (just as g) satisfies (aii). For each \(x \in {{\mathcal {C}}}\), we thus have \(\partial g(x) \cap \partial g_1(x) \supset A(f^{-1}(x)) \ne \varnothing \) with \(g,g_1\) convex l.s.c. functions. Hence, by Corollary 9, since \({{\mathcal {C}}}\) is polygonally connected, there is a constant K such that \(g(x) = g_1(x)+K\), \(\forall \;x \in {{\mathcal {C}}}\). To establish the relation (2) between g and \(\varphi \), we now show that \(g_1(x) = b(x) + \varphi (x)\) on \({{\mathcal {C}}}\). By (22) and Proposition 3, we have \(\breve{\theta }(x)=\theta (x)\) for each \(x \in {\text {Im}}(f)\), hence as \({{\mathcal {C}}}\subset {\text {Im}}(f)\) we obtain \(g_1(x) := \breve{\theta }(x) = \theta (x) = b(x)+\varphi (x)\) for each \(x \in {{\mathcal {C}}}\). This establishes (2).
(ci) \(\Rightarrow \) (cii) Define
Consider \(y \in {\mathcal {Y}}\). From (ci), for each \(y'\) the global minimizer of \(x \mapsto {\widetilde{D}}(x,y')+\varphi (x)\) is reached at \(x'=f(y')\). Hence, for \(x = f(y)\) we have
Using this inequality, we obtain that
This shows that
Set \(\psi _1 := \breve{\varrho } \) according to Definition 6. Then, the function \(\psi _1\) is convex l.s.c. and for each \(y \in {\mathcal {Y}}\) the function B(f(y)) is well-defined, so \(\partial \varrho (y) \ne \varnothing \). Hence, by Proposition 3, \(\partial \varrho (y) =\partial \breve{\varrho }(y)= \partial \psi _1(y)\) and \(\varrho (y)=\breve{\varrho }(y) = \psi _{1}(y)\) for each \(y \in {\mathcal {Y}}\). This establishes (cii) with \(\psi := \psi _1 =\breve{\varrho }\).
(cii) \(\Rightarrow \) (ci) Define \(h:{\mathcal {Y}}\rightarrow {\mathbb {R}}\) by
Since \(B(f(y')) \in \partial \psi (y')\) with \(\psi \) convex by (cii), applying Definition 5 to \(\partial \psi \) yields \(\psi (y) - \psi (y') \geqslant {\langle }y-y',B(f(y')){\rangle }\). Using this inequality, one has
Noticing that for each \(x \in {\text {Im}}(f)\) there is \(y \in {\mathcal {Y}}\) such that \(x=f(y)\), we can define \(\theta :{{\mathcal {H}}}\rightarrow {\mathbb {R}}\cup \{+\infty \}\) obeying \({\text {dom}}(\theta )={\text {Im}}(f)\) by
For \(x \in {\text {Im}}(f)\), as \(f(y)=f(y')=x\) for each \(y,y' \in f^{-1}(x)\), applying (25) yields \(h(y')-h(y) \geqslant 0\). By symmetry \(h(y')=h(y)\), hence the definition of \(\theta (x)\) does not depend of which \(y \in f^{-1}(x)\) is chosen.
For \(x'\in {\text {Im}}(f) \), we write \(x'=f(y')\). Using (25) and the definition of \(\theta \) yields
that is to say
This also trivially holds for \(x' \notin {\text {Im}}(f)\). Setting \(\varphi (x):= \theta (x)-b(x)\) for each \(x \in {{\mathcal {H}}}\), and replacing \(\theta \) by \(b+\varphi \) in the inequality above yields
showing that \(f(y) \in \arg \min _{x'} \{{\widetilde{D}}(x',y)+\varphi (x')\}\). As this holds for each \(y \in {\mathcal {Y}}\), \(\varphi \) satisfies (ci).
(d) Consider \(\varphi \) and \(\psi \) satisfying (ci) and (cii), respectively. Using the arguments of (ci) \(\Rightarrow \) (cii), the function \(\psi _1 := \breve{\varrho }\) with \(\varrho \) defined in (23) satisfies (cii). As \(\psi \) and \(\psi _1\) both satisfy (cii), for each \(y \in {{\mathcal {C}}}'\) we have \(\partial \psi (y) \cap \partial \psi _1(y) \supset B(f(y)) \ne \varnothing \) with \(\psi ,\psi _1\) convex l.s.c. functions. Hence, by Corollary 9, since \({{\mathcal {C}}}'\) is polygonally connected, there is a constant \(K'\) such that \(\psi (y) = \psi _1(y)+K'\), \(\forall \;y \in {{\mathcal {C}}}'\). By (24), \(\partial \varrho (y) \ne \varnothing \) for each \(y \in {\mathcal {Y}}\), hence by Proposition 3 we have \(\breve{\varrho }(y) = \varrho (y)\) for each \(y \in {\mathcal {Y}}\). As \({{\mathcal {C}}}' \subset {\mathcal {Y}}\), it follows that \(\psi _1(y) = \breve{\varrho }(y) = \varrho (y)\) for each \(y \in {{\mathcal {C}}}'\). This establishes (3).
1.5 Appendix 5 Proof of Lemma 1
Proof
Without loss of generality, we prove the equivalence for the convex envelope \(\breve{\theta }\) instead of \(\theta \): Indeed by Proposition 3, since \(\partial \theta (x) \ne \varnothing \) on \({\mathcal {X}}\) we have \(\breve{\theta }(x) = \theta (x)\) and \(\partial \breve{\theta }(x) = \partial \theta (x)\) on \({\mathcal {X}}\).
(a) \(\Rightarrow \) (b). By [6, Prop 17.41(iii)\(\Rightarrow \)(i)], as \(\breve{\theta }\) is convex l.s.c. and \(\varrho \) is a selection of its subdifferential which is continuous at each \(x \in {\mathcal {X}}\), \(\breve{\theta }\) is (Fréchet) differentiable at each \(x \in {\mathcal {X}}\). By Proposition 4, we get \(\partial \breve{\theta }(x) = \{\nabla \breve{\theta }(x)\} = \{\varrho (x)\}\) on \({\mathcal {X}}\). Since \(\varrho \) is continuous, \(x \mapsto \nabla \breve{\theta }(x)\) is continuous on \({\mathcal {X}}\).
(b) \(\Rightarrow \) (a). Since \(\breve{\theta }\) is differentiable on \({\mathcal {X}}\), by Proposition 4 we have \(\partial \breve{\theta }(x) = \{\nabla \breve{\theta }(x)\}\) on \({\mathcal {X}}\). By (9), it follows that \(\varrho (x) = \nabla \breve{\theta }(x)\) on \({\mathcal {X}}\). Since \(\nabla \breve{\theta }\) is continuous on \({\mathcal {X}}\), so is \(\varrho \).
1.6 Appendix 6 Proof of Corollary 3
By Theorem 2, as \({\mathcal {Y}}\) is open and convex and f is \(C^1({\mathcal {Y}})\) with Df(y) symmetric semi-definite positive for each \(y \in {\mathcal {Y}}\), there is a function \(\varphi _0\) and a convex l.s.c. function \(\psi \in C^2({\mathcal {Y}})\) such that \(\nabla \psi (y) = f(y) \in \text {prox}_{\varphi _0}(y)\) for each \(y \in {\mathcal {Y}}\). We define \(\varphi (x) := \varphi _0(x) + \chi _{{\text {Im}}(f)}(x)\) and let the reader check that \(f(y) \in \text {prox}_\varphi (y)\) for each \(y \in {\mathcal {Y}}\). By construction, \({\text {dom}}(\varphi ) = {\text {Im}}(f)\).
Uniqueness of the Global Minimizer. Consider \(\widetilde{f}\) any function such that \(\widetilde{f}(y) \in \text {prox}_{\varphi }(y)\) for each y. This implies
By Corollary 1, there is a convex l.s.c. function \(\widetilde{\psi }\) such that \(\widetilde{f}(y) \in \partial \widetilde{\psi }(y)\) for each \(y \in {\mathcal {Y}}\). Since \({\mathcal {Y}}\) is convex, it is polygonally connected hence by Theorem 4(b) and (26) there are \(K,K' \in {\mathbb {R}}\) such that
Thus, \(\widetilde{\psi }\) is also \(C^2({\mathcal {Y}})\) and \(\widetilde{f}(y) \in \partial \widetilde{\psi }(y) = \{\nabla \psi (y)\} = \{f(y)\}\) for each \(y \in {\mathcal {Y}}\). This shows that \(\widetilde{f}(y)=f(y)\) for each y, hence f(y) is the unique global minimizer on \({{\mathcal {H}}}\) of \(x \mapsto \tfrac{1}{2}\Vert y-x\Vert ^2+\varphi (x)\), i.e., \(\text {prox}_\varphi (y) = \{f(y)\}\).
Injectivity off. The proof follows that of [17, Lemma 1]. Given \(y \ne y'\), define \(v := y'-y \ne 0\) and \(\theta (t) := {\langle }f(y+tv),v{\rangle }\) for \(t \in [0,1]\). As \({\mathcal {Y}}\) is convex, this is well-defined. As \(f \in {{\mathcal {C}}}^1({\mathcal {Y}})\) and \(Df(y+tv) \succ 0\), the function \(\theta \) is \(C^1([0,1])\) with \(\theta '(t) = {\langle } Df(y+tv)\ v,v{\rangle } > 0\) for each t. If we had \(f(y) = f(y')\), then by Rolle’s theorem there would be \(t \in [0,1]\) such that \(\theta '(t)=0\), contradicting the fact that \(\theta '(t)>0\).
Differentiability of\(\varvec{\varphi }\). If Df(y) is boundedly invertible for each \(y \in {\mathcal {Y}}\), then by the inverse function theorem \({\text {Im}}(f)\) is open and \(f^{-1}: {\text {Im}}(f) \rightarrow {\mathcal {Y}}\) is \(C^{1}\). Given \(x \in {\text {Im}}(f)\), denoting \(u := f^{-1}(x)\), (27) yields
Since \(\psi \) is \(C^{2}\) and \(f^{-1}\) is \(C^{1}\), it follows that \(\varphi \) is \(C^{1}\).
Global Minimum is the Unique Critical Point. The proof is inspired by that of [17, Theorem 1]. Consider x a critical point of \(\theta : x \mapsto \tfrac{1}{2}\Vert y-x\Vert ^2+\varphi (x)\), i.e., since \(\varphi \) is \(C^{1}\), a point where \(\nabla \theta (x)=0\). Since \({\text {dom}}(\varphi ) = {\text {Im}}(f)\), there is some \(v \in {\mathcal {Y}}\) such that \(x = f(v)\). Moreover, as \(\varphi \) is \(C^{1}\) on the open set \({\text {Im}}(f)\), the gradient \(\nabla \theta (x)\) is well-defined and \(\nabla \theta (x)=0\). On the one hand, denoting \(\varrho (u):= (\theta \circ f)(u) = \tfrac{1}{2}\Vert y-f(u)\Vert ^2+\varphi (f(u))\) we have \(\nabla \varrho (u) = Df(u) \nabla \theta (f(u))\) for each \(u \in {\mathcal {Y}}\). On the other hand, for each \(u \in {\mathcal {Y}}\), as \(f(u) = \nabla \psi (u)\) we also have
For \(u=v\), we get \(Df(v)\ (v-y) = \nabla \varrho (v) = Df(v) \nabla \theta (f(v)) = Df(v)\ \nabla \theta (x) = 0\). As \(Df(v) \succ 0\), this implies \(v=y\), hence \(x=f(y)\).
1.7 Appendix 7 Proof of Lemma 2
As a preliminary, let us compute the entries of the \(n \times n\) matrix associated to Df(y):
Note that if \(\Vert \text {diag}(w^{i})y\Vert _{2}=\lambda \), then f may not be differentiable at y; this case will not be useful below.
The proof exploits Corollary 2 which shows that if f is a proximity operator, then Df(y) is symmetric in each open set where it is well-defined.
Let f be a generalized social shrinkage operator as described in Lemma 2 and consider \({\mathcal {G}} = \{G_{1},\ldots ,G_{p}\}\) the partition of \(\llbracket 1,n \rrbracket \) into disjoint groups corresponding to the equivalence classes defined by the equivalence relation between indices: For \(i,j \in \llbracket 1,n \rrbracket \), \(i \sim j\) if and only if \(w^i = w^j\). Given \(G \in {\mathcal {G}}\), denote \(w^G\) the weight vector shared all \(i \in G\). If f is a proximity operator, then we show that for each \(G \in {\mathcal {G}}\), we have \(\text {supp}(w^{G}) = G\).
For \(i \in G\), by Definition 4 we have \(i \in N_i = \text {supp}(w^i) = \text {supp}(w^G)\), establishing thatFootnote 10
From now on, we assume that f is a proximity operator, and consider a group \(G \in {\mathcal {G}}\). To prove that \(G = \text {supp}(w^G)\), we will establish that for each \(i,j \in \llbracket 1,n \rrbracket \)
To see why it allows to conclude, consider \(j \in \text {supp}(w^G)\), and \(i \in G\). As \(N_i := \text {supp}(w^i) = \text {supp}(w^G)\) we obtain that \(j \in N_i\), i.e., \(w^i_j \ne 0\). By (30), it follows that \( \Vert \text {diag}(w^{j})y\Vert _{2} = \Vert \text {diag}(w^{i})y\Vert _{2}\) for each y. As \(w^i,w^j\) have nonnegative entries, this means that \(w^{i} = w^{j}\). As \(i \in G\), this implies \(j \in G\) by the very definition of G as an equivalence class. This shows \(\text {supp}(w^G) \subset G\). Using also (29), we conclude that \(\text {supp}(w^G)=G\).
Let us now prove (30). Consider a given pair \(i,j \in \llbracket 1,n \rrbracket \). Assume that \(\Vert \text {diag}(w^{j})y\Vert _{2} \ne \Vert \text {diag}(w^{i})y\Vert _{2}\) for at least one vector y. Without loss of generality, assume that \(a := \Vert \text {diag}(w^{j})y\Vert _{2} < \Vert \text {diag}(w^{i})y\Vert _{2} =:b\). Rescaling y by a factor \(c = 2\lambda /(a+b)\) yields the existence of y such that for the considered pair i, j
By continuity, perturbing y if needed we can also assume that for this pair i, j we have \(y_{i}y_{j} \ne 0\).
By (28), as (31) holds in a neighborhood of y, f is \(C^1\) at y and its partial derivatives for the considered pair i, j satisfy
Since f is a proximity operator, by Corollary 2 we have \(\tfrac{\partial f_{i}}{\partial y_{j}}(y) = \tfrac{\partial f_{j}}{\partial y_{i}}(y)\). It follows that for the considered pair i, j
As \(y_iy_j \ne 0\) and \(h'_i(t) \ne 0\) for \(t \ne 0\), we obtain \(w^{i}_{j} = 0\).
To conclude, we now show that \(w^j_i = 0\). As \(w^i_j=0\), \(f_i\) is in fact independent of \(y_j\) and \(\frac{\partial f_{i}}{\partial y_{j}}\) is identically zero on \({\mathbb {R}}^{n}\). By scaling y as needed, we get a vector \(y'\) such that \(y'_{i}y'_{j} \ne 0\) and
Reasoning as above yields \(2(w^j_i)^2 y'_j y'_i h'_j\left( \Vert \text {diag}(w^j)y'\Vert _2^2\right) = \frac{\partial f_{j}}{\partial y_{i}}(y') = \frac{\partial f_{i}}{\partial y_{j}}(y') =0\), hence \(w^{j}_{i} = 0\). We thus obtain that \(w^{i}_{j}=w^{j}_{i}=0\) as claimed, establishing (30) and therefore \(G = \text {supp}(w^G)\).
Rights and permissions
About this article
Cite this article
Gribonval, R., Nikolova, M. A Characterization of Proximity Operators. J Math Imaging Vis 62, 773–789 (2020). https://doi.org/10.1007/s10851-020-00951-y
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10851-020-00951-y