Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
Skip to main content

Sharp MSE Bounds for Proximal Denoising

  • Published:
Foundations of Computational Mathematics Aims and scope Submit manuscript

Abstract

Denoising has to do with estimating a signal \(\mathbf {x}_0\) from its noisy observations \(\mathbf {y}=\mathbf {x}_0+\mathbf {z}\). In this paper, we focus on the “structured denoising problem,” where the signal \(\mathbf {x}_0\) possesses a certain structure and \(\mathbf {z}\) has independent normally distributed entries with mean zero and variance \(\sigma ^2\). We employ a structure-inducing convex function \(f(\cdot )\) and solve \(\min _\mathbf {x}\{\frac{1}{2}\Vert \mathbf {y}-\mathbf {x}\Vert _2^2+\sigma {\lambda }f(\mathbf {x})\}\) to estimate \(\mathbf {x}_0\), for some \(\lambda >0\). Common choices for \(f(\cdot )\) include the \(\ell _1\) norm for sparse vectors, the \(\ell _1-\ell _2\) norm for block-sparse signals and the nuclear norm for low-rank matrices. The metric we use to evaluate the performance of an estimate \(\mathbf {x}^*\) is the normalized mean-squared error \(\text {NMSE}(\sigma )=\frac{{\mathbb {E}}\Vert \mathbf {x}^*-\mathbf {x}_0\Vert _2^2}{\sigma ^2}\). We show that NMSE is maximized as \(\sigma \rightarrow 0\) and we find the exact worst-case NMSE, which has a simple geometric interpretation: the mean-squared distance of a standard normal vector to the \({\lambda }\)-scaled subdifferential \({\lambda }\partial f(\mathbf {x}_0)\). When \({\lambda }\) is optimally tuned to minimize the worst-case NMSE, our results can be related to the constrained denoising problem \(\min _{f(\mathbf {x})\le f(\mathbf {x}_0)}\{\Vert \mathbf {y}-\mathbf {x}\Vert _2\}\). The paper also connects these results to the generalized LASSO problem, in which one solves \(\min _{f(\mathbf {x})\le f(\mathbf {x}_0)}\{\Vert \mathbf {y}-{\mathbf {A}}\mathbf {x}\Vert _2\}\) to estimate \(\mathbf {x}_0\) from noisy linear observations \(\mathbf {y}={\mathbf {A}}\mathbf {x}_0+\mathbf {z}\). We show that certain properties of the LASSO problem are closely related to the denoising problem. In particular, we characterize the normalized LASSO cost and show that it exhibits a “phase transition” as a function of number of observations. We also provide an order-optimal bound for the LASSO error in terms of the mean-squared distance. Our results are significant in two ways. First, we find a simple formula for the performance of a general convex estimator. Secondly, we establish a connection between the denoising and linear inverse problems.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6

Similar content being viewed by others

Notes

  1. Observe that if \(\mathbf {z}\) has independent entries with variance \(\sigma ^2\), \( \Vert \mathbf {z}\Vert _2 ^2\) will concentrate around \(\sigma ^2m\).

  2. These works appeared after the initial submission of this manuscript.

References

  1. A. Agarwal, S. Negahban, M. J. Wainwright, et al. Noisy matrix decomposition via convex relaxation: Optimal rates in high dimensions. The Annals of Statistics, 40(2):1171–1197, 2012.

    Article  MathSciNet  MATH  Google Scholar 

  2. D. Amelunxen, M. Lotz, M. B. McCoy, and J. A. Tropp. Living on the edge: Phase transitions in convex programs with random data. Inform. Inference, 2014.

  3. F. R. Bach. Structured sparsity-inducing norms through submodular functions. In Advances in Neural Information Processing Systems, pages 118–126, 2010.

  4. A. Banerjee, S. Chen, F. Fazayeli, and V. Sivakumar. Estimation with norm regularization. In Advances in Neural Information Processing Systems, pages 1556–1564, 2014.

  5. R. G. Baraniuk, V. Cevher, M. F. Duarte, and C. Hegde. Model-based compressive sensing. Information Theory, IEEE Transactions on, 56(4):1982–2001, 2010.

    Article  MathSciNet  Google Scholar 

  6. M. Bayati, M. Lelarge, and A. Montanari. Universality in polytope phase transitions and message passing algorithms. arXiv preprint  arXiv:1207.7321, 2012.

  7. M. Bayati and A. Montanari. The dynamics of message passing on dense graphs, with applications to compressed sensing. Information Theory, IEEE Transactions on, 57(2):764–785, 2011.

    Article  MathSciNet  Google Scholar 

  8. M. Bayati and A. Montanari. The lasso risk for gaussian matrices. Information Theory, IEEE Transactions on, 58(4):1997–2017, 2012.

    Article  MathSciNet  Google Scholar 

  9. A. Belloni, V. Chernozhukov, and L. Wang. Square-root lasso: pivotal recovery of sparse signals via conic programming. Biometrika, 98(4):791–806, 2011.

    Article  MathSciNet  MATH  Google Scholar 

  10. D. P. Bertsekas, A. Nedić, and A. E. Ozdaglar. Convex analysis and optimization. Athena Scientific, Belmont, 2003.

    MATH  Google Scholar 

  11. B. N. Bhaskar, G. Tang, and B. Recht. Atomic norm denoising with applications to line spectral estimation. Signal Processing, IEEE Transactions on, 61(23):5987–5999, 2013.

    Article  MathSciNet  Google Scholar 

  12. P. J. Bickel, Y. Ritov, and A. B. Tsybakov. Simultaneous analysis of lasso and dantzig selector. The Annals of Statistics, pages 1705–1732, 2009.

  13. V. I. Bogachev. Gaussian measures. American Mathematical Society, Providence 1998.

    Book  MATH  Google Scholar 

  14. S. Boyd and L. Vandenberghe. Convex optimization. Cambridge University Press, Cambridge 2009.

    MATH  Google Scholar 

  15. F. Bunea, A. Tsybakov, M. Wegkamp, et al. Sparsity oracle inequalities for the lasso. Electronic Journal of Statistics, 1:169–194, 2007.

    Article  MathSciNet  MATH  Google Scholar 

  16. J.-F. Cai, E. J. Candès, and Z. Shen. A singular value thresholding algorithm for matrix completion. SIAM Journal on Optimization, 20(4):1956–1982, 2010.

    Article  MathSciNet  MATH  Google Scholar 

  17. J.-F. Cai and W. Xu. Guarantees of total variation minimization for signal recovery. arXiv preprint  arXiv:1301.6791, 2013.

  18. T. T. Cai, T. Liang, and A. Rakhlin. Geometrizing local rates of convergence for linear inverse problems. arXiv preprint  arXiv:1404.4408, 2014.

  19. E. Candès and B. Recht. Simple bounds for recovering low-complexity models. Mathematical Programming, 141(1-2):577–589, 2013.

    Article  MathSciNet  MATH  Google Scholar 

  20. E. J. Candès, X. Li, Y. Ma, and J. Wright. Robust principal component analysis? Journal of the ACM (JACM), 58(3):11, 2011.

    Article  MathSciNet  MATH  Google Scholar 

  21. E. J. Candes and Y. Plan. Tight oracle inequalities for low-rank matrix recovery from a minimal number of noisy random measurements. Information Theory, IEEE Transactions on, 57(4):2342–2359, 2011.

    Article  MathSciNet  Google Scholar 

  22. E. J. Candès and B. Recht. Exact matrix completion via convex optimization. Foundations of Computational mathematics, 9(6):717–772, 2009.

    Article  MathSciNet  MATH  Google Scholar 

  23. E. J. Candès, J. Romberg, and T. Tao. Robust uncertainty principles: Exact signal reconstruction from highly incomplete frequency information. Information Theory, IEEE Transactions on, 52(2):489–509, 2006.

    Article  MathSciNet  MATH  Google Scholar 

  24. E. J. Candes, J. K. Romberg, and T. Tao. Stable signal recovery from incomplete and inaccurate measurements. Communications on pure and applied mathematics, 59(8):1207–1223, 2006.

    Article  MathSciNet  MATH  Google Scholar 

  25. E. J. Candes, M. B. Wakin, and S. P. Boyd. Enhancing sparsity by reweighted \(\ell _1\) minimization. Journal of Fourier analysis and applications, 14(5-6):877–905, 2008.

    Article  MathSciNet  MATH  Google Scholar 

  26. V. Chandrasekaran and M. I. Jordan. Computational and statistical tradeoffs via convex relaxation. Proceedings of the National Academy of Sciences, 110(13):E1181–E1190, 2013.

    Article  MathSciNet  MATH  Google Scholar 

  27. V. Chandrasekaran, B. Recht, P. A. Parrilo, and A. S. Willsky. The convex geometry of linear inverse problems. Foundations of Computational Mathematics, 12(6):805–849, 2012.

    Article  MathSciNet  MATH  Google Scholar 

  28. S. Chen and D. L. Donoho. Examples of basis pursuit. In SPIE’s 1995 International Symposium on Optical Science, Engineering, and Instrumentation, pages 564–574. International Society for Optics and Photonics, 1995.

  29. P. L. Combettes and J.-C. Pesquet. Proximal splitting methods in signal processing. In Fixed-point algorithms for inverse problems in science and engineering, pages 185–212. Springer, Berlin 2011.

  30. P. L. Combettes and V. R. Wajs. Signal recovery by proximal forward-backward splitting. Multiscale Modeling & Simulation, 4(4):1168–1200, 2005.

    Article  MathSciNet  MATH  Google Scholar 

  31. D. Donoho, I. Johnstone, and A. Montanari. Accurate prediction of phase transitions in compressed sensing via a connection to minimax denoising. arXiv preprint  arXiv:1111.1041, 2011.

  32. D. Donoho and J. Tanner. Observed universality of phase transitions in high-dimensional geometry, with implications for modern data analysis and signal processing. Philosophical Transactions of the Royal Society A: Mathematical, Physical and Engineering Sciences, 367(1906):4273–4293, 2009.

    Article  MathSciNet  MATH  Google Scholar 

  33. D. L. Donoho. De-noising by soft-thresholding. Information Theory, IEEE Transactions on, 41(3):613–627, 1995.

    Article  MathSciNet  MATH  Google Scholar 

  34. D. L. Donoho. Compressed sensing. Information Theory, IEEE Transactions on, 52(4):1289–1306, 2006.

    Article  MathSciNet  MATH  Google Scholar 

  35. D. L. Donoho. High-dimensional centrally symmetric polytopes with neighborliness proportional to dimension. Discrete & Computational Geometry, 35(4):617–652, 2006.

    Article  MathSciNet  MATH  Google Scholar 

  36. D. L. Donoho and M. Gavish. Minimax risk of matrix denoising by singular value thresholding. arXiv preprint  arXiv:1304.2085, 2013.

  37. D. L. Donoho, M. Gavish, and A. Montanari. The phase transition of matrix recovery from gaussian measurements matches the minimax mse of matrix denoising. Proceedings of the National Academy of Sciences, 110(21):8405–8410, 2013.

    Article  MathSciNet  MATH  Google Scholar 

  38. D. L. Donoho, A. Maleki, and A. Montanari. Message-passing algorithms for compressed sensing. Proceedings of the National Academy of Sciences, 106(45):18914–18919, 2009.

    Article  Google Scholar 

  39. D. L. Donoho, A. Maleki, and A. Montanari. The noise-sensitivity phase transition in compressed sensing. Information Theory, IEEE Transactions on, 57(10):6920–6941, 2011.

    Article  MathSciNet  Google Scholar 

  40. D. L. Donoho and J. Tanner. Neighborliness of randomly projected simplices in high dimensions. Proceedings of the National Academy of Sciences of the United States of America, 102(27):9452–9457, 2005.

    Article  MathSciNet  MATH  Google Scholar 

  41. D. L. Donoho and J. Tanner. Thresholds for the recovery of sparse solutions via l1 minimization. In Information Sciences and Systems, 2006 40th Annual Conference on, pages 202–206. IEEE, 2006.

  42. Y. C. Eldar, P. Kuppinger, and H. Bolcskei. Block-sparse signals: Uncertainty relations and efficient recovery. Signal Processing, IEEE Transactions on, 58(6):3042–3054, 2010.

    Article  MathSciNet  Google Scholar 

  43. M. Fazel. Matrix rank minimization with applications. PhD thesis, Stanford University, 2002.

  44. R. Foygel and L. Mackey. Corrupted sensing: Novel guarantees for separating structured signals. Information Theory, IEEE Transactions on, 60(2):1223–1247, 2014.

    Article  MathSciNet  Google Scholar 

  45. Y. Gordon. On Milman’s inequality and random subspaces which escape through a mesh in \({\mathbb{R}}^n\). Springer, Berlin 1988.

  46. O. Güler. On the convergence of the proximal point algorithm for convex minimization. SIAM Journal on Control and Optimization, 29(2):403–419, 1991.

    Article  MathSciNet  MATH  Google Scholar 

  47. E. T. Hale, W. Yin, and Y. Zhang. A fixed-point continuation method for l1-regularized minimization with applications to compressed sensing. CAAM TR07-07, Rice University, 2007.

  48. J.-B. Hiriart-Urruty and C. Lemaréchal. Convex Analysis and Minimization Algorithms I: Part 1: Fundamentals, volume 305. Springer, Berlin 1996.

    MATH  Google Scholar 

  49. R. Jenatton, J. Mairal, F. R. Bach, and G. R. Obozinski. Proximal methods for sparse hierarchical dictionary learning. In Proceedings of the 27th International Conference on Machine Learning (ICML-10), pages 487–494, 2010.

  50. M. A. Khajehnejad, A. G. Dimakis, W. Xu, and B. Hassibi. Sparse recovery of nonnegative signals with minimal expansion. Signal Processing, IEEE Transactions on, 59(1):196–208, 2011.

    Article  MathSciNet  Google Scholar 

  51. M. A. Khajehnejad, W. Xu, A. S. Avestimehr, and B. Hassibi. Weighted \(\ell _1\) minimization for sparse recovery with prior information. In Information Theory, 2009. ISIT 2009. IEEE International Symposium on, pages 483–487. IEEE, 2009.

  52. V. Koltchinskii, K. Lounici, A. B. Tsybakov, et al. Nuclear-norm penalization and optimal rates for noisy low-rank matrix completion. The Annals of Statistics, 39(5):2302–2329, 2011.

    Article  MathSciNet  MATH  Google Scholar 

  53. M. Ledoux. The concentration of measure phenomenon, volume 89. American Mathematical Society, Providence, 2005.

    MATH  Google Scholar 

  54. M. Ledoux and M. Talagrand. Probability in Banach Spaces: isoperimetry and processes, volume 23. Springer, Berlin 1991.

    Book  MATH  Google Scholar 

  55. J.-J. Moreau. Fonctions convexes duales et points proximaux dans un espace hilbertien. CR Acad. Sci. Paris Sér. A Math, 255:2897–2899, 1962.

    MathSciNet  MATH  Google Scholar 

  56. D. Needell and R. Ward. Stable image reconstruction using total variation minimization. SIAM Journal on Imaging Sciences, 6(2):1035–1058, 2013.

    Article  MathSciNet  MATH  Google Scholar 

  57. S. Negahban, B. Yu, M. J. Wainwright, and P. K. Ravikumar. A unified framework for high-dimensional analysis of \( m \)-estimators with decomposable regularizers. In Advances in Neural Information Processing Systems, pages 1348–1356, 2009.

  58. S. Oymak and B. Hassibi. New null space results and recovery thresholds for matrix rank minimization. arXiv preprint  arXiv:1011.6326, 2010.

  59. S. Oymak and B. Hassibi. Tight recovery thresholds and robustness analysis for nuclear norm minimization. In Information Theory Proceedings (ISIT), 2011 IEEE International Symposium on, pages 2323–2327. IEEE, 2011.

  60. S. Oymak, A. Jalali, M. Fazel, Y. C. Eldar, and B. Hassibi. Simultaneously structured models with application to sparse and low-rank matrices. arXiv preprint  arXiv:1212.3753, 2012.

  61. S. Oymak, C. Thrampoulidis, and B. Hassibi. The squared-error of generalized lasso: A precise analysis. arXiv preprint  arXiv:1311.0830, 2013.

  62. N. Parikh and S. Boyd. Proximal algorithms. Foundations and Trends in optimization, 1(3):123–231, 2013.

    Google Scholar 

  63. N. Rao, B. Recht, and R. Nowak. Tight measurement bounds for exact recovery of structured sparse signals. arXiv preprint  arXiv:1106.4355, 2011.

  64. B. Recht, M. Fazel, and P. A. Parrilo. Guaranteed minimum-rank solutions of linear matrix equations via nuclear norm minimization. SIAM review, 52(3):471–501, 2010.

    Article  MathSciNet  MATH  Google Scholar 

  65. B. Recht, W. Xu, and B. Hassibi. Necessary and sufficient conditions for success of the nuclear norm heuristic for rank minimization. In Decision and Control, 2008. CDC 2008. 47th IEEE Conference on, pages 3065–3070. IEEE, 2008.

  66. E. Richard, F. Bach, and J.-P. Vert. Intersecting singularities for multi-structured estimation. In ICML 2013-30th International Conference on Machine Learning, 2013.

  67. E. Richard, P.-A. Savalle, and N. Vayatis. Estimation of simultaneously sparse and low rank matrices. arXiv preprint  arXiv:1206.6474, 2012.

  68. R. T. Rockafellar. Monotone operators and the proximal point algorithm. SIAM journal on control and optimization, 14(5):877–898, 1976.

    Article  MathSciNet  MATH  Google Scholar 

  69. R. T. Rockafellar. Convex analysis. Princeton University Press, Princeton 1997.

    MATH  Google Scholar 

  70. L. I. Rudin, S. Osher, and E. Fatemi. Nonlinear total variation based noise removal algorithms. Physica D: Nonlinear Phenomena, 60(1):259–268, 1992.

    Article  MathSciNet  MATH  Google Scholar 

  71. A. A. Shabalin and A. B. Nobel. Reconstruction of a low-rank matrix in the presence of gaussian noise. Journal of Multivariate Analysis, 118:67–76, 2013.

    Article  MathSciNet  MATH  Google Scholar 

  72. M. Stojnic. Various thresholds for \(\ell _1\)-optimization in compressed sensing. arXiv preprint  arXiv:0907.3666, 2009.

  73. M. Stojnic. A framework to characterize performance of lasso algorithms. arXiv preprint  arXiv:1303.7291, 2013.

  74. M. Stojnic. A performance analysis framework for socp algorithms in noisy compressed sensing. arXiv preprint  arXiv:1304.0002, 2013.

  75. M. Stojnic, F. Parvaresh, and B. Hassibi. On the reconstruction of block-sparse signals with an optimal number of measurements. Signal Processing, IEEE Transactions on, 57(8):3075–3085, 2009.

    Article  MathSciNet  Google Scholar 

  76. M. Tao and X. Yuan. Recovering low-rank and sparse components of matrices from incomplete and noisy observations. SIAM Journal on Optimization, 21(1):57–81, 2011.

    Article  MathSciNet  MATH  Google Scholar 

  77. C. Thrampoulidis, A. Panahi, and B. Hassibi. Asymptotically exact error analysis for the generalized \(\ell _2^2\)-lasso. lasso. arXiv preprint  arXiv:1502.06287, 2015.

  78. R. Tibshirani. Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society. Series B (Methodological), pages 267–288, 1996.

  79. N. Vaswani and W. Lu. Modified-cs: Modifying compressive sensing for problems with partially known support. Signal Processing, IEEE Transactions on, 58(9):4595–4607, 2010.

    Article  MathSciNet  Google Scholar 

  80. J. Wright, A. Ganesh, K. Min, and Y. Ma. Compressive principal component pursuit. Information and Inference, 2(1):32–68, 2013.

    Article  MathSciNet  MATH  Google Scholar 

  81. Z. Zhou, X. Li, J. Wright, E. Candes, and Y. Ma. Stable principal component pursuit. In Information Theory Proceedings (ISIT), 2010 IEEE International Symposium on, pages 1518–1522. IEEE, 2010.

Download references

Acknowledgments

This work was supported in part by the National Science Foundation under Grants CCF-0729203, CNS-0932428 and CIF-1018927, by the Office of Naval Research under the MURI Grant N00014-08-1-0747, and by a Grant from Qualcomm Inc. Authors would like to thank Michael McCoy and Joel Tropp for stimulating discussions and helpful comments. Michael McCoy pointed out Lemma 12.1 and informed us of various recent results most importantly Theorem 7.1. S.O. would also like to thank his colleagues Kishore Jaganathan and Christos Thrampoulidis for their support and to the anonymous reviewers for their valuable suggestions.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Samet Oymak.

Additional information

Communicated by Michael Todd.

Appendices

Auxiliary Results

Fact 10.1

(Hyperplane separation theorem [10]) Assume \({\mathcal {C}}_1,{\mathcal {C}}_2\subseteq {\mathbb {R}}^n\) are disjoint closed sets at least one of which is compact. Then, there exists a hyperplane H such that \({\mathcal {C}}_1,{\mathcal {C}}_2\) lies on different open half planes induced by H.

Fact 10.2

(Properties of the projection [10, 14]) Assume \({\mathcal {C}}\subseteq {\mathbb {R}}^n\) is a nonempty, closed and convex set and \(\mathbf {a},\mathbf {b}\in {\mathbb {R}}^n\) are arbitrary points. Then,

$$\begin{aligned} \Vert \text {Proj}(\mathbf {a})-\text {Proj}(\mathbf {b})\Vert _2\le \Vert \mathbf {a}-\mathbf {b}\Vert _2. \end{aligned}$$

The projection \(\text {Proj}(\mathbf {a},{\mathcal {C}})\) is the unique vector satisfying,

$$\begin{aligned} \text {Proj}(\mathbf {a},{\mathcal {C}})=\arg \min _{\mathbf {v}\in {\mathcal {C}}}\Vert \mathbf {a}-\mathbf {v}\Vert _2. \end{aligned}$$
(86)

The projection \(\text {Proj}(\mathbf {a},{\mathcal {C}})\) is also the unique vector \(\mathbf {s}_0\) that satisfies,

$$\begin{aligned} \left<\mathbf {s}_0,\mathbf {a}-\mathbf {s}_0\right>=\sup _{\mathbf {s}\in {\mathcal {C}}} \left<\mathbf {s},\mathbf {a}-\mathbf {s}_0\right>. \end{aligned}$$
(87)

In other words, \(\mathbf {a}\) and \({\mathcal {C}}\) lie on different half planes induced by the hyperplane that goes through \(\text {Proj}(\mathbf {a},{\mathcal {C}})\) and that is orthogonal to \(\mathbf {a}-\text {Proj}(\mathbf {a},{\mathcal {C}})\).

Fact 10.3

(Moreau’s decomposition theorem [55]) Let \({\mathcal {C}}\) be a closed and convex cone in \({\mathbb {R}}^n\). For any \(\mathbf {v}\in {\mathbb {R}}^n\), the followings are equivalent:

  • \(\mathbf {v}=\mathbf {a}+\mathbf {b}\), \(\mathbf {a}\in {\mathcal {C}},\mathbf {b}\in {\mathcal {C}}^*\) and \(\mathbf {a}^T\mathbf {b}=0\).

  • \(\mathbf {a}=\text {Proj}(\mathbf {v},{\mathcal {C}}),\mathbf {b}=\text {Proj}(\mathbf {v},{\mathcal {C}}^*)\).

Definition 10.1

(Lipschitz function) \(h(\cdot ):{\mathbb {R}}^n\rightarrow {\mathbb {R}}\) is called L-Lipschitz if for all \(\mathbf {x},\mathbf {y}\in {\mathbb {R}}^n\), \(|h(\mathbf {x})-h(\mathbf {y})|\le L\Vert \mathbf {x}-\mathbf {y}\Vert _2\).

The next lemma provides a concentration inequality for Lipschitz functions of Gaussian vectors [54].

Fact 10.4

Let \(\mathbf {g}\sim {\mathcal {N}}(0,I)\) and \(h(\cdot ):{\mathbb {R}}^n\rightarrow {\mathbb {R}}\) be an L-Lipschitz function. Then for all \(t\ge 0\):

$$\begin{aligned} {\mathbb {P}}(|h(\mathbf {g})-{\mathbb {E}}[h(\mathbf {g})]|\ge t)\le 2\exp \left( -\frac{t^2}{2L^2}\right) . \end{aligned}$$

Lemma 10.1

For any \(\mathbf {g}\sim {\mathcal {N}}(0,I)\), \(c>1\), we have that

$$\begin{aligned} {\mathbb {P}}(\Vert \mathbf {g}\Vert _2\ge c\sqrt{n})\le 2\exp \left( -\frac{(c-1)^2n}{2}\right) . \end{aligned}$$

Proof

\({\mathbb {E}}[\Vert \mathbf {g}\Vert _2]\le \sqrt{{\mathbb {E}}[\Vert \mathbf {g}\Vert _2^2]}=\sqrt{n}\). Secondly \(\ell _2\) norm is a 1-Lipschitz function due to the triangle inequality. Hence

$$\begin{aligned} {\mathbb {P}}(\Vert \mathbf {g}\Vert _2\ge c\sqrt{n})\le {\mathbb {P}}(\Vert \mathbf {g}\Vert _2\ge (c-1)\sqrt{n}+{\mathbb {E}}[\Vert \mathbf {g}\Vert _2])\le 2\exp \left( -\frac{(c-1)^2n}{2}\right) . \end{aligned}$$

\(\square \)

Lemma 10.2

Let \({\mathcal {C}}\) be a closed and convex cone in \({\mathbb {R}}^n\). Then, \({\mathbf{D}}({\mathcal {C}})+{\mathbf{D}}({\mathcal {C}}^*)=n\).

Proof

Using Fact 10.3, any \(\mathbf {v}\in {\mathbb {R}}^n\) can be written as \(\text {Proj}(\mathbf {v},{\mathcal {C}})+\text {Proj}(\mathbf {v},{\mathcal {C}}^*)=\mathbf {v}\) and \(\left<\text {Proj}(\mathbf {v},{\mathcal {C}}),\text {Proj}(\mathbf {v},{\mathcal {C}}^*)\right>=0\). Hence,

$$\begin{aligned} \Vert \mathbf {v}\Vert ^2=\Vert \text {Proj}(\mathbf {v},{\mathcal {C}}^*)\Vert ^2+\Vert \text {Proj}(\mathbf {v},{\mathcal {C}})\Vert ^2=\text {dist}(\mathbf {v},{\mathcal {C}})^2+\text {dist}(\mathbf {v},{\mathcal {C}}^*)^2. \end{aligned}$$

Letting \(\mathbf {v}\sim {\mathcal {N}}(0,{\mathbf{I}})\) and taking the expectations, we can conclude. \(\square \)

Subdifferential of the Approximation

Proof (Proof of Lemma 3.3)

Recall that \(\hat{f}_{\mathbf {x}_0}(\mathbf {x}_0+\mathbf {v})-f(\mathbf {x}_0)\) is equal to the directional derivative \(f'({\mathbf {x}_0},\mathbf {v})=\sup _{\mathbf {s}\in \partial f(\mathbf {x}_0)} \left<\mathbf {s},\mathbf {v}\right>\). Also recall the “set of maximizing subgradients” from (34). Clearly, \(\partial f'({\mathbf {x}_0},\mathbf {v})=\partial \hat{f}_{\mathbf {x}_0}(\mathbf {x}_0+\mathbf {v})\). We will let \(\mathbf {x}={\mathbf {w}}+\mathbf {x}_0\) and investigate \(\partial f'({\mathbf {x}_0},{\mathbf {w}})\) as a function of \({\mathbf {w}}\).

If \({\mathbf {w}}=0:\) For any \(\mathbf {s}\in \partial f(\mathbf {x}_0)\) and any \(\mathbf {v}\) by definition, we have:

$$\begin{aligned} f'({\mathbf {x}_0},\mathbf {v})-f'({\mathbf {x}_0},0)=f'({\mathbf {x}_0},\mathbf {v})=\sup _{\mathbf {s}'\in \partial f(\mathbf {x}_0)} \left<\mathbf {v},\mathbf {s}'\right>\ge \left<\mathbf {v},\mathbf {s}\right>\end{aligned}$$

hence \(\mathbf {s}\in \partial f'({\mathbf {x}_0},0)\). Conversely, assume \(\mathbf {s}\not \in \partial f(\mathbf {x}_0)\), then there exists \(\mathbf {v}\) such that:

$$\begin{aligned} f(\mathbf {v}+\mathbf {x}_0)<f(\mathbf {x}_0)+\left<\mathbf {v},\mathbf {s}\right>. \end{aligned}$$

By convexity for any \(\epsilon >0\):

$$\begin{aligned} \frac{f(\epsilon \mathbf {v}+\mathbf {x}_0)-f(\mathbf {x}_0)}{\epsilon }\le f(\mathbf {v}+\mathbf {x}_0)-f(\mathbf {x}_0)<\left<\mathbf {v},\mathbf {s}\right>. \end{aligned}$$

Taking \(\epsilon \rightarrow 0\) on the left-hand side, we find:

$$\begin{aligned} f'({\mathbf {x}_0},\mathbf {v})-f'({\mathbf {x}_0},0)=f'({\mathbf {x}_0},\mathbf {v})<\left<\mathbf {v},\mathbf {s}\right>\end{aligned}$$

which implies \(\mathbf {s}\not \in \partial f'({\mathbf {x}_0},0)\).

If \({\mathbf {w}}\ne 0\): Now, consider the case \({\mathbf {w}}\ne 0\). Assume \(\mathbf {s}\in \partial f(\mathbf {x}_0,{\mathbf {w}})\). Then, for any \(\mathbf {v}\), we have:

$$\begin{aligned} f'({\mathbf {x}_0},{\mathbf {w}}+\mathbf {v})-f'({\mathbf {x}_0},{\mathbf {w}})&=\sup _{\mathbf {s}_1\in \partial f(\mathbf {x}_0)} \left<{\mathbf {w}}+\mathbf {v},\mathbf {s}_1\right>-\sup _{\mathbf {s}_2\in \partial f(\mathbf {x}_0)} \left<{\mathbf {w}},\mathbf {s}_2\right>\\&=\sup _{\mathbf {s}_1\in \partial f(\mathbf {x}_0)} \left<{\mathbf {w}}+\mathbf {v},\mathbf {s}_1\right>- \left<{\mathbf {w}},\mathbf {s}\right>\ge \left<\mathbf {v},\mathbf {s}\right>. \end{aligned}$$

Hence, \(\mathbf {s}\in \partial f'({\mathbf {x}_0},{\mathbf {w}})\). Conversely, assume \(\mathbf {s}\not \in \partial f(\mathbf {x}_0,{\mathbf {w}})\). Then, we will argue that \(\mathbf {s}\not \in \partial f'({\mathbf {x}_0},{\mathbf {w}})\).

Assume \(f'({\mathbf {x}_0},{\mathbf {w}})=c\Vert {\mathbf {w}}\Vert _2^2\) for some scalar \(c=c({\mathbf {w}})\). We can write \(\mathbf {s}=a{\mathbf {w}}+\mathbf {u}\) where \(\mathbf {u}^T{\mathbf {w}}=0\). Choose \(\mathbf {v}=\epsilon {\mathbf {w}}\) with \(|\epsilon |<1\). We end up with:

$$\begin{aligned} f'({\mathbf {x}_0},{\mathbf {w}}+\mathbf {v})-f'({\mathbf {x}_0},{\mathbf {w}})=\epsilon \sup _{\mathbf {s}_1\in \partial f(\mathbf {x}_0)} \left<{\mathbf {w}},\mathbf {s}_1\right>=c\epsilon \Vert {\mathbf {w}}\Vert _2^2\ge \left<\mathbf {s},\mathbf {v}\right>=a\epsilon \Vert {\mathbf {w}}\Vert _2^2. \end{aligned}$$

Consequently, we have \(c\epsilon \ge a\epsilon \) for all \(|\epsilon |<1\) which implies \(a=c\). Hence, \(\mathbf {s}\) can be written as \(c{\mathbf {w}}+\mathbf {u}\). Now, if \(\mathbf {s}\in \partial f(\mathbf {x}_0)\), then \(\mathbf {s}\in \partial f(\mathbf {x}_0,\mathbf {x}-\mathbf {x}_0)\) as it maximizes \(\left<\mathbf {s}',{\mathbf {w}}\right>\) over \(\mathbf {s}'\in \partial f(\mathbf {x}_0)\). However, we assumed \(\mathbf {s}\not \in \partial f(\mathbf {x}_0,\mathbf {x}-\mathbf {x}_0)\). Observe that \(\mathbf {u}=\mathbf {s}-c{\mathbf {w}}\) and \(\partial f(\mathbf {x}_0,\mathbf {x}-\mathbf {x}_0)-c{\mathbf {w}}\) lies on \(n-1\)-dimensional subspace H that is perpendicular to \({\mathbf {w}}\). By assumption \(\mathbf {u}\not \in \partial f(\mathbf {x}_0,\mathbf {x}-\mathbf {x}_0)-c{\mathbf {w}}\). We will argue that this leads to a contradiction. By making use of convexity of \(\partial f(\mathbf {x}_0,\mathbf {x}-\mathbf {x}_0)-c{\mathbf {w}}\) and invoking hyperplane separation theorem (Fact 10.1), we can find a direction \(\mathbf {h}\in H\) such that:

$$\begin{aligned} \left<\mathbf {h},\mathbf {u}\right>>\sup _{\mathbf {s}'\in \partial f(\mathbf {x}_0,\mathbf {x}-\mathbf {x}_0)-c{\mathbf {w}}}\left<\mathbf {h},\mathbf {s}'\right>. \end{aligned}$$
(88)

Next, considering \(\epsilon \mathbf {h}\) perturbation, we have:

$$\begin{aligned} f'({\mathbf {x}_0},{\mathbf {w}}+\epsilon \mathbf {h})-f'({\mathbf {x}_0},{\mathbf {w}})=\sup _{\mathbf {s}_1\in \partial f(\mathbf {x}_0)}( \epsilon \left<\mathbf {h},\mathbf {s}_1\right>-\sup _{\mathbf {s}_2\in \partial f(\mathbf {x}_0)} \left<{\mathbf {w}},\mathbf {s}_2-\mathbf {s}_1\right>). \end{aligned}$$

Denote the \(\mathbf {s}_1\) that establish equality by \(\mathbf {s}_1^*\).

Claim As \(\epsilon \rightarrow 0\), \(\left<\mathbf {s}_1^*,{\mathbf {w}}\right>\rightarrow c\Vert {\mathbf {w}}\Vert _2^2\). \(\square \)

Proof

Recall that \(\partial f(\mathbf {x}_0)\) is bounded. Let \(R=\sup _{\mathbf {s}'\in \partial f(\mathbf {x}_0)}\Vert \mathbf {s}'\Vert _2\). Choosing \(\mathbf {s}_1\in \partial f(\mathbf {x}_0,\mathbf {x}-\mathbf {x}_0)\), we always have:

$$\begin{aligned} f'({\mathbf {x}_0},{\mathbf {w}}+\epsilon \mathbf {h})-f'({\mathbf {x}_0},{\mathbf {w}})\ge \epsilon \left<\mathbf {s}_1,\mathbf {h}\right>\ge -\epsilon R\Vert \mathbf {h}\Vert _2. \end{aligned}$$

On the other hand, for any \(\mathbf {s}_1\) we may write:

$$\begin{aligned} \epsilon \left<\mathbf {h},\mathbf {s}_1\right>-\sup _{\mathbf {s}_2\in \partial f(\mathbf {x}_0)} \left<{\mathbf {w}},\mathbf {s}_2-\mathbf {s}_1\right>\le \epsilon R\Vert \mathbf {h}\Vert _2+\left<\mathbf {s}_1,{\mathbf {w}}\right>-c\Vert {\mathbf {w}}\Vert _2^2. \end{aligned}$$

Hence, for \(\mathbf {s}_1^*\), we obtain:

$$\begin{aligned} \epsilon R\Vert \mathbf {h}\Vert _2+\left<\mathbf {s}_1^*,{\mathbf {w}}\right>-c\Vert {\mathbf {w}}\Vert _2^2\ge -\epsilon R\Vert \mathbf {h}\Vert _2\implies \left<\mathbf {s}_1^*,{\mathbf {w}}\right>\ge c\Vert {\mathbf {w}}\Vert _2^2-2\epsilon R\Vert \mathbf {h}\Vert _2. \end{aligned}$$

Letting \(\epsilon \rightarrow 0\), we obtain the desired result. \(\square \)

Claim Given \(\partial f(\mathbf {x}_0)\), for any \(\epsilon '>0\) there exists a \(\delta >0\) such that for all \(\mathbf {s}_1\in \partial f(\mathbf {x}_0)\) satisfying \(\left<\mathbf {s}_1,{\mathbf {w}}\right>>c\Vert {\mathbf {w}}\Vert _2^2-\delta \) we have \(\text {dist}(\mathbf {s}_1,\partial f(\mathbf {x}_0,\mathbf {x}-\mathbf {x}_0))<\epsilon '\).

Proof

Assume for some \(\epsilon '>0\), claim is false. Then, we can construct a sequence \(\mathbf {s}(i)\) such that \(\text {dist}(\mathbf {s}(i),\partial f(\mathbf {x}_0,\mathbf {x}-\mathbf {x}_0))\ge \epsilon '\) but \(\left<\mathbf {s}(i),{\mathbf {w}}\right>\rightarrow c\Vert {\mathbf {w}}\Vert _2^2\). From the well-known Bolzano–Weierstrass Theorem and the compactness of \(\partial f(\mathbf {x}_0)\subseteq {\mathbb {R}}^n\), \(\mathbf {s}(i)\) will have a convergent subsequence whose limit \(\mathbf {s}(\infty )\) will be inside \(\partial f(\mathbf {x}_0)\) and will satisfy \(\left<\mathbf {s}_\infty ,{\mathbf {w}}\right>=c\Vert {\mathbf {w}}\Vert _2^2=f'({\mathbf {x}_0},{\mathbf {w}})\). On the other hand, \(\text {dist}(\mathbf {s}(\infty ),\partial f(\mathbf {x}_0,\mathbf {x}-\mathbf {x}_0))\ge \epsilon '\implies \mathbf {s}(\infty )\not \in \partial f(\mathbf {x}_0,\mathbf {x}-\mathbf {x}_0)\) which is a contradiction.

Going back to what we have, using the first claim, as \(\epsilon \mathbf {h}\rightarrow 0\), \(\left<\mathbf {s}_1^*,{\mathbf {w}}\right>\rightarrow c\Vert {\mathbf {w}}\Vert _2^2\). Using the second claim, this implies for some \(\delta \) which approaches to 0 as \(\epsilon \rightarrow 0\), we have:

$$\begin{aligned} \sup _{\mathbf {s}_1\in \partial f(\mathbf {x}_0)}( \epsilon \left<\mathbf {h},\mathbf {s}_1\right>-\sup _{\mathbf {s}_2\in \partial f(\mathbf {x}_0)} \left<{\mathbf {w}},\mathbf {s}_2-\mathbf {s}_1\right>)\le \epsilon (\delta \Vert \mathbf {h}\Vert _2+\sup _{\mathbf {s}'\in \partial f(\mathbf {x}_0,\mathbf {x}-\mathbf {x}_0)-c{\mathbf {w}}}\left<\mathbf {s}',\mathbf {h}\right>). \end{aligned}$$

Finally, based on (88), whenever \(\epsilon \) is chosen to ensure \(\delta \Vert \mathbf {h}\Vert _2<\left<\mathbf {h},\mathbf {u}\right>-\sup _{\mathbf {s}'\in \partial f(\mathbf {x}_0,\mathbf {x}-\mathbf {x}_0)-c{\mathbf {w}}}\left<\mathbf {s}',\mathbf {h}\right>\) we have,

$$\begin{aligned} f'({\mathbf {x}_0},{\mathbf {w}}+\epsilon \mathbf {h})-f'({\mathbf {x}_0},{\mathbf {w}})<\epsilon \left<\mathbf {h},\mathbf {u}\right>, \end{aligned}$$

which contradicts with the initial assumption that \(\mathbf {s}\) is a subgradient of \(f'({\mathbf {x}_0},\cdot )\) at \({\mathbf {w}}\), since,

$$\begin{aligned} f'({\mathbf {x}_0},{\mathbf {w}}+\epsilon \mathbf {h})-f'({\mathbf {x}_0},{\mathbf {w}})\ge \left<\mathbf {s},\epsilon \mathbf {h}\right>=\epsilon \left<\mathbf {u},\mathbf {h}\right>. \end{aligned}$$

\(\square \)

Lemma 11.1

\(\hat{f}_{\mathbf {x}_0}(\mathbf {x})\) is a convex function of \(\mathbf {x}\).

Proof

To show convexity, we need to argue that the function \(f'({\mathbf {x}_0},{\mathbf {w}})\) is a convex function of \({\mathbf {w}}=\mathbf {x}-\mathbf {x}_0\).

Observe that \(g({\mathbf {w}})=f(\mathbf {x}_0+{\mathbf {w}})-f(\mathbf {x}_0)\) is a convex function of \({\mathbf {w}}\) and behaves same as the directional derivative \(f'({\mathbf {x}_0},{\mathbf {w}})\) for sufficiently small \({\mathbf {w}}\). More rigorously, from (32), for any \({\mathbf {w}}_1,{\mathbf {w}}_2\in {\mathbb {R}}^n\) and \( \delta >0\) there exists \(\epsilon >0\) such that, we have:

$$\begin{aligned} g(\epsilon {\mathbf {w}}_1)\le f'({\mathbf {x}_0},\epsilon {\mathbf {w}}_1)+\delta \epsilon , ~g(\epsilon {\mathbf {w}}_2)\le f'({\mathbf {x}_0},\epsilon {\mathbf {w}}_2)+\delta \epsilon . \end{aligned}$$

Hence, for any \(0\le c\le 1\):

$$\begin{aligned} f'({\mathbf {x}_0},\epsilon (c{\mathbf {w}}_1+(1-c){\mathbf {w}}_2))&\le g(\epsilon (c{\mathbf {w}}_1+(1-c){\mathbf {w}}_2))\\&\le cg(\epsilon {\mathbf {w}}_1)+(1-c)g(\epsilon {\mathbf {w}}_2)\\&\le cf'({\mathbf {x}_0},\epsilon {\mathbf {w}}_1)+(1-c)f'({\mathbf {x}_0},\epsilon {\mathbf {w}}_2)+\epsilon \delta \end{aligned}$$

Making use of the fact that \(f'({\mathbf {x}_0},\epsilon \mathbf {s})=\epsilon f'({\mathbf {x}_0},\mathbf {s})\) for any direction \(\mathbf {s}\), we obtain:

$$\begin{aligned} f'({\mathbf {x}_0},c{\mathbf {w}}_1+(1-c){\mathbf {w}}_2)\le cf'({\mathbf {x}_0},{\mathbf {w}}_1)+(1-c)f'({\mathbf {x}_0},{\mathbf {w}}_2)+ \delta . \end{aligned}$$

Letting \(\delta \rightarrow 0\), we may conclude with the convexity of \(f'({\mathbf {x}_0},\cdot )\) and problem (37). \(\square \)

Swapping the Minimization over \(\tau \) and the Expectation

Lemma 12.1

([13, 53]) Assume \(\mathbf {g}\sim {\mathcal {N}}(0,{\mathbf{I}}_n)\) and let \(h(\cdot ):{\mathbb {R}}^n\rightarrow {\mathbb {R}}\) be an L-Lipschitz function. Then, we have:

$$\begin{aligned} \text {Var}(h(\mathbf {g}))\le L^2 \end{aligned}$$

We next show a closely related result.

Lemma 12.2

Assume \(\mathbf {g}\sim {\mathcal {N}}(0,{\mathbf{I}}_n)\) and let \(h(\cdot ):{\mathbb {R}}^n\rightarrow {\mathbb {R}}\) be an L-Lipschitz function. Then, we have:

$$\begin{aligned} |h(\mathbf {g})-{\mathbb {E}}[h(\mathbf {g})]|\le \sqrt{2\pi }L \end{aligned}$$

Proof

From Lipschitzness of \(h(\cdot )\), letting \(\mathbf {a}=h(\mathbf {g})-{\mathbb {E}}[h(\mathbf {g})]\) and invoking Lemma 10.4 for all \(t\ge 0\), we have:

$$\begin{aligned} {\mathbb {P}}(|\mathbf {a}-{\mathbb {E}}[\mathbf {a}]|\ge t)= {\mathbb {P}}(|\mathbf {a}|\ge t)\le 2\exp \left( -\frac{t^2}{2L^2}\right) \end{aligned}$$

Denote the probability density function of \(|\mathbf {a}|\) by \(p(\cdot )\) and let \(Q(u)={\mathbb {P}}(|\mathbf {a}|\ge u)\). We may write:

$$\begin{aligned} {\mathbb {E}}[|\mathbf {a}|]=\int _{0}^{\infty }u p(u)du=\int _{\infty }^{0}udQ(u)=[uQ(u)]_{\infty }^0+\int _{0}^{\infty }Q(u)du \end{aligned}$$

Using \(Q(u)\le 2\exp \left( -\frac{u^2}{2L^2}\right) \) for \(u\ge 0\), we have:

$$\begin{aligned}{}[uQ(u)]_{\infty }^0=\left[ 2u\exp \left( -\frac{u^2}{2L^2}\right) \right] _{\infty }^0=0 \end{aligned}$$

Next,

$$\begin{aligned} \int _{0}^{\infty }Q(u)du\le \int _{0}^{\infty }2\exp \left( -\frac{u^2}{2L^2}\right) du=\sqrt{2\pi }L \end{aligned}$$

\(\square \)

Lemma 12.3

Suppose Assumption 6.1 holds. Recall that \(\tau (\mathbf {v})=\arg \min _{\tau \ge 0}\text {dist}(\mathbf {v},\tau \partial f(\mathbf {x}_0))\). Then, for all \(\mathbf {v}_1,\mathbf {v}_2\),

$$\begin{aligned} |\tau (\mathbf {v}_1)-\tau (\mathbf {v}_2)|\le \frac{\Vert \mathbf {v}_1-\mathbf {v}_2\Vert _2}{\Vert \mathbf {e}\Vert _2} \end{aligned}$$
(89)

Hence, \(\tau (\mathbf {v})\) is \(\Vert \mathbf {e}\Vert _2^{-1}\)-Lipschitz function of \(\mathbf {v}\).

Proof

Let \(\mathbf {a}_i=\text {Proj}(\mathbf {v}_i,\text {cone}(\partial f(\mathbf {x}_0)))\) for \(1\le i\le 2\). Using Lemma 10.2, we have \(\Vert \mathbf {a}_1-\mathbf {a}_2\Vert _2\le \Vert \mathbf {v}_1-\mathbf {v}_2\Vert _2\) as \(\partial f(\mathbf {x}_0)\) is convex. Now, we will further lower bound \(\Vert \mathbf {v}_1-\mathbf {v}_2\Vert _2\) as follows:

$$\begin{aligned} \Vert \text {Proj}(\mathbf {a}_1-\mathbf {a}_2,T)\Vert _2\le \Vert \mathbf {a}_1-\mathbf {a}_2\Vert _2\le \Vert \mathbf {v}_1-\mathbf {v}_2\Vert _2 \end{aligned}$$

Now, observe that \(\Vert \text {Proj}(\mathbf {a}_1-\mathbf {a}_2,T)\Vert _2=\Vert \tau (\mathbf {v}_1)\mathbf {e}-\tau (\mathbf {v}_2)\mathbf {e}\Vert _2\). Hence, we may conclude with (89). \(\square \)

Lemma 12.4

Let \({\mathcal {C}}\) be a convex and closed set. Define the set of \(\tau \) that minimizes \(\text {dist}(\mathbf {v},\tau {\mathcal {C}})\),

$$\begin{aligned} {\mathbf{T}}(\mathbf {v})=\left\{ \tau \ge 0\big |\arg \min _{\tau \ge 0}\text {dist}(\mathbf {v},\tau {\mathcal {C}})\right\} \end{aligned}$$

and let \(\tau (\mathbf {v})=\inf _{\tau \in {\mathbf{T}}(\mathbf {v})}\tau \). \(\tau (\mathbf {v})\) is uniquely determined, given \({\mathcal {C}}\) and \(\mathbf {v}\). Further, assume \(\tau (\mathbf {v})\) is an L Lipschitz function of \(\mathbf {v}\) and let \(R:=R({\mathcal {C}})=\max _{\mathbf {u}\in {\mathcal {C}}}\Vert \mathbf {u}\Vert _2\). Then,

$$\begin{aligned} \min _{\tau \ge 0}{\mathbb {E}}[\text {dist}(\mathbf {g},\tau {\mathcal {C}})^2]\le {\mathbf {D}}(\text {cone}({\mathcal {C}}))+2\pi (R^2L^2+RL\sqrt{{\mathbf {D}}(\text {cone}({\mathcal {C}}))}+1) \end{aligned}$$

Proof

Let \(\mathbf {g}\sim {\mathcal {N}}(0,{\mathbf{I}})\) and let \(\tau ^*={\mathbb {E}}[\tau (\mathbf {g})]\). Now, from triangle inequality:

$$\begin{aligned} |\tau (\mathbf {v})-\tau ^*|\le t\implies \text {dist}(\mathbf {v},\tau ^* {\mathcal {C}})\le \text {dist}(\mathbf {v},\tau (\mathbf {v}) {\mathcal {C}})+Rt \end{aligned}$$

Consequently,

$$\begin{aligned} {\mathbb {E}}[\text {dist}(\mathbf {g},\tau (\mathbf {g}){\mathcal {C}})]\le & {} \min _{\tau \ge 0}{\mathbb {E}}[\text {dist}(\mathbf {g},\tau {\mathcal {C}})]\le {\mathbb {E}}[\text {dist}(\mathbf {g},\tau ^*{\mathcal {C}})]\\&\le {\mathbb {E}}[\text {dist}(\mathbf {g},\tau (\mathbf {g}){\mathcal {C}})+R|\tau (\mathbf {g})-\tau ^*|] \end{aligned}$$

This gives:

$$\begin{aligned} {\mathbb {E}}[\text {dist}(\mathbf {g},\tau ^*{\mathcal {C}})]-{\mathbb {E}}[\text {dist}(\mathbf {g},\tau (\mathbf {g}){\mathcal {C}})]\le R{\mathbb {E}}[|\tau (\mathbf {g})-\tau ^*|] \end{aligned}$$

Observing \({\mathbb {E}}[\text {dist}(\mathbf {g},\tau (\mathbf {g}){\mathcal {C}})]={\mathbb {E}}[\text {dist}(\mathbf {g},\text {cone}({\mathcal {C}}))]\le \sqrt{{\mathbf {D}}(\text {cone}({\mathcal {C}}))}\), and using Lemma 12.2 we find:

$$\begin{aligned} {\mathbb {E}}[\text {dist}(\mathbf {g},\tau ^*{\mathcal {C}})]-\sqrt{{\mathbf {D}}(\text {cone}({\mathcal {C}}))}\le \sqrt{2\pi }RL \end{aligned}$$

This yields:

$$\begin{aligned} {\mathbb {E}}[\text {dist}(\mathbf {g},\tau ^*{\mathcal {C}})]^2-{\mathbf {D}}(\text {cone}({\mathcal {C}}))\le \sqrt{2\pi }RL(2\sqrt{{\mathbf {D}}(\text {cone}({\mathcal {C}}))}+\sqrt{2\pi }RL) \end{aligned}$$

Using Lemma 12.1 we have \({\mathbb {E}}[\text {dist}(\mathbf {g},\tau (\mathbf {g}){\mathcal {C}})]^2\ge {\mathbb {E}}[\text {dist}(\mathbf {g},\tau (\mathbf {g}){\mathcal {C}})^2]-1\), which gives:

$$\begin{aligned} \min _{\tau \ge 0}{\mathbb {E}}[\text {dist}(\mathbf {g},\tau {\mathcal {C}})^2]-{\mathbf {D}}(\text {cone}({\mathcal {C}}))\le 2\pi R^2L^2+2\sqrt{2\pi }RL\sqrt{{\mathbf {D}}(\text {cone}({\mathcal {C}}))}+1 \end{aligned}$$

\(\square \)

Intersection of a Cone and a Subspace

1.1 Intersections of Randomly Oriented Cones

Based on Kinematic formula (Theorem 7.1), one may find the following result on the intersection of the two cones. We first consider the scenario in which one of the cones is a subspace.

Proposition 13.1

(Intersection with a subspace) Let A be a closed and convex cone and let B be a linear subspace. Denote \(\delta (A)+\delta (B)-n\) by \(\delta (A,B)\). Assume the unitary \({\mathbf {U}}\) is generated uniformly at random. Given \(\epsilon >0\), we have the following:

  • If \(\delta (A)+\delta (B)+\epsilon \sqrt{n}>n\),

    $$\begin{aligned} {\mathbb {P}}(\delta (A\cap {\mathbf {U}}B)\ge \delta (A,B)+\epsilon \sqrt{n})\le 8\exp \left( -\frac{\epsilon ^2}{64}\right) . \end{aligned}$$
  • \({\mathbb {P}}(\delta (A\cap {\mathbf {U}}B)\le \delta (A,B)-\epsilon \sqrt{n})\le 8\exp \left( -\frac{\epsilon ^2}{64}\right) \).

Proof

Denote \(A\cap {\mathbf {U}}B\) by C. Let H be a subspace with dimension \(n-d\) chosen uniformly at random independent of \({\mathbf {U}}\). Observe that \({\mathbf {U}}B\cap H\) is a \(\delta (B)-d\)-dimensional random subspace for \(d<\delta (B)\). Hence, using Theorem 7.1 with A and \({\mathbf {U}}B\cap H\) yields:

$$\begin{aligned}&\delta (A)+\delta (B)-d\le n-t \sqrt{n}\implies {\mathbb {P}}(A\cap {\mathbf {U}}B\cap H=\{0\})\ge 1-4\exp \left( -\frac{t^2}{16}\right) \end{aligned}$$
(90)
$$\begin{aligned}&\delta (A)+\delta (B)-d\ge n+t \sqrt{n}\implies {\mathbb {P}}(A\cap {\mathbf {U}}B\cap H=\{0\})\le 4\exp \left( -\frac{t^2}{16}\right) . \end{aligned}$$
(91)

Observe that (90) is true even when \(d\ge \delta (B)\) since if \(d\ge \delta (B)\), \({\mathbf {U}}B\cap H=\{0\}\) with probability 1.

Proving the first statement: Let \(\gamma =\delta (A)+\delta (B)-n\), \(\gamma _{\epsilon }=\gamma +\epsilon \sqrt{n}\) and \(\gamma _{\epsilon /2}=\gamma +\frac{\epsilon }{2}\sqrt{n}\). We assume \(\gamma _{\epsilon }>0\). Observing \(A\cap {\mathbf {U}}B\cap H=C\cap H\), we may write:

$$\begin{aligned}&{\mathbb {P}}(C\cap H=\{0\})\le {\mathbb {P}}(C\cap H=\{0\}\big |\delta (C)\ge \gamma _{\epsilon })+{\mathbb {P}}(\delta (C)\le \gamma _{\epsilon })\nonumber \\&\text {and}~~~{\mathbb {P}}(\delta (C)\le \gamma _{\epsilon })\ge {\mathbb {P}}(C\cap H=\{0\})- {\mathbb {P}}(C\cap H=\{0\}\big |\delta (C)\ge \gamma _{\epsilon }) \end{aligned}$$
(92)

If \(\gamma _{\epsilon }>n\), \({\mathbb {P}}(\delta (C)\le \gamma _{\epsilon })=1\). Otherwise, choose \(d=\max \{\gamma _{\epsilon /2},0\}\).

Case 1 If \(d=0\), then \(\gamma _{\epsilon /2}\le 0\) and \(H={\mathbb {R}}^n\). This gives,

$$\begin{aligned} {\mathbb {P}}(C\cap H=\{0\}\big |\delta (C)\ge \gamma _{\epsilon })={\mathbb {P}}(C=\{0\}\big |\delta (C)\ge \gamma _{\epsilon })=0. \end{aligned}$$
(93)

Also, choosing \(t=\frac{\epsilon }{2}\sqrt{n}\) in (90) and using \(\gamma \le -\frac{\epsilon }{2}\sqrt{n}\), we obtain:

$$\begin{aligned} {\mathbb {P}}(C\cap H=\{0\})={\mathbb {P}}(C=\{0\})\ge 1-4\exp \left( -\frac{\epsilon ^2}{64}\right) . \end{aligned}$$
(94)

Case 2 Otherwise, \(d=\gamma _{\epsilon /2}>0\). Applying Theorem 7.1, we find:

$$\begin{aligned} {\mathbb {P}}(C\cap H=\{0\}\big |\delta (C)\ge \gamma _{\epsilon })\le 4\exp \left( -\frac{\epsilon ^2}{64}\right) . \end{aligned}$$
(95)

Next, choosing \(t=\frac{\epsilon }{2}\sqrt{n}\) in (90), we obtain:

$$\begin{aligned} {\mathbb {P}}(C\cap H=\{0\})\ge 1-4\exp \left( -\frac{\epsilon ^2}{64}\right) . \end{aligned}$$
(96)

Overall, combining (92), (93), (94), (95) and (96), we obtain:

$$\begin{aligned} {\mathbb {P}}(\delta (C)\le \gamma _{\epsilon })\ge 1-8\exp \left( -\frac{\epsilon ^2}{64}\right) . \end{aligned}$$

Proving the second statement: In the exact same manner, this time, let \(\gamma _{-\epsilon }=\gamma -\epsilon \sqrt{n}\), \(\gamma _{-\epsilon /2}=\gamma -\frac{\epsilon }{2}\sqrt{n}\). If \(\gamma _{-\epsilon }<0\),

$$\begin{aligned} {\mathbb {P}}(\delta (C)\le \gamma _{-\epsilon })\le {\mathbb {P}}(\delta (C)<0)=0. \end{aligned}$$

Otherwise, let \(d=\gamma _{-\epsilon /2}\), we may write,

$$\begin{aligned} {\mathbb {P}}(\delta (C)\ge \gamma _{-\epsilon })\ge {\mathbb {P}}(C\cap H\ne \{0\})- {\mathbb {P}}(C\cap H\ne \{0\}\big |\delta (C)\le \gamma _{-\epsilon }) \end{aligned}$$
(97)

in an identical way to (92). Repeating the previous argument and using (91), we may first obtain,

$$\begin{aligned} {\mathbb {P}}(C\cap H\ne \{0\})\ge 1-4\exp \left( -\frac{\epsilon ^2}{64}\right) . \end{aligned}$$

and using Theorem 7.1,

$$\begin{aligned} {\mathbb {P}}(C\cap H\ne \{0\}\big |\delta (C)\le \gamma _{-\epsilon })\le 4\exp \left( -\frac{\epsilon ^2}{64}\right) . \end{aligned}$$

Combining these gives the desired result.

$$\begin{aligned} {\mathbb {P}}(\delta (C)\ge \gamma _{-\epsilon })\ge 1-8\exp \left( -\frac{\epsilon ^2}{64}\right) . \end{aligned}$$

\(\square \)

Proof of Theorem 2.1: Lower Bound

Theorem 14.1

Let \({\mathcal {C}}\) be a closed and convex set, \(\mathbf {v}\sim {\mathcal {N}}(0,{\mathbf{I}})\) and let \(\mathbf {x}^*(\sigma \mathbf {v})=\arg \min _{\mathbf {x}\in {\mathcal {C}}} \Vert \mathbf {x}_0+\sigma \mathbf {v}-\mathbf {x}\Vert _2\). Then, we have,

$$\begin{aligned} \lim _{\sigma \rightarrow 0}\frac{{\mathbb {E}}\left[ \Vert \mathbf {x}^*(\sigma \mathbf {v})-\mathbf {x}_0\Vert _2^2\right] }{\sigma ^2}={\mathbf {D}}(T_{\mathcal {C}}(\mathbf {x}_0)^*). \end{aligned}$$

Proof

Let \(1\ge \alpha ,\epsilon > 0\) be numbers to be determined. Denote probability density function of a \({\mathcal {N}}(0,c{\mathbf{I}})\) distributed vector by \(p_c(\cdot )\). From Lemma 15.1, the expected error \({\mathbb {E}}[\Vert \mathbf {x}^*-\mathbf {x}_0\Vert _2^2]\) is simply,

$$\begin{aligned} \int _{\mathbf {v}\in {\mathbb {R}}^n}\Vert \text {Proj}(\sigma \mathbf {v},F_{\mathcal {C}}(\mathbf {x}_0))\Vert _2^2p_1(\mathbf {v})d\mathbf {v}. \end{aligned}$$

Let \(S_{\alpha }\) be the set satisfying:

$$\begin{aligned} S_{\alpha }=\left\{ \mathbf {u}\in {\mathbb {R}}^n|\frac{\Vert \text {Proj}(\mathbf {u},T_{\mathcal {C}}(\mathbf {x}_0))\Vert _2}{\Vert \mathbf {u}\Vert _2}\ge \alpha \right\} . \end{aligned}$$

Let \(\bar{S}_\alpha ={\mathbb {R}}^n-S_\alpha \). Using Proposition 15.1, given \(\epsilon >0\), choose \(\epsilon _0>0\) such that for all \(\Vert \mathbf {u}\Vert _2\le \epsilon _0\) and \(\mathbf {u}\in S_{\alpha }\), we have,

$$\begin{aligned} \Vert \text {Proj}(\mathbf {u},F_{\mathcal {C}}(\mathbf {x}_0))\Vert _2\ge (1-\epsilon )\Vert \text {Proj}(\mathbf {u},T_{\mathcal {C}}(\mathbf {x}_0))\Vert _2. \end{aligned}$$
(98)

Now, let \(\mathbf {z}=\sigma \mathbf {v}\). Split the error into three groups, namely

  • \(F_1=\int _{\Vert \mathbf {z}\Vert _2\le \epsilon _0,\mathbf {z}\in S_{\alpha }}\Vert \text {Proj}(\mathbf {z},F_{\mathcal {C}}(\mathbf {x}_0))\Vert _2^2p_\sigma (\mathbf {z})d\mathbf {z}\),   \(T_1=\int _{\Vert \mathbf {v}\Vert _2\le \frac{\epsilon _0}{\sigma },\mathbf {v}\in S_{\alpha }}\Vert \text {Proj}(\mathbf {v},T_{\mathcal {C}}(\mathbf {x}_0))\Vert _2^2p_1(\mathbf {v})d\mathbf {v}\).

  • \(F_2=\int _{\Vert \mathbf {z}\Vert _2\ge \epsilon _0,\mathbf {z}\in S_{\alpha }}\Vert \text {Proj}(\mathbf {z},F_{\mathcal {C}}(\mathbf {x}_0))\Vert _2^2p_\sigma (\mathbf {z})d\mathbf {z}\),   \(T_2=\int _{\Vert \mathbf {v}\Vert _2\ge \frac{\epsilon _0}{\sigma },\mathbf {v}\in S_{\alpha }}\Vert \text {Proj}(\mathbf {v},T_{\mathcal {C}}(\mathbf {x}_0))\Vert _2^2p_1(\mathbf {v})d\mathbf {v}\).

  • \(F_3=\int _{\mathbf {z}\in \bar{S}_{\alpha }}\Vert \text {Proj}(\mathbf {z},F_{\mathcal {C}}(\mathbf {x}_0))\Vert _2^2p_\sigma (\mathbf {z})d\mathbf {z}\),            \(T_3=\int _{\mathbf {v}\in \bar{S}_{\alpha }}\Vert \text {Proj}(\mathbf {v},T_{\mathcal {C}}(\mathbf {x}_0))\Vert _2^2p_1(\mathbf {g})d\mathbf {v}\).

The rest of the argument will be very similar to the proof of Proposition 4.2. We know the following from Proposition 3.1:

$$\begin{aligned}&T_1+T_2+T_3={\mathbf {D}}(T_{\mathcal {C}}(\mathbf {x}_0)^*)\\&F_1+F_2+F_3={\mathbb {E}}[\Vert \mathbf {x}^*-\mathbf {x}_0\Vert _2^2]\le \sigma ^2(T_1+T_2+T_3). \end{aligned}$$

To proceed, we will argue that the contributions of the second and third terms are small for sufficiently small \(\sigma ,\alpha ,\epsilon >0\). Observe that:

$$\begin{aligned} T_3\le \int _{\mathbf {v}\in \bar{S}_{\alpha }}\alpha ^2\Vert \mathbf {v}\Vert _2^2p_1(\mathbf {v})d\mathbf {v}\le \alpha ^2n. \end{aligned}$$

For \(T_2\), we have:

$$\begin{aligned} T_2\le \int _{\Vert \mathbf {v}\Vert _2\ge \frac{\epsilon _0}{\sigma }}\Vert \mathbf {v}\Vert _2^2p_1(\mathbf {g})d\mathbf {v}=C\left( \frac{\epsilon _0}{\sigma }\right) . \end{aligned}$$

Since \(\Vert \mathbf {g}\Vert _2\) has finite second moment, fixing \(\epsilon _0>0\) and letting \(\sigma \rightarrow 0\), we have \(C\left( \frac{\epsilon _0}{\sigma }\right) \rightarrow 0\). For \(T_1\), from (98), we have:

$$\begin{aligned} F_1\ge (1-\epsilon )^2\sigma ^2T_1. \end{aligned}$$

Overall, we found:

$$\begin{aligned} \frac{{\mathbb {E}}\left[ \Vert \mathbf {x}^*-\mathbf {x}_0\Vert _2^2\right] }{\sigma ^2}\ge \frac{F_1}{\sigma ^2}\ge (1-\epsilon )^2\frac{T_1}{T_1+T_2+T_3}{\mathbf {D}}(T_{\mathcal {C}}(\mathbf {x}_0)^*). \end{aligned}$$

Writing \(T_1={\mathbf {D}}(T_{\mathcal {C}}(\mathbf {x}_0)^*)-T_2-T_3\ge {\mathbf {D}}(T_{\mathcal {C}}(\mathbf {x}_0)^*)-\alpha ^2n-C\left( \frac{\epsilon _0}{\sigma }\right) \), we have:

$$\begin{aligned} \frac{T_1}{T_1+T_2+T_3}\ge \frac{{\mathbf {D}}(T_{\mathcal {C}}(\mathbf {x}_0)^*)-\alpha ^2n-C\left( \frac{\epsilon _0}{\sigma }\right) }{{\mathbf {D}}(T_{\mathcal {C}}(\mathbf {x}_0)^*)}. \end{aligned}$$

Letting \(\sigma \rightarrow 0\) for fixed \(\alpha ,\epsilon _0,\epsilon \), we obtain

$$\begin{aligned} \lim _{\sigma \rightarrow 0}\frac{{\mathbb {E}}\left[ \Vert \mathbf {x}^*-\mathbf {x}_0\Vert _2^2\right] }{\sigma ^2{\mathbf {D}}(T_{\mathcal {C}}(\mathbf {x}_0)^*)}\ge (1-\epsilon )^2\frac{{\mathbf {D}}(T_{\mathcal {C}}(\mathbf {x}_0)^*)-\alpha ^2n}{{\mathbf {D}}(T_{\mathcal {C}}(\mathbf {x}_0)^*)}. \end{aligned}$$

Since \(\alpha ,\epsilon \) can be made arbitrarily small, we obtain \(\lim _{\sigma \rightarrow 0}\frac{{\mathbb {E}}[\Vert \mathbf {x}^*-\mathbf {x}_0\Vert _2^2]}{\sigma ^2{\mathbf {D}}(T_{\mathcal {C}}(\mathbf {x}_0)^*)}=1\). \(\square \)

The next result shows that, as \(\sigma \rightarrow 0\), we can exactly predict the cost of the constrained problem.

Proposition 14.1

Consider the setup in Theorem 14.1. Let \({\mathbf {w}}^*(\sigma \mathbf {v})=\mathbf {x}^*(\sigma \mathbf {v})-\mathbf {x}_0\). Then,

$$\begin{aligned} \lim _{\sigma \rightarrow 0}\frac{{\mathbb {E}}\left[ \Vert \sigma \mathbf {v}-{\mathbf {w}}^*(\sigma \mathbf {v})\Vert _2^2\right] }{\sigma ^2}={\mathbf{D}}(T_{\mathcal {C}}(\mathbf {x}_0)). \end{aligned}$$

Proof

Let \({\mathbf {w}}^*={\mathbf {w}}^*(\sigma \mathbf {v})\) and \(\mathbf {z}=\sigma \mathbf {v}\). \(\mathbf {z}-{\mathbf {w}}^*\) satisfies two conditions.

  • From Lemma 15.1, \(\Vert \mathbf {z}-{\mathbf {w}}^*\Vert _2=\text {dist}(\mathbf {z},F_{\mathcal {C}}(\mathbf {x}_0))\ge \text {dist}(\mathbf {z},T_{\mathcal {C}}(\mathbf {x}_0))\).

  • Using Lemma 15.4, \(\Vert \mathbf {z}-{\mathbf {w}}^*\Vert _2^2+\Vert {\mathbf {w}}^*\Vert _2^2\le \Vert \mathbf {z}\Vert _2^2\).

Consequently, when \(\mathbf {v}\sim {\mathcal {N}}(0,{\mathbf{I}})\), we find:

$$\begin{aligned} n\sigma ^2= & {} {\mathbb {E}}\left[ \Vert \mathbf {z}\Vert _2^2\right] \ge {\mathbb {E}}\left[ \Vert \mathbf {z}-{\mathbf {w}}^*\Vert _2^2\right] +{\mathbb {E}}\left[ \Vert {\mathbf {w}}^*\Vert _2^2\right] \\&\ge \sigma ^2{\mathbb {E}}\left[ \text {Proj}(\mathbf {v},T_{\mathcal {C}}(\mathbf {x}_0)^*)^2\right] +{\mathbb {E}}\left[ \Vert {\mathbf {w}}^*\Vert _2^2\right] . \end{aligned}$$

Normalizing both sides by \(\sigma ^2\) and subtracting \({\mathbf{D}}(T_{\mathcal {C}}(\mathbf {x}_0))={\mathbb {E}}[\text {Proj}(\mathbf {v},T_{\mathcal {C}}(\mathbf {x}_0)^*)^2]\) and \({\mathbb {E}}[\Vert {\mathbf {w}}^*\Vert _2^2]\), we find:

$$\begin{aligned} {\mathbf {D}}(T_{\mathcal {C}}(\mathbf {x}_0)^*)-\frac{{\mathbb {E}}\left[ \Vert {\mathbf {w}}^*\Vert _2^2\right] }{\sigma ^2}\ge \frac{{\mathbb {E}}\left[ \Vert \mathbf {z}-{\mathbf {w}}^*\Vert _2^2\right] }{\sigma ^2}-{\mathbf{D}}(T_{\mathcal {C}}(\mathbf {x}_0))\ge 0 \end{aligned}$$

where we used Lemma 10.2. Now, letting \(\sigma \rightarrow 0\) and using the fact that \(\lim _{\sigma \rightarrow 0}\frac{{\mathbb {E}}[\Vert {\mathbf {w}}^*\Vert _2^2]}{\sigma ^2}={\mathbf {D}}(T_{\mathcal {C}}(\mathbf {x}_0)^*)\), we find the desired result. \(\square \)

Approximation Results on Convex Cones

Remark

Throughout the section, \({\mathcal {C}}\) is assumed to be a nonempty, closed and convex set in \({\mathbb {R}}^n\).

1.1 Standard Observations

Property 15.1

Let \(\mathbf {x}_0\in {\mathcal {C}}\) and \(\mathbf {y}=\mathbf {x}_0+\mathbf {z}\in {\mathbb {R}}^n\). From Lemma 86, recall that \(\text {Proj}(\mathbf {y},{\mathcal {C}})\) is the unique vector that is equal to \(\arg \min _{\mathbf {u}\in {\mathcal {C}}}\Vert \mathbf {y}-\mathbf {u}\Vert _2\). By definition of feasible set \(F_{\mathcal {C}}(\mathbf {x}_0)\), we also have, \(\text {Proj}(\mathbf {y},{\mathcal {C}})=\text {Proj}(\mathbf {z},F_{\mathcal {C}}(\mathbf {x}_0))\).

Lemma 15.1

For all \(\mathbf {z}\in {\mathbb {R}}^n\) and \(\mathbf {x}_0\in {\mathcal {C}}\), we have

$$\begin{aligned} \Vert \text {Proj}(\mathbf {z},F_{\mathcal {C}}(\mathbf {x}_0))\Vert _2\le \Vert \text {Proj}(\mathbf {z},T_{\mathcal {C}}(\mathbf {x}_0))\Vert _2. \end{aligned}$$

Proof

Setting \(f(\cdot )=0\) and \(\mathbf {y}=\mathbf {x}_0+\mathbf {z}\) in Lemma 3.2, we have

$$\begin{aligned} \Vert \text {Proj}(\mathbf {z},F_{\mathcal {C}}(\mathbf {x}_0))\Vert _2=\Vert \mathbf {x}^*-\mathbf {x}_0\Vert _2\le \text {dist}(\mathbf {z},T_{\mathcal {C}}(\mathbf {x}_0)^*)=\Vert \text {Proj}(\mathbf {z},T_{\mathcal {C}}(\mathbf {x}_0))\Vert _2. \end{aligned}$$

\(\square \)

The following lemma shows that projection onto the feasible cone is arbitrarily close to the projection onto the tangent cone as we scale down the vector. This is due to Proposition 5.3.5 of Chapter III of [48].

Lemma 15.2

Assume \(\mathbf {x}_0\in {\mathcal {C}}\). Then, for any \(\mathbf {z}\in {\mathcal {C}}\),

$$\begin{aligned} \lim _{\epsilon \rightarrow 0} \frac{\text {Proj}(\epsilon {\mathbf {w}},F_{\mathcal {C}}(\mathbf {x}_0))}{\epsilon }\rightarrow \text {Proj}({\mathbf {w}},T_{\mathcal {C}}(\mathbf {x}_0)). \end{aligned}$$

Hence,

  • If \(\text {Proj}({\mathbf {w}},T_{\mathcal {C}}(\mathbf {x}_0))=0\), using Lemma 15.1, \(\text {Proj}(\mathbf {z},F_{\mathcal {C}}(\mathbf {x}_0))=0\).

  • If \(\text {Proj}({\mathbf {w}},T_{\mathcal {C}}(\mathbf {x}_0))\ne 0\),

    $$\begin{aligned} \lim _{\epsilon \rightarrow 0}\frac{\Vert \text {Proj}(\epsilon {\mathbf {w}},F_{\mathcal {C}}(\mathbf {x}_0))\Vert _2}{\Vert \text {Proj}(\epsilon {\mathbf {w}},T_{\mathcal {C}}(\mathbf {x}_0))\Vert _2}=1. \end{aligned}$$

1.2 Uniform Approximation to the Tangent Cone

Proposition 15.1

Let \({\mathcal {C}}\) be a closed and convex set including \(\mathbf {x}_0\). Denote the unit \(\ell _2\)-sphere in \({\mathbb {R}}^n\) by \({\mathcal {S}}^{n-1}\) and let \(1\ge \alpha >0\) be arbitrary. Given \(\alpha ,\epsilon >0\), there exists an \(\epsilon _0>0\) such that for all \({\mathbf {w}}\in {\mathcal {S}}^{n-1}\), \(\Vert \text {Proj}({\mathbf {w}},T_{\mathcal {C}}(\mathbf {x}_0))\Vert _2\ge \alpha \) and for all \(0<t\le \epsilon _0\), we have:

$$\begin{aligned} \frac{\Vert \text {Proj}(t{\mathbf {w}},F_{\mathcal {C}}(\mathbf {x}_0))\Vert _2}{t\Vert \text {Proj}({\mathbf {w}},T_{\mathcal {C}}(\mathbf {x}_0))\Vert _2}\ge 1-\epsilon . \end{aligned}$$
(99)

In particular, setting \(\alpha =1\), given \(\epsilon >0\), there exists \(\epsilon _0>0\) such that, for all \(t\le \epsilon _0\) and all \({\mathbf {w}}\in T_{\mathcal {C}}(\mathbf {x}_0)\cap {\mathcal {S}}^{n-1}\), \(\Vert \text {Proj}(t{\mathbf {w}},F_{\mathcal {C}}(\mathbf {x}_0))\Vert _2\ge (1-\epsilon )t\).

Remark

Note that statements of Propositions 15.1 and 4.1 are quite similar.

Proof

Given \(\alpha >0\), consider the following set:

$$\begin{aligned} S=\left\{ {\mathbf {w}}\in {\mathcal {S}}^{n-1}\Big |\Vert \text {Proj}({\mathbf {w}},T_{\mathcal {C}}(\mathbf {x}_0))\Vert _2\ge \alpha \right\} . \end{aligned}$$

This set is closed and bounded and hence compact. Define the following function on this set

$$\begin{aligned} c({\mathbf {w}})=\max \left\{ c>0 \left| \frac{\Vert \text {Proj}(c{\mathbf {w}},F_{\mathcal {C}}(\mathbf {x}_0)\Vert _2}{\Vert \text {Proj}(c{\mathbf {w}},T_{\mathcal {C}}(\mathbf {x}_0))\Vert _2}\ge 1-\epsilon \right. \right\} . \end{aligned}$$

\(c({\mathbf {w}})\) is strictly positive due to Lemma 15.2 and it can be as high as infinity. Furthermore, from Lemma 15.3, we know that whenever \(c<c({\mathbf {w}})\)

$$\begin{aligned} \frac{\Vert \text {Proj}(c{\mathbf {w}},F_{\mathcal {C}}(\mathbf {x}_0)\Vert _2}{\Vert \text {Proj}(c{\mathbf {w}},T_{\mathcal {C}}(\mathbf {x}_0))\Vert _2}\ge 1-\epsilon \end{aligned}$$

as well. Let \(s({\mathbf {w}})=\min \{1,c({\mathbf {w}})\}\). If \(s({\mathbf {w}})\) is continuous, since \({\mathcal {S}}^{n-1}\) is compact \(s({\mathbf {w}})\) will attain its minimum which implies \(c({\mathbf {w}})\ge s({\mathbf {w}})\ge \epsilon _0>0\) for some \(\epsilon _0\). Again, this also implies, for all \({\mathbf {w}}\in {\mathcal {S}}^{n-1}\), and \(0<t\le \epsilon _0\),

$$\begin{aligned} \frac{\Vert \text {Proj}(t{\mathbf {w}},F_{\mathcal {C}}(\mathbf {x}_0)\Vert _2}{\Vert t\text {Proj}({\mathbf {w}},T_{\mathcal {C}}(\mathbf {x}_0))\Vert _2}\ge 1-\epsilon . \end{aligned}$$

To end the proof, we will show continuity of \(s({\mathbf {w}})\).

Claim \(s({\mathbf {w}})\) is continuous. \(\square \)

Proof

We will show that \(\lim _{{\mathbf {w}}_2\rightarrow {\mathbf {w}}_1}s({\mathbf {w}}_2)=s({\mathbf {w}}_1)\). To do this, we will make use of the continuity of the functions \(\Vert \text {Proj}(c_1{\mathbf {w}},F_{\mathcal {C}}(\mathbf {x}_0)\Vert _2\), \(\Vert \text {Proj}(c_1{\mathbf {w}},T_{\mathcal {C}}(\mathbf {x}_0)\Vert _2\) and \(\frac{\Vert \text {Proj}(c_1{\mathbf {w}},F_{\mathcal {C}}(\mathbf {x}_0)\Vert _2}{\Vert \text {Proj}(c_1{\mathbf {w}},T_{\mathcal {C}}(\mathbf {x}_0)\Vert _2}\) when the denominator is nonzero. Given \({\mathbf {w}}_1\), let \(c_1=\min \{2,c({\mathbf {w}}_1)\}\).

Case 1 If \(\frac{\Vert \text {Proj}(c_1{\mathbf {w}}_1,F_{\mathcal {C}}(\mathbf {x}_0)\Vert _2}{\Vert \text {Proj}(c_1{\mathbf {w}}_1,T_{\mathcal {C}}(\mathbf {x}_0))\Vert _2}>1-\epsilon \), then \(c({\mathbf {w}}_1)>2\) and for all \({\mathbf {w}}_2\) sufficiently close to \({\mathbf {w}}_1\), \(\frac{\Vert \text {Proj}(c_1{\mathbf {w}}_2,F_{\mathcal {C}}(\mathbf {x}_0)\Vert _2}{\Vert \text {Proj}(c_1{\mathbf {w}}_2,T_{\mathcal {C}}(\mathbf {x}_0))\Vert _2}\) is more than \(1-\epsilon \) and hence \(c({\mathbf {w}}_2)\ge 2>1\). Hence, \(s({\mathbf {w}}_1)=s({\mathbf {w}}_2)\).

Case 2 Now, assume \(\frac{\Vert \text {Proj}(c_1{\mathbf {w}}_1,F_{\mathcal {C}}(\mathbf {x}_0)\Vert _2}{\Vert \text {Proj}(c_1{\mathbf {w}}_1,T_{\mathcal {C}}(\mathbf {x}_0))\Vert _2}=1-\epsilon \) which implies \(c_1=c({\mathbf {w}}_1)\). Using the “strict decrease” part of Lemma 15.3, for any \(\epsilon '>0\) and \(c'=c_1-\epsilon '\), \(\frac{\Vert \text {Proj}(c'{\mathbf {w}}_1,F_{\mathcal {C}}(\mathbf {x}_0)\Vert _2}{\Vert \text {Proj}(c'{\mathbf {w}}_1,T_{\mathcal {C}}(\mathbf {x}_0))\Vert _2}>1-\epsilon \). Then, for \({\mathbf {w}}_2\) sufficiently close to \({\mathbf {w}}_1\), \(\frac{\Vert \text {Proj}(c'{\mathbf {w}}_2,F_{\mathcal {C}}(\mathbf {x}_0)\Vert _2}{\Vert \text {Proj}(c'{\mathbf {w}}_2,T_{\mathcal {C}}(\mathbf {x}_0))\Vert _2}>1-\epsilon \) which implies \(c({\mathbf {w}}_2)\ge c'\). Hence, \(c({\mathbf {w}}_2)\ge c_1-\epsilon '\) for arbitrarily small \(\epsilon '>0\). Conversely, for any \(\epsilon '>0\) and \(c'=c_1+\epsilon '\), \(\frac{\Vert \text {Proj}(c'{\mathbf {w}}_1,F_{\mathcal {C}}(\mathbf {x}_0)\Vert _2}{\Vert \text {Proj}(c'{\mathbf {w}}_1,T_{\mathcal {C}}(\mathbf {x}_0))\Vert _2}<1-\epsilon \). Then, for \({\mathbf {w}}_2\) sufficiently close to \({\mathbf {w}}_1\), \(\frac{\Vert \text {Proj}(c'{\mathbf {w}}_2,F_{\mathcal {C}}(\mathbf {x}_0)\Vert _2}{\Vert \text {Proj}(c'{\mathbf {w}}_2,T_{\mathcal {C}}(\mathbf {x}_0))\Vert _2}<1-\epsilon \) which implies \(c({\mathbf {w}}_2)\le c'\). Hence, \(c({\mathbf {w}}_2)\le c_1+\epsilon '\) for arbitrarily small \(\epsilon '>0\). Combining these, we obtain \(c({\mathbf {w}}_2)\rightarrow c({\mathbf {w}}_1)\) as \({\mathbf {w}}_2\rightarrow {\mathbf {w}}_1\). This also implies \(s({\mathbf {w}}_2)\rightarrow s({\mathbf {w}}_1)\). \(\square \)

This finishes the proof of the main statement (99). For the \(\alpha =1\) case, observe that \(\Vert {\mathbf {w}}\Vert _2=1\) and \(\Vert \text {Proj}({\mathbf {w}},T_{\mathcal {C}}(\mathbf {x}_0))\Vert _2=1\) implies \({\mathbf {w}}\in T_{\mathcal {C}}(\mathbf {x}_0)\).

Lemma 15.3

Let \(\mathbf {x}_0{\in }{\mathbb {R}}^n\) and let \({\mathbf {w}}\) have unit \(\ell _2\)-norm and set \(l_T{=}\Vert \text {Proj}({\mathbf {w}},T_{\mathcal {C}}(\mathbf {x}_0))\Vert _2\). Define the function,

$$\begin{aligned} g(t)={\left\{ \begin{array}{ll} \frac{\Vert \text {Proj}(t{\mathbf {w}},F_{\mathcal {C}}(\mathbf {x}_0))\Vert _2}{t}~\text {for}~t>0\\ l_T~\text {for}~t=0\end{array}\right. }. \end{aligned}$$

Then, \(g(\cdot )\) is continuous and nonincreasing on \([0,\infty )\). Furthermore, it is strictly decreasing on the interval \([t_0,\infty )\) where \(t_0=\sup _{t} \{t>0\big |g(t)=l_T\}\).

Proof

Due to Lemma 15.1, \(g(t)\le l_T\) and from Lemma 15.2, the function is continuous at 0. Continuity at \(t\ne 0\) follows from the continuity of the projection (see Fact 10.2). Next, if \(g(t)=l_T\), using the fact that \(F_{\mathcal {C}}(\mathbf {x}_0)\) contains 0, the second statement of Lemma 15.4 gives,

$$\begin{aligned} \text {Proj}(t{\mathbf {w}},T_{\mathcal {C}}(\mathbf {x}_0))=\text {Proj}(t{\mathbf {w}},F_{\mathcal {C}}(\mathbf {x}_0))\in F_{\mathcal {C}}(\mathbf {x}_0). \end{aligned}$$

From convexity, \(\text {Proj}(t'{\mathbf {w}},T_{\mathcal {C}}(\mathbf {x}_0))\in F_{\mathcal {C}}(\mathbf {x}_0)\) for all \(0\le t'\le t\). Hence, \(g(t')=l_T\). This implies \(g(t)=l_T\) for \(t\le t_0\).

Now, assume \(t_1>t_0\) and \(t_1>t_2>0\) for some \(t_1,t_2>0\). Then, \(g(t_1)<l_T\), and hence, the third statement of Lemma 15.4 applies. Setting \(\alpha =\frac{t_2}{t_1}\) in Lemma 15.4, we find,

$$\begin{aligned} \Vert \text {Proj}(t_1{\mathbf {w}},F_{\mathcal {C}}(\mathbf {x}_0))\Vert _2<\frac{\Vert \text {Proj}(t_2{\mathbf {w}},F_{\mathcal {C}}(\mathbf {x}_0))\Vert _2}{\frac{t_2}{t_1}}, \end{aligned}$$

which implies the strict decrease of \(\frac{\Vert \text {Proj}(t{\mathbf {w}},F_{\mathcal {C}}(\mathbf {x}_0))\Vert _2}{t}\) over \(t\ge t_0\). \(\square \)

For the rest of the discussion, given three points ABC in \({\mathbb {R}}^n\), the angle induced by the lines AB and BC will be denoted by \(A\hat{B}C\).

Lemma 15.4

Let \({\mathcal {K}}\) be a convex and closed set in \({\mathbb {R}}^n\) that includes 0. Let \(\mathbf {z}\in {\mathbb {R}}^n\) and \(0<\alpha <1\) be arbitrary, let \(\mathbf {p}_1=\text {Proj}(\mathbf {z},{\mathcal {K}})\), \(\mathbf {p}_2=\text {Proj}(\alpha \mathbf {z},{\mathcal {K}})\). Denote the points whose coordinates are determined by \(0,\mathbf {p}_1,\mathbf {p}_2,\mathbf {z}\) by \(O, P_1,P_2\) and Z, respectively. Then,

  • \(Z\hat{P_1}O\) is either wide or right angle.

  • If \(Z\hat{P_1}O\) is right angle, then \(\mathbf {p}_1=\frac{\mathbf {p}_2}{\alpha }=\text {Proj}(\mathbf {z},T_{\mathcal {K}}(0))\).

  • If \(Z\hat{P_1}O\) is wide angle, then \(\Vert \mathbf {p}_1\Vert _2<\frac{\Vert \mathbf {p}_2\Vert _2}{\alpha }\le \Vert \text {Proj}(\mathbf {z},T_{\mathcal {K}}(0))\Vert _2\).

Proof

Acute angle: Assume \(Z\hat{P_1}O\) is acute angle. If \(Z\hat{O}P_1\) is right or wide angle, then 0 is closer to \(\mathbf {z}\) than \(\mathbf {p}_1\) which is a contradiction. If \(Z\hat{O}P_1\) is acute angle, then draw the perpendicular from Z to the line \(OP_1\). The intersection is in \({\mathcal {K}}\) due to convexity and it is closer to \(\mathbf {z}\) than \(\mathbf {p}_1\), which again is a contradiction.

Right angle: Now, assume \(Z\hat{P_1}O\) is right angle. Using Fact 10.2, there exists a hyperplane H that separates \(\mathbf {z}\) and \({\mathcal {K}}\) passing through \(P_1\) which is perpendicular to \(\mathbf {z}-\mathbf {p}_1\). The line \(P_1O\) lies on H. Consequently, for any \(\alpha \in [0,1]\), the closest point to \(\alpha \mathbf {z}\) over \({\mathcal {K}}\) is simply \(\alpha \mathbf {p}_1\). Hence, \(\mathbf {p}_2=\alpha \mathbf {p}_1\). Now, let \(\mathbf {q}_1:=\text {Proj}(\mathbf {z},T_{\mathcal {K}}(0))\). Then, \(\text {Proj}(\alpha \mathbf {z},T_{\mathcal {K}}(0))=\alpha \mathbf {q}_1\). If \(\mathbf {q}_1\ne \mathbf {p}_1\), then \(\Vert \mathbf {q}_1\Vert _2>\Vert \mathbf {p}_1\Vert _2\) since \(\Vert \mathbf {z}-\mathbf {q}_1\Vert _2<\Vert \mathbf {z}-\mathbf {p}_1\Vert _2\) and:

$$\begin{aligned} \Vert \mathbf {q}_1\Vert _2^2=\Vert \mathbf {z}\Vert _2^2-\Vert \mathbf {z}-\mathbf {q}_1\Vert _2^2> \Vert \mathbf {z}\Vert _2^2-\Vert \mathbf {z}-\mathbf {p}_1\Vert _2^2\ge \Vert \mathbf {p}_1\Vert _2^2 \end{aligned}$$

where the last inequality follows from the fact that \(Z\hat{P_1}O\) is not acute. Then,

$$\begin{aligned} \lim _{\alpha \rightarrow 0}\frac{\Vert \text {Proj}(\mathbf {z},T_{\mathcal {K}}(0))\Vert _2}{\Vert \text {Proj}(\mathbf {z},{\mathcal {K}})\Vert _2}=\frac{\Vert \mathbf {q}_1\Vert _2}{\Vert \mathbf {p}_1\Vert _2}>1 \end{aligned}$$

which contradicts with Lemma 15.2.

Wide angle: Finally, assume \(Z\hat{P_1}O\) is wide angle. We start by reducing the problem to a two-dimensional one. Obtain \({\mathcal {K}}'\) by projecting the set \({\mathcal {K}}\) to the 2D plane induced by the points \(Z,P_1\) and O. Now, let \(\mathbf {p}_2'=\text {Proj}(\alpha \mathbf {z},{\mathcal {K}}')\). Due to the projection, we still have

$$\begin{aligned} \Vert \mathbf {z}-\mathbf {p}_2'\Vert _2\le \Vert \mathbf {z}-\mathbf {p}_2\Vert _2\le \Vert \mathbf {z}-\alpha \mathbf {p}_1\Vert _2 \end{aligned}$$

and \(\Vert \mathbf {p}_2'\Vert _2\le \Vert \mathbf {p}_2\Vert _2\). Next, we will prove that \(\Vert \mathbf {p}_2'\Vert _2>\Vert \alpha \mathbf {p}_1\Vert _2\) to conclude. Figure 7 will help us explain our approach. Let the line \(UP_1\) be perpendicular to \(ZP_1\). Assume, it crosses ZO at S. Let \(P'Z'\) be parallel to \(P_1Z_1\). Observe that \(P'\) corresponds to \(\alpha \mathbf {p}_1\). H is the intersection of \(P'Z'\) and \(P_1U\). Denote the point corresponding to \(\mathbf {p}_2'\) by \(P_2'\). Observe that \(P_2'\) satisfies the following:

  • \(P_1\) is the closest point to Z in \({\mathcal {K}}\) and hence \(P_2'\) lies on the left of \(P_1U\) (same side as O).

  • \(P_2\) is the closest point to \(Z'\). Hence, \(Z'\hat{P_2}P_1\) is not acute angle. Otherwise, we can draw a perpendicular to \(P_2P_1\) from \(Z'\) and end up with a shorted distance. This would also imply that \(Z'\hat{P_2'}P_1\) is not acute as well. The reason is, due to projection, \(|Z'P_2'|\le |Z'P_2|\) and \(|P_2'P_1|\le |P_2P_1|\) and hence,

    $$\begin{aligned} |Z'P_1|\ge |Z'P_2|^2+|P_2P_1|^2\ge |Z'P_2'|^2+|P_2'P_1|^2. \end{aligned}$$
    (100)
  • \(P_2'\) has to lie below or on the line \(OP_1\) otherwise, perpendicular to \(OP_1\) from \(Z'\) would yield a shorter distance than \(|P_2'Z'|\).

  • \(\mathbf {p}_2\ne \alpha \mathbf {p}_1\). To see this, note that \(Z'\hat{P'}O\) is wide angle. Let \(\mathbf {q}\in {\mathbb {R}}^n\) be the projection of \(\alpha \mathbf {z}\) on the line \(\{c\mathbf {p}_1\big |c\in {\mathbb {R}}\}\) and point Q denote the vector \(\mathbf {q}\). If Q lies between O and \(P_1\), \(\mathbf {q}\in {\mathcal {K}}\) and \(|QZ'|<|P'Z'|\). Otherwise, \(P_1\) lies between Q and \(P'\) and hence \(|P_1Z'|<|P'Z'|\) and \(\mathbf {p}\in {\mathcal {K}}\). This implies \(P_2,P_2'\ne P'\).

Based on these observations, we investigate the problem in two cases illustrated in Fig. 7.

Fig. 7
figure 7

Possible configurations of the points in Lemma 15.4

Case 1 (S lies on \(Z'Z\)): Consider the left-hand side of Fig. 7. If \(P_2'\) lies on the right-hand side of \(P'U\), this implies \(|P_2'O|> |P'O|\) which is what we wanted.

If \(P_2'\) lies on the region induced by \(OP'TT'\), then \(P_1\hat{P}_2'Z'\) is acute angle as \(P_1\hat{Z}'P_2'>P_1\hat{Z}'P'\) is wide, which contradicts with (100).

If \(P_2'\) lies on the remaining region \(T'TU\), then \(Z'\hat{P}_2'P_1\) is acute. The reason is that \(P'_2\hat{Z}'P_1\) is wide as follows

$$\begin{aligned} P'_2\hat{Z}'P_1\ge P'_2\hat{T}P_1\ge U\hat{T}P_1>U\hat{P}'P_1=\frac{\pi }{2}. \end{aligned}$$

Case 2 (S lies on \(OZ'\)): Consider the right-hand side of Fig. 7. Due to location restrictions, \(P_2'\) lies on either \(P_1P'H\) triangle or the region induced by \(OP'HU\). If it lies on \(P_1P'H\), then \(O\hat{P'}P_2'\ge O\hat{P'}H\) (thus wide), which implies \(|OP_2'|>|OP'|\) as \(O\hat{P}'P_2'\) is wide angle and \(P'\ne P_2'\).

If \(P_2'\) lies on \(OP'HU\), then \(P_1\hat{P}_2'Z'<P_1\hat{H}Z'=\frac{\pi }{2}\) hence \(P_1\hat{P}'_2Z'\) is acute angle which contradicts with (100).

In all cases, we end up with \(|OP_2'|>|OP'|\) which implies \(\Vert \mathbf {p}_2\Vert _2\ge \Vert \mathbf {p}_2'\Vert _2>\alpha \Vert \mathbf {p}_1\Vert _2\) as desired.

Finally, apply Lemma 15.1 on \(\alpha \mathbf {z}\) to upper bound \(\Vert \mathbf {p}_2\Vert _2\) by \(\alpha \Vert \text {Proj}(\mathbf {z},T_{\mathcal {K}}(0))\Vert _2\). \(\square \)

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Oymak, S., Hassibi, B. Sharp MSE Bounds for Proximal Denoising. Found Comput Math 16, 965–1029 (2016). https://doi.org/10.1007/s10208-015-9278-4

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10208-015-9278-4

Keywords

Mathematics Subject Classification