High-Probability Complexity Bounds for Non-smooth Stochastic Convex Optimization with Heavy-Tailed Noise

Gorbunov, Eduard; Danilova, Marina; Shibaev, Innokentiy; Dvurechensky, Pavel; Gasnikov, Alexander

doi:10.1007/s10957-024-02533-z

High-Probability Complexity Bounds for Non-smooth Stochastic Convex Optimization with Heavy-Tailed Noise

Published: 15 October 2024

Volume 203, pages 2679–2738, (2024)
Cite this article

Journal of Optimization Theory and Applications Aims and scope Submit manuscript

Eduard Gorbunov ORCID: orcid.org/0000-0002-3370-4130¹,
Marina Danilova^2,3,
Innokentiy Shibaev^3,4,
Pavel Dvurechensky⁵ &
…
Alexander Gasnikov^3,6,7

286 Accesses
Explore all metrics

Abstract

Stochastic first-order methods are standard for training large-scale machine learning models. Random behavior may cause a particular run of an algorithm to result in a highly suboptimal objective value, whereas theoretical guarantees are usually proved for the expectation of the objective value. Thus, it is essential to theoretically guarantee that algorithms provide small objective residuals with high probability. Existing methods for non-smooth stochastic convex optimization have complexity bounds with the dependence on the confidence level that is either negative-power or logarithmic but under an additional assumption of sub-Gaussian (light-tailed) noise distribution that may not hold in practice. In our paper, we resolve this issue and derive the first high-probability convergence results with logarithmic dependence on the confidence level for non-smooth convex stochastic optimization problems with non-sub-Gaussian (heavy-tailed) noise. To derive our results, we propose novel stepsize rules for two stochastic methods with gradient clipping. Moreover, our analysis works for generalized smooth objectives with Hölder-continuous gradients, and for both methods, we provide an extension for strongly convex problems. Finally, our results imply that the first (accelerated) method we consider also has optimal iteration and oracle complexity in all the regimes, and the second one is optimal in the non-smooth setting.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

On variance reduction for stochastic smooth convex optimization with multiplicative noise

Article 05 June 2018

Gradient-free methods for non-smooth convex stochastic optimization with heavy-tailed noise on convex compact

Article 28 August 2023

A zeroth order method for stochastic weakly convex optimization

Article Open access 01 September 2021

Data Availability Statement

The codes for the conducted numerical experiments are publicly available: https://github.com/ClippedStochasticMethods/clipped-SSTM.

Notes

Our proofs work for any $x^*$. In particular, one can choose $x^*$ being a projection of $x^0$ on the solutions set.
By default, we always write “gradients”, though our analysis works for non-differentiable convex functions as well (when $\nu = 0$): at any point where the gradient is now calculated, it is sufficient to use any subgradient at this point. This remark is valid for Definition 1.1 as well.
It is also worth mentioning that some functions have Hölder continuous gradients for multiple $\nu $ simultaneously [31]. Therefore, if constants $M_\nu $ are available, one can choose the best $\nu $ in terms of the iteration/oracle complexity of a method.
Our proofs are valid for any solution $x^*$ and, for example, one can take as $x^*$ the closest solution to the starting point $x^0$.
Our proofs are valid for any solution $x^*$ and, for example, one can take $x^*$ as the closest solution to the starting point $x^0$.
The choice of the parameters (in this and the following results) is dictated by the need to estimate and control the stochastic error in the proofs. If some of the parameters (such as $\nu , R_0, M_\nu , \sigma $) are unknown, one can directly tune parameters $\alpha , a, m_k$. To satisfy (26) and (27) it sufficient to choose sufficiently large a (or, alternatively, sufficiently small $\varepsilon $).
To achieve $f(\bar{x}^N) - f(x^*) \le \varepsilon $ it is sufficient to take N such that $\frac{9 C^2 R_0 \sigma \sqrt{\ln \tfrac{4N}{\beta }}}{\sqrt{N}} \le \varepsilon $. Solving this inequality w.r.t. N, we get that it is sufficient to take N such that $N \ge \frac{81C^4\sigma ^2 R_0^2 \ln \frac{4N}{\beta }}{\varepsilon ^2}$, e.g., $N = \Bigg \lceil \frac{162C^4\sigma ^2 R_0^2 \ln \left( \frac{648C^4\sigma ^2 R_0^2}{\varepsilon ^2\beta }\right) }{\varepsilon ^2}\Bigg \rceil $ satisfies this inequality.
For $p \in (1,2]$ function $f_{i,p}(x)$ is differentiable and $\nabla f_{i,p}(x) = p|a_i^\top x - y_i|^{p-1} \textrm{sign}(a_i^\top x - y_i)a_i $ and for $p = 1$ it has subdifferential $\partial f_{i,p}(x) = \left\{ \begin{array}{lll} a_i,& \text {if } a_i^\top x - y_i > 0,\\ {[}-a_i, a_i],& \text {if } a_i^\top x - y_i = 0,\\ -a_i,& \text {if } a_i^\top x - y_i < 0. \end{array}\right. $
We conduct these experiments to illustrate that clipped-SSTM and clipped-SGD might be useful even for the problems that are not theoretically studied in this paper. Since [16] does not provide numerical experiments with clipped-SSTM on the training of neural networks, our experiments are the first ones showing the behavior of clipped-SSTM on the considered tasks.
Following standard practice in the usage of clipping, we use coordinate-wise clipping in clipped-SGD [44]. In the preliminary experiments, we also tried norm-clipping for clipped-SGD, but it showed worse results than the coordinate-wise one. Our analysis can be generalized to the case of coordinate-wise clipping if we assume the boundedness of the coordinate-wise variance $\sigma _{c}^2$ of the stochastic gradients. Then, the result of Lemma 4.2 will hold with $\sigma ^2 = n\sigma _c^2$, and the norm of the clipped vector will be bounded by $\sqrt{n}\lambda $. These changes will lead to the explicit dependence on the dimension in the complexity bounds, similarly to [44].
When f is not differentiable, we use subgradients. In this case, 0 belongs to the subdifferential of f at the point $x^*$, and we take it as $\nabla f(x^*)$.

References

Bennett, G.: Probability inequalities for the sum of independent random variables. J. Am. Stat. Assoc. 57(297), 33–45 (1962)
Article MATH Google Scholar
Chaux, C., Combettes, P.L., Pesquet, J.-C., Wajs, V.R.: A variational formulation for frame-based inverse problems. Inverse Prob. 23(4), 1495–1518 (2007)
Article MathSciNet MATH Google Scholar
Davis, D., Drusvyatskiy, D., Xiao, L., Zhang, J.: From low probability to high confidence in stochastic convex optimization. J. Mach. Learn. Res. 22(49), 1–38 (2021)
MathSciNet MATH Google Scholar
Devlin, J., Chang, M.-W., Lee, K., Toutanova, K.: BERT: Pre-training of deep bidirectional transformers for language understanding. In: North American Chapter of the Association for Computational Linguistics (2019)
Devolder, O.: Exactness, inexactness and stochasticity in first-order methods for large-scale convex optimization. PhD thesis, UCLouvain (2013)
Devolder, O., Glineur, F., Nesterov, Y.: First-order methods of smooth convex optimization with inexact oracle. Math. Program. 146(1), 37–75 (2014)
Article MathSciNet MATH Google Scholar
Dvurechensky, P., Gasnikov, A.: Stochastic intermediate gradient method for convex problems with stochastic inexact oracle. J. Optim. Theory Appl. 171(1), 121–145 (2016)
Article MathSciNet MATH Google Scholar
Dzhaparidze, K., Van Zanten, J.H.: On Bernstein-type inequalities for martingales. Stochastic processes and their applications 93(1), 109–117 (2001)
Article MathSciNet MATH Google Scholar
Freedman, D.A., et al.: On tail probabilities for martingales. the Annals of Probability 3(1), 100–118 (1975)
Article MathSciNet MATH Google Scholar
Gasnikov, A.V., Nesterov, Y.E.: Universal method for stochastic composite optimization problems. Comput. Math. Math. Phys. 58, 48–64 (2018)
Article MathSciNet MATH Google Scholar
Gasnikov, A.V., Nesterov, Y.E., Spokoiny, V.G.: On the efficiency of a randomized mirror descent algorithm in online optimization problems. Comput. Math. Math. Phys. 55(4), 580–596 (2015)
Article MathSciNet MATH Google Scholar
Gehring, J., Auli, M., Grangier, D., Yarats, D., Dauphin, Y. N.: Convolutional sequence to sequence learning. In: Proceedings of the 34th International Conference on Machine Learning-Volume 70, pages 1243–1252. https://www.jmlr.org/ (2017)
Ghadimi, S., Lan, G.: Optimal stochastic approximation algorithms for strongly convex stochastic composite optimization I: a generic algorithmic framework. SIAM J. Optim. 22(4), 1469–1492 (2012)
Article MathSciNet MATH Google Scholar
Ghadimi, S., Lan, G.: Stochastic first-and zeroth-order methods for nonconvex stochastic programming. SIAM J. Optim. 23(4), 2341–2368 (2013)
Article MathSciNet MATH Google Scholar
Goodfellow, I., Bengio, Y., Courville, A.: Deep Learning. MIT Press, Cambridge (2016). http://www.deeplearningbook.org
Gorbunov, E., Danilova, M., Gasnikov, A.: Stochastic optimization with heavy-tailed noise via accelerated gradient clipping. In: Larochelle, H., Ranzato, M., Hadsell, R., Balcan, M.F., Lin, H. (eds.) Advances in Neural Information Processing Systems, vol. 33, pp. 15042–15053. Curran Associates Inc (2020)
MATH Google Scholar
Gorbunov, E., Danilova, M., Shibaev, I., Dvurechensky, P., Gasnikov, A.: High probability complexity bounds for non-smooth stochastic optimization with heavy-tailed noise. arXiv preprint arXiv:2106.05958 (2021)
Gower, R.M., Loizou, N., Qian, X., Sailanbayev, A., Shulgin, E., Richtárik, P.: SGD: general analysis and improved rates. In: Chaudhuri, K., Salakhutdinov, R. (eds), Proceedings of the 36th International Conference on Machine Learning, volume 97 of Proceedings of Machine Learning Research, pages 5200–5209. PMLR, 09–15 (2019)
Guigues, V., Juditsky, A., Nemirovski, A.: Non-asymptotic confidence bounds for the optimal value of a stochastic program. Optim. Methods Softw. 32(5), 1033–1058 (2017)
Article MathSciNet MATH Google Scholar
Guzmán, C., Nemirovski, A.: On lower complexity bounds for large-scale smooth convex optimization. J. Complex. 31(1), 1–14 (2015)
Article MathSciNet MATH Google Scholar
Hazan, E., Levy, K., Shalev-Shwartz, S.: Beyond convexity: stochastic quasi-convex optimization. In: Cortes, C., Lawrence, N., Lee, D., Sugiyama, M., Garnett, R. (eds.) Advances in Neural Information Processing Systems, vol. 28. Curran Associates, Inc., Red Hook (2015)
MATH Google Scholar
Juditsky, A., Nemirovski, A.: First order methods for nonsmooth convex large-scale optimization, i: general purpose methods. Optimization for Machine Learning, pages 121–148 (2011)
Kingma, D.P., Ba, J.: Adam: a method for stochastic optimization. In: International Conference on Learning Representations (2015)
Lan, G.: An optimal method for stochastic composite optimization. Math. Program. 133(1–2), 365–397 (2012)
Article MathSciNet MATH Google Scholar
Mai, V.V., Johansson, M.: Stability and Convergence of Stochastic Gradient Clipping: Beyond Lipschitz Continuity and Smoothness. In: Meila, M., Zhang, T. (eds), Proceedings of the 38th International Conference on Machine Learning, volume 139 of Proceedings of Machine Learning Research, pages 7325–7335. PMLR, 18–24 (2021)
Menon, A.K., Rawat, A.S., Reddi, S.J., Kumar, S.: Can gradient clipping mitigate label noise? In: International Conference on Learning Representations (2020)
Moulines, E., Bach, F.: Non-asymptotic analysis of stochastic approximation algorithms for machine learning. In: Shawe-Taylor, J., Zemel, R., Bartlett, P., Pereira, F., Weinberger, K.Q. (eds.) Advances in Neural Information Processing Systems, vol. 24, pp. 451–459. Curran Associates, Inc. (2011)
MATH Google Scholar
Nazin, A.V., Nemirovsky, A.S., Tsybakov, A.B., Juditsky, A.B.: Algorithms of robust stochastic optimization based on mirror descent method. Autom. Remote. Control. 80(9), 1607–1627 (2019)
Article MathSciNet MATH Google Scholar
Nemirovski, A., Juditsky, A., Lan, G., Shapiro, A.: Robust stochastic approximation approach to stochastic programming. SIAM J. Optim. 19(4), 1574–1609 (2009)
Article MathSciNet MATH Google Scholar
Nemirovski, A.S., Yudin, D.B.: Problem Complexity and Method Efficiency in Optimization. A Wiley-Interscience publication, Wiley, New York (1983)
Nesterov, Y.: Universal gradient methods for convex optimization problems. Math. Program. 152(1–2), 381–404 (2015)
Article MathSciNet MATH Google Scholar
Nesterov, Y.E.: A method for solving the convex programming problem with convergence rate O$(1/k^2)$. Dokl. Akad. Nauk SSSR 269, 543–547 (1983)
MathSciNet MATH Google Scholar
Pascanu, R., Mikolov, T., Bengio, Y.: On the difficulty of training recurrent neural networks. In: Dasgupta, S., McAllester, D. (eds), Proceedings of the 30th International Conference on Machine Learning, volume 28 of Proceedings of Machine Learning Research, pages 1310–1318, Atlanta, Georgia, USA, 17–19 (2013). PMLR
Paszke, A., Gross, S., Massa, F., Lerer, A., Bradbury, J., Chanan, G., Killeen, T., Lin, Z., Gimelshein, N., Antiga, L., Desmaison, A., Kopf, A., Yang, E., DeVito, Z., Raison, M., Tejani, A., Chilamkurthy, S., Steiner, B., Fang, L., Bai, J., Chintala, S.: PyTorch: an imperative style, high-performance deep learning library. In: Wallach, H., Larochelle, H., Beygelzimer, A., d’Alché-Buc, F., Fox, E., Garnett, R. (eds.) Advances in Neural Information Processing Systems 32, pp. 8024–8035. Curran Associates, Inc. (2019)
Robbins, H., Monro, S.: A stochastic approximation method. The annals of mathematical statistics, pages 400–407 (1951)
Russakovsky, O., Deng, J., Hao, S., Krause, J., Satheesh, S., Ma, S., Huang, Z., Karpathy, A., Khosla, A., Bernstein, M., Berg, A.C., Fei-Fei, L.: ImageNet large scale visual recognition challenge. Int. J. Comput. Vis. (IJCV) 115(3), 211–252 (2015)
Article MathSciNet Google Scholar
Sadiev, A., Danilova, M., Gorbunov, E., Horváth, S., Gidel, G., Dvurechensky, P., Gasnikov, A., Richtárik, P.: High-probability bounds for stochastic optimization and variational inequalities: the case of unbounded variance. In: International Conference on Machine Learning, pages 29563–29648. PMLR (2023)
Şimşekli, U., Gürbüzbalaban, M., Nguyen, T.H., Richard, G., Sagun, L.: On the heavy-tailed theory of stochastic gradient descent for deep neural networks. arXiv preprint arXiv:1912.00018, (2019)
Simsekli, U., Sagun, L., Gurbuzbalaban, M.: A tail-index analysis of stochastic gradient noise in deep neural networks. In: Chaudhuri, K., Salakhutdinov, R. (eds), Proceedings of the 36th International Conference on Machine Learning, volume 97 of Proceedings of Machine Learning Research, pages 5827–5837. PMLR, 09–15 (2019)
Spokoiny, V.: Parametric estimation. Finite sample theory. Ann. Stat. 40(6), 2877–2909 (2012)
Article MathSciNet MATH Google Scholar
Warstadt, A., Singh, A., Bowman, S.R.: Neural network acceptability judgments. Trans. Assoc. Comput. Linguist. 7, 625–641 (2019)
Article MATH Google Scholar
Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A, Cistac, P., Rault, T., Louf, R. Funtowicz, M., Davison, J., Shleifer, S., von Platen, P., Ma, C., Jernite, Y., Plu, J., Xu, C., Scao, T.L., Gugger, S., Drame, M., Lhoest, Q., Rush, A.M.: Transformers: State-of-the-art natural language processing. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pages 38–45, Online, October (2020). Association for Computational Linguistics
Zhang, J., He, T., Sra, S., Jadbabaie, A.: Why gradient clipping accelerates training: a theoretical justification for adaptivity. In: International Conference on Learning Representations (2020)
Zhang, J., Karimireddy, S.P., Veit, A., Kim, S., Reddi, S., Kumar, S., Sra, S.: Why are adaptive methods good for attention models? In: Larochelle, H., Ranzato, M., Hadsell, R., Balcan, M.F., Lin, H. (eds.) Advances in Neural Information Processing Systems, vol. 33, pp. 15383–15393. Curran Associates, Inc. (2020)
Google Scholar

Download references

Acknowledgements

This work was supported by a Grant for research centers in the field of artificial intelligence, provided by the Analytical Center for the Government of the Russian Federation in accordance with the subsidy agreement (Agreement identifier 000000D730324P540002) and the agreement with the Moscow Institute of Physics and Technology dated November 1, 2021 No. 70-2021-00138.

Author information

Authors and Affiliations

Mohamed bin Zayed University of Artificial Intelligence, 1B Masdar City, Abu Dhabi, UAE
Eduard Gorbunov
Institute of Control Sciences RAS, 65 Profsoyuznaya Street, 117997, Moscow, Russian Federation
Marina Danilova
Moscow Institute of Physics and Technology, 9 Institutskiy per., 141701, Dolgoprudny, Moscow Region, Russian Federation
Marina Danilova, Innokentiy Shibaev & Alexander Gasnikov
National Research University Higher School of Economics, 11 Pokrovsky boulevard, 109028, Moscow, Russian Federation
Innokentiy Shibaev
Weierstrass Institute for Applied Analysis and Stochastics, Mohrenstr. 39, 10117, Berlin, Germany
Pavel Dvurechensky
Innopolis University, 1 Universitetskaya Street, 420500, Innopolis, Tatarstan, Russian Federation
Alexander Gasnikov
Institute for Information Transmission Problems RAS, 19 Bolshoy Karetny per., build.1, 127051, Moscow, Russian Federation
Alexander Gasnikov

Authors

Eduard Gorbunov
View author publications
You can also search for this author in PubMed Google Scholar
Marina Danilova
View author publications
You can also search for this author in PubMed Google Scholar
Innokentiy Shibaev
View author publications
You can also search for this author in PubMed Google Scholar
Pavel Dvurechensky
View author publications
You can also search for this author in PubMed Google Scholar
Alexander Gasnikov
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Eduard Gorbunov.

Additional information

Communicated by Akhtar A. Khan.

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

This is a shortened version of the paper. The full version is available on arXiv [17].

Appendices

Basic Facts, Technical Lemmas, and Auxiliary Results

1.1 Useful Inequalities

For all $a,b\in {\mathbb {R}}^n$

$$\begin{aligned} & \Vert a+b\Vert _2^2 \le 2\Vert a\Vert _2^2 + 2\Vert b\Vert _2^2, \end{aligned}$$

(76)

$$\begin{aligned} & \langle a, b\rangle = \frac{1}{2}\left( \Vert a+b\Vert _2^2 - \Vert a\Vert _2^2 - \Vert b\Vert _2^2\right) . \end{aligned}$$

(77)

1.2 Auxiliary Lemmas

The following lemma is a standard result about functions with $(\nu , M_\nu )$-Hölder continuous gradient [6, 31].

Lemma A.1

Let f has $(\nu , M_\nu )$-Hölder continuous gradient on $Q\subseteq {\mathbb {R}}^n$. Then for all $x,y\in Q$ and for all $\delta > 0$

$$\begin{aligned} & f(y) \le f(x) + \langle \nabla f(x), y-x \rangle + \frac{M_\nu }{1+\nu } \Vert x-y\Vert _2^{1+\nu }, \end{aligned}$$

(78)

$$\begin{aligned} & f(y) \le f(x) + \langle \nabla f(x), y-x \rangle + \frac{L(\delta ,\nu )}{2} \Vert x-y\Vert _2^{2} + \frac{\delta }{2}, \nonumber \\ & L(\delta ,\nu ) = \left( \frac{1}{\delta }\right) ^{\frac{1-\nu }{1+\nu }}M_\nu ^{\frac{2}{1+\nu }}. \end{aligned}$$

(79)

The next result is known as Bernstein inequality for martingale differences [1, 8, 9].

Lemma A.2

Let the sequence of random variables $\{X_i\}_{i\ge 1}$ form a martingale difference sequence, i.e. ${\mathbb {E}}\left[ X_i\mid X_{i-1},\ldots , X_1\right] = 0$ for all $i \ge 1$. Assume that conditional variances $\sigma _i^2{\mathop {=}\limits ^{\text {def}}}{\mathbb {E}}\left[ X_i^2\mid X_{i-1},\ldots , X_1\right] $ exist and are bounded and also assume that there exists deterministic constant $c>0$ such that $\Vert X_i\Vert _2 \le c$ almost surely for all $i\ge 1$. Then for all $b > 0$, $F > 0$ and $n\ge 1$

$$\begin{aligned} {\mathbb {P}}\left\{ \Big |\sum \limits _{i=1}^nX_i\Big | > b \text { and } \sum \limits _{i=1}^n\sigma _i^2 \le F\right\} \le 2\exp \left( -\frac{b^2}{2F + \nicefrac {2cb}{3}}\right) . \end{aligned}$$

(80)

1.3 Technical Lemmas

Lemma A.3

Let sequences $\{\alpha _k\}_{k\ge 0}$ and $\{A_k\}_{k\ge 0}$ satisfy

$$\begin{aligned} \alpha _{0}= & A_0 = 0,\quad \alpha _{k+1} = \frac{(k+1)^{\frac{2\nu }{1+\nu }}(\nicefrac {\varepsilon }{2})^{\frac{1-\nu }{1+\nu }}}{2^{\frac{2\nu }{1+\nu }}aM_\nu ^{\frac{2}{1+\nu }}},\nonumber \\ A_{k+1}= & A_k + \alpha _{k+1},\quad a,\varepsilon , M_{\nu } > 0,\; \nu \in [0,1] \end{aligned}$$

(81)

for all $k\ge 0$. Then for all $k\ge 0$ we have

$$\begin{aligned} A_{k} \ge a L_{k} \alpha _{k}^2,\quad A_{k} \ge \frac{k^{\frac{1+3\nu }{1+\nu }}(\nicefrac {\varepsilon }{2})^{\frac{1-\nu }{1+\nu }}}{2^{\frac{1+3\nu }{1+\nu }}aM_\nu ^{\frac{2}{1+\nu }}}, \end{aligned}$$

(82)

where $L_0 = 0$ and for $k > 0$

$$\begin{aligned} L_{k} = \left( \frac{2A_{k}}{\alpha _{k}\varepsilon }\right) ^{\frac{1-\nu }{1+\nu }} M_\nu ^{\frac{2}{1+\nu }}. \end{aligned}$$

(83)

Moreover, for all $k \ge 0$

$$\begin{aligned} A_k \le \frac{k^{\frac{1+3\nu }{1+\nu }}(\nicefrac {\varepsilon }{2})^{\frac{1-\nu }{1+\nu }}}{2^{\frac{2\nu }{1+\nu }}aM_\nu ^{\frac{2}{1+\nu }}}. \end{aligned}$$

(84)

Proof

We start with deriving the second inequality from (82). The proof goes by induction. For $k = 0$, the inequality holds. Next, we assume that it holds for all $k \le K$. Then,

$$\begin{aligned} A_{K+1} = A_{K} + \alpha _{K+1} \ge \frac{K^{\frac{1+3\nu }{1+\nu }}(\nicefrac {\varepsilon }{2})^{\frac{1-\nu }{1+\nu }}}{2^{\frac{1+3\nu }{1+\nu }}aM_\nu ^{\frac{2}{1+\nu }}} + \frac{(K+1)^{\frac{2\nu }{1+\nu }}(\nicefrac {\varepsilon }{2})^{\frac{1-\nu }{1+\nu }}}{2^{\frac{2\nu }{1+\nu }}aM_\nu ^{\frac{2}{1+\nu }}}. \end{aligned}$$

Let us estimate the right-hand side of the previous inequality. We want to show that

$$\begin{aligned} \frac{K^{\frac{1+3\nu }{1+\nu }}(\nicefrac {\varepsilon }{2})^{\frac{1-\nu }{1+\nu }}}{2^{\frac{1+3\nu }{1+\nu }}aM_\nu ^{\frac{2}{1+\nu }}} + \frac{(K+1)^{\frac{2\nu }{1+\nu }}(\nicefrac {\varepsilon }{2})^{\frac{1-\nu }{1+\nu }}}{2^{\frac{2\nu }{1+\nu }}aM_\nu ^{\frac{2}{1+\nu }}}\ge & \frac{(K+1)^{\frac{1+3\nu }{1+\nu }}(\nicefrac {\varepsilon }{2})^{\frac{1-\nu }{1+\nu }}}{2^{\frac{1+3\nu }{1+\nu }}aM_\nu ^{\frac{2}{1+\nu }}} \end{aligned}$$

that is equivalent to the inequality:

$$\begin{aligned} \frac{K^{\frac{1+3\nu }{1+\nu }}}{2} + (K+1)^{\frac{2\nu }{1+\nu }} \ge \frac{(K+1)^{\frac{1+3\nu }{1+\nu }}}{2} \Longleftrightarrow \frac{K^{\frac{1+3\nu }{1+\nu }}}{2} \ge \frac{(K+1)^{\frac{2\nu }{1+\nu }}(K-1)}{2}. \end{aligned}$$

If $K = 1$, it trivially holds. If $K > 1$, it is equivalent to

$$\begin{aligned} \frac{K}{K-1} \ge \left( \frac{K+1}{K}\right) ^{2 - \frac{2}{1+\nu }}. \end{aligned}$$

Since $2 - \frac{2}{1+\nu }$ is monotonically increasing function for $\nu \in [0,1]$ we have that

$$\begin{aligned} \left( \frac{K+1}{K}\right) ^{2 - \frac{2}{1+\nu }} \le \frac{K+1}{K} \le \frac{K}{K-1}. \end{aligned}$$

That is, the second inequality in (82) holds for $k = K+1$, and, as a consequence, it holds for all $k \ge 0$. Next, we derive the first part of (82). For $k = 0$, it trivially holds. For $k > 0$ we consider cases $\nu = 0$ and $\nu > 0$ separately. When $\nu = 0$ the inequality is equivalent to

$$\begin{aligned} 1 \ge \frac{2a\alpha _k M_0^2}{\varepsilon }, \text { where } \frac{2a\alpha _k M_0^2}{\varepsilon } \overset{(81)}{=} 1, \end{aligned}$$

i.e., we have $A_k = aL_k\alpha _k^2$ for all $k\ge 0$. When $\nu > 0$ the first inequality in (82) is equivalent to

$$\begin{aligned} A_{k} \ge a^{\frac{1+\nu }{2\nu }}\alpha _{k}^{\frac{1+3\nu }{2\nu }}(\nicefrac {\varepsilon }{2})^{-\frac{1-\nu }{2\nu }}M_\nu ^{\frac{1}{\nu }} \overset{(81)}{\Longleftrightarrow } A_{k} \ge \frac{k^{\frac{1+3\nu }{1+\nu }}(\nicefrac {\varepsilon }{2})^{\frac{1-\nu }{1+\nu }}}{2^{\frac{1+3\nu }{1+\nu }}aM_\nu ^{\frac{2}{1+\nu }}}, \end{aligned}$$

where the last inequality coincides with the second inequality from (82) that we derived earlier in the proof.

To finish the proof, it remains to derive (84). Again, the proof goes by induction. For $k=0$ inequality (84) is trivial. Next, we assume that it holds for all $k \le K$. Then,

$$\begin{aligned} A_{K+1} = A_{K} + \alpha _{K+1} \le \frac{K^{\frac{1+3\nu }{1+\nu }}(\nicefrac {\varepsilon }{2})^{\frac{1-\nu }{1+\nu }}}{2^{\frac{2\nu }{1+\nu }}aM_\nu ^{\frac{2}{1+\nu }}} + \frac{(K+1)^{\frac{2\nu }{1+\nu }}(\nicefrac {\varepsilon }{2})^{\frac{1-\nu }{1+\nu }}}{2^{\frac{2\nu }{1+\nu }}aM_\nu ^{\frac{2}{1+\nu }}}. \end{aligned}$$

Let us estimate the right-hand side of the previous inequality. We want to show that

$$\begin{aligned} \frac{K^{\frac{1+3\nu }{1+\nu }}(\nicefrac {\varepsilon }{2})^{\frac{1-\nu }{1+\nu }}}{2^{\frac{2\nu }{1+\nu }}aM_\nu ^{\frac{2}{1+\nu }}} + \frac{(K+1)^{\frac{2\nu }{1+\nu }}(\nicefrac {\varepsilon }{2})^{\frac{1-\nu }{1+\nu }}}{2^{\frac{2\nu }{1+\nu }}aM_\nu ^{\frac{2}{1+\nu }}}\le & \frac{(K+1)^{\frac{1+3\nu }{1+\nu }}(\nicefrac {\varepsilon }{2})^{\frac{1-\nu }{1+\nu }}}{2^{\frac{2\nu }{1+\nu }}aM_\nu ^{\frac{2}{1+\nu }}} \end{aligned}$$

that is equivalent to the inequality:

$$\begin{aligned} K^{\frac{1+3\nu }{1+\nu }} + (K+1)^{\frac{2\nu }{1+\nu }} \le (K+1)^{\frac{1+3\nu }{1+\nu }}. \end{aligned}$$

This inequality holds due to

$$\begin{aligned} K^{\frac{1+3\nu }{1+\nu }} \le (K+1)^{\frac{2\nu }{1+\nu }}K. \end{aligned}$$

That is, (84) holds for $k = K+1$, and, as a consequence, it holds for all $k \ge 0$. $\square $

Lemma A.4

Let f have Hölder continuous gradients on ${\mathbb {R}}^n$ for some $\nu \in [0,1]$ with constant $M_\nu > 0$, be convex and $x^*$ be some minimum of f(x) on ${\mathbb {R}}^n$. Then, for all $x\in {\mathbb {R}}^n$

$$\begin{aligned} \Vert \nabla f(x)\Vert _2 \le \left( \frac{1+\nu }{\nu }\right) ^{\frac{\nu }{1+\nu }}M_\nu ^{\frac{1}{1+\nu }} \left( f(x) - f(x^*)\right) ^{\frac{\nu }{1+\nu }}, \end{aligned}$$

(85)

where for $\nu = 0$ we use $\left[ \left( \frac{1+\nu }{\nu }\right) ^{\frac{\nu }{1+\nu }}\right] _{\nu =0}:= \lim _{\nu \rightarrow 0}\left( \frac{1+\nu }{\nu }\right) ^{\frac{\nu }{1+\nu }} = 1$.

Proof

For $\nu = 0$ inequality (85) follows from (3) and^{Footnote 11}$\nabla f(x^*) = 0$. When $\nu > 0$ for arbitrary point $x\in {\mathbb {R}}^n$ we consider the point $y = x - \alpha \nabla f(x)$, where $\alpha = \left( \frac{\Vert \nabla f(x)\Vert _2^{1-\nu }}{M_\nu }\right) ^{\frac{1}{\nu }}$.

For the pair of points x, y we apply (78) and get

$$\begin{aligned} f(y)\le & f(x) + \langle \nabla f(x), y-x\rangle + \frac{M_\nu }{1+\nu }\Vert x-y\Vert _2^{1+\nu }\\= & f(x) - \alpha \Vert \nabla f(x)\Vert ^2 + \frac{\alpha ^{\nu +1}M_\nu }{1+\nu }\Vert \nabla f(x)\Vert _2^{1+\nu }\\= & f(x) - \frac{\Vert \nabla f(x)\Vert _2^{\frac{1+\nu }{\nu }}}{M_\nu ^{\frac{1}{\nu }}} + \frac{\Vert \nabla f(x)\Vert _2^{\frac{1+\nu }{\nu }}}{(1+\nu )M_\nu ^{\frac{1}{\nu }}} = f(x) - \frac{\nu \Vert \nabla f(x)\Vert _2^{\frac{1+\nu }{\nu }}}{(1+\nu )M_\nu ^{\frac{1}{\nu }}} \end{aligned}$$

implying

$$\begin{aligned} \Vert \nabla f(x)\Vert _2\le & \left( \frac{1+\nu }{\nu }\right) ^{\frac{\nu }{1+\nu }}M_\nu ^{\frac{1}{1+\nu }} \left( f(x) - f(y)\right) ^{\frac{\nu }{1+\nu }} \\\le & \left( \frac{1+\nu }{\nu }\right) ^{\frac{\nu }{1+\nu }}M_\nu ^{\frac{1}{1+\nu }} \left( f(x) - f(x^*)\right) ^{\frac{\nu }{1+\nu }}. \square \end{aligned}$$

$\square $

Lemma A.5

Let f have Hölder continuous gradients on ${\mathbb {R}}^n$ for some $\nu \in [0,1]$ with constant $M_\nu > 0$, be convex and $x^*$ be some minimum of f(x) on ${\mathbb {R}}^n$. Then, for all $x\in {\mathbb {R}}^n$ and all $\delta >0$,

$$\begin{aligned} \Vert \nabla f(x)\Vert _2^2 \le 2\left( \frac{1}{\delta }\right) ^{\frac{1-\nu }{1+\nu }}M_{\nu }^{\frac{2}{1+\nu }}\left( f(x)-f(x^*)\right) + \delta ^{\frac{2\nu }{1+\nu }} M_{\nu }^{\frac{2}{1+\nu }}. \end{aligned}$$

(86)

Proof

For a given $\delta > 0$ we consider an arbitrary point $x\in Q$ and $y = x - \frac{1}{L(\delta ,\nu )}\nabla f(x)$, where $L(\delta ,\nu ) = \left( \frac{1}{\delta }\right) ^{\frac{1-\nu }{1+\nu }}M_\nu ^{\frac{2}{1+\nu }}$.

For the pair of points x, y we apply (79) and get

$$\begin{aligned} f(y)\le & f(x) + \langle \nabla f(x), y-x \rangle + \frac{L(\delta ,\nu )}{2} \Vert x-y\Vert _2^{2} + \frac{\delta }{2}\\= & f(x) - \frac{1}{2L(\delta ,\nu )}\Vert {\nabla f(x)}\Vert _2^2 + \frac{\delta }{2} \end{aligned}$$

implying

$$\begin{aligned} \Vert \nabla f(x)\Vert _2^2\le & 2L(\delta ,\nu )\left( f(x) - f(y)\right) + \delta L(\delta , \nu )\\\le & 2\left( \frac{1}{\delta }\right) ^{\frac{1-\nu }{1+\nu }}M_{\nu }^{\frac{2}{1+\nu }}\left( f(x)-f(x^*)\right) + \delta ^{\frac{2\nu }{1+\nu }} M_{\nu }^{\frac{2}{1+\nu }}. \end{aligned}$$

$\square $

Additional Experimental Details and Results

1.1 Experiments on Synthetic Data

1.1.1 Hyper-parameters Tuning

We grid-searched hyper-parameters for each model. Commonly for all models we considered batch sizes of ${\{5, 10, 20, 50, 100, 200\}}$ and stepsizes $lr\in [1\textrm{e}{-1},1\textrm{e}{-5}]$. As to model-specific parameters:

for Adam we grid-search over $betas\in (\{0.8, 0.9, 0.95, 0.99\}, \{0.9, 0.99, 0.999\})$,
for SGD — over $momentum\in \{0.8, 0.9, 0.99, 0.999\}$,
for clipped-SSTM — over clipping parameter $B\in {\{1\textrm{e}{-0},1\textrm{e}{-1}, 1\textrm{e}{-2}, 1\textrm{e}{-3}\}}$,
for clipped-SGD — over $momentum\in \{0.8, 0.9, 0.99, 0.999\}$ and clipping parameter $B\in {\{1\textrm{e}{-0},1\textrm{e}{-1}, 1\textrm{e}{-2}, 1\textrm{e}{-3}\}}$.

For clipped-SSTM we additionally use $\nu =1$ and norm clipping (we did not gridsearch over it extensively; however, in our experiments on real data, these parameters were the best). For clipped-SGD we use coordinate-wise clipping.

For Adam, clipped-SSTM and clipped-SGD the best parameters for each p were approximately the same:

Adam: $lr=1\textrm{e}{-3}$, $betas=(0.9, 0.9)$ and batch size of 10
clipped-SSTM: $lr=1\textrm{e}{-3}$, $\nu = 1$, $B=1\textrm{e}{-2}$, norm clipping and a batch size of 5
clipped-SGD: $lr=1\textrm{e}{-3}$ and $B=1\textrm{e}{-1}$ or $lr=1\textrm{e}{-2}$ and $B=1\textrm{e}{-2}$, $momentum=0.8$, coordinate-wise clipping and a batch size of 5

1.1.2 Comparison w.r.t. Certain Relative Train Loss Level

In Fig. 2, we reported the performance of the methods in terms of the best models w.r.t. train loss achieved. However, it is also interesting to compare the methods w.r.t. the speed they achieve a certain (2.0) level of relative loss on train ($f_p(x_{\text {pred}})/f_p(x_{\text {true}})$). This is a valid metric, since $f_p(x_{\text {true}})$ is non-zero, after adding noise to the train part of the dataset, and $x_{\text {true}}$ is still a good approximation of the optimal solution. The results are represented in Fig. 5. As in the previous set of experiments, one can see that clipped-SSTM outperforms other algorithms and achieves this 2.0 level of relative loss much faster, though later loses to Adam/clipped-SGD.

1.2 Neural Networks Training

1.2.1 Hyper-parameters

In our experiments with the training of neural networks, we use standard implementations of Adam and SGD from PyTorch [34]; we write only the parameters we changed from the default.

To conduct these experiments, we used Nvidia RTX 2070s. The longest experiment (evolution of the noise distribution for image classification task) took 53 h (we iterated several times over the train dataset to build a better histogram; see Appendix B.2.3).

Image Classification. For ResNet-18 + ImageNet-100 the parameters of the methods were chosen as follows:

Adam: $lr=1e-3$ and a batch size of $4\times 32$
SGD: $lr=1e-2$, $momentum=0.9$ and a batch size of 32
clipped-SGD: $lr=5e-2$, $momentum=0.9$, coordinate-wise clipping with clipping parameter $B=0.1$ and a batch size of 32
clipped-SSTM: $\nu = 1$, stepsize parameter $\alpha = 1e-3$ (in code we use separately $lr=1e-2$ and $L=10$ and $\alpha = \frac{lr}{L}$), norm clipping with clipping parameter $B=1$ and a batch size of $2\times 32$. We also upper bounded the ratio $\nicefrac {A_k}{A_{k+1}}$ by 0.99 (see $a\_k\_ratio\_upper\_bound$ parameter in code)

The main two parameters that we grid-searched were lr and batch size. For both of them, we used a logarithmic grid (i.e., for lr, we used $1e-5,2e-5,5e-5,1e-4,\ldots ,1e-2,2e-2,5e-2$ for Adam). Batchsize was chosen from $32, 2\cdot 32, 4\cdot 32$, and $8\cdot 32$. For SGD, we also tried various momentum parameters.

For clipped-SSTM and clipped-SGD, we used clipping levels of 1 and 0.1, respectively. Too small a choice of the clipping level, e.g. 0.01, slows down the convergence significantly.

Another important parameter for clipped-SSTM here was $a\_k\_ratio\_upper\_bound$ – we used it to upper bound the maximum ratio of $\nicefrac {A_k}{A_{k+1}}$. Without this modification, the method is too conservative. e.g., after $10^4$ steps $\nicefrac {A_k}{A_{k+1}}\approx 0.9999$. Effectively, it can be seen as a momentum parameter of SGD.

Text Classification, CoLA. For BERT + CoLA the parameters of the methods were chosen as follows:

Adam: $lr=5e-5$, $weight\_decay=5e-4$ and a batch size of 32
SGD: $lr=1e-3$, $momentum=0.9$ and a batch size of 32
clipped-SSTM: $\nu = 1$, stepsize parameter $\alpha = 8e-3$, norm clipping with clipping parameter $B=1$ and a batch size of $8\times 32$
clipped-SGD: $lr=2e-3$, $momentum=0.9$, coordinate-wise clipping with clipping parameter $B=0.1$ and a batch size of 32

There, we used the same grid as in the previous task. The main difference here is that we didn’t bound clipped-SSTM $A_k/A_{k+1}$ ratio – there are only $\approx 300$ steps of the method (because the batch size is $8\cdot 32$). Thus, the method is still not too conservative.

1.2.2 On the Relation Between Stepsize Parameter $\alpha $ and Batchsize

In our experiments, we noticed that clipped-SSTM shows similar results when the ratio $\nicefrac {bs^2}{\alpha }$ is kept unchanged, where bs is batch size (see Fig. 6). We compare the performance of clipped-SSTM with 4 different choices of $\alpha $ and the batch size.

Theorem 4.1 explains this phenomenon in the convex case. For the case of $\nu = 1$ we have (from (24) and (30)):

$$\begin{aligned} \alpha \sim \frac{1}{aM_1},\quad \alpha _k \sim k\alpha ,\quad m_k \sim \frac{N a \sigma ^2 \alpha _{k+1}^2}{C^2R_0^2\ln \frac{4N}{\beta }},\quad N \sim \frac{a^{\frac{1}{2}}CR_0M_1^{\frac{1}{2}}}{\varepsilon ^{\frac{1}{2}}}\sim \frac{CR_0}{\alpha ^{\frac{1}{2}}\varepsilon ^{\frac{1}{2}}}, \end{aligned}$$

whence

$$\begin{aligned} m_k \sim \frac{CR_0 a \sigma ^2 \alpha ^2(k+1)^2}{\alpha ^{\frac{1}{2}}\varepsilon ^{\frac{1}{2}}C^2R_0^2\ln \frac{4N}{\beta }} \sim \frac{\sigma ^2 \alpha ^2(k+1)^2}{\alpha ^{\frac{1}{2}}\alpha M_1\varepsilon ^{\frac{1}{2}}CR_0\ln \frac{4N}{\beta }}\sim \alpha ^{\frac{1}{2}}, \end{aligned}$$

where the dependencies on numerical constants and logarithmic factors are omitted. Therefore, the observed empirical relation between batch size ($m_k$) and $\alpha $ correlates well with the established theoretical results for clipped-SSTM.

1.2.3 Evolution of the Noise Distribution

In this section, we provide our empirical study of the noise distribution evolution along the trajectories of different optimizers. As one can see from the plots in Figs. 7 and 8, the noise distribution for ResNet-18 + ImageNet-100 task is always close to Gaussian distribution, whereas for BERT + CoLA task it is significantly heavy-tailed.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Cite this article

Gorbunov, E., Danilova, M., Shibaev, I. et al. High-Probability Complexity Bounds for Non-smooth Stochastic Convex Optimization with Heavy-Tailed Noise. J Optim Theory Appl 203, 2679–2738 (2024). https://doi.org/10.1007/s10957-024-02533-z

Download citation

Received: 17 December 2023
Accepted: 31 August 2024
Published: 15 October 2024
Issue Date: December 2024
DOI: https://doi.org/10.1007/s10957-024-02533-z

Keywords

Mathematics Subject Classification

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

High-Probability Complexity Bounds for Non-smooth Stochastic Convex Optimization with Heavy-Tailed Noise

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

On variance reduction for stochastic smooth convex optimization with multiplicative noise

Gradient-free methods for non-smooth convex stochastic optimization with heavy-tailed noise on convex compact

A zeroth order method for stochastic weakly convex optimization

Data Availability Statement

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Appendices

Basic Facts, Technical Lemmas, and Auxiliary Results

1.1 Useful Inequalities

1.2 Auxiliary Lemmas

Lemma A.1

Lemma A.2

1.3 Technical Lemmas

Lemma A.3

Proof

Lemma A.4

Proof

Lemma A.5

Proof

Additional Experimental Details and Results

1.1 Experiments on Synthetic Data

1.1.1 Hyper-parameters Tuning

1.1.2 Comparison w.r.t. Certain Relative Train Loss Level

1.2 Neural Networks Training

1.2.1 Hyper-parameters

1.2.2 On the Relation Between Stepsize Parameter \(\alpha \) and Batchsize

1.2.3 Evolution of the Noise Distribution

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Mathematics Subject Classification

Subscribe and save

Buy Now

Search

Navigation