Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
Skip to main content

Advertisement

High-Probability Complexity Bounds for Non-smooth Stochastic Convex Optimization with Heavy-Tailed Noise

  • Published:
Journal of Optimization Theory and Applications Aims and scope Submit manuscript

Abstract

Stochastic first-order methods are standard for training large-scale machine learning models. Random behavior may cause a particular run of an algorithm to result in a highly suboptimal objective value, whereas theoretical guarantees are usually proved for the expectation of the objective value. Thus, it is essential to theoretically guarantee that algorithms provide small objective residuals with high probability. Existing methods for non-smooth stochastic convex optimization have complexity bounds with the dependence on the confidence level that is either negative-power or logarithmic but under an additional assumption of sub-Gaussian (light-tailed) noise distribution that may not hold in practice. In our paper, we resolve this issue and derive the first high-probability convergence results with logarithmic dependence on the confidence level for non-smooth convex stochastic optimization problems with non-sub-Gaussian (heavy-tailed) noise. To derive our results, we propose novel stepsize rules for two stochastic methods with gradient clipping. Moreover, our analysis works for generalized smooth objectives with Hölder-continuous gradients, and for both methods, we provide an extension for strongly convex problems. Finally, our results imply that the first (accelerated) method we consider also has optimal iteration and oracle complexity in all the regimes, and the second one is optimal in the non-smooth setting.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Algorithm 1
Algorithm 2
Algorithm 3
Algorithm 4
Fig. 1
Fig. 2
Fig. 3
Fig. 4

Similar content being viewed by others

Data Availability Statement

The codes for the conducted numerical experiments are publicly available: https://github.com/ClippedStochasticMethods/clipped-SSTM.

Notes

  1. Our proofs work for any \(x^*\). In particular, one can choose \(x^*\) being a projection of \(x^0\) on the solutions set.

  2. By default, we always write “gradients”, though our analysis works for non-differentiable convex functions as well (when \(\nu = 0\)): at any point where the gradient is now calculated, it is sufficient to use any subgradient at this point. This remark is valid for Definition 1.1 as well.

  3. It is also worth mentioning that some functions have Hölder continuous gradients for multiple \(\nu \) simultaneously [31]. Therefore, if constants \(M_\nu \) are available, one can choose the best \(\nu \) in terms of the iteration/oracle complexity of a method.

  4. Our proofs are valid for any solution \(x^*\) and, for example, one can take as \(x^*\) the closest solution to the starting point \(x^0\).

  5. Our proofs are valid for any solution \(x^*\) and, for example, one can take \(x^*\) as the closest solution to the starting point \(x^0\).

  6. The choice of the parameters (in this and the following results) is dictated by the need to estimate and control the stochastic error in the proofs. If some of the parameters (such as \(\nu , R_0, M_\nu , \sigma \)) are unknown, one can directly tune parameters \(\alpha , a, m_k\). To satisfy (26) and (27) it sufficient to choose sufficiently large a (or, alternatively, sufficiently small \(\varepsilon \)).

  7. To achieve \(f(\bar{x}^N) - f(x^*) \le \varepsilon \) it is sufficient to take N such that \(\frac{9 C^2 R_0 \sigma \sqrt{\ln \tfrac{4N}{\beta }}}{\sqrt{N}} \le \varepsilon \). Solving this inequality w.r.t. N, we get that it is sufficient to take N such that \(N \ge \frac{81C^4\sigma ^2 R_0^2 \ln \frac{4N}{\beta }}{\varepsilon ^2}\), e.g., \(N = \Bigg \lceil \frac{162C^4\sigma ^2 R_0^2 \ln \left( \frac{648C^4\sigma ^2 R_0^2}{\varepsilon ^2\beta }\right) }{\varepsilon ^2}\Bigg \rceil \) satisfies this inequality.

  8. For \(p \in (1,2]\) function \(f_{i,p}(x)\) is differentiable and \(\nabla f_{i,p}(x) = p|a_i^\top x - y_i|^{p-1} \textrm{sign}(a_i^\top x - y_i)a_i \) and for \(p = 1\) it has subdifferential \(\partial f_{i,p}(x) = \left\{ \begin{array}{lll} a_i,& \text {if } a_i^\top x - y_i > 0,\\ {[}-a_i, a_i],& \text {if } a_i^\top x - y_i = 0,\\ -a_i,& \text {if } a_i^\top x - y_i < 0. \end{array}\right. \)

  9. We conduct these experiments to illustrate that clipped-SSTM and clipped-SGD might be useful even for the problems that are not theoretically studied in this paper. Since [16] does not provide numerical experiments with clipped-SSTM on the training of neural networks, our experiments are the first ones showing the behavior of clipped-SSTM on the considered tasks.

  10. Following standard practice in the usage of clipping, we use coordinate-wise clipping in clipped-SGD [44]. In the preliminary experiments, we also tried norm-clipping for clipped-SGD, but it showed worse results than the coordinate-wise one. Our analysis can be generalized to the case of coordinate-wise clipping if we assume the boundedness of the coordinate-wise variance \(\sigma _{c}^2\) of the stochastic gradients. Then, the result of Lemma 4.2 will hold with \(\sigma ^2 = n\sigma _c^2\), and the norm of the clipped vector will be bounded by \(\sqrt{n}\lambda \). These changes will lead to the explicit dependence on the dimension in the complexity bounds, similarly to [44].

  11. When f is not differentiable, we use subgradients. In this case, 0 belongs to the subdifferential of f at the point \(x^*\), and we take it as \(\nabla f(x^*)\).

References

  1. Bennett, G.: Probability inequalities for the sum of independent random variables. J. Am. Stat. Assoc. 57(297), 33–45 (1962)

    Article  MATH  Google Scholar 

  2. Chaux, C., Combettes, P.L., Pesquet, J.-C., Wajs, V.R.: A variational formulation for frame-based inverse problems. Inverse Prob. 23(4), 1495–1518 (2007)

    Article  MathSciNet  MATH  Google Scholar 

  3. Davis, D., Drusvyatskiy, D., Xiao, L., Zhang, J.: From low probability to high confidence in stochastic convex optimization. J. Mach. Learn. Res. 22(49), 1–38 (2021)

    MathSciNet  MATH  Google Scholar 

  4. Devlin, J., Chang, M.-W., Lee, K., Toutanova, K.: BERT: Pre-training of deep bidirectional transformers for language understanding. In: North American Chapter of the Association for Computational Linguistics (2019)

  5. Devolder, O.: Exactness, inexactness and stochasticity in first-order methods for large-scale convex optimization. PhD thesis, UCLouvain (2013)

  6. Devolder, O., Glineur, F., Nesterov, Y.: First-order methods of smooth convex optimization with inexact oracle. Math. Program. 146(1), 37–75 (2014)

    Article  MathSciNet  MATH  Google Scholar 

  7. Dvurechensky, P., Gasnikov, A.: Stochastic intermediate gradient method for convex problems with stochastic inexact oracle. J. Optim. Theory Appl. 171(1), 121–145 (2016)

    Article  MathSciNet  MATH  Google Scholar 

  8. Dzhaparidze, K., Van Zanten, J.H.: On Bernstein-type inequalities for martingales. Stochastic processes and their applications 93(1), 109–117 (2001)

    Article  MathSciNet  MATH  Google Scholar 

  9. Freedman, D.A., et al.: On tail probabilities for martingales. the Annals of Probability 3(1), 100–118 (1975)

    Article  MathSciNet  MATH  Google Scholar 

  10. Gasnikov, A.V., Nesterov, Y.E.: Universal method for stochastic composite optimization problems. Comput. Math. Math. Phys. 58, 48–64 (2018)

    Article  MathSciNet  MATH  Google Scholar 

  11. Gasnikov, A.V., Nesterov, Y.E., Spokoiny, V.G.: On the efficiency of a randomized mirror descent algorithm in online optimization problems. Comput. Math. Math. Phys. 55(4), 580–596 (2015)

    Article  MathSciNet  MATH  Google Scholar 

  12. Gehring, J., Auli, M., Grangier, D., Yarats, D., Dauphin, Y. N.: Convolutional sequence to sequence learning. In: Proceedings of the 34th International Conference on Machine Learning-Volume 70, pages 1243–1252. https://www.jmlr.org/ (2017)

  13. Ghadimi, S., Lan, G.: Optimal stochastic approximation algorithms for strongly convex stochastic composite optimization I: a generic algorithmic framework. SIAM J. Optim. 22(4), 1469–1492 (2012)

    Article  MathSciNet  MATH  Google Scholar 

  14. Ghadimi, S., Lan, G.: Stochastic first-and zeroth-order methods for nonconvex stochastic programming. SIAM J. Optim. 23(4), 2341–2368 (2013)

    Article  MathSciNet  MATH  Google Scholar 

  15. Goodfellow, I., Bengio, Y., Courville, A.: Deep Learning. MIT Press, Cambridge (2016). http://www.deeplearningbook.org

  16. Gorbunov, E., Danilova, M., Gasnikov, A.: Stochastic optimization with heavy-tailed noise via accelerated gradient clipping. In: Larochelle, H., Ranzato, M., Hadsell, R., Balcan, M.F., Lin, H. (eds.) Advances in Neural Information Processing Systems, vol. 33, pp. 15042–15053. Curran Associates Inc (2020)

    MATH  Google Scholar 

  17. Gorbunov, E., Danilova, M., Shibaev, I., Dvurechensky, P., Gasnikov, A.: High probability complexity bounds for non-smooth stochastic optimization with heavy-tailed noise. arXiv preprint arXiv:2106.05958 (2021)

  18. Gower, R.M., Loizou, N., Qian, X., Sailanbayev, A., Shulgin, E., Richtárik, P.: SGD: general analysis and improved rates. In: Chaudhuri, K., Salakhutdinov, R. (eds), Proceedings of the 36th International Conference on Machine Learning, volume 97 of Proceedings of Machine Learning Research, pages 5200–5209. PMLR, 09–15 (2019)

  19. Guigues, V., Juditsky, A., Nemirovski, A.: Non-asymptotic confidence bounds for the optimal value of a stochastic program. Optim. Methods Softw. 32(5), 1033–1058 (2017)

    Article  MathSciNet  MATH  Google Scholar 

  20. Guzmán, C., Nemirovski, A.: On lower complexity bounds for large-scale smooth convex optimization. J. Complex. 31(1), 1–14 (2015)

    Article  MathSciNet  MATH  Google Scholar 

  21. Hazan, E., Levy, K., Shalev-Shwartz, S.: Beyond convexity: stochastic quasi-convex optimization. In: Cortes, C., Lawrence, N., Lee, D., Sugiyama, M., Garnett, R. (eds.) Advances in Neural Information Processing Systems, vol. 28. Curran Associates, Inc., Red Hook (2015)

    MATH  Google Scholar 

  22. Juditsky, A., Nemirovski, A.: First order methods for nonsmooth convex large-scale optimization, i: general purpose methods. Optimization for Machine Learning, pages 121–148 (2011)

  23. Kingma, D.P., Ba, J.: Adam: a method for stochastic optimization. In: International Conference on Learning Representations (2015)

  24. Lan, G.: An optimal method for stochastic composite optimization. Math. Program. 133(1–2), 365–397 (2012)

    Article  MathSciNet  MATH  Google Scholar 

  25. Mai, V.V., Johansson, M.: Stability and Convergence of Stochastic Gradient Clipping: Beyond Lipschitz Continuity and Smoothness. In: Meila, M., Zhang, T. (eds), Proceedings of the 38th International Conference on Machine Learning, volume 139 of Proceedings of Machine Learning Research, pages 7325–7335. PMLR, 18–24 (2021)

  26. Menon, A.K., Rawat, A.S., Reddi, S.J., Kumar, S.: Can gradient clipping mitigate label noise? In: International Conference on Learning Representations (2020)

  27. Moulines, E., Bach, F.: Non-asymptotic analysis of stochastic approximation algorithms for machine learning. In: Shawe-Taylor, J., Zemel, R., Bartlett, P., Pereira, F., Weinberger, K.Q. (eds.) Advances in Neural Information Processing Systems, vol. 24, pp. 451–459. Curran Associates, Inc. (2011)

    MATH  Google Scholar 

  28. Nazin, A.V., Nemirovsky, A.S., Tsybakov, A.B., Juditsky, A.B.: Algorithms of robust stochastic optimization based on mirror descent method. Autom. Remote. Control. 80(9), 1607–1627 (2019)

    Article  MathSciNet  MATH  Google Scholar 

  29. Nemirovski, A., Juditsky, A., Lan, G., Shapiro, A.: Robust stochastic approximation approach to stochastic programming. SIAM J. Optim. 19(4), 1574–1609 (2009)

    Article  MathSciNet  MATH  Google Scholar 

  30. Nemirovski, A.S., Yudin, D.B.: Problem Complexity and Method Efficiency in Optimization. A Wiley-Interscience publication, Wiley, New York (1983)

  31. Nesterov, Y.: Universal gradient methods for convex optimization problems. Math. Program. 152(1–2), 381–404 (2015)

    Article  MathSciNet  MATH  Google Scholar 

  32. Nesterov, Y.E.: A method for solving the convex programming problem with convergence rate O\((1/k^2)\). Dokl. Akad. Nauk SSSR 269, 543–547 (1983)

    MathSciNet  MATH  Google Scholar 

  33. Pascanu, R., Mikolov, T., Bengio, Y.: On the difficulty of training recurrent neural networks. In: Dasgupta, S., McAllester, D. (eds), Proceedings of the 30th International Conference on Machine Learning, volume 28 of Proceedings of Machine Learning Research, pages 1310–1318, Atlanta, Georgia, USA, 17–19 (2013). PMLR

  34. Paszke, A., Gross, S., Massa, F., Lerer, A., Bradbury, J., Chanan, G., Killeen, T., Lin, Z., Gimelshein, N., Antiga, L., Desmaison, A., Kopf, A., Yang, E., DeVito, Z., Raison, M., Tejani, A., Chilamkurthy, S., Steiner, B., Fang, L., Bai, J., Chintala, S.: PyTorch: an imperative style, high-performance deep learning library. In: Wallach, H., Larochelle, H., Beygelzimer, A., d’Alché-Buc, F., Fox, E., Garnett, R. (eds.) Advances in Neural Information Processing Systems 32, pp. 8024–8035. Curran Associates, Inc. (2019)

  35. Robbins, H., Monro, S.: A stochastic approximation method. The annals of mathematical statistics, pages 400–407 (1951)

  36. Russakovsky, O., Deng, J., Hao, S., Krause, J., Satheesh, S., Ma, S., Huang, Z., Karpathy, A., Khosla, A., Bernstein, M., Berg, A.C., Fei-Fei, L.: ImageNet large scale visual recognition challenge. Int. J. Comput. Vis. (IJCV) 115(3), 211–252 (2015)

    Article  MathSciNet  Google Scholar 

  37. Sadiev, A., Danilova, M., Gorbunov, E., Horváth, S., Gidel, G., Dvurechensky, P., Gasnikov, A., Richtárik, P.: High-probability bounds for stochastic optimization and variational inequalities: the case of unbounded variance. In: International Conference on Machine Learning, pages 29563–29648. PMLR (2023)

  38. Şimşekli, U., Gürbüzbalaban, M., Nguyen, T.H., Richard, G., Sagun, L.: On the heavy-tailed theory of stochastic gradient descent for deep neural networks. arXiv preprint arXiv:1912.00018, (2019)

  39. Simsekli, U., Sagun, L., Gurbuzbalaban, M.: A tail-index analysis of stochastic gradient noise in deep neural networks. In: Chaudhuri, K., Salakhutdinov, R. (eds), Proceedings of the 36th International Conference on Machine Learning, volume 97 of Proceedings of Machine Learning Research, pages 5827–5837. PMLR, 09–15 (2019)

  40. Spokoiny, V.: Parametric estimation. Finite sample theory. Ann. Stat. 40(6), 2877–2909 (2012)

    Article  MathSciNet  MATH  Google Scholar 

  41. Warstadt, A., Singh, A., Bowman, S.R.: Neural network acceptability judgments. Trans. Assoc. Comput. Linguist. 7, 625–641 (2019)

    Article  MATH  Google Scholar 

  42. Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A, Cistac, P., Rault, T., Louf, R. Funtowicz, M., Davison, J., Shleifer, S., von Platen, P., Ma, C., Jernite, Y., Plu, J., Xu, C., Scao, T.L., Gugger, S., Drame, M., Lhoest, Q., Rush, A.M.: Transformers: State-of-the-art natural language processing. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pages 38–45, Online, October (2020). Association for Computational Linguistics

  43. Zhang, J., He, T., Sra, S., Jadbabaie, A.: Why gradient clipping accelerates training: a theoretical justification for adaptivity. In: International Conference on Learning Representations (2020)

  44. Zhang, J., Karimireddy, S.P., Veit, A., Kim, S., Reddi, S., Kumar, S., Sra, S.: Why are adaptive methods good for attention models? In: Larochelle, H., Ranzato, M., Hadsell, R., Balcan, M.F., Lin, H. (eds.) Advances in Neural Information Processing Systems, vol. 33, pp. 15383–15393. Curran Associates, Inc. (2020)

    Google Scholar 

Download references

Acknowledgements

This work was supported by a Grant for research centers in the field of artificial intelligence, provided by the Analytical Center for the Government of the Russian Federation in accordance with the subsidy agreement (Agreement identifier 000000D730324P540002) and the agreement with the Moscow Institute of Physics and Technology dated November 1, 2021 No. 70-2021-00138.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Eduard Gorbunov.

Additional information

Communicated by Akhtar A. Khan.

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

This is a shortened version of the paper. The full version is available on arXiv [17].

Appendices

Basic Facts, Technical Lemmas, and Auxiliary Results

1.1 Useful Inequalities

For all \(a,b\in {\mathbb {R}}^n\)

$$\begin{aligned} & \Vert a+b\Vert _2^2 \le 2\Vert a\Vert _2^2 + 2\Vert b\Vert _2^2, \end{aligned}$$
(76)
$$\begin{aligned} & \langle a, b\rangle = \frac{1}{2}\left( \Vert a+b\Vert _2^2 - \Vert a\Vert _2^2 - \Vert b\Vert _2^2\right) . \end{aligned}$$
(77)

1.2 Auxiliary Lemmas

The following lemma is a standard result about functions with \((\nu , M_\nu )\)-Hölder continuous gradient [6, 31].

Lemma A.1

Let f has \((\nu , M_\nu )\)-Hölder continuous gradient on \(Q\subseteq {\mathbb {R}}^n\). Then for all \(x,y\in Q\) and for all \(\delta > 0\)

$$\begin{aligned} & f(y) \le f(x) + \langle \nabla f(x), y-x \rangle + \frac{M_\nu }{1+\nu } \Vert x-y\Vert _2^{1+\nu }, \end{aligned}$$
(78)
$$\begin{aligned} & f(y) \le f(x) + \langle \nabla f(x), y-x \rangle + \frac{L(\delta ,\nu )}{2} \Vert x-y\Vert _2^{2} + \frac{\delta }{2}, \nonumber \\ & L(\delta ,\nu ) = \left( \frac{1}{\delta }\right) ^{\frac{1-\nu }{1+\nu }}M_\nu ^{\frac{2}{1+\nu }}. \end{aligned}$$
(79)

The next result is known as Bernstein inequality for martingale differences [1, 8, 9].

Lemma A.2

Let the sequence of random variables \(\{X_i\}_{i\ge 1}\) form a martingale difference sequence, i.e. \({\mathbb {E}}\left[ X_i\mid X_{i-1},\ldots , X_1\right] = 0\) for all \(i \ge 1\). Assume that conditional variances \(\sigma _i^2{\mathop {=}\limits ^{\text {def}}}{\mathbb {E}}\left[ X_i^2\mid X_{i-1},\ldots , X_1\right] \) exist and are bounded and also assume that there exists deterministic constant \(c>0\) such that \(\Vert X_i\Vert _2 \le c\) almost surely for all \(i\ge 1\). Then for all \(b > 0\), \(F > 0\) and \(n\ge 1\)

$$\begin{aligned} {\mathbb {P}}\left\{ \Big |\sum \limits _{i=1}^nX_i\Big | > b \text { and } \sum \limits _{i=1}^n\sigma _i^2 \le F\right\} \le 2\exp \left( -\frac{b^2}{2F + \nicefrac {2cb}{3}}\right) . \end{aligned}$$
(80)

1.3 Technical Lemmas

Lemma A.3

Let sequences \(\{\alpha _k\}_{k\ge 0}\) and \(\{A_k\}_{k\ge 0}\) satisfy

$$\begin{aligned} \alpha _{0}= & A_0 = 0,\quad \alpha _{k+1} = \frac{(k+1)^{\frac{2\nu }{1+\nu }}(\nicefrac {\varepsilon }{2})^{\frac{1-\nu }{1+\nu }}}{2^{\frac{2\nu }{1+\nu }}aM_\nu ^{\frac{2}{1+\nu }}},\nonumber \\ A_{k+1}= & A_k + \alpha _{k+1},\quad a,\varepsilon , M_{\nu } > 0,\; \nu \in [0,1] \end{aligned}$$
(81)

for all \(k\ge 0\). Then for all \(k\ge 0\) we have

$$\begin{aligned} A_{k} \ge a L_{k} \alpha _{k}^2,\quad A_{k} \ge \frac{k^{\frac{1+3\nu }{1+\nu }}(\nicefrac {\varepsilon }{2})^{\frac{1-\nu }{1+\nu }}}{2^{\frac{1+3\nu }{1+\nu }}aM_\nu ^{\frac{2}{1+\nu }}}, \end{aligned}$$
(82)

where \(L_0 = 0\) and for \(k > 0\)

$$\begin{aligned} L_{k} = \left( \frac{2A_{k}}{\alpha _{k}\varepsilon }\right) ^{\frac{1-\nu }{1+\nu }} M_\nu ^{\frac{2}{1+\nu }}. \end{aligned}$$
(83)

Moreover, for all \(k \ge 0\)

$$\begin{aligned} A_k \le \frac{k^{\frac{1+3\nu }{1+\nu }}(\nicefrac {\varepsilon }{2})^{\frac{1-\nu }{1+\nu }}}{2^{\frac{2\nu }{1+\nu }}aM_\nu ^{\frac{2}{1+\nu }}}. \end{aligned}$$
(84)

Proof

We start with deriving the second inequality from (82). The proof goes by induction. For \(k = 0\), the inequality holds. Next, we assume that it holds for all \(k \le K\). Then,

$$\begin{aligned} A_{K+1} = A_{K} + \alpha _{K+1} \ge \frac{K^{\frac{1+3\nu }{1+\nu }}(\nicefrac {\varepsilon }{2})^{\frac{1-\nu }{1+\nu }}}{2^{\frac{1+3\nu }{1+\nu }}aM_\nu ^{\frac{2}{1+\nu }}} + \frac{(K+1)^{\frac{2\nu }{1+\nu }}(\nicefrac {\varepsilon }{2})^{\frac{1-\nu }{1+\nu }}}{2^{\frac{2\nu }{1+\nu }}aM_\nu ^{\frac{2}{1+\nu }}}. \end{aligned}$$

Let us estimate the right-hand side of the previous inequality. We want to show that

$$\begin{aligned} \frac{K^{\frac{1+3\nu }{1+\nu }}(\nicefrac {\varepsilon }{2})^{\frac{1-\nu }{1+\nu }}}{2^{\frac{1+3\nu }{1+\nu }}aM_\nu ^{\frac{2}{1+\nu }}} + \frac{(K+1)^{\frac{2\nu }{1+\nu }}(\nicefrac {\varepsilon }{2})^{\frac{1-\nu }{1+\nu }}}{2^{\frac{2\nu }{1+\nu }}aM_\nu ^{\frac{2}{1+\nu }}}\ge & \frac{(K+1)^{\frac{1+3\nu }{1+\nu }}(\nicefrac {\varepsilon }{2})^{\frac{1-\nu }{1+\nu }}}{2^{\frac{1+3\nu }{1+\nu }}aM_\nu ^{\frac{2}{1+\nu }}} \end{aligned}$$

that is equivalent to the inequality:

$$\begin{aligned} \frac{K^{\frac{1+3\nu }{1+\nu }}}{2} + (K+1)^{\frac{2\nu }{1+\nu }} \ge \frac{(K+1)^{\frac{1+3\nu }{1+\nu }}}{2} \Longleftrightarrow \frac{K^{\frac{1+3\nu }{1+\nu }}}{2} \ge \frac{(K+1)^{\frac{2\nu }{1+\nu }}(K-1)}{2}. \end{aligned}$$

If \(K = 1\), it trivially holds. If \(K > 1\), it is equivalent to

$$\begin{aligned} \frac{K}{K-1} \ge \left( \frac{K+1}{K}\right) ^{2 - \frac{2}{1+\nu }}. \end{aligned}$$

Since \(2 - \frac{2}{1+\nu }\) is monotonically increasing function for \(\nu \in [0,1]\) we have that

$$\begin{aligned} \left( \frac{K+1}{K}\right) ^{2 - \frac{2}{1+\nu }} \le \frac{K+1}{K} \le \frac{K}{K-1}. \end{aligned}$$

That is, the second inequality in (82) holds for \(k = K+1\), and, as a consequence, it holds for all \(k \ge 0\). Next, we derive the first part of (82). For \(k = 0\), it trivially holds. For \(k > 0\) we consider cases \(\nu = 0\) and \(\nu > 0\) separately. When \(\nu = 0\) the inequality is equivalent to

$$\begin{aligned} 1 \ge \frac{2a\alpha _k M_0^2}{\varepsilon }, \text { where } \frac{2a\alpha _k M_0^2}{\varepsilon } \overset{(81)}{=} 1, \end{aligned}$$

i.e., we have \(A_k = aL_k\alpha _k^2\) for all \(k\ge 0\). When \(\nu > 0\) the first inequality in (82) is equivalent to

$$\begin{aligned} A_{k} \ge a^{\frac{1+\nu }{2\nu }}\alpha _{k}^{\frac{1+3\nu }{2\nu }}(\nicefrac {\varepsilon }{2})^{-\frac{1-\nu }{2\nu }}M_\nu ^{\frac{1}{\nu }} \overset{(81)}{\Longleftrightarrow } A_{k} \ge \frac{k^{\frac{1+3\nu }{1+\nu }}(\nicefrac {\varepsilon }{2})^{\frac{1-\nu }{1+\nu }}}{2^{\frac{1+3\nu }{1+\nu }}aM_\nu ^{\frac{2}{1+\nu }}}, \end{aligned}$$

where the last inequality coincides with the second inequality from (82) that we derived earlier in the proof.

To finish the proof, it remains to derive (84). Again, the proof goes by induction. For \(k=0\) inequality (84) is trivial. Next, we assume that it holds for all \(k \le K\). Then,

$$\begin{aligned} A_{K+1} = A_{K} + \alpha _{K+1} \le \frac{K^{\frac{1+3\nu }{1+\nu }}(\nicefrac {\varepsilon }{2})^{\frac{1-\nu }{1+\nu }}}{2^{\frac{2\nu }{1+\nu }}aM_\nu ^{\frac{2}{1+\nu }}} + \frac{(K+1)^{\frac{2\nu }{1+\nu }}(\nicefrac {\varepsilon }{2})^{\frac{1-\nu }{1+\nu }}}{2^{\frac{2\nu }{1+\nu }}aM_\nu ^{\frac{2}{1+\nu }}}. \end{aligned}$$

Let us estimate the right-hand side of the previous inequality. We want to show that

$$\begin{aligned} \frac{K^{\frac{1+3\nu }{1+\nu }}(\nicefrac {\varepsilon }{2})^{\frac{1-\nu }{1+\nu }}}{2^{\frac{2\nu }{1+\nu }}aM_\nu ^{\frac{2}{1+\nu }}} + \frac{(K+1)^{\frac{2\nu }{1+\nu }}(\nicefrac {\varepsilon }{2})^{\frac{1-\nu }{1+\nu }}}{2^{\frac{2\nu }{1+\nu }}aM_\nu ^{\frac{2}{1+\nu }}}\le & \frac{(K+1)^{\frac{1+3\nu }{1+\nu }}(\nicefrac {\varepsilon }{2})^{\frac{1-\nu }{1+\nu }}}{2^{\frac{2\nu }{1+\nu }}aM_\nu ^{\frac{2}{1+\nu }}} \end{aligned}$$

that is equivalent to the inequality:

$$\begin{aligned} K^{\frac{1+3\nu }{1+\nu }} + (K+1)^{\frac{2\nu }{1+\nu }} \le (K+1)^{\frac{1+3\nu }{1+\nu }}. \end{aligned}$$

This inequality holds due to

$$\begin{aligned} K^{\frac{1+3\nu }{1+\nu }} \le (K+1)^{\frac{2\nu }{1+\nu }}K. \end{aligned}$$

That is, (84) holds for \(k = K+1\), and, as a consequence, it holds for all \(k \ge 0\). \(\square \)

Lemma A.4

Let f have Hölder continuous gradients on \({\mathbb {R}}^n\) for some \(\nu \in [0,1]\) with constant \(M_\nu > 0\), be convex and \(x^*\) be some minimum of f(x) on \({\mathbb {R}}^n\). Then, for all \(x\in {\mathbb {R}}^n\)

$$\begin{aligned} \Vert \nabla f(x)\Vert _2 \le \left( \frac{1+\nu }{\nu }\right) ^{\frac{\nu }{1+\nu }}M_\nu ^{\frac{1}{1+\nu }} \left( f(x) - f(x^*)\right) ^{\frac{\nu }{1+\nu }}, \end{aligned}$$
(85)

where for \(\nu = 0\) we use \(\left[ \left( \frac{1+\nu }{\nu }\right) ^{\frac{\nu }{1+\nu }}\right] _{\nu =0}:= \lim _{\nu \rightarrow 0}\left( \frac{1+\nu }{\nu }\right) ^{\frac{\nu }{1+\nu }} = 1\).

Proof

For \(\nu = 0\) inequality (85) follows from (3) andFootnote 11\(\nabla f(x^*) = 0\). When \(\nu > 0\) for arbitrary point \(x\in {\mathbb {R}}^n\) we consider the point \(y = x - \alpha \nabla f(x)\), where \(\alpha = \left( \frac{\Vert \nabla f(x)\Vert _2^{1-\nu }}{M_\nu }\right) ^{\frac{1}{\nu }}\).

For the pair of points xy we apply (78) and get

$$\begin{aligned} f(y)\le & f(x) + \langle \nabla f(x), y-x\rangle + \frac{M_\nu }{1+\nu }\Vert x-y\Vert _2^{1+\nu }\\= & f(x) - \alpha \Vert \nabla f(x)\Vert ^2 + \frac{\alpha ^{\nu +1}M_\nu }{1+\nu }\Vert \nabla f(x)\Vert _2^{1+\nu }\\= & f(x) - \frac{\Vert \nabla f(x)\Vert _2^{\frac{1+\nu }{\nu }}}{M_\nu ^{\frac{1}{\nu }}} + \frac{\Vert \nabla f(x)\Vert _2^{\frac{1+\nu }{\nu }}}{(1+\nu )M_\nu ^{\frac{1}{\nu }}} = f(x) - \frac{\nu \Vert \nabla f(x)\Vert _2^{\frac{1+\nu }{\nu }}}{(1+\nu )M_\nu ^{\frac{1}{\nu }}} \end{aligned}$$

implying

$$\begin{aligned} \Vert \nabla f(x)\Vert _2\le & \left( \frac{1+\nu }{\nu }\right) ^{\frac{\nu }{1+\nu }}M_\nu ^{\frac{1}{1+\nu }} \left( f(x) - f(y)\right) ^{\frac{\nu }{1+\nu }} \\\le & \left( \frac{1+\nu }{\nu }\right) ^{\frac{\nu }{1+\nu }}M_\nu ^{\frac{1}{1+\nu }} \left( f(x) - f(x^*)\right) ^{\frac{\nu }{1+\nu }}. \square \end{aligned}$$

\(\square \)

Lemma A.5

Let f have Hölder continuous gradients on \({\mathbb {R}}^n\) for some \(\nu \in [0,1]\) with constant \(M_\nu > 0\), be convex and \(x^*\) be some minimum of f(x) on \({\mathbb {R}}^n\). Then, for all \(x\in {\mathbb {R}}^n\) and all \(\delta >0\),

$$\begin{aligned} \Vert \nabla f(x)\Vert _2^2 \le 2\left( \frac{1}{\delta }\right) ^{\frac{1-\nu }{1+\nu }}M_{\nu }^{\frac{2}{1+\nu }}\left( f(x)-f(x^*)\right) + \delta ^{\frac{2\nu }{1+\nu }} M_{\nu }^{\frac{2}{1+\nu }}. \end{aligned}$$
(86)

Proof

For a given \(\delta > 0\) we consider an arbitrary point \(x\in Q\) and \(y = x - \frac{1}{L(\delta ,\nu )}\nabla f(x)\), where \(L(\delta ,\nu ) = \left( \frac{1}{\delta }\right) ^{\frac{1-\nu }{1+\nu }}M_\nu ^{\frac{2}{1+\nu }}\).

For the pair of points xy we apply (79) and get

$$\begin{aligned} f(y)\le & f(x) + \langle \nabla f(x), y-x \rangle + \frac{L(\delta ,\nu )}{2} \Vert x-y\Vert _2^{2} + \frac{\delta }{2}\\= & f(x) - \frac{1}{2L(\delta ,\nu )}\Vert {\nabla f(x)}\Vert _2^2 + \frac{\delta }{2} \end{aligned}$$

implying

$$\begin{aligned} \Vert \nabla f(x)\Vert _2^2\le & 2L(\delta ,\nu )\left( f(x) - f(y)\right) + \delta L(\delta , \nu )\\\le & 2\left( \frac{1}{\delta }\right) ^{\frac{1-\nu }{1+\nu }}M_{\nu }^{\frac{2}{1+\nu }}\left( f(x)-f(x^*)\right) + \delta ^{\frac{2\nu }{1+\nu }} M_{\nu }^{\frac{2}{1+\nu }}. \end{aligned}$$

\(\square \)

Additional Experimental Details and Results

1.1 Experiments on Synthetic Data

1.1.1 Hyper-parameters Tuning

We grid-searched hyper-parameters for each model. Commonly for all models we considered batch sizes of \({\{5, 10, 20, 50, 100, 200\}}\) and stepsizes \(lr\in [1\textrm{e}{-1},1\textrm{e}{-5}]\). As to model-specific parameters:

  • for Adam we grid-search over \(betas\in (\{0.8, 0.9, 0.95, 0.99\}, \{0.9, 0.99, 0.999\})\),

  • for SGD  — over \(momentum\in \{0.8, 0.9, 0.99, 0.999\}\),

  • for clipped-SSTM  — over clipping parameter \(B\in {\{1\textrm{e}{-0},1\textrm{e}{-1}, 1\textrm{e}{-2}, 1\textrm{e}{-3}\}}\),

  • for clipped-SGD  — over \(momentum\in \{0.8, 0.9, 0.99, 0.999\}\) and clipping parameter \(B\in {\{1\textrm{e}{-0},1\textrm{e}{-1}, 1\textrm{e}{-2}, 1\textrm{e}{-3}\}}\).

For clipped-SSTM we additionally use \(\nu =1\) and norm clipping (we did not gridsearch over it extensively; however, in our experiments on real data, these parameters were the best). For clipped-SGD we use coordinate-wise clipping.

For Adam, clipped-SSTM and clipped-SGD the best parameters for each p were approximately the same:

  • Adam: \(lr=1\textrm{e}{-3}\), \(betas=(0.9, 0.9)\) and batch size of 10

  • clipped-SSTM: \(lr=1\textrm{e}{-3}\), \(\nu = 1\), \(B=1\textrm{e}{-2}\), norm clipping and a batch size of 5

  • clipped-SGD: \(lr=1\textrm{e}{-3}\) and \(B=1\textrm{e}{-1}\) or \(lr=1\textrm{e}{-2}\) and \(B=1\textrm{e}{-2}\), \(momentum=0.8\), coordinate-wise clipping and a batch size of 5

1.1.2 Comparison w.r.t. Certain Relative Train Loss Level

In Fig. 2, we reported the performance of the methods in terms of the best models w.r.t. train loss achieved. However, it is also interesting to compare the methods w.r.t. the speed they achieve a certain (2.0) level of relative loss on train (\(f_p(x_{\text {pred}})/f_p(x_{\text {true}})\)). This is a valid metric, since \(f_p(x_{\text {true}})\) is non-zero, after adding noise to the train part of the dataset, and \(x_{\text {true}}\) is still a good approximation of the optimal solution. The results are represented in Fig. 5. As in the previous set of experiments, one can see that clipped-SSTM outperforms other algorithms and achieves this 2.0 level of relative loss much faster, though later loses to Adam/clipped-SGD.

Fig. 5
figure 5

Results obtained for different p by the lowest epoch when model achieved \(\times 2\) from loss in \(x_{\text {true}}\)

1.2 Neural Networks Training

1.2.1 Hyper-parameters

In our experiments with the training of neural networks, we use standard implementations of Adam and SGD from PyTorch [34]; we write only the parameters we changed from the default.

To conduct these experiments, we used Nvidia RTX 2070s. The longest experiment (evolution of the noise distribution for image classification task) took 53 h (we iterated several times over the train dataset to build a better histogram; see Appendix B.2.3).

Image Classification. For ResNet-18 + ImageNet-100 the parameters of the methods were chosen as follows:

  • Adam: \(lr=1e-3\) and a batch size of \(4\times 32\)

  • SGD: \(lr=1e-2\), \(momentum=0.9\) and a batch size of 32

  • clipped-SGD: \(lr=5e-2\), \(momentum=0.9\), coordinate-wise clipping with clipping parameter \(B=0.1\) and a batch size of 32

  • clipped-SSTM: \(\nu = 1\), stepsize parameter \(\alpha = 1e-3\) (in code we use separately \(lr=1e-2\) and \(L=10\) and \(\alpha = \frac{lr}{L}\)), norm clipping with clipping parameter \(B=1\) and a batch size of \(2\times 32\). We also upper bounded the ratio \(\nicefrac {A_k}{A_{k+1}}\) by 0.99 (see \(a\_k\_ratio\_upper\_bound\) parameter in code)

Fig. 6
figure 6

Train and validation loss + accuracy for clipped-SSTM with different parameters. Here \(\alpha _0 = 0.000125\), bs means batch size. As we can see from the plots, increasing \(\alpha \) 4 times and batch size 2 times almost does not affect the method’s behavior

Fig. 7
figure 7

Evolution of the noise distribution for BERT + CoLA task

Fig. 8
figure 8

Evolution of the noise distribution for ResNet-18 + ImageNet-100 task

The main two parameters that we grid-searched were lr and batch size. For both of them, we used a logarithmic grid (i.e., for lr, we used \(1e-5,2e-5,5e-5,1e-4,\ldots ,1e-2,2e-2,5e-2\) for Adam). Batchsize was chosen from \(32, 2\cdot 32, 4\cdot 32\), and \(8\cdot 32\). For SGD, we also tried various momentum parameters.

For clipped-SSTM and clipped-SGD, we used clipping levels of 1 and 0.1, respectively. Too small a choice of the clipping level, e.g. 0.01, slows down the convergence significantly.

Another important parameter for clipped-SSTM here was \(a\_k\_ratio\_upper\_bound\) – we used it to upper bound the maximum ratio of \(\nicefrac {A_k}{A_{k+1}}\). Without this modification, the method is too conservative. e.g., after \(10^4\) steps \(\nicefrac {A_k}{A_{k+1}}\approx 0.9999\). Effectively, it can be seen as a momentum parameter of SGD.

Text Classification, CoLA. For BERT + CoLA the parameters of the methods were chosen as follows:

  • Adam: \(lr=5e-5\), \(weight\_decay=5e-4\) and a batch size of 32

  • SGD: \(lr=1e-3\), \(momentum=0.9\) and a batch size of 32

  • clipped-SSTM: \(\nu = 1\), stepsize parameter \(\alpha = 8e-3\), norm clipping with clipping parameter \(B=1\) and a batch size of \(8\times 32\)

  • clipped-SGD: \(lr=2e-3\), \(momentum=0.9\), coordinate-wise clipping with clipping parameter \(B=0.1\) and a batch size of 32

There, we used the same grid as in the previous task. The main difference here is that we didn’t bound clipped-SSTM \(A_k/A_{k+1}\) ratio – there are only \(\approx 300\) steps of the method (because the batch size is \(8\cdot 32\)). Thus, the method is still not too conservative.

1.2.2 On the Relation Between Stepsize Parameter \(\alpha \) and Batchsize

In our experiments, we noticed that clipped-SSTM shows similar results when the ratio \(\nicefrac {bs^2}{\alpha }\) is kept unchanged, where bs is batch size (see Fig. 6). We compare the performance of clipped-SSTM with 4 different choices of \(\alpha \) and the batch size.

Theorem 4.1 explains this phenomenon in the convex case. For the case of \(\nu = 1\) we have (from (24) and (30)):

$$\begin{aligned} \alpha \sim \frac{1}{aM_1},\quad \alpha _k \sim k\alpha ,\quad m_k \sim \frac{N a \sigma ^2 \alpha _{k+1}^2}{C^2R_0^2\ln \frac{4N}{\beta }},\quad N \sim \frac{a^{\frac{1}{2}}CR_0M_1^{\frac{1}{2}}}{\varepsilon ^{\frac{1}{2}}}\sim \frac{CR_0}{\alpha ^{\frac{1}{2}}\varepsilon ^{\frac{1}{2}}}, \end{aligned}$$

whence

$$\begin{aligned} m_k \sim \frac{CR_0 a \sigma ^2 \alpha ^2(k+1)^2}{\alpha ^{\frac{1}{2}}\varepsilon ^{\frac{1}{2}}C^2R_0^2\ln \frac{4N}{\beta }} \sim \frac{\sigma ^2 \alpha ^2(k+1)^2}{\alpha ^{\frac{1}{2}}\alpha M_1\varepsilon ^{\frac{1}{2}}CR_0\ln \frac{4N}{\beta }}\sim \alpha ^{\frac{1}{2}}, \end{aligned}$$

where the dependencies on numerical constants and logarithmic factors are omitted. Therefore, the observed empirical relation between batch size (\(m_k\)) and \(\alpha \) correlates well with the established theoretical results for clipped-SSTM.

1.2.3 Evolution of the Noise Distribution

In this section, we provide our empirical study of the noise distribution evolution along the trajectories of different optimizers. As one can see from the plots in Figs. 7 and 8, the noise distribution for ResNet-18 + ImageNet-100 task is always close to Gaussian distribution, whereas for BERT + CoLA task it is significantly heavy-tailed.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Gorbunov, E., Danilova, M., Shibaev, I. et al. High-Probability Complexity Bounds for Non-smooth Stochastic Convex Optimization with Heavy-Tailed Noise. J Optim Theory Appl 203, 2679–2738 (2024). https://doi.org/10.1007/s10957-024-02533-z

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10957-024-02533-z

Keywords

Mathematics Subject Classification