×

Convergence of gradient algorithms for nonconvex \(C^{1+ \alpha}\) cost functions. (English) Zbl 07708673

Summary: This paper is concerned with convergence of stochastic gradient algorithms with momentum terms in the nonconvex setting. A class of stochastic momentum methods, including stochastic gradient descent, heavy ball and Nesterov’s accelerated gradient, is analyzed in a general framework under mild assumptions. Based on the convergence result of expected gradients, the authors prove the almost sure convergence by a detailed discussion of the effects of momentum and the number of upcrossings. It is worth noting that there are not additional restrictions imposed on the objective function and stepsize. Another improvement over previous results is that the existing Lipschitz condition of the gradient is relaxed into the condition of Hölder continuity. As a byproduct, the authors apply a localization procedure to extend the results to stochastic stepsizes.

MSC:

62L20 Stochastic approximation
90C26 Nonconvex programming, global optimization

Software:

AdaGrad; Adam

References:

[1] An, W., Wang, H., Sun, Q., et al., A PID Controller Approach for Stochastic Optimization of Deep Networks, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Salt Lake City, 2018, 8522-8531.
[2] Becker, S. and Lecun, Y., Improving the Convergence of Back-propagation Learning with Second-order Methods, Proceedings of the 1988 Connectionist Models Summer School, San Mateo, 1988, 29-37.
[3] Bertsekas, D. P.; Tsitsiklis, J. N., Neuro-Dynamic Programming (1996), Belmont, MA: Athena Scientific, Belmont, MA · Zbl 0924.68163
[4] Bertsekas, D. P.; Tsitsiklis, J. N., Gradient convergence in gradient methods with errors, SIAM J. Control Optim., 10, 3, 627-642 (2000) · Zbl 1049.90130 · doi:10.1137/S1052623497331063
[5] Bottou, L.; Curtis, F. E.; Nocedal, J., Optimization methods for large-scale machine learning, SIAM Rev., 60, 2, 223-311 (2018) · Zbl 1397.65085 · doi:10.1137/16M1080173
[6] Cauchy, A., Méthode générale pour la résolution des systèmes d’équations simultanées, Comp. Rend. Sci. Paris, 25, 536-538 (1847)
[7] Chen, X., Liu, S., Sun, R. and Hong, M., On the Convergence of a Class of Adam-type Algorithms for Non-convex Optimization, ICLR 2019, Seventh International Conference on Learning Representations, New Orleans, 2019.
[8] Curry, H. B., The method of steepest descent for non-linear minimization problems, Q. Appl. Math., 2, 3, 258-261 (1944) · Zbl 0061.26801 · doi:10.1090/qam/10667
[9] Cutkosky, A. and Orabona, F., Momentum-based Variance Reduction in Non-convex SGD, Advances in Neural Information Processing Systems 32 (NIPS 2019), Vancouver, 2019, 15210-15219.
[10] Duchi, J.; Hazan, E.; Singer, Y., Adaptive subgradient methods for online learning and stochastic optimization, J. Mach. Learn. Res., 12, 61, 2121-2159 (2011) · Zbl 1280.68164
[11] Fort, J. C.; Pagès, G., Convergence of stochastic algorithms: From the Kushner-Clark theorem to the Lyapounov function method, Adv. Appl. Probab., 28, 4, 1072-1094 (1996) · Zbl 0881.62085 · doi:10.2307/1428165
[12] Gadat, S.; Panloup, F.; Saadane, S., Stochastic heavy ball, Electron. J. Stat., 12, 1, 461-529 (2018) · Zbl 1392.62244 · doi:10.1214/18-EJS1395
[13] Ghadimi, E., Feyzmahdavian, H. R. and Johansson, M., Global Convergence of the Heavy-ball Method for Convex Optimization, ROC15, 14th European Control Conference, Linz, 2015, 310-315.
[14] Ghadimi, S.; Lan, G., Stochastic first- and zeroth-order methods for nonconvex stochastic programming, SIAM J. Optim., 23, 4, 2341-2368 (2013) · Zbl 1295.90026 · doi:10.1137/120880811
[15] Ghadimi, S.; Lan, G., Accelerated gradient methods for nonconvex nonlinear and stochastic programming, Math. Program., 156, 1, 59-99 (2016) · Zbl 1335.62121 · doi:10.1007/s10107-015-0871-8
[16] Gitman, I., Lang, II., Zhang, P. and Xiao, L., Understanding the Role of Momentum in Stochastic Gradient Methods, Advances in Neural Information Processing Systems 32 (NIPS 2019), Vancouver, 2019, 9633-9643.
[17] Hoffman, M. D. and Blei, D. M., Stochastic Structured Variational Inference, AISTATS 2015, 18th International Conference on Artificial Intelligence and Statistics, San Diego, 2015, 361-369.
[18] Kiefer, J.; Wolfowitz, J., Stochastic estimation of the maximum of a regression function, Ann. Math. Stat., 23, 3, 462-466 (1952) · Zbl 0049.36601 · doi:10.1214/aoms/1177729392
[19] Kingma, D. P. and Ba, J. L., Adam: A Method for Stochastic Optimization, ICLR 2015, Third International Conference on Learning Representations, San Diego, 2015.
[20] Kushner, H. J.; Clark, D. S., Stochastic Approximation Methods for Constrained and Unconstrained Systems (1978), New York-Berlin: Springer-Verlag, New York-Berlin · Zbl 0381.60004 · doi:10.1007/978-1-4684-9352-8
[21] Kushner, H. J.; Shwartz, A., An invariant measure approach to the convergence of stochastic approximations with state dependent noise, SIAM J. Control Optim., 22, 1, 13-27 (1984) · Zbl 0535.62069 · doi:10.1137/0322002
[22] Lessard, L.; Recht, B.; Packard, A., Analysis and design of optimization algorithms via integral quadratic constraints, SIAM J. Optim., 26, 1, 57-95 (2016) · Zbl 1329.90103 · doi:10.1137/15M1009597
[23] Li, Q., Tai, C. and B, W., Stochastic Modified Equations and Adaptive Stochastic Gradient Algorithms, ICML’17, Proceedings of the 34th International Conference on Machine Learning, Sydney, 2017, 2101-2110.
[24] Li, Q.; Tai, C.; E, W., Stochastic modified equations and dynamics of stochastic gradient algorithms I: Mathematical foundations, J. Mach. Learn. Res., 20, 40, 1-47 (2019) · Zbl 1484.62106
[25] Li, X. and Orabona, F., On the Convergence of Stochastic Gradient Descent with Adaptive Stepsizes, AISTATS 2018, 21st International Conference on Artificial Intelligence and Statistics, Playa Blanca, 2018, 983-992.
[26] Ljung, L., Analysis of recursive stochastic algorithms, IEEE Trans. Autom. Control, 22, 4, 551-575 (1977) · Zbl 0362.93031 · doi:10.1109/TAC.1977.1101561
[27] Nesterov, Y. E., A method for unconstrained convex minimization problem with the rate of convergence O(l/k^2), Dokl. Akad. Nauk SSSR, 269, 543-547 (1983)
[28] Nesterov, Y. E., Introductory Lectures on Convex Optimization: A Basic Course (2004), Boston, MA: Kluwer Academic Publishers, Boston, MA · Zbl 1086.90045 · doi:10.1007/978-1-4419-8853-9
[29] Nesterov, Y. R., Efficiency of coordinate descent methods on huge-scale optimization problems, SIAM J. Optim., 22, 2, 341-362 (2012) · Zbl 1257.90073 · doi:10.1137/100802001
[30] Polyak, B. T., Some methods of speeding up the convergence of iteration methods, USSR Comput. Math. Math. Phys., 4, 5, 1-17 (1964) · Zbl 0147.35301 · doi:10.1016/0041-5553(64)90137-5
[31] Polyak, B. T., Introduction to Optimization (1987), New York: Optimization Software, Inc., New York · Zbl 0625.62093
[32] Polyak, B. T.; Juditsky, A. B., Acceleration of stochastic approximation by averaging, SIAM J. Control Optim., 30, 4, 838-855 (1992) · Zbl 0762.62022 · doi:10.1137/0330046
[33] Reddi, S. J., Kale, S. and Kumar, S., On the Convergence of Adam and Beyond, ICLR 2018, Sixth International Conference on Learning Representations, Vancouver, 2018.
[34] Rennie, J. D. M., Smooth hinge classification, http://people.csail.mit.edu/jrennie/writing, Massachusetts Inst. Technol., 2005.
[35] Robbins, H.; Monro, S., A stochastic approximation method, Ann. Math. Stat., 22, 3, 400-407 (1951) · Zbl 0054.05901 · doi:10.1214/aoms/1177729586
[36] Robbins, H. and Siegmund, D., A convergence theorem for non negative almost supermartingales and some applications, Optimizing Methods in Statistics, 1971, 233-257. · Zbl 0286.60025
[37] Sypherd, T., Diaz, M., Sankar, L. and Kairouz, P., A Tunable Loss Function for Binary Classification, 2019 IEEE International Symposium on Information Theory, Paris, 2019, 2479-2483.
[38] Tao, H.; Hou, C.; Nie, F., Effective discriminative feature selection with nontrivial solution, IEEE Trans. Neural Netw. Learn. Syst., 27, 4, 796-808 (2016) · doi:10.1109/TNNLS.2015.2424721
[39] Xiong, H.; Chi, Y.; Hu, B.; Zhang, W., Analytical convergence regions of accelerated gradient descent in nonconvex optimization under regularity condition, Automatica, 113, 108715 (2020) · Zbl 1440.93071 · doi:10.1016/j.automatica.2019.108715
[40] Yan, Y., Yang, T., Li, Z., et al., A Unified Analysis of Stochastic Momentum Methods for Deep Learning, Proceedings of the Twenty-Seventh International Joint Conference on Artificial Intelligence (IJCA1 2018), Stockholm, 2018, 2955-2961.
This reference list is based on information provided by the publisher or from digital mathematics libraries. Its items are heuristically matched to zbMATH identifiers and may contain data conversion errors. In some cases that data have been complemented/enhanced by data from zbMATH Open. This attempts to reflect the references listed in the original paper as accurately as possible without claiming completeness or a perfect matching.