×

Stochastic optimization with momentum: convergence, fluctuations, and traps avoidance. (English) Zbl 1471.62442

Summary: In this paper, a general stochastic optimization procedure is studied, unifying several variants of the stochastic gradient descent such as, among others, the stochastic heavy ball method, the Stochastic Nesterov Accelerated Gradient algorithm (S-NAG), and the widely used Adam algorithm. The algorithm is seen as a noisy Euler discretization of a non-autonomous ordinary differential equation, recently introduced by Belotto da Silva and Gazeau, which is analyzed in depth. Assuming that the objective function is non-convex and differentiable, the stability and the almost sure convergence of the iterates to the set of critical points are established. A noteworthy special case is the convergence proof of S-NAG in a non-convex setting. Under some assumptions, the convergence rate is provided under the form of a Central Limit Theorem. Finally, the non-convergence of the algorithm to undesired critical points, such as local maxima or saddle points, is established. Here, the main ingredient is a new avoidance of traps result for non-autonomous settings, which is of independent interest.

MSC:

62L20 Stochastic approximation
34A12 Initial value problems, existence, uniqueness, continuous dependence and continuation of solutions to ordinary differential equations
60F99 Limit theorems in probability theory
68T99 Artificial intelligence

Software:

RMSprop; Adam; AdaGrad

References:

[1] Alacaoglu, A., Malitsky, Y. and Cevher, V. (2020). Convergence of adaptive algorithms for weakly convex constrained optimization. arXiv preprint arXiv:2006.06650.
[2] Alacaoglu, A., Malitsky, Y., Mertikopoulos, P. and Cevher, V. (2020). A new regret analysis for Adam-type algorithms. In Proceedings of the 37th International Conference on Machine Learning (H. D. III and A. Singh, eds.). Proceedings of Machine Learning Research 119 202-210.
[3] Alvarez, F. (2000). On the minimizing property of a second order dissipative system in Hilbert spaces. SIAM Journal on Control and Optimization 38 1102-1119. · Zbl 0954.34053
[4] Assran, M. and Rabbat, M. (2020). On the convergence of Nesterov’s accelerated gradient method in stochastic settings. In Proceedings of the 37th International Conference on Machine Learning (H. D. III and A. Singh, eds.). Proceedings of Machine Learning Research 119 410-420.
[5] Attouch, H., Goudou, X. and Redont, P. (2000). The heavy ball with friction method, I. The continuous dynamical system: global exploration of the local minima of a real-valued function by asymptotic analysis of a dissipative dynamical system. Communications in Contemporary Mathematics 2 1-34. · Zbl 0983.37016
[6] Attouch, H., Chbani, Z., Peypouquet, J. and Redont, P. (2018). Fast convergence of inertial dynamics and algorithms with asymptotic vanishing viscosity. Mathematical Programming 168 123-175. · Zbl 1395.34068
[7] Aujol, J. F., Dossal, C. and Rondepierre, A. (2019). Optimal convergence rates for Nesterov acceleration. SIAM Journal on Optimization 29 3131-3153. · Zbl 1453.90117 · doi:10.1137/18M1186757
[8] Barakat, A. and Bianchi, P. (2021). Convergence and dynamical behavior of the ADAM algorithm for nonconvex stochastic optimization. SIAM Journal on Optimization 31 244-274. · Zbl 1465.90050
[9] Belotto da Silva, A. and Gazeau, M. (2020). A general system of differential equations to model first-order adaptive algorithms. Journal of Machine Learning Research 21 1-42. · Zbl 1517.65053
[10] Benaïm, M. (1999). Dynamics of stochastic approximation algorithms. In Séminaire de Probabilités, XXXIII. Lecture Notes in Math. 1709 1-68. Springer, Berlin. · Zbl 0955.62085 · doi:10.1007/BFb0096509
[11] Benaïm, M. and Hirsch, M. W. (1996). Asymptotic pseudotrajectories and chain recurrent flows, with applications. J. Dynam. Differential Equations 8 141-176. · Zbl 0878.58053 · doi:10.1007/BF02218617
[12] Brandière, O. and Duflo, M. (1996). Les algorithmes stochastiques contournent-ils les pièges? Ann. Inst. H. Poincaré Probab. Statist. 32 395-427. · Zbl 0849.62043
[13] Cabot, A., Engler, H. and Gadat, S. (2009). On the long time behavior of second order differential equations with asymptotically small dissipation. Transactions of the American Mathematical Society 361 5983-6017. · Zbl 1191.34078
[14] Chen, J., Zhou, D., Tang, Y., Yang, Z. and Gu, Q. (2018). Closing the generalization gap of adaptive gradient methods in training deep neural networks. arXiv preprint arXiv:1806.06763.
[15] Chen, X., Liu, S., Sun, R. and Hong, M. (2019). On the convergence of a class of Adam-type algorithms for non-convex optimization. In International Conference on Learning Representations.
[16] Dalec′kiĭ, J. L. and Kreĭn, M. G. (1974). Stability of Solutions of Differential Equations in Banach Space. American Mathematical Society, Providence, R.I. Translated from the Russian by S. Smith, Translations of Mathematical Monographs, Vol. 43. · Zbl 0286.34094
[17] De, S., Mukherjee, A. and Ullah, E. (2018). Convergence guarantees for RMSProp and ADAM in non-convex optimization and their comparison to Nesterov acceleration on autoencoders. arXiv preprint arXiv:1807.06766.
[18] Défossez, A., Bottou, L., Bach, F. and Usunier, N. (2020). A simple convergence proof of Adam and Adagrad. arXiv preprint arXiv:2003.02395.
[19] Delyon, B., Lavielle, M. and Moulines, E. (1999). Convergence of a stochastic approximation version of the EM algorithm. Annals of statistics 94-128. · Zbl 0932.62094
[20] Du, S. S., Jin, C., Lee, J. D., Jordan, M. I., Singh, A. and Poczos, B. (2017). Gradient descent can take exponential time to escape saddle points. In Advances in Neural Information Processing Systems 30 (I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan and R. Garnett, eds.) 1067-1077. Curran Associates, Inc.
[21] Duchi, J., Hazan, E. and Singer, Y. (2011). Adaptive subgradient methods for online learning and stochastic optimization. Journal of Machine Learning Research 12 2121-2159. · Zbl 1280.68164
[22] Gadat, S. and Gavra, I. (2020). Asymptotic study of stochastic adaptive algorithm in non-convex landscape. arXiv preprint arXiv:2012.05640.
[23] Gadat, S., Panloup, F. and Saadane, S. (2018). Stochastic heavy ball. Electron. J. Stat. 12 461-529. · Zbl 1392.62244 · doi:10.1214/18-EJS1395
[24] Haraux, A. (1991). Systèmes Dynamiques Dissipatifs et Applications 17. Masson.
[25] Hartman, P. (2002). Ordinary Differential Equations, Second ed. Society for Industrial and Applied Mathematics. · Zbl 1009.34001 · doi:10.1137/1.9780898719222
[26] Horn, R. A. and Johnson, C. R. (1994). Topics in Matrix Analysis. Cambridge University Press, Cambridge Corrected reprint of the 1991 original. · Zbl 0729.15001
[27] Jin, C., Ge, R., Netrapalli, P., Kakade, S. M. and Jordan, M. I. (2017). How to Escape Saddle Points Efficiently. (D. Precup and Y. W. Teh, eds.). Proceedings of Machine Learning Research 70 1724-1732. PMLR.
[28] Karatzas, I. and Shreve, S. E. (1991). Brownian Motion and Stochastic Calculus, second ed. Springer-Verlag. · Zbl 0734.60060
[29] Kingma, D. P. and Ba, J. (2015). Adam: A method for stochastic optimization. In International Conference on Learning Representations.
[30] Kloeden, P. E. and Rasmussen, M. (2011). Nonautonomous Dynamical Systems. Mathematical Surveys and Monographs 176. American Mathematical Society, Providence, RI. · Zbl 1244.37001 · doi:10.1090/surv/176
[31] Lee, J. D., Panageas, I., Piliouras, G., Simchowitz, M., Jordan, M. I. and Recht, B. (2019). First-order methods almost always avoid strict saddle points. Math. Program. 176 311-337. · Zbl 1415.90089 · doi:10.1007/s10107-019-01374-3
[32] Mai, V. V. and Johansson, M. (2020). Convergence of a Stochastic Gradient Method with Momentum for Nonsmooth Nonconvex Optimization. Proceedings of Machine Learning Research. PMLR.
[33] Mertikopoulos, P., Hallak, N., Kavis, A. and Cevher, V. (2020). On the almost sure convergence of stochastic gradient descent in non-convex problems. In Advances in Neural Information Processing Systems (H. Larochelle, M. Ranzato, R. Hadsell, M. F. Balcan and H. Lin, eds.) 33 1117-1128. Curran Associates, Inc.
[34] Métivier, M. and Priouret, P. (1987). Théorèmes de convergence presque sûre pour une classe d’algorithmes stochastiques à pas décroissant. Probability Theory and Related Fields 74 403-428. · Zbl 0588.62153 · doi:10.1007/BF00699098
[35] Panageas, I. and Piliouras, G. (2017). Gradient descent only converges to minimizers: Non-isolated critical points and invariant regions. In ITCS. · Zbl 1402.90210
[36] Panageas, I., Piliouras, G. and Wang, X. (2019). First-order methods almost always avoid saddle points: The case of vanishing step-sizes. In Advances in Neural Information Processing Systems 32 6474-6483.
[37] Pelletier, M. (1998). Weak convergence rates for stochastic approximation with application to multiple targets and simulated annealing. Annals of Applied Probability 10-44. · Zbl 0965.62065
[38] Pemantle, R. (1990). Nonconvergence to unstable points in urn models and stochastic approximations. Ann. Probab. 18 698-712. · Zbl 0709.60054
[39] Pötzsche, C. and Rasmussen, M. (2006). Taylor approximation of integral manifolds. J. Dynam. Differential Equations 18 427-460. · Zbl 1106.34029 · doi:10.1007/s10884-006-9011-8
[40] Robbins, H. and Siegmund, D. (1971). A convergence theorem for non negative almost supermartingales and some applications. In Optimizing Methods in Statistics 233-257. Academic Press, New York. · Zbl 0286.60025
[41] Su, W., Boyd, S. and Candès, E. J. (2016). A differential equation for modeling Nesterov’s accelerated gradient method: theory and insights. J. Mach. Learn. Res. 17 Paper No. 153, 43. · Zbl 1391.90667
[42] Tieleman, T. and Hinton, G. (2012). Lecture 6.e-rmsprop: Divide the gradient by a running average of its recent magnitude. Coursera: Neural networks for machine learning 26-31.
[43] Yan, Y., Yang, T., Li, Z., Lin, Q. and Yang, Y. (2018). A unified analysis of stochastic momentum methods for deep learning. In Proceedings of the 27th International Joint Conference on Artificial Intelligence 2955-2961.
[44] Zaheer, M., Reddi, S., Sachan, D., Kale, S. and Kumar, S. (2018). Adaptive methods for nonconvex optimization. In Advances in Neural Information Processing Systems 9793-9803.
[45] Zhou, D., Tang, Y., Yang, Z., Cao, Y. and Gu, Q. (2018). On the convergence of adaptive gradient methods for nonconvex optimization. arXiv preprint arXiv:1808.05671.
[46] Zou, F., Shen, L., Jie, Z., Zhang, W. and Liu, W. (2019). A sufficient condition for convergences of adam and rmsprop. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition 11127-11135.
This reference list is based on information provided by the publisher or from digital mathematics libraries. Its items are heuristically matched to zbMATH identifiers and may contain data conversion errors. In some cases that data have been complemented/enhanced by data from zbMATH Open. This attempts to reflect the references listed in the original paper as accurately as possible without claiming completeness or a perfect matching.