Document Zbl 1465.90050

Convergence and dynamical behavior of the ADAM algorithm for nonconvex stochastic optimization. (English) Zbl 1465.90050

SIAM J. Optim. 31, No. 1, 244-274 (2021).

Summary: Adam is a popular variant of stochastic gradient descent for finding a local minimizer of a function. In the constant stepsize regime, assuming that the objective function is differentiable and nonconvex, we establish the convergence in the long run of the iterates to a stationary point under a stability condition. The key ingredient is the introduction of a continuous-time version of Adam, under the form of a nonautonomous ordinary differential equation. This continuous-time system is a relevant approximation of the Adam iterates, in the sense that the interpolated Adam process converges weakly toward the solution to the ODE. The existence and the uniqueness of the solution are established. We further show the convergence of the solution toward the critical points of the objective function and quantify its convergence rate under a Łojasiewicz assumption. Then, we introduce a novel decreasing stepsize version of Adam. Under mild assumptions, it is shown that the iterates are almost surely bounded and converge almost surely to critical points of the objective function. Finally, we analyze the fluctuations of the algorithm by means of a conditional central limit theorem.

Cited in 7 Documents

MSC:

90C15	Stochastic programming
90C26	Nonconvex programming, global optimization
62L20	Stochastic approximation
60F05	Central limit and other weak theorems
65K05	Numerical mathematical programming methods
34A12	Initial value problems, existence, uniqueness, continuous dependence and continuation of solutions to ordinary differential equations
37C60	Nonautonomous smooth dynamical systems

Keywords:

stochastic approximation; dynamical systems; adaptive gradient methods; conditional central limit theorem.

Software:

Adam; RMSprop; AdaGrad

Cite Review PDF

Full Text: DOI arXiv

References:

[1]	H. Attouch and J. Bolte, On the convergence of the proximal algorithm for nonsmooth functions involving analytic features, Math. Program., 116 (2009), pp. 5-16. · Zbl 1165.90018
[2]	H. Attouch, X. Goudou, and P. Redont, The heavy ball with friction method, i. the continuous dynamical system: Global exploration of the local minima of a real-valued function by asymptotic analysis of a dissipative dynamical system, Commun. Contemp. Math., 2 (2000), pp. 1-34. · Zbl 0983.37016
[3]	L. Balles and P. Hennig, Dissecting Adam: The sign, magnitude and variance of stochastic gradients, in Proceedings of the 35th International Conference on Machine Learning, PMLR 80, 2018, pp. 404-413.
[4]	A. Basu, S. De, A. Mukherjee, and E. Ullah, Convergence Guarantees for RMSprop and Adam in Non-Convex Optimization and Their Comparison to Nesterov Acceleration on Autoencoders, preprint, https://arxiv.org/abs/1807.06766, 2018.
[5]	A. Belotto da Silva and M. Gazeau, A General System of Differential Equations to Model First Order Adaptive Algorithms, https://arxiv.org/abs/1810.13108, 2018. · Zbl 1517.65053
[6]	M. Benaïm, Dynamics of stochastic approximation algorithms, in Séminaire de Probabilités, XXXIII, Lecture Notes in Math. 1709, Springer, Berlin, 1999, pp. 1-68. · Zbl 0955.62085
[7]	P. Bianchi, W. Hachem, and A. Salim, Constant step stochastic approximations involving differential inclusions: Stability, long-run convergence and applications, Stochastics, 91 (2019), pp. 288-320. · Zbl 1500.60040
[8]	J. Bolte, S. Sabach, and M. Teboulle, Proximal alternating linearized minimization for nonconvex and nonsmooth problems, Math. Program., 146 (2014), pp. 459-494. · Zbl 1297.90125
[9]	A. Cabot, H. Engler, and S. Gadat, On the long time behavior of second order differential equations with asymptotically small dissipation, Trans. Amer. Math. Soc., 361 (2009), pp. 5983-6017. · Zbl 1191.34078
[10]	X. Chen, S. Liu, R. Sun, and M. Hong, On the convergence of a class of adam-type algorithms for non-convex optimization, in Proceedings of the International Conference on Learning Representations, 2019.
[11]	D. Davis, D. Drusvyatskiy, S. Kakade, and J. Lee, Stochastic subgradient method converges on tame functions, Found. Comput. Math., 20 (2020), pp. 119-154. · Zbl 1433.65141
[12]	J. Duchi, E. Hazan, and Y. Singer, Adaptive subgradient methods for online learning and stochastic optimization, J. Mach. Learn. Res., 12 (2011), pp. 2121-2159. · Zbl 1280.68164
[13]	J.-C. Fort and G. Pagès, Asymptotic behavior of a Markovian stochastic algorithm with constant step, SIAM J. Control Optim., 37 (1999), pp. 1456-1482. · Zbl 0954.60057
[14]	S. Gadat, F. Panloup, and S. Saadane, Stochastic heavy ball, Electron. J. Stat., 12 (2018), pp. 461-529. · Zbl 1392.62244
[15]	A. Haraux, Systemes dynamiques dissipatifs et applications, Masson, Paris, 1991. · Zbl 0726.58001
[16]	A. Haraux and M. Jendoubi, The Convergence Problem for Dissipative Autonomous Systems, Springer Briefs Math., Springer, New York, 2015. · Zbl 1345.37081
[17]	I. Karatzas and S. Shreve, Brownian Motion and Stochastic Calculus, 2nd ed., Springer, New York, 1991. · Zbl 0734.60060
[18]	D. P. Kingma and J. Ba, Adam: A method for stochastic optimization, in Proceedings of the International Conference on Learning Representations, 2015.
[19]	S. Łojasiewicz, Une propriété topologique des sous-ensembles analytiques réels, Les équations aux dérivées partielles, 117 (1963), pp. 87-89. · Zbl 0234.57007
[20]	M. Pelletier, Weak convergence rates for stochastic approximation with application to multiple targets and simulated annealing, Ann. Appl. Probab., 8 (1998), pp. 10-44. · Zbl 0965.62065
[21]	S. J. Reddi, S. Kale, and S. Kumar, On the convergence of adam and beyond, in Proceedings of the International Conference on Learning Representations, 2018.
[22]	H. Robbins and D. Siegmund, A convergence theorem for non negative almost supermartingales and some applications, in Optimizing Methods in Statistics, Academic Press, New York, 1971, pp. 233-257. · Zbl 0286.60025
[23]	T. Tieleman and G. Hinton, Lecture \(6\).e-RMSprop: Divide the gradient by a running average of its recent magnitude, Coursera: Neural Networks for Machine Learning, 2012, pp. 26-31.
[24]	R. Ward, X. Wu, and L. Bottou, AdaGrad stepsizes: Sharp convergence over nonconvex landscapes, in Proceedings of the 36th International Conference on Machine Learning, PMLR 97, 2019, pp. 6677-6686.
[25]	M. Zaheer, S. J. Reddi, D. Sachan, S. Kale, and S. Kumar, Adaptive methods for nonconvex optimization, in Advances in Neural Information Processing Systems, 2018, pp. 9793-9803.
[26]	D. Zhou, Y. Tang, Z. Yang, Y. Cao, and Q. Gu, On the Convergence of Adaptive Gradient Methods for Nonconvex Optimization, https://arxiv.org/abs/1808.05671, 2018.

This reference list is based on information provided by the publisher or from digital mathematics libraries. Its items are heuristically matched to zbMATH identifiers and may contain data conversion errors. In some cases that data have been complemented/enhanced by data from zbMATH Open. This attempts to reflect the references listed in the original paper as accurately as possible without claiming completeness or a perfect matching.