Document Zbl 1516.62033

Adaptive step size rules for stochastic optimization in large-scale learning. (English) Zbl 1516.62033

Stat. Comput. 33, No. 2, Paper No. 45, 22 p. (2023).

Summary: The importance of the step size in stochastic optimization has been confirmed both theoretically and empirically during the past few decades and reconsidered in recent years, especially for large-scale learning. Different rules of selecting the step size have been discussed since the arising of stochastic approximation methods. The first part of this work reviews the studies on several representative techniques of setting the step size, covering heuristic rules, meta-learning procedure, adaptive step size technique and line search technique. The second component of this work proposes a novel class of accelerating stochastic optimization methods by resorting to the Barzilai-Borwein (BB) technique with a diagonal selection rule for the metric, particularly, termed as DBB. We first explore the theoretical and empirical properties of variance reduced stochastic optimization algorithms with DBB. Especially, we study the theoretical and numerical properties of the resulting method under strongly convex and non-convex cases respectively. To great show the efficacy of the step size schedule of DBB, we extend it into more general stochastic optimization methods. The theoretical and empirical properties of such the case also developed under different cases. Extensive numerical results in machine learning are offered, suggesting that the proposed algorithms show much promise.

MSC:

62-08

Computational methods for problems pertaining to statistics

Keywords:

adaptive step size; quasi-Newton; stochastic optimization; large-scale learning

Software:

Adam; AdaGrad; RMSprop; ADADELTA

Cite Review PDF

Full Text: DOI

References:

[1]	Antoniadis, A.; Gijbels, I.; Nikolova, M., Penalized likelihood regression for generalized linear models with non-quadratic penalties, Ann. Inst. Stat. Math., 63, 585-615 (2011) · Zbl 1333.62113 · doi:10.1007/s10463-009-0242-4
[2]	Asi, H.; Duchi, JC, Stochastic (approximate) proximal point methods: convergence, optimality, and adaptivity, SIAM J. Optim., 29, 2257-2290 (2019) · Zbl 07105236 · doi:10.1137/18M1230323
[3]	Bach, F., Adaptivity of averaged stochastic gradient descent to local strong convexity for logistic regression, J. Mach. Learn. Res., 15, 595-627 (2014) · Zbl 1318.62224
[4]	Baker, J.; Fearnhead, P.; Fox, EB; Nemeth, C., Control variates for stochastic gradient mcmc, Stat. Comput., 29, 599-615 (2019) · Zbl 1430.62265 · doi:10.1007/s11222-018-9826-2
[5]	Barzilai, J.; Borwein, JM, Two-point step size gradient methods, IMA J. Numer. Anal., 8, 141-148 (1988) · Zbl 0638.65055 · doi:10.1093/imanum/8.1.141
[6]	Baydin, A.G., Cornish, R., Rubio, D.M., Schmidt, M., Wood, F.: Online learning rate adaptation with hypergradient descent. In: International Conference on Learning Representations (2018)
[7]	Benveniste, A.; Métivier, M.; Priouret, P., Adaptive Algorithms and Stochastic Approximations (2012), Berlin: Springer, Berlin
[8]	Bernstein, J., Wang, Y.-X., Azizzadenesheli, K., Anandkumar, A.: signsgd: compressed optimisation for non-convex problems. In: International Conference on Machine Learning, pp. 560-569 (2018)
[9]	Borkar, VS, Stochastic Approximation: A Dynamical Systems Viewpoint (2009), Berlin: Springer, Berlin
[10]	Byrd, RH; Hansen, SL; Nocedal, J.; Singer, Y., A stochastic quasi-newton method for large-scale optimization, SIAM J. Optim., 26, 1008-1031 (2016) · Zbl 1382.65166 · doi:10.1137/140954362
[11]	Chowdhury, AKR; Chellappa, R., Stochastic approximation and rate-distortion analysis for robust structure and motion estimation, Int. J. Comput. Vis., 55, 27-53 (2003) · doi:10.1023/A:1024488407740
[12]	Cotter, A., Shamir, O., Srebro, N., Sridharan, K.: Better mini-batch algorithms via accelerated gradient methods. In: Proceedings of the 24th International Conference on Neural Information Processing Systems, pp. 1647-1655 (2011)
[13]	Crisci, S.; Porta, F.; Ruggiero, V.; Zanni, L., Spectral properties of Barzilai-Borwein rules in solving singly linearly constrained optimization problems subject to lower and upper bounds, SIAM J. Optim., 30, 1300-1326 (2020) · Zbl 1461.65139 · doi:10.1137/19M1268641
[14]	Csiba, D., Qu, Z., Richtárik, P.: Stochastic dual coordinate ascent with adaptive probabilities. In: International Conference on Machine Learning, pp. 674-683 (2015)
[15]	Delyon, B.; Juditsky, A., Accelerated stochastic approximation, SIAM J. Optim., 3, 868-881 (1993) · Zbl 0801.62071 · doi:10.1137/0803045
[16]	Duchi, J.; Hazan, E.; Singer, Y., Adaptive subgradient methods for online learning and stochastic optimization, J. Mach. Learn. Res., 12, 2121-2159 (2011) · Zbl 1280.68164
[17]	Ekblom, J.; Blomvall, J., Importance sampling in stochastic optimization: an application to intertemporal portfolio choice, Eur. J. Oper. Res., 285, 106-119 (2020) · Zbl 1441.90104 · doi:10.1016/j.ejor.2019.01.013
[18]	Ermoliev, Y.: Stochastic quasigradient methods. In: Numerical Techniques for Stochastic Optimization, pp. 141-185. Springer (1988) · Zbl 0666.90072
[19]	Fang, C., Li, C.J., Lin, Z., Zhang, T.: SPIDER: near-optimal non-convex optimization via stochastic path integrated differential estimator. In: Proceedings of the 32nd International Conference on Neural Information Processing Systems, pp. 687-697 (2018)
[20]	Ghadimi, S.; Lan, G., Stochastic first-and zeroth-order methods for nonconvex stochastic programming, SIAM J. Optim., 23, 2341-2368 (2013) · Zbl 1295.90026 · doi:10.1137/120880811
[21]	Huang, Y.; Liu, H., Smoothing projected Barzilai-Borwein method for constrained non-lipschitz optimization, Comput. Optim. Appl., 65, 671-698 (2016) · Zbl 1357.90117 · doi:10.1007/s10589-016-9854-9
[22]	Jacobs, RA, Increased rates of convergence through learning rate adaptation, Neural Netw., 1, 295-307 (1988) · doi:10.1016/0893-6080(88)90003-2
[23]	Johnson, R., Zhang, T.: Accelerating stochastic gradient descent using predictive variance reduction. In: Advances in Neural Information Processing Systems, pp. 315-323 (2013)
[24]	Karimireddy, S.P., Rebjock, Q., Stich, S., Jaggi, M.: Error feedback fixes signsgd and other gradient compression schemes. In: International Conference on Machine Learning, pp. 3252-3261 (2019)
[25]	Kesten, H., Accelerated stochastic approximation, Ann. Math. Stat., 29, 1, 41-59 (1958) · Zbl 0087.13404 · doi:10.1214/aoms/1177706705
[26]	Kingma, D.P., Ba, J.: Adam: a method for stochastic optimization. In: International Conference on Learning Representations (2015)
[27]	Klein, S.; Pluim, JP; Staring, M.; Viergever, MA, Adaptive stochastic gradient descent optimisation for image registration, Int. J. Comput. Vis., 81, 227 (2009) · Zbl 1477.68513 · doi:10.1007/s11263-008-0168-y
[28]	Konečnỳ, J.; Liu, J.; Richtárik, P.; Takáč, M., Mini-batch semi-stochastic gradient descent in the proximal setting, IEEE J. Sel. Top. Signal Process., 10, 242-255 (2016) · doi:10.1109/JSTSP.2015.2505682
[29]	Kovalev, D., Horváth, S., Richtárik, P.: Don’t jump through hoops and remove those loops: SVRG and katyusha are better without the outer loop. In: Proceedings of the 31st International Conference on Algorithmic Learning Theory, PMLR, vol. 117, pp. 451-467 (2020)
[30]	Lei, Y.; Tang, K., Learning rates for stochastic gradient descent with nonconvex objectives, IEEE Trans. Pattern Anal. Mach. Intell., 43, 4505-4511 (2021) · doi:10.1109/TPAMI.2021.3068154
[31]	Liang, J.; Xu, Y.; Bao, C.; Quan, Y.; Ji, H., Barzilai-Borwein-based adaptive learning rate for deep learning, Pattern Recogn. Lett., 128, 197-203 (2019) · doi:10.1016/j.patrec.2019.08.029
[32]	Loizou, N., Vaswani, S., Laradji, I.H., Lacoste-Julien, S.: Stochastic polyak step-size for sgd: an adaptive learning rate for fast convergence. In: International Conference on Artificial Intelligence and Statistics, pp. 1306-1314. PMLR (2021)
[33]	Mahsereci, M.; Hennig, P., Probabilistic line searches for stochastic optimization, J. Mach. Learn. Res., 18, 4262-4320 (2017) · Zbl 1441.90110
[34]	Mokhtari, A., Ribeiro, A.: Stochastic quasi-Newton methods. In: Proceedings of the IEEE (2020)
[35]	Nesterov, Y., Introductory Lectures on Convex Optimization: Basic Course (2004), Dordrecht: Kluwer Academic, Dordrecht · Zbl 1086.90045 · doi:10.1007/978-1-4419-8853-9
[36]	Nguyen, L.M., Liu, J., Scheinberg, K., Takáč, M.: SARAH: a novel method for machine learning problems using stochastic recursive gradient. In: International Conference on Machine Learning-Volume 70, pp. 2613-2621 (2017)
[37]	Nguyen, LM; Tran-Dinh, Q.; Phan, DT; Nguyen, PH; Van Dijk, M., A unified convergence analysis for shuffling-type gradient methods, J. Mach. Learn. Res., 22, 9397-9440 (2021) · Zbl 07626722
[38]	Nitanda, A.: Stochastic proximal gradient descent with acceleration techniques. In: Proceedings of the 27th International Conference on Neural Information Processing Systems-Volume 1, pp. 1574-1582 (2014)
[39]	Paquette, C.; Scheinberg, K., A stochastic line search method with expected complexity analysis, SIAM J. Optim., 30, 349-376 (2020) · Zbl 1431.90153 · doi:10.1137/18M1216250
[40]	Park, Y., Dhar, S., Boyd, S., Shah, M.: Variable metric proximal gradient method with diagonal Barzilai-Borwein stepsize. In: ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 3597-3601. IEEE (2020)
[41]	Plagianakos, V.P., Magoulas, G.D., Vrahatis, M.N.: Learning rate adaptation in stochastic gradient descent. In: Advances in Convex Analysis and Global Optimization: Honoring the Memory of C. Caratheodory (1873-1950), pp. 433-444 (2001) · Zbl 1049.90139
[42]	Reddi, S.J., Kale, S., Kumar, S.: On the convergence of ADAM and beyond. In: International Conference on Learning Representations (2018)
[43]	Robbins, H.; Monro, S., A stochastic approximation method, Ann. Math. Stat., 22, 3, 400-407 (1951) · Zbl 0054.05901 · doi:10.1214/aoms/1177729586
[44]	Roux, N. L., Schmidt, M., Bach, F.: A stochastic gradient method with an exponential convergence rate for finite training sets. In: Proceedings of the 25th International Conference on Neural Information Processing Systems-Volume 2, pp. 2663-2671 (2012)
[45]	Saridis, GN, Learning applied to successive approximation algorithms, IEEE Trans. Syst. Sci. Cybern., 6, 97-103 (1970) · Zbl 0198.49101 · doi:10.1109/TSSC.1970.300282
[46]	Schaul, T., Zhang, S., LeCun, Y.: No more pesky learning rates. In: International Conference on Machine Learning, pp. 343-351 (2013)
[47]	Schmidt, M., Babanezhad, R., Ahemd, M., Clifton, A., Sarkar, A.: Non-uniform stochastic average gradient method for training conditional random fields. In: Proceedings of the Eighteenth International Conference on Artificial Intelligence and Statistics, PMLR, vol. 38, pp. 819-828 (2015)
[48]	Schraudolph, N.: Local gain adaptation in stochastic gradient descent. In: Proceedings of ICANN, pp. 569-574. IEE (1999)
[49]	Sebag, A., Schoenauer, M., Sebag, M.: Stochastic gradient descent: going as fast as possible but not faster. In: OPTML 2017: 10th NIPS Workshop on Optimization for Machine Learning, pp. 1-8 (2017)
[50]	Shao, S.; Yip, PP, Rates of convergence of adaptive step-size of stochastic approximation algorithms, J. Math. Anal. Appl., 244, 333-347 (2000) · Zbl 0960.65016 · doi:10.1006/jmaa.2000.6703
[51]	Sopyła, K.; Drozda, P., Stochastic gradient descent with Barzilai-Borwein update step for SVM, Inf. Sci., 316, 218-233 (2015) · Zbl 1390.68555 · doi:10.1016/j.ins.2015.03.073
[52]	Tan, C., Ma, S., Dai, Y. H., Qian, Y.: Barzilai-Borwein step size for stochastic gradient descent. In: Proceedings of the 30th International Conference on Neural Information Processing Systems, pp. 685-693) (2016)
[53]	Tieleman, T., Hinton, G.: Rmsprop: Divide the gradient by a running average of its recent magnitude. In: COURSERA: Neural Networks for Machine Learning, vol. 4(2), pp. 26-31 (2012)
[54]	Toulis, P.; Airoldi, EM, Scalable estimation strategies based on stochastic approximations: classical results and new insights, Stat. Comput., 25, 781-795 (2015) · Zbl 1332.62291 · doi:10.1007/s11222-015-9560-y
[55]	Vaswani, S., Mishkin, A., Laradji, I., Schmidt, M., Gidel, G., Lacoste-Julien, S.: Painless stochastic gradient: interpolation, line-search, and convergence rates. In: Proceedings of the 33rd International Conference on Neural Information Processing Systems, pp. 3732-3745 (2019)
[56]	Wang, Z., Zhou, Y., Liang, Y., Lan, G.: Cubic regularization with momentum for nonconvex optimization. In: Proceedings of the 35th Uncertainty in Artificial Intelligence Conference, PMLR, vol. 115, pp. 313-322 (2020)
[57]	Ward, R., Wu, X., Bottou, L.: Adagrad stepsizes: sharp convergence over nonconvex landscapes. In: International Conference on Machine Learning, pp. 6677-6686. PMLR (2019) · Zbl 1531.68104
[58]	Wei, F.; Bao, C.; Liu, Y., Stochastic Anderson mixing for nonconvex stochastic optimization, Adv. Neural. Inf. Process. Syst., 34, 22995-23008 (2021)
[59]	Wilson, A. C., Roelofs, R., Stern, M., Srebro, N., Recht, B.: The marginal value of adaptive gradient methods in machine learning. In: Proceedings of the 31st International Conference on Neural Information Processing Systems, pp. 4151-4161 (2017)
[60]	Yang, Z., On the step size selection in variance-reduced algorithm for nonconvex optimization, Expert Syst. Appl., 169 (2020) · doi:10.1016/j.eswa.2020.114336
[61]	Yang, Z., Wang, C., Zang, Y., Li, J.: Mini-batch algorithms with Barzilai-Borwein update step. Neurocomputing 314, 177-185 (2018a)
[62]	Yang, Z., Wang, C., Zhang, Z., Li, J.: Random Barzilai-Borwein step size for mini-batch algorithms. Eng. Appl. Artif. Intell. 72, 124-135 (2018b)
[63]	Yang, Z., Wang, C., Zhang, Z., Li, J.: Accelerated stochastic gradient descent with step size selection rules. Signal Process. 159, 171-186 (2019a)
[64]	Yang, Z., Wang, C., Zhang, Z., Li, J.: Mini-batch algorithms with online step size. Knowl. Based Syst. 165, 228-240 (2019b)
[65]	Yu, T., Liu, X.-W., Dai, Y.-H., Sun, J.: A variable metric mini-batch proximal stochastic recursive gradient algorithm with diagonal Barzilai-Borwein stepsize (2020). arXiv:2010.00817
[66]	Zeiler, M.D.: Adadelta: an adaptive learning rate method (2012). arXiv:1212.5701
[67]	Zhao, P., Zhang, T.: Stochastic optimization with importance sampling for regularized loss minimization. In: International Conference on Machine Learning, pp. 1-9. PMLR (2015)
[68]	Zhou, D., Xu, P., Gu, Q.: Stochastic nested variance reduction for nonconvex optimization. In: Proceedings of the 32nd International Conference on Neural Information Processing Systems, pp. 3925-3936 (2018)

This reference list is based on information provided by the publisher or from digital mathematics libraries. Its items are heuristically matched to zbMATH identifiers and may contain data conversion errors. In some cases that data have been complemented/enhanced by data from zbMATH Open. This attempts to reflect the references listed in the original paper as accurately as possible without claiming completeness or a perfect matching.