×

A non-monotone trust-region method with noisy oracles and additional sampling. (English) Zbl 07914769

Summary: In this work, we introduce a novel stochastic second-order method, within the framework of a non-monotone trust-region approach, for solving the unconstrained, nonlinear, and non-convex optimization problems arising in the training of deep neural networks. The proposed algorithm makes use of subsampling strategies that yield noisy approximations of the finite sum objective function and its gradient. We introduce an adaptive sample size strategy based on inexpensive additional sampling to control the resulting approximation error. Depending on the estimated progress of the algorithm, this can yield sample size scenarios ranging from mini-batch to full sample functions. We provide convergence analysis for all possible scenarios and show that the proposed method achieves almost sure convergence under standard assumptions for the trust-region framework. We report numerical experiments showing that the proposed algorithm outperforms its state-of-the-art counterpart in deep neural network training for image classification and regression tasks while requiring a significantly smaller number of gradient evaluations.

MSC:

90Cxx Mathematical programming
90C30 Nonlinear programming
90C06 Large-scale problems in mathematical programming
90C53 Methods of quasi-Newton type
90C90 Applications of mathematical programming
65K05 Numerical mathematical programming methods

Software:

MNIST; Adam; CIFAR

References:

[1] Nocedal, J., Wright, S.: Numerical Optimization. Springer, New York, NY (2006). doi:10.1007/978-0-387-40065-5 · Zbl 1104.65059
[2] Conn, A.R., Gould, N.I.M., Toint, P.L.: Trust-Region Methods. SIAM, Philadelphia, PA (2000). doi:10.1137/1.9780898719857 · Zbl 0958.65071
[3] Ahookhosh, M.; Amini, K.; Peyghami, MR, A non-monotone trust-region line search method for large-scale unconstrained optimization, Appl. Math. Model., 36, 1, 478-487, 2012 · Zbl 1236.90077 · doi:10.1016/j.apm.2011.07.021
[4] Di Serafino, D.; Krejić, N.; Krklec Jerinkić, N.; Viola, M., LSOS: line-search second-order stochastic optimization methods for nonconvex finite sums, Math. Comput., 92, 341, 1273-1299, 2023 · Zbl 1511.65048 · doi:10.1090/mcom/3802
[5] Robbins, H.; Monro, S., A stochastic approximation method, Ann. Math. Stat., 22, 400-407, 1951 · Zbl 0054.05901 · doi:10.1214/aoms/1177729586
[6] Bottou, L., LeCun, Y.: Large scale online learning. In: Advances in Neural Information Processing Systems, vol. 16, pp. 217-224 (2004). Available at: https://proceedings.neurips.cc/paper_files/paper/2003
[7] Defazio, A., Bach, F., Lacoste-Julien, S.: SAGA: A fast incremental gradient method with support for non-strongly convex composite objectives. In: Advances in Neural Information Processing Systems, pp. 1646-1654 (2014). Available at: https://proceedings.neurips.cc/paper_files/paper/2014
[8] Johnson, R., Zhang, T.: Accelerating stochastic gradient descent using predictive variance reduction. In: Advances in Neural Information Processing Systems, vol. 26, pp. 315-323 (2013). Available at: https://proceedings.neurips.cc/paper_files/paper/2013
[9] Schmidt, M.; Le Roux, N.; Bach, F., Minimizing finite sums with the stochastic average gradient, Math. Program., 162, 1-2, 83-112, 2017 · Zbl 1358.90073 · doi:10.1007/s10107-016-1030-6
[10] Nguyen, L.M., Liu, J., Scheinberg, K., Takáč, M.: SARAH: A novel method for machine learning problems using stochastic recursive gradient. In: International Conference on Machine Learning, pp. 2613-2621 (2017). PMLR. Available at: https://proceedings.mlr.press/v70/
[11] Duchi, J., Hazan, E., Singer, Y.: Adaptive subgradient methods for online learning and stochastic optimization. J. Mach. Learn. Res. 12(7) (2011). Available at: https://www.jmlr.org/papers/v12/ · Zbl 1280.68164
[12] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: 3rd International Conference on Learning Representations, ICLR 2015 - Conference Track Proceedings (2015). Available at: http://arxiv.org/abs/1412.6980
[13] Bottou, L.; Curtis, FE; Nocedal, J., Optimization methods for large-scale machine learning, SIAM Rev., 60, 2, 223-311, 2018 · Zbl 1397.65085 · doi:10.1137/16M1080173
[14] Kylasa, S., Roosta, F., Mahoney, M.W., Grama, A.: GPU accelerated sub-sampled Newton’s method for convex classification problems. In: Proceedings of the 2019 SIAM International Conference on Data Mining, pp. 702-710 (2019). doi:10.1137/1.9781611975673.79 . SIAM
[15] Martens, J.: Deep learning via Hessian-free optimization. In: Proceedings of the 27th International Conference on Machine Learning, pp. 735-742 (2010). Available at: https://www.icml2010.org/abstracts.html
[16] Martens, J., Sutskever, I.: Training deep and recurrent networks with Hessian-free optimization. In: Neural Networks: Tricks of the Trade, pp. 479-535. Springer, Heidelberg (2012). doi:10.1007/978-3-642-35289-8_27
[17] Bollapragada, R.; Byrd, RH; Nocedal, J., Exact and inexact subsampled Newton methods for optimization, IMA J. Numer. Anal., 39, 2, 545-578, 2019 · Zbl 1462.65077 · doi:10.1093/imanum/dry009
[18] Xu, P., Roosta, F., Mahoney, M.W.: Second-order optimization for non-convex machine learning: An empirical study. In: Proceedings of the 2020 SIAM International Conference on Data Mining, pp. 199-207 (2020). doi:10.1137/1.9781611976236.23 . SIAM
[19] Martens, J., Grosse, R.: Optimizing neural networks with Kronecker-factored approximate curvature. In: International Conference on Machine Learning, pp. 2408-2417 (2015). PMLR. Available at: https://proceedings.mlr.press/v37/
[20] Goldfarb, D., Ren, Y., Bahamou, A.: Practical quasi-newton methods for training deep neural networks. In: Advances in Neural Information Processing Systems, vol. 33, pp. 2386-2396 (2020). Available at: https://proceedings.neurips.cc/paper_files/paper/2020
[21] Mokhtari, A.; Ribeiro, A., Global convergence of online limited memory BFGS, J. Mach. Learn. Res., 16, 1, 3151-3181, 2015 · Zbl 1351.90124
[22] Gower, R., Goldfarb, D., Richtárik, P.: Stochastic block BFGS: Squeezing more curvature out of data. In: International Conference on Machine Learning, pp. 1869-1878 (2016). PMLR. Available at: https://proceedings.mlr.press/v48/
[23] Wang, X.; Ma, S.; Goldfarb, D.; Liu, W., Stochastic quasi-Newton methods for nonconvex stochastic optimization, SIAM J. Optim., 27, 2, 927-956, 2017 · Zbl 1365.90182 · doi:10.1137/15M1053141
[24] Berahas, AS; Takáč, M., A robust multi-batch L-BFGS method for machine learning, Optim. Methods Softw., 35, 1, 191-219, 2020 · Zbl 1430.90523 · doi:10.1080/10556788.2019.1658107
[25] Bollapragada, R., Nocedal, J., Mudigere, D., Shi, H.-J., Tang, P.T.P.: A progressive batching L-BFGS method for machine learning. In: International Conference on Machine Learning, pp. 620-629 (2018). PMLR. Available at: https://proceedings.mlr.press/v80/
[26] Jahani, M., Nazari, M., Rusakov, S., Berahas, A.S., Takáč, M.: Scaling up quasi-newton algorithms: communication efficient distributed SR1. In: Machine Learning, Optimization, and Data Science. LOD 2020. Lecture Notes in Computer Science, vol. 12565, pp. 41-54. Springer, Cham (2020). doi:10.1007/978-3-030-64583-0_5
[27] Berahas, AS; Jahani, M.; Richtárik, P.; Takáč, M., Quasi-newton methods for machine learning: forget the past, just sample, Optim. Methods Softw., 37, 5, 1668-1704, 2022 · Zbl 1509.90221 · doi:10.1080/10556788.2021.1977806
[28] Rafati, J., Marcia, R.F.: Improving L-BFGS initialization for trust-region methods in deep learning. In: 2018 17th IEEE International Conference on Machine Learning and Applications (ICMLA), pp. 501-508 (2018). doi:10.1109/ICMLA.2018.00081 . IEEE
[29] Yousefi, M., Martínez Calomardo, Á.: A stochastic modified limited memory BFGS for training deep neural networks. In: Intelligent Computing: Proceedings of the 2022 Computing Conference, Volume 2, pp. 9-28 (2022). Springer. doi:10.1007/978-3-031-10464-0_2
[30] Erway, JB; Griffin, J.; Marcia, RF; Omheni, R., Trust-region algorithms for training responses: machine learning methods using indefinite Hessian approximations, Optim. Methods Softw., 35, 3, 460-487, 2020 · Zbl 1440.90092 · doi:10.1080/10556788.2019.1624747
[31] Grippo, L.; Lampariello, F.; Lucidi, S., A non-monotone line search technique for Newton’s method, SIAM J. Numer. Anal., 23, 4, 707-716, 1986 · Zbl 0616.65067 · doi:10.1137/0723046
[32] Deng, N.; Xiao, Y.; Zhou, F., Nonmonotonic trust-region algorithm, J. Optim. Theory Appl., 76, 2, 259-285, 1993 · Zbl 0797.90088 · doi:10.1007/BF00939608
[33] Cui, Z.; Wu, B.; Qu, S., Combining non-monotone conic trust-region and line search techniques for unconstrained optimization, J. Comput. Appl. Math., 235, 8, 2432-2441, 2011 · Zbl 1215.65107 · doi:10.1016/j.cam.2010.10.044
[34] Krejić, N.; Krklec Jerinkić, N., Non-monotone line search methods with variable sample size, Numer. Algor., 68, 4, 711-739, 2015 · Zbl 1316.65062 · doi:10.1007/s11075-014-9869-1
[35] Yousefi, M., Martínez Calomardo, Á.: A stochastic nonmonotone trust-region training algorithm for image classification. In: 2022 16th International Conference on Signal-Image Technology and Internet-Based Systems (SITIS), pp. 522-529 (2022). IEEE. doi:10.1109/SITIS57111.2022.00084
[36] Sun, S.; Nocedal, J., A trust-region method for noisy unconstrained optimization, Math. Program., 2023 · Zbl 1526.65023 · doi:10.1007/s10107-023-01941-9
[37] Cao, L.; Berahas, AS; Scheinberg, K., First- and second-order high probability complexity bounds for trust-region methods with noisy oracles, Math. Program., 2023 · Zbl 07915911 · doi:10.1007/s10107-023-01999-5
[38] Iusem, AN; Jofré, A.; Oliveira, RI; Thompson, P., Variance-based extra gradient methods with line search for stochastic variational inequalities, SIAM J. Optim., 29, 1, 175-206, 2019 · Zbl 1415.65145 · doi:10.1137/17M1144799
[39] Krejić, N.; Lužanin, Z.; Ovcin, Z.; Stojkovska, I., Descent direction method with line search for unconstrained optimization in noisy environment, Optim. Methods Softw., 30, 6, 1164-1184, 2015 · Zbl 1328.90091 · doi:10.1080/10556788.2015.1025403
[40] Blanchet, J.; Cartis, C.; Menickelly, M.; Scheinberg, K., Convergence rate analysis of a stochastic trust-region method via supermartingales, INFORMS J. Optim., 1, 2, 92-119, 2019 · doi:10.1287/ijoo.2019.0016
[41] Chen, R.; Menickelly, M.; Scheinberg, K., Stochastic optimization using a trust-region method and random models, Math. Program., 169, 2, 447-487, 2018 · Zbl 1401.90136 · doi:10.1007/s10107-017-1141-8
[42] Bellavia, S.; Krejić, N.; Morini, B.; Rebegoldi, S., A stochastic first-order trust-region method with inexact restoration for finite-sum minimization, Comput. Optim. Appl., 84, 1, 53-84, 2023 · Zbl 1510.90257 · doi:10.1007/s10589-022-00430-7
[43] Glorot, X., Bengio, Y.: Understanding the difficulty of training deep feedforward neural networks. In: Proceedings of the 13th International Conference on Artificial Intelligence and Statistics. JMLR Workshop and Conference Proceedings, pp. 249-256 (2010). Available at: https://proceedings.mlr.press/v9/
[44] Brust, J.; Erway, JB; Marcia, RF, On solving L-SR1 trust-region subproblems, Comput. Optim. Appl., 66, 2, 245-266, 2017 · Zbl 1364.90239 · doi:10.1007/s10589-016-9868-3
[45] Goodfellow, I.; Bengio, Y.; Courville, A., Deep Learning, 2016, Cambridge, MA: MIT Press, Cambridge, MA · Zbl 1373.68009
[46] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770-778 (2016). doi:10.1109/CVPR.2016.90
[47] Yousefi, M.; Martínez, Á., Deep neural networks training by stochastic quasi-newton trust-region methods, Algorithms, 16, 10, 490, 2023 · doi:10.3390/a16100490
[48] LeCun, Y.; Bottou, L.; Bengio, Y.; Haffner, P., Gradient-based learning applied to document recognition, Proc. IEEE, 86, 11, 2278-2324, 1998 · doi:10.1109/5.726791
[49] LeCun, Y.: The MNIST Database of Handwritten Digits (1998). Available at: https://www.kaggle.com/datasets/hojjatk/mnist-dataset
[50] Krizhevsky, A.: Learning multiple layers of features from tiny images (2009). Available at: https://api.semanticscholar.org/CorpusID:18268744
This reference list is based on information provided by the publisher or from digital mathematics libraries. Its items are heuristically matched to zbMATH identifiers and may contain data conversion errors. In some cases that data have been complemented/enhanced by data from zbMATH Open. This attempts to reflect the references listed in the original paper as accurately as possible without claiming completeness or a perfect matching.