×

A mini-batch stochastic conjugate gradient algorithm with variance reduction. (English) Zbl 1528.90191

Summary: Stochastic gradient descent method is popular for large scale optimization but has slow convergence asymptotically due to the inherent variance. To remedy this problem, there have been many explicit variance reduction methods for stochastic descent, such as SVRG [R. Johnson and T. Zhang, “Accelerating stochastic gradient descent using predictive variance reduction”, in: Proceedings of the 26th international conference on neural information processing systems, NIPS 2013. New York, NY: Association for Computing Machinery (ACM). 315–323 (2013; doi:10.5555/2999611.2999647)], SAG [N. Le Roux et al., “A stochastic gradient method with an exponential convergence rate for finite training sets”, in: Proceedings of the 25th international conference on neural information processing systems, NIPS 2012. New York, NY: Association for Computing Machinery (ACM). 2663–2671 (2012; doi:10.5555/2999325.2999432)], SAGA [A. Defazio et al., “A fast incremental gradient method with support for non-strongly convex composite objectives”, in: Proceedings of the 27th international conference on neural information processing systems, NIPS 2014. New York, NY: Association for Computing Machinery (ACM). 1646–1654 (2014; doi:10.5555/2968826.2969010)] and so on. Conjugate gradient method, which has the same computation cost with gradient descent method, is considered. In this paper, in the spirit of SAGA, we propose a stochastic conjugate gradient algorithm which we call SCGA. With the Fletcher and Reeves type choices, we prove a linear convergence rate for smooth and strongly convex functions. We experimentally demonstrate that SCGA converges faster than the popular SGD type algorithms for four machine learning models, which may be convex, nonconvex or nonsmooth. Solving regression problems, SCGA is competitive with CGVR, which is the only one stochastic conjugate gradient algorithm with variance reduction so far, as we know.

MSC:

90C25 Convex programming
65K10 Numerical optimization and variational techniques
90C15 Stochastic programming
90C52 Methods of reduced gradient type
Full Text: DOI

References:

[1] Krizhevsky, A., Sutskever, I., Hinton, G. E.: Imagenet classification with deep convolutional neural networks, In: Advances in neural information processing systems, pp. 1097-1105. (2012)
[2] Dahl, GE; Yu, D.; Deng, L.; Acero, A., Context-dependent pre-trained deep neural networks for large-vocabulary speech recognition, IEEE Transactions on audio, speech, and language processing, 20, 1, 30-42 (2011) · doi:10.1109/TASL.2011.2134090
[3] Hinton, G.; Deng, L.; Yu, D.; Dahl, GE; Mohamed, A-R; Jaitly, N.; Senior, A.; Vanhoucke, V.; Nguyen, P.; Sainath, TN, Deep neural networks for acoustic modeling in speech recognition: The shared views of four research groups, IEEE Signal processing magazine, 29, 6, 82-97 (2012) · doi:10.1109/MSP.2012.2205597
[4] Collobert, R., Weston, J.: A unified architecture for natural language processing: Deep neural networks with multitask learning,” in Proceedings of the 25th international conference on Machine learning, pp. 160-167. (2008)
[5] Dahl, G. E., Stokes, J. W., Deng, L., Yu, D.: Large-scale malware classification using random projections and neural networks, In 2013 IEEE International Conference on Acoustics, Speech and Signal Processing. IEEE, pp. 3422-3426. (2013)
[6] Cauchy, A., Méthode générale pour la résolution des systemes d’équations simultanées, Comp. Rend. Sci. Paris., 25, 1847, 536-538 (1847)
[7] Robbins, H., Monro, S.: A stochastic approximation method, The annals of mathematical statistics, pp. 400-407, (1951) · Zbl 0054.05901
[8] Bottou, L.: Large-scale machine learning with stochastic gradient descent, Proc. COMPSTAT, pp. 177-186, (2010) · Zbl 1436.68293
[9] Goodfellow, I., Bengio, Y., Courville, A., Bengio, Y.: Deep learning. MIT press Cambridge, 1, (2016) · Zbl 1373.68009
[10] Polyak, BT, Some methods of speeding up the convergence of iteration methods, USSR Computational Mathematics and Mathematical Physics, 4, 5, 1-17 (1964) · Zbl 0147.35301 · doi:10.1016/0041-5553(64)90137-5
[11] Nesterov, Y.: A method of solving a convex programming problem with convergence rate o (1/k2), In Soviet Mathematics Doklady, (1983)
[12] Qian, N., On the momentum term in gradient descent learning algorithms, Neural Netw, 12, 1, 145-151 (1999) · doi:10.1016/S0893-6080(98)00116-6
[13] Duchi, J., Hazan, E., Singer, Y.: Adaptive subgradient methods for online learning and stochastic optimization. J. Machine Learning Research, 12(7), (2011) · Zbl 1280.68164
[14] Zeiler, M. D.: Adadelta: an adaptive learning rate method, arXiv preprint arXiv:1212.5701, (2012)
[15] Tieleman, T.; Hinton, G., Lecture 6.5-rmsprop: Divide the gradient by a running average of its recent magnitude,, COURSERA: Neural networks for machine learning, 4, 2, 26-31 (2012)
[16] Kingma, D., Ba, J.: Adam: A method for stochastic optimization, Computer ence, (2014)
[17] Hager, W.W., Zhang, H.: A survey of nonlinear conjugate gradient methods. Pacific Journal of Optimization, 2(1), 35-58 (2006) · Zbl 1117.90048
[18] Roux, N. L., Schmidt, M., Bach, F. R.: A stochastic gradient method with an exponential convergence rate for finite training sets, in Advances in neural information processing systems, pp. 2663-2671, (2012)
[19] Johnson, R., Zhang, T.: Accelerating stochastic gradient descent using predictive variance reduction, In Advances in neural information processing systems, pp. 315-323. (2013)
[20] Defazio, A., Bach, F., Lacoste-Julien, S.: Saga: A fast incremental gradient method with support for non-strongly convex composite objectives, in Advances in neural information processing systems, pp. 1646-1654. (2014)
[21] Nguyen, L. M., Liu, J., Scheinberg, K., Taká, M.: Sarah: A novel method for machine learning problems using stochastic recursive gradient, (2017)
[22] Gilbert, JC; Nocedal, J., Global convergence properties of conjugate gradient methods for optimization, SIAM J. Optim., 2, 21-42 (1992) · Zbl 0767.90082 · doi:10.1137/0802003
[23] Nocedal, J., Wright, S.: Numerical optimization. Springer Science & Business Media, (2006) · Zbl 1104.65059
[24] Fletcher, R.; Reeves, CM, Function minimization by conjugate gradients, The computer journal, 7, 2, 149-154 (1964) · Zbl 0132.11701 · doi:10.1093/comjnl/7.2.149
[25] Polak, E., Ribiere, G.: Note sur la convergence de méthodes de directions conjuguées,” ESAIM: Mathematical Modelling and Numerical Analysis-Modélisation Mathématique et Analyse Numérique, 3(R1), pp. 35-43, (1969) · Zbl 0174.48001
[26] Polyak, BT, The conjugate gradient method in extreme problem, USSR Comp. Math. Math. Phys., 9, 4, 94-112 (1969) · Zbl 0229.49023 · doi:10.1016/0041-5553(69)90035-4
[27] Hestenes, M.R., Stiefel, E.: Methods of conjugate gradients for solving. Journal of research of the National Bureau of Standards49(6), 409 (1952) · Zbl 0048.09901
[28] Dai, YH; Yuan, Y., A nonlinear conjugate gradient method with a strong global convergence property, Siam Journal on Optimization, 10, 1, 177-182 (1999) · Zbl 0957.65061 · doi:10.1137/S1052623497318992
[29] Hager, WW; Zhang, H., A new conjugate gradient method with guaranteed descent and an efficient line search, SIAM Journal on Optimization, 16, 1, 170-192 (2005) · Zbl 1093.90085 · doi:10.1137/030601880
[30] Dai, YH; Kou, CX, A nonlinear conjugate gradient algorithm with an optimal property and an improved wolfe line search, Siam J Optim, 23, 1, 296-320 (2013) · Zbl 1266.49065 · doi:10.1137/100813026
[31] Dai, Y.H., Yuan, Y.: Nonlinear conjugate gradient methods. Shanghai Science and Technology Publisher, (2000)
[32] Møller, MF, A scaled conjugate gradient algorithm for fast supervised learning, Neural networks, 6, 4, 525-533 (1993) · doi:10.1016/S0893-6080(05)80056-5
[33] Le, Q. V., Ngiam, J., Coates, A., Lahiri, A., Prochnow, B., Ng, A. Y.: On optimization methods for deep learning, In ICML, (2011)
[34] Moritz, P., Nishihara, R., Jordan, M. I.: A linearly convergent stochastic l-bfgs algorithm, Mathematics, (2015)
[35] Jin, X.B., Zhang, X.Y., Huang, K., Geng, G.G.: Stochastic conjugate gradient algorithm with variance reduction. IEEE transactions on neural networks and learning systems 30(5), 1360-1369 (2018)
This reference list is based on information provided by the publisher or from digital mathematics libraries. Its items are heuristically matched to zbMATH identifiers and may contain data conversion errors. In some cases that data have been complemented/enhanced by data from zbMATH Open. This attempts to reflect the references listed in the original paper as accurately as possible without claiming completeness or a perfect matching.