×

A proof of convergence for stochastic gradient descent in the training of artificial neural networks with ReLU activation for constant target functions. (English) Zbl 07575573

Summary: In this article we study the stochastic gradient descent (SGD) optimization method in the training of fully connected feedforward artificial neural networks with ReLU activation. The main result of this work proves that the risk of the SGD process converges to zero if the target function under consideration is constant. In the established convergence result the considered artificial neural networks consist of one input layer, one hidden layer, and one output layer (with \(d \in\mathbb{N}\) neurons on the input layer, \(H\in\mathbb{N}\) neurons on the hidden layer, and one neuron on the output layer). The learning rates of the SGD process are assumed to be sufficiently small, and the input data used in the SGD process to train the artificial neural networks is assumed to be independent and identically distributed.

MSC:

68T99 Artificial intelligence
41A60 Asymptotic approximations, asymptotic expansions (steepest descent, etc.)
65D15 Algorithms for approximation of functions

References:

[1] Akyildiz, Ö.D., Sabanis, S.: Nonasymptotic analysis of Stochastic Gradient Hamiltonian Monte Carlo under local conditions for nonconvex optimization (2021). arXiv:2002.05465
[2] Allen-Zhu, Z., Li, Y., Liang, Y.: Learning and generalization in overparameterized neural networks, going beyond two layers. In Wallach, H., Larochelle, H., Beygelzimer, A., d’ Alché-Buc, F., Fox, E., Garnett, R. (eds.), Advances in Neural Information Processing Systems, Vol. 32, pp. 6158-6169. Curran Associates, Inc., (2019). https://proceedings.neurips.cc/paper/2019/file/62dad6e273d32235ae02b7d321578ee8-Paper.pdf
[3] Allen-Zhu, Z., Li, Y., Song, Z.: A convergence theory for deep learning via over-parameterization. In: Chaudhuri, K., Salakhutdinov, R., (eds.) Proceedings of the 36th International Conference on Machine Learning, volume 97 of Proceedings of Machine Learning Research, pp. 242-252. PMLR, 09-15. http://proceedings.mlr.press/v97/allen-zhu19a.html (2019)
[4] Bach, F., Moulines, E.: Non-strongly-convex smooth stochastic approximation with convergence rate \(O(1/n)\). In: Burges, C.J.C., Bottou, L., Welling, M., Ghahramani, Z., Weinberger, K.Q. (eds.) Advances in Neural Information Processing Systems, Vol. 26, pp. 773-781. Curran Associates, Inc. http://papers.nips.cc/paper/4900-non-strongly-convex-smooth-stochastic-approximation-with-convergence-rate-o1n.pdf (2013)
[5] Beck, C., Becker, S., Grohs, P., Jaafari, N., Jentzen, A.: Solving stochastic differential equations and Kolmogorov equations by means of deep learning, 2018. Published in Journal of Scientific Computing. arXiv:1806.00421 (2021) · Zbl 1490.65006
[6] Beck, C., Jentzen, A., Kuckuck, B.:. Full error analysis for the training of deep neural networks. Infin. Dimens. Anal. Quantum Probab. Relat. Top. 25(2):Paper No. 2150020, 76 (2022). doi:10.1142/S021902572150020X. · Zbl 1492.65132
[7] Bertsekas, DP; Tsitsiklis, JN, Gradient convergence in gradient methods with errors, SIAM J. Optim., 10, 3, 627-642 (2000) · Zbl 1049.90130 · doi:10.1137/S1052623497331063
[8] Bottou, L., Curtis, F.E., Nocedal, J.: Optimization methods for large-scale machine learning. arXiv:1606.04838 (2018) · Zbl 1397.65085
[9] Cheridito, P.; Jentzen, A.; Riekert, A.; Rossmannek, F., A proof of convergence for gradient descent in the training of artificial neural networks for constant target functions, J. Complex. (2022) · Zbl 1502.65037 · doi:10.1016/j.jco.2022.101646
[10] Cheridito, P.; Jentzen, A.; Rossmannek, F., Non-convergence of stochastic gradient descent in the training of deep neural networks, J. Complex. (2020) · Zbl 1494.65044 · doi:10.1016/j.jco.2020.101540
[11] Cheridito, P., Jentzen, A., Rossmannek, F.: Landscape analysis for shallow neural networks: complete classification of critical points for affine target functions. J. Nonlinear Sci. 32(5):Paper No. 64 (2022). doi:10.1007/s00332-022-09823-8 · Zbl 1491.68179
[12] Chizat, L., Bach, F.: On the global convergence of gradient descent for over-parameterized models using optimal transport. In: Bengio, S., Wallach, H., Larochelle, H., Grauman, K., Cesa-Bianchi, N., Garnett, R. (eds.) Advances in Neural Information Processing Systems, Vol. 31, pp. 3036-3046. Curran Associates, Inc. https://proceedings.neurips.cc/paper/2018/file/a1afc58c6ca9540d057299ec3016d726-Paper.pdf (2018)
[13] Dereich, S., Kassing, S.: Convergence of stochastic gradient descent schemes for Lojasiewicz-landscapes. arXiv:2102.09385 (2021)
[14] Dereich, S., Müller-Gronbach, T.: General multilevel adaptations for stochastic approximation algorithms of Robbins-Monro and Polyak-Ruppert type. Numer. Math. 142(2), 279-328 (2019). doi:10.1007/s00211-019-01024-y · Zbl 1464.62366
[15] Du, S., Lee, J., Li, H., Wang, L., Zhai, X.: Gradient descent finds global minima of deep neural networks. In: Chaudhuri, K., Salakhutdinov, R. (eds.) Proceedings of the 36th International Conference on Machine Learning, Volume 97 of Proceedings of Machine Learning Research, pp. 1675-1685, Long Beach, California, USA, 6. PMLR. http://proceedings.mlr.press/v97/du19c.html (2019)
[16] Du, S.S., Zhai, X., Poczós, B., Singh, A.: Gradient descent provably optimizes over-parameterized neural networks. arXiv:1810.02054 (2018)
[17] E, W., Ma, C., Wu, L.: A comparative analysis of optimization and generalization properties of two-layer neural network and random feature models under gradient descent dynamics. Sci. China Math. 63(7), 1235-1258 (2020). doi:10.1007/s11425-019-1628-5 · Zbl 1453.68163
[18] Fehrman, B.; Gess, B.; Jentzen, A., Convergence rates for the stochastic gradient descent method for non-convex objective functions, J. Mach. Learn. Res., 21, 136, 1-48 (2020) · Zbl 1520.68143
[19] Ge, R., Huang, F., Jin, C., Yuan, Y.: Escaping from saddle points—online stochastic gradient for tensor decomposition. In: Grónwald, P., Hazan, E., Kale, S. (eds.) Proceedings of the 28th Conference on Learning Theory, Volume 40 of Proceedings of Machine Learning Research, pp. 797-842, Paris, France, 03-06. PMLR (2015)
[20] Goodfellow, I., Bengio, Y., Courville, A.: Deep learning. In: Adaptive Computation and Machine Learning. MIT Press, Cambridge (2016) · Zbl 1373.68009
[21] Hanin, B.: Which neural net architectures give rise to exploding and vanishing gradients? In: Bengio, S., Wallach, H., Larochelle, H., Grauman, K., Cesa-Bianchi, N., Garnett, R. (eds.) Advances in Neural Information Processing Systems, Vol. 31, pp. 582-591. Curran Associates, Inc. https://proceedings.neurips.cc/paper/2018/file/13f9896df61279c928f19721878fac41-Paper.pdf (2018)
[22] Hanin, B., Rolnick, D.: How to start training: The effect of initialization and architecture. In: Bengio, S., Wallach, H., Larochelle, H., Grauman, K., Cesa-Bianchi, N., Garnett, R. (eds.) Advances in Neural Information Processing Systems, Volume 31, pp. 571-581. Curran Associates, Inc. https://proceedings.neurips.cc/paper/2018/file/d81f9c1be2e08964bf9f24b15f0e4900-Paper.pdf (2018)
[23] Jentzen, A., Kröger, T.: Convergence rates for gradient descent in the training of overparameterized artificial neural networks with biases. arXiv:2102.11840 (2021)
[24] Jentzen, A., Kuckuck, B., Neufeld, A., von Wurstemberger, P.: Strong error analysis for stochastic gradient descent optimization algorithms, 2018. Published in IMA J. Numer. Anal. arXiv:1801.09324 (2021) · Zbl 1460.65071
[25] Jentzen, A.; von Wurstemberger, P., Lower error bounds for the stochastic gradient descent optimization algorithm: sharp convergence rates for slowly and fast decaying learning rates, J. Complex., 57, 101438 (2020) · Zbl 1433.68353 · doi:10.1016/j.jco.2019.101438
[26] Jentzen, A., Welti, T.: Overall error analysis for the training of deep neural networks via stochastic gradient descent with random initialisation. arXiv:2003.01291v1 (2020)
[27] Karimi, B., Miasojedow, B., Moulines, E., Wai, H.-T.: Non-asymptotic analysis of biased stochastic approximation scheme. arXiv:1902.00629 (2019)
[28] LeCun, Y.; Bengio, Y.; Hinton, G., Deep learning, Nature, 521, 7553, 436-444 (2015) · doi:10.1038/nature14539
[29] Lee, JD; Panageas, I.; Piliouras, G.; Simchowitz, M.; Jordan, MI; Recht, B., First-order methods almost always avoid strict saddle points, Math. Program., 176, 1-2, 311-337 (2019) · Zbl 1415.90089 · doi:10.1007/s10107-019-01374-3
[30] Lee, J.D., Simchowitz, M., Jordan, M.I., Recht, B.: Gradient descent only converges to minimizers. In: Feldman, V., Rakhlin, A., Shamir, O. (eds.) 29th Annual Conference on Learning Theory, Volume 49 of Proceedings of Machine Learning Research, pp. 1246-1257, Columbia University, New York, 23-26. PMLR. http://proceedings.mlr.press/v49/lee16.html (2016)
[31] Lei, Y.; Hu, T.; Li, G.; Tang, K., Stochastic gradient descent for nonconvex learning without bounded gradient assumptions, IEEE Trans. Neural Netw. Learn. Syst., 31, 10, 4394-4400 (2020) · doi:10.1109/TNNLS.2019.2952219
[32] Li, Y., Liang, Y.: Learning overparameterized neural networks via stochastic gradient descent on structured data. In: Bengio, S., Wallach, H., Larochelle, H., Grauman, K., Cesa-Bianchi, N., Garnett, R. (eds.) Advances in Neural Information Processing Systems, vol. 31, pp. 8157-8166. Curran Associates, Inc. https://proceedings.neurips.cc/paper/2018/file/54fe976ba170c19ebae453679b362263-Paper.pdf (2018)
[33] Lovas, A., Lytras, I., Rásonyi, M., Sabanis, S.: Taming neural networks with TUSLA: Non-convex learning via adaptive stochastic gradient Langevin algorithms. arXiv:2006.14514 (2020)
[34] Lu, L.; Shin, Y.; Yanhui, S.; Karniadakis, GE, Dying ReLU and initialization: theory and numerical examples, Commun. Comput. Phys., 28, 5, 1671-1706 (2020) · Zbl 1507.68248 · doi:10.4208/cicp.OA-2020-0165
[35] Moulines, E., Bach, F.: Non-asymptotic analysis of stochastic approximation algorithms for machine learning. In Shawe-Taylor, J., Zemel, R., Bartlett, P., Pereira, F., Weinberger, K.Q. (eds.) Advances in Neural Information Processing Systems, vol. 24, pp. 451-459. Curran Associates, Inc. https://proceedings.neurips.cc/paper/2011/file/40008b9a5380fcacce3976bf7c08af5b-Paper.pdf (2011)
[36] Nesterov, Y., A method for solving the convex programming problem with convergence rate \(o(1/k^2)\), Proc. USSR Acad. Sci., 269, 543-547 (1983)
[37] Nesterov, Y., Introductory Lectures on Convex Optimization (2004), Berlin: Springer, Berlin · Zbl 1086.90045 · doi:10.1007/978-1-4419-8853-9
[38] Panageas, I., Piliouras, G.: Gradient descent only converges to minimizers: non-isolated critical points and invariant regions. In: Papadimitriou, C.H. (ed.) 8th Innovations in Theoretical Computer Science Conference (ITCS 2017), Volume 67 of Leibniz International Proceedings in Informatics (LIPIcs), pp. 2:1-2:12, Dagstuhl, Germany, (2017). Schloss Dagstuhl-Leibniz-Zentrum fuer Informatik. doi:10.4230/LIPIcs.ITCS.2017.2 · Zbl 1402.90210
[39] Panageas, I., Piliouras, G., Wang, X.: First-order methods almost always avoid saddle points: the case of vanishing step-sizes. arXiv:1906.07772 (2019)
[40] Patel, V.: Stopping criteria for, and strong convergence of, stochastic gradient descent on Bottou-Curtis-Nocedal functions. arXiv:2004.00475 (2021)
[41] Rakhlin, A., Shamir, O., Sridharan, K.: Making gradient descent optimal for strongly convex stochastic optimization. In: Proceedings of the 29th International Conference on Machine Learning, pp. 1571-1578, Madison. Omnipress (2012)
[42] Ruder, S.: An overview of gradient descent optimization algorithms. arXiv:1609.04747 (2017)
[43] Sankararaman, K.A., De, S., Xu, Z., Ronny Huang, W., Goldstein, T.: The impact of neural network overparameterization on gradient confusion and stochastic gradient descent. arXiv:1904.06963 (2020)
[44] Shamir, O.: Exponential convergence time of gradient descent for one-dimensional deep linear neural networks. In: Beygelzimer, A., Hsu, D. (eds.) Proceedings of the Thirty-Second Conference on Learning Theory, Volume 99 of Proceedings of Machine Learning Research, pp. 2691-2713, Phoenix. PMLR. http://proceedings.mlr.press/v99/shamir19a.html (2019)
[45] Shin, Y.; Karniadakis, GE, Trainability of ReLU networks and data-dependent initialization, J. Mach. Learn. Model. Comput., 1, 1, 39-74 (2020) · doi:10.1615/JMachLearnModelComput.2020034126
[46] Wu, L., Ma, C., E, W.: How SGD selects the global minima in over-parameterized learning: a dynamical stability perspective. In: Bengio, S., Wallach, H., Larochelle, H., Grauman, K., Cesa-Bianchi, N., Garnett, R. (eds.) Advances in Neural Information Processing Systems, Vol. 31, pp. 8279-8288. Curran Associates, Inc. https://proceedings.neurips.cc/paper/2018/file/6651526b6fb8f29a00507de6a49ce30f-Paper.pdf (2018)
[47] Zou, D.; Cao, Y.; Zhou, D.; Quanquan, G., Gradient descent optimizes over-parameterized deep ReLU networks, Mach. Learn., 109, 467-492 (2020) · Zbl 1494.68245 · doi:10.1007/s10994-019-05839-6
This reference list is based on information provided by the publisher or from digital mathematics libraries. Its items are heuristically matched to zbMATH identifiers and may contain data conversion errors. In some cases that data have been complemented/enhanced by data from zbMATH Open. This attempts to reflect the references listed in the original paper as accurately as possible without claiming completeness or a perfect matching.