×

Scaling description of generalization with number of parameters in deep learning. (English) Zbl 1459.82250

Summary: Supervised deep learning involves the training of neural networks with a large number \(N\) of parameters. For large enough \(N\), in the so-called over-parametrized regime, one can essentially fit the training data points. Sparsity-based arguments would suggest that the generalization error increases as \(N\) grows past a certain threshold \(N^*\). Instead, empirical studies have shown that in the over-parametrized regime, generalization error keeps decreasing with \(N\). We resolve this paradox through a new framework. We rely on the so-called Neural Tangent Kernel, which connects large neural nets to kernel methods, to show that the initialization causes finite-size random fluctuations \(\|f_N-\langle{f}_N\rangle\|\sim N^{-1/4}\) of the neural net output function \(f_N\) around its expectation \(\langle{f}_N\rangle \). These affect the generalization error \(\epsilon(f_N)\) for classification: under natural assumptions, it decays to a plateau value \(\epsilon(f_{\infty})\) in a power-law fashion \(\sim N^{-1/2} \). This description breaks down at a so-called jamming transition \(N = N^*\). At this threshold, we argue that \(\|f_N\|\) diverges. This result leads to a plausible explanation for the cusp in test error known to occur at \(N^*\). Our results are confirmed by extensive empirical observations on the MNIST and CIFAR image datasets. Our analysis finally suggests that, given a computational envelope, the smallest generalization error is obtained using several networks of intermediate sizes, just beyond \(N^*\), and averaging their outputs.

MSC:

82C32 Neural nets applied to problems in time-dependent statistical mechanics
68T05 Learning and adaptive systems in artificial intelligence

Software:

ImageNet; AlexNet; Adam; MNIST

References:

[1] Krizhevsky A, Sutskever I and Hinton G E 2012 Imagenet classification with deep convolutional neural networks Advances in Neural Information Processing Systems pp 1097-105
[2] LeCun Y, Bengio Y and Hinton G 2015 Deep learning Nature521 436 · doi:10.1038/nature14539
[3] Hinton G et al 2012 Deep neural networks for acoustic modeling in speech recognition: the shared views of four research groups IEEE Signal Process. Mag.29 82-97 · doi:10.1109/MSP.2012.2205597
[4] Sutskever I, Vinyals O and Le Q V 2014 Sequence to sequence learning with neural networks Proceedings of the 27th International Conference on Neural Information Processing Systems2 3104-12 · doi:10.5555/2969033.2969173
[5] Zhang C, Bengio S, Hardt M, Recht B and Vinyals O 2017 Understanding deep learning requires rethinking generalization Int. Conf. on Learning Representations (arXiv:1611.03530)
[6] Freeman C D and Bruna J 2017 Topology and geometry of deep rectified network optimization landscapes Int. Conf. on Learning Representations (arXiv:1611.01540)
[7] Venturi L, Bandeira A and Bruna J 2018 Neural networks with finite intrinsic dimension have no spurious valleys (arXiv:1802.06384)
[8] Hoffer E, Hubara I and Soudry D 2017 Train longer, generalize better: closing the generalization gap in large batch training of neural networks Advances in Neural Information Processing Systems pp 1729-39
[9] Soudry D and Carmon Y 2016 No bad local minima: data independent training error guarantees for multilayer neural networks (arXiv:1605.08361)
[10] Cooper Y 2018 The loss landscape of overparameterized neural networks (arXiv:1804.10200)
[11] Sagun L, Bottou L and LeCun Y 2017 Singularity of the hessian in deep learning Int. Conf. on Learning Representations
[12] Sagun L, Evci U, Güney V U, Dauphin Y and Bottou L 2017 Empirical analysis of the hessian of over-parametrized neural networks ICLR 2018 Workshop Contribution (arXiv:1706.04454)
[13] Ballard A J, Das R, Martiniani S, Mehta D, Sagun L, Stevenson J D and Wales D J 2017 Energy landscapes for machine learning Phys. Chem. Chem. Phys.19 12585-603 · doi:10.1039/C7CP01108C
[14] Lipton Z C 2016 Stuck in a what? Adventures in weight space Int. Conf. on Learning Representations (arXiv:1602.07320)
[15] 2018 Comparing dynamics: deep neural networks versus glassy systems J. Stat. Mech. 124013 · Zbl 1459.82317 · doi:10.1088/1742-5468/ab3281
[16] Geiger M, Spigler S, d’Ascoli S, Sagun L, Baity-Jesi M, Biroli G and Wyart M 2018 The jamming transition as a paradigm to understand the loss landscape of deep neural networks (arXiv:1809.09349)
[17] Spigler S, Geiger M, d’Ascoli S, Sagun L, Biroli G and Wyart M 2018 A jamming transition from under-to over-parametrization affects loss landscape and generalization (arXiv:1810.09665)
[18] Dauphin Y, Pascanu R, Gulcehre C, Cho K, Ganguli S and Bengio Y 2014 Identifying and attacking the saddle point problem in high-dimensional non-convex optimization Proceedings of the 27th International Conference on Neural Information Processing Systems2 2933-41 · doi:10.5555/2969033.2969154
[19] Choromanska A, Henaff M, Mathieu M, Ben Arous G and LeCun Y 2015 The loss surfaces of multilayer networks J. Mach. Learn. Res.38 192-204
[20] Jacot A and Gabriel F 2018 Hongler neural tangent kernel C and generalization in neural networks Proceedings of the 32nd International Conference on Neural Information Processing Systems 8580-9 · doi:10.5555/3327757.3327948
[21] Du S S, Zhai X, Póczos B and Singh A 2019 Gradient descent provably optimizes over-parameterized neural networks (arXiv:1810.02054)
[22] Allen-Zhu Z, Li Y and Song Z 2018 A convergence theory for Deep learning via over-parameterization CoRR (arXiv:abs/1811.03962)
[23] Arora S, Du S S, Hu W, Li Z, Salakhutdinov R and Wang R 2019 On exact computation with an infinitely wide neural net (arXiv:1904.11955)
[24] Neyshabur B, Tomioka R, Salakhutdinov R and Srebro N 2017 Geometry of optimization and implicit regularization in deep learning (arXiv:1705.03071)
[25] Neyshabur B, Li Z, Bhojanapalli S, LeCun Y and Srebro N 2018 Towards understanding the role of over-parametrization in generalization of neural networks (arXiv:1805.12076)
[26] Bansal Y, Advani M, Cox D D and Saxe A M 2018 Minnorm training: an algorithm for training over-parameterized deep neural networks (arXiv:1806.00730)
[27] Advani M S and Saxe A M 2017 High-dimensional dynamics of generalization error in neural networks (arXiv:1710.03667)
[28] Liao Z and Couillet R 2018 The dynamics of learning: a random matrix approach (arXiv:1805.11917)
[29] Neal B, Mittal S, Baratin A, Tantia V, Scicluna M, Lacoste-Julien S and Mitliagkas I 2018 A modern take on the bias-variance tradeoff in neural networks (arXiv:1810.08591)
[30] Soudry D, Hoffer E, Nacson M S, Gunasekar S and Srebro N 2018 The implicit bias of gradient descent on separable data J. Mach. Learn. Res.19 2822-78 · Zbl 1477.62192 · doi:10.5555/3291125.3309632
[31] Liang T and Rakhlin A 2018 Just interpolate: Kernel’ ridgeless’ regression can generalize (arXiv:1808.00387)
[32] Chizat L and Bach F 2018 A note on lazy training in supervised differentiable programming (arXiv:1812.07956)
[33] Rotskoff G M and Vanden-Eijnden E 2018 Neural networks as interacting particle systems: asymptotic convexity of the loss landscape and universal scaling of the approximation error (arXiv:1805.00915)
[34] Mei S, Montanari A and Nguyen P-M 2018 A mean field view of the landscape of two-layers neural networks (arXiv:1804.06561)
[35] Sirignano J and Spiliopoulos K 2018 Mean field analysis of neural networks (arXiv:1805.01053)
[36] Belkin M, Hsu D, Ma S and Mandal S 2019 Reconciling modern machine-learning practice and the classical bias-variance trade-off Proc. Natl Acad. Sci.116 15849-54 · Zbl 1433.68325 · doi:10.1073/pnas.1903070116
[37] Belkin M, Hsu D and Xu J 2019 Two models of double descent for weak features (arXiv:1903.07571)
[38] Mei S and Montanari A 2019 The generalization error of random features regression: precise asymptotics and double descent curve (arXiv:1908.05355)
[39] Hastie T, Montanari A, Rosset S and Tibshirani R J 2019 Surprises in high-dimensional ridgeless least squares interpolation (arXiv:1903.08560)
[40] Hanin B and Nica M 2019 Finite depth and width corrections to the neural tangent kernel (arXiv:1909.05989)
[41] Dyer E and Gur-Ari G 2019 Asymptotics of wide networks from feynman diagrams (arXiv:1909.11304)
[42] Mei S, Misiakiewicz T and Montanari A 2019 Mean-field theory of two-layers neural networks: dimension-free bounds and kernel limit (arXiv:1902.06015)
[43] Nguyen P-M 2019 Mean field limit of the learning dynamics of multilayer neural networks (arXiv:1902.02880)
[44] LeCun Y, Cortes C and Burges C J 1998 The mnist database of handwritten digits 10 34 (http://yann.lecun.com/exdb/mnist)
[45] Domingos P 2000 A unified bias-variance decomposition Proceedings of the Seventeenth National Conference on Artificial Intelligence and Twelfth Conference on Innovative Applications of Artificial Intelligence 564-9 · doi:10.5555/647288.721421
[46] Neal R M 1996 Bayesian Learning for Neural Networks (New York: Springer) · Zbl 0888.62021 · doi:10.1007/978-1-4612-0745-0
[47] Cho Y and Saul L K 2009 Kernel methods for deep learning Advances in Neural Information Processing Systems
[48] Lee J H, Bahri Y, Novak R, Schoenholz S S, Pennington J and Sohl-Dickstein J 2018 Deep neural networks as gaussian processes ICLR
[49] Lee J, Xiao L, Schoenholz S S, Bahri Y, Sohl-Dickstein J and Pennington J 2019 Wide neural networks of any depth evolve as linear models under gradient descent (arXiv:1902.06720)
[50] Saad D and Solla S A 1995 On-line learning in soft committee machines Phys. Rev. E 52 4225 · doi:10.1103/PhysRevE.52.4225
[51] Engel A and Van den Broeck C 2001 Statistical Mechanics of Learning (Cambridge: Cambridge University Press) · Zbl 0984.82034 · doi:10.1017/CBO9781139164542
[52] Bös S and Opper M 1997 Dynamics of training Advances in Neural Information Processing Systems pp 141-7
[53] Le Cun Y, Kanter I and Solla S A 1991 Eigenvalues of covariance matrices: application to neural-network learning Phys. Rev. Lett.66 2396 · doi:10.1103/PhysRevLett.66.2396
[54] Franz S and Parisi G 2016 The simplest model of jamming J. Phys. A: Math. Theor.49 145001 · Zbl 1342.82111 · doi:10.1088/1751-8113/49/14/145001
[55] Franz S, Hwang S and Urbani P 2018 Jamming in multilayer supervised learning models Phys. Rev. Lett.123 160602 · doi:10.1103/PhysRevLett.123.160602
[56] Franz S, Parisi G, Urbani P and Zamponi F 2015 Universal spectrum of normal modes in low-temperature glasses Proc. Natl Acad. Sci.112 14539-44 · doi:10.1073/pnas.1511134112
[57] Franz S, Parisi G, Sevelev M, Urbani P and Zamponi F 2017 Universality of the sat-unsat (jamming) threshold in non-convex continuous constraint satisfaction problems SciPost Phys.2 019 · doi:10.21468/SciPostPhys.2.3.019
[58] Saxe A M, McClelland J L and Ganguli S 2014 Exact solutions to the nonlinear dynamics of learning in deep linear neural networks Int. Conf. on Learning Representations (arXiv:1312.6120)
[59] Kingma D P and Ba J 2015 Adam: a method for stochastic optimization Int. Conf. on Learning Representations (arXiv:1412.6980)
This reference list is based on information provided by the publisher or from digital mathematics libraries. Its items are heuristically matched to zbMATH identifiers and may contain data conversion errors. In some cases that data have been complemented/enhanced by data from zbMATH Open. This attempts to reflect the references listed in the original paper as accurately as possible without claiming completeness or a perfect matching.