×

Random neural networks in the infinite width limit as Gaussian processes. (English) Zbl 1531.68113

Neural networks, originally introduced in the 1940s and 1950s, have become extremely useful as powerful tools for a variety of mathematical problems in many subjects, for example image processing, machine learning, neuroscience, signal processing, manifold learning, language processing, probability and many others. See the paper under review for many references in this regard. This fascinating paper studies an important probabilistic question for a class of networks called fully connected neural networks. They are defined as follows:
Fix a positive integer \(L\) as well as \(L+2\) positive integers \(n_0,\dots,n_{L+1}\) and a function \(\sigma: \mathbb R\to \mathbb R\). A fully connected depth-\(L\) neural network with input dimension \(n_0\), output dimension \(n_{L+1}\), hidden layer widths \(n_1...n_{L}\) and nonlinearity \(\sigma\) is any function \(x_{\alpha}\in \mathbb R^{n_0}\to z_{\alpha}^{(L+1)}\in \mathbb R^{n_{L+1}}\) of the form \[ z_{\alpha}^{(l)}=\left\{ \begin{array}{ll} W^{(1)}x_{\alpha}+b^{(1)}, & l=1\\ W^{(l)}\sigma(z_{\alpha}^{(l-1)})+b^{(l)},& l=2,...L+1, \end{array} \right. \] where \(W^{(l)}\in \mathbb R^{n_l\times n_{l-1}}\) are matrices, \(b^{(l)}\in \mathbb R^{n_l}\) are vectors and \(\sigma\) applied to a vector is shorthand for \(\sigma\) applied to each component.
The parameters \(L, n_0, . . . , n_{L+1}\) are called the network architecture, and \(z_{\alpha}^{(l)}\in \mathbb R^{n_l}\) called the vector of pre-activations at layer \(l\) corresponding to input \(x_{\alpha}\). A fully connected network with a fixed architecture and given nonlinearity \(\sigma\) is therefore a finite but typically high-dimensional family of functions, parameterized by the network weights (entries of the weight matrices \(W^{(l)}\)) and biases (components of bias vectors \(b^{(l)}\)). This article considers the mapping \(x_{\alpha}\to z_{\alpha}^{(L+1)}\) when the network’s weights and biases are chosen independently at random and the hidden layer widths \(n_1, . . . , n_L\) are sent to infinity while the input dimension \(n_0\), output dimension \(n_{L+1}\), and network depth \(L\) are fixed. In this infinite width limit, akin to the large matrix limit in random matrix theory, neural networks with random weights and biases converge to Gaussian processes. The main result of the paper under review is that this holds for general nonlinearities \(\sigma\) and distributions of network weights.
The paper is well written with an excellent set of references.

MSC:

68T07 Artificial neural networks and deep learning
60G15 Gaussian processes

Software:

ImageNet; GPT-3; AlexNet

References:

[1] ADLAM, B., LEVINSON, J. and PENNINGTON, J. (2022). A random matrix perspective on mixtures of nonlinearities for deep learning. AISTATS.
[2] ADLAM, B. and PENNINGTON, J. (2020). The neural tangent kernel in high dimensions: Triple descent and a multi-scale theory of generalization. In International Conference on Machine Learning 74-84. PMLR.
[3] AHN, A. (2022). Fluctuations of \(β\)-Jacobi product processes. Probab. Theory Related Fields 183 57-123. Digital Object Identifier: 10.1007/s00440-022-01109-0 Google Scholar: Lookup Link MathSciNet: MR4421171 · Zbl 1489.60124 · doi:10.1007/s00440-022-01109-0
[4] AKEMANN, G. and BURDA, Z. (2012). Universal microscopic correlation functions for products of independent Ginibre matrices. J. Phys. A 45 465201. Digital Object Identifier: 10.1088/1751-8113/45/46/465201 Google Scholar: Lookup Link MathSciNet: MR2993423 · Zbl 1261.15041 · doi:10.1088/1751-8113/45/46/465201
[5] AKEMANN, G., BURDA, Z. and KIEBURG, M. (2019). From integrable to chaotic systems: Universal local statistics of Lyapunov exponents. Europhys. Lett. 126 40001.
[6] AKEMANN, G., BURDA, Z., KIEBURG, M. and NAGAO, T. (2014). Universal microscopic correlation functions for products of truncated unitary matrices. J. Phys. A 47 255202. Digital Object Identifier: 10.1088/1751-8113/47/25/255202 Google Scholar: Lookup Link MathSciNet: MR3224113 · Zbl 1296.15016 · doi:10.1088/1751-8113/47/25/255202
[7] BARTLETT, P. L., LONG, P. M., LUGOSI, G. and TSIGLER, A. (2020). Benign overfitting in linear regression. Proc. Natl. Acad. Sci. USA 117 30063-30070. Digital Object Identifier: 10.1073/pnas.1907378117 Google Scholar: Lookup Link MathSciNet: MR4263288 · Zbl 1485.62085 · doi:10.1073/pnas.1907378117
[8] Belkin, M., Hsu, D., Ma, S. and Mandal, S. (2019). Reconciling modern machine-learning practice and the classical bias-variance trade-off. Proc. Natl. Acad. Sci. USA 116 15849-15854. Digital Object Identifier: 10.1073/pnas.1903070116 Google Scholar: Lookup Link MathSciNet: MR3997901 · Zbl 1433.68325 · doi:10.1073/pnas.1903070116
[9] BROWN, T. B., MANN, B., RYDER, N., SUBBIAH, M., KAPLAN, J., DHARIWAL, P., NEELAKANTAN, A., SHYAM, P., SASTRY, G. et al. (2020). Language models are few-shot learners. ArXiv Preprint. Available at arXiv:2005.14165.
[10] CRISANTI, A., PALADIN, G. and VULPIANI, A. (2012). Products of Random Matrices: In Statistical Physics 104. Springer, Berlin.
[11] DANIELY, A., FROSTIG, R. and SINGER, Y. (2016). Toward deeper understanding of neural networks: The power of initialization and a dual view on expressivity. In Advances in Neural Information Processing Systems 2253-2261.
[12] DAUBECHIES, I., DEVORE, R., FOUCART, S., HANIN, B. and PETROVA, G. (2022). Nonlinear approximation and (deep) ReLU networks. Constr. Approx. 55 127-172. Digital Object Identifier: 10.1007/s00365-021-09548-z Google Scholar: Lookup Link MathSciNet: MR4376561 · Zbl 1501.41003 · doi:10.1007/s00365-021-09548-z
[13] DEVORE, R., HANIN, B. and PETROVA, G. (2021). Neural network approximation. Acta Numer. 30 327-444. Digital Object Identifier: 10.1017/S0962492921000052 Google Scholar: Lookup Link MathSciNet: MR4298220 · Zbl 1518.65022 · doi:10.1017/S0962492921000052
[14] Du, S. S., Zhai, X., Poczos, B. and Singh, A. (2019). Gradient descent provably optimizes over-parameterized neural networks. In International Conference on Learning Representations.
[15] ELDAN, R., MIKULINCER, D. and SCHRAMM, T. (2021). Non-asymptotic approximations of neural networks by Gaussian processes. In Conference on Learning Theory 1754-1775. PMLR.
[16] FAN, Z. and WANG, Z. (2020). Spectra of the conjugate kernel and neural tangent kernel for linear-width neural networks. Adv. Neural Inf. Process. Syst. 33 7710-7721.
[17] FURSTENBERG, H. (1963). Noncommuting random products. Trans. Amer. Math. Soc. 108 377-428. Digital Object Identifier: 10.2307/1993589 Google Scholar: Lookup Link MathSciNet: MR0163345 · Zbl 0203.19102 · doi:10.2307/1993589
[18] FURSTENBERG, H. and KESTEN, H. (1960). Products of random matrices. Ann. Math. Stat. 31 457-469. Digital Object Identifier: 10.1214/aoms/1177705909 Google Scholar: Lookup Link MathSciNet: MR0121828 · Zbl 0137.35501 · doi:10.1214/aoms/1177705909
[19] GARRIGA-ALONSO, A. and RASMUSSEN, CE. and AITCHISON, L. (2021). Deep Convolutional Networks as Shallow Gaussian Processes. International Conference on Representation Learning 2021.
[20] GORIN, V. and SUN, Y. (2022). Gaussian fluctuations for products of random matrices. Amer. J. Math. 144 287-393. Digital Object Identifier: 10.1353/ajm.2022.0006 Google Scholar: Lookup Link MathSciNet: MR4401507 · Zbl 1498.60032 · doi:10.1353/ajm.2022.0006
[21] HANIN, B. (2018). Which neural net architectures give rise to exploding and vanishing gradients? In Advances in Neural Information Processing Systems.
[22] HANIN, B. (2019). Universal function approximation by deep neural nets with bounded width and relu activations. Mathematics 7 992.
[23] HANIN, B. and NICA, M. (2019). Finite depth and width corrections to the neural tangent kernel. ICLR 2020 and available at arXiv:1909.05989.
[24] HANIN, B. and NICA, M. (2020). Products of many large random matrices and gradients in deep neural networks. Comm. Math. Phys. 376 287-322. Digital Object Identifier: 10.1007/s00220-019-03624-z Google Scholar: Lookup Link MathSciNet: MR4093863 · Zbl 1446.60007 · doi:10.1007/s00220-019-03624-z
[25] HANIN, B. and PAOURIS, G. (2021). Non-asymptotic results for singular values of Gaussian matrix products. Geom. Funct. Anal. 31 268-324. Digital Object Identifier: 10.1007/s00039-021-00560-w Google Scholar: Lookup Link MathSciNet: MR4268303 · Zbl 1471.15032 · doi:10.1007/s00039-021-00560-w
[26] HANIN, B. and ROLNICK, D. (2019). Deep ReLU networks have surprisingly few activation patterns. NeurIPS.
[27] HANIN, B. and ROLNICK, D. (2019). Complexity of linear regions in deep networks. ICML.
[28] HASTIE, T., MONTANARI, A., ROSSET, S. and TIBSHIRANI, R. J. (2022). Surprises in high-dimensional ridgeless least squares interpolation. Ann. Statist. 50 949-986. Digital Object Identifier: 10.1214/21-aos2133 Google Scholar: Lookup Link MathSciNet: MR4404925 · Zbl 1486.62202 · doi:10.1214/21-aos2133
[29] HEBB, D. O. (1949). The Organization of Behavior: A Neuropsychological Theory. Wiley, New York.
[30] HUANG, J. and YAU, H.-T. (2020). Dynamics of deep neural networks and neural tangent hierarchy. In International Conference on Machine Learning 4542-4551. PMLR.
[31] Jacot, A., Gabriel, F. and Hongler, C. (2018). Neural tangent kernel: Convergence and generalization in neural networks. In Advances in Neural Information Processing Systems 8571-8580.
[32] Krizhevsky, A., Sutskever, I. and Hinton, G. E. (2012). Imagenet classification with deep convolutional neural networks. In Advances in Neural Information Processing Systems 1097-1105.
[33] LATALA, R. (1997). Estimation of moments of sums of independent real random variables. Ann. Probab. 25 1502-1513. · Zbl 0885.60011
[34] LEE, J., BAHRI, Y., NOVAK, R., SCHOENHOLZ, S. S., PENNINGTON, J. and SOHL-DICKSTEIN, J. (2018). Deep neural networks as Gaussian processes. ICML 2018 and available at arXiv:1711.00165.
[35] LIU, C., ZHU, L. and BELKIN, M. (2020). On the linearity of large non-linear models: When and why the tangent kernel is constant. Adv. Neural Inf. Process. Syst. 33 15954-15964.
[36] MATTHEWS, A. G. D. G., ROWLAND, M., HRON, J., TURNER, R. E. and GHAHRAMANI, Z. (2018). Gaussian process behaviour in wide deep neural networks. ArXiv Preprint. Available at arXiv:1804.11271.
[37] Neal, R. M. (1996). Priors for infinite networks. In Bayesian Learning for Neural Networks 29-53. Springer, Berlin. · Zbl 0888.62021
[38] Nica, A. and Speicher, R. (2006). Lectures on the Combinatorics of Free Probability. London Mathematical Society Lecture Note Series 335. Cambridge Univ. Press, Cambridge. Digital Object Identifier: 10.1017/CBO9780511735127 Google Scholar: Lookup Link MathSciNet: MR2266879 · Zbl 1133.60003 · doi:10.1017/CBO9780511735127
[39] NOCI, L., BACHMANN, G., ROTH, K., NOWOZIN, S. and HOFMANN, T. (2021). Precise characterization of the prior predictive distribution of deep ReLU networks. Adv. Neural Inf. Process. Syst. 34 20851-20862.
[40] NOVAK, R., XIAO, L., LEE, J., BAHRI, Y., YANG, G., HRON, J., ABOLAFIA, D. A., PENNINGTON, J. and SOHL-DICKSTEIN, J. (2018). Bayesian deep convolutional networks with many channels are Gaussian processes. ArXiv Preprint. Available at arXiv:1810.05148.
[41] PÉCHÉ, S. (2019). A note on the Pennington-Worah distribution. Electron. Commun. Probab. 24 66. Digital Object Identifier: 10.1214/19-ecp262 Google Scholar: Lookup Link MathSciNet: MR4029435 · Zbl 1423.60032 · doi:10.1214/19-ecp262
[42] PENNINGTON, J. and WORAH, P. (2017). Nonlinear random matrix theory for deep learning. In Advances in Neural Information Processing Systems 2634-2643.
[43] ROBERTS, D. A., YAIDA, S. and HANIN, B. (2022). The Principles of Deep Learning Theory: An Effective Theory Approach to Understanding Neural Networks. Cambridge Univ. Press, Cambridge. · Zbl 1507.68003
[44] ROSENBLATT, F. (1958). The perceptron: A probabilistic model for information storage and organization in the brain. Psychol. Rev. 65 386.
[45] RUELLE, D. (1979). Ergodic theory of differentiable dynamical systems. Publ. Math. Inst. Hautes étud. Sci. 50 27-58. · Zbl 0426.58014
[46] SILVER, D., HUANG, A., MADDISON, C. J., GUEZ, A., SIFRE, L., VAN DEN DRIESSCHE, G., SCHRITTWIESER, J., ANTONOGLOU, I., PANNEERSHELVAM, V. et al. (2016). Mastering the game of Go with deep neural networks and tree search. Nature 529 484-489.
[47] VOICULESCU, D. (1986). Addition of certain noncommuting random variables. J. Funct. Anal. 66 323-346. Digital Object Identifier: 10.1016/0022-1236(86)90062-5 Google Scholar: Lookup Link MathSciNet: MR0839105 · Zbl 0651.46063 · doi:10.1016/0022-1236(86)90062-5
[48] Wigner, E. P. (1958). On the distribution of the roots of certain symmetric matrices. Ann. of Math. (2) 67 325-327. Digital Object Identifier: 10.2307/1970008 Google Scholar: Lookup Link MathSciNet: MR0095527 · Zbl 0085.13203 · doi:10.2307/1970008
[49] YAIDA, S. (2020). Non-Gaussian processes and neural networks at finite widths. MSML.
[50] YANG, G. (2019). Scaling limits of wide neural networks with weight sharing: Gaussian process behavior, gradient independence, and neural tangent kernel derivation. ArXiv Preprint. Available at arXiv:1902.04760.
[51] YANG, G. (2019). Tensor programs I: Wide feedforward or recurrent neural networks of any architecture are gaussian processes. ArXiv Preprint. Available at arXiv:1910.12478.
[52] YANG, G. (2020). Tensor programs II: Neural tangent kernel for any architecture. ArXiv Preprint. Available at arXiv:2006.14548.
[53] YANG, G. (2020). Tensor programs III: Neural matrix laws. ArXiv Preprint. Available at arXiv:2009.10685.
[54] YAROTSKY, D. (2016). Error bounds for approximations with deep ReLU networks. ArXiv Preprint. Available at arXiv:1610.01145.
[55] YAROTSKY, D. (2018). Optimal approximation of continuous functions by very deep ReLU networks. In Conference on Learning Theory 639-649. PMLR.
[56] ZAVATONE-VETH, J. and PEHLEVAN, C. (2021). Exact marginal prior distributions of finite Bayesian neural networks. Adv. Neural Inf. Process. Syst. 34.
[57] ZHANG, C., BENGIO, S., HARDT, M., RECHT, B. and VINYALS, O. (2017). Understanding deep learning requires rethinking generalization. In International Conference on Learning Representations, (ICLR).
This reference list is based on information provided by the publisher or from digital mathematics libraries. Its items are heuristically matched to zbMATH identifiers and may contain data conversion errors. In some cases that data have been complemented/enhanced by data from zbMATH Open. This attempts to reflect the references listed in the original paper as accurately as possible without claiming completeness or a perfect matching.