×

Memory capacity of two layer neural networks with smooth activations. (English) Zbl 07892810

Summary: Determining the memory capacity of two layer neural networks with \(m\) hidden neurons and input dimension \(d\) (i.e., \(md+2m\) total trainable parameters), which refers to the largest size of general data the network can memorize, is a fundamental machine learning question. For activations that are real analytic at a point and, if restricting to a polynomial there, have sufficiently high degree, we establish a lower bound of \(\lfloor md/2\rfloor\) and optimality up to a factor of approximately 2. All practical activations, such as sigmoids, Heaviside, and the rectified linear unit (ReLU), are real analytic at a point. Furthermore, the degree condition is mild, requiring, for example, that \(\binom{k+d-1}{d-1} \geq n\) if the activation is \(x^k\). Analogous prior results were limited to Heaviside and ReLU activations – our result covers almost everything else. In order to analyze general activations, we derive the precise generic rank of the network’s Jacobian, which can be written in terms of Hadamard powers and the Khatri-Rao product. Our analysis extends classical linear algebraic facts about the rank of Hadamard powers. Overall, our approach differs from prior works on memory capacity and holds promise for extending to deeper models and other architectures.

MSC:

68-XX Computer science

References:

[1] Allman, E. S., Matias, C., and Rhodes, J. A., Identifiability of parameters in latent structure models with many observed variables, Ann. Statist., 37 (2009), pp. 3099-3132. · Zbl 1191.62003
[2] Alon, N., Perturbed identity matrices have high rank: Proof and applications, Combin. Probab. Comput., 18 (2009), pp. 3-15. · Zbl 1190.15002
[3] Axler, S., Linear Algebra Done Right, 3rd ed., Springer, New York, 2015. · Zbl 1304.15001
[4] Barvinok, A., Approximations of Convex Bodies by Polytopes and by Projections of Spectrahedra, preprint, arXiv:1204.0471, 2012.
[5] Baum, E. B., On the capabilities of multilayer perceptrons, J. Complexity, 4 (1988), pp. 193-215. · Zbl 0648.68085
[6] Belkin, M., Hsu, D., Ma, S., and Mandal, S., Reconciling modern machine-learning practice and the classical bias-variance trade-off, Proc. Natl. Acad. Sci. USA, 116 (2019), pp. 15849-15854. · Zbl 1433.68325
[7] Brualdi, R. A., Introductory Combinatorics, 5th ed., Pearson, London, 2009.
[8] Bubeck, S., Eldan, R., Lee, Y. T., and Mikulincer, D., Network size and size of the weights in memorization with two-layers neural networks, in Neural Information Processing Systems (NeurIPS), Vol. 33, 2020, pp. 4977-4986.
[9] Cover, T. M., Geometrical and statistical properties of systems of linear inequalities with applications in pattern recognition, IEEE Trans. Electron. Comput., 3 (1965), pp. 326-334. · Zbl 0152.18206
[10] Damm, T. and Dietrich, N., Hadamard powers and kernel perceptrons, Linear Algebra Appl., 672 (2023), pp. 93-107. · Zbl 1517.15002
[11] Du, S. S., Zhai, X., Poczos, B., and Singh, A., Gradient descent provably optimizes over-parameterized neural networks, in International Conference on Learning Representations (ICLR), , 2019.
[12] Gantmacher, F. R., The Theory of Matrices, Vol. 1, Chelsea Publishing Company, New York, 1959. · Zbl 0927.15001
[13] Guaraldo, F., Macrì, P., and Tancredi, A., Topics on Real Analytic Spaces, Vieweg+Teubner Verlag, 1986. · Zbl 0616.32012
[14] Gunning, R. C. and Rossi, H., Analytic Functions of Several Complex Variables, Prentice-Hall, Englewood Cliffs, NJ, 1965. · Zbl 0141.08601
[15] Kruskal, J. B., Three-way arrays: Rank and uniqueness of trilinear decompositions, with application to arithmetic complexity and statistics, Linear Algebra Appl., 18 (1977), pp. 95-138. · Zbl 0364.15021
[16] Lee, J. M., Introduction to Smooth Manifolds, 2nd ed., Springer, New York, 2013. · Zbl 1258.53002
[17] Merris, R., Combinatorics, 2nd ed., John Wiley & Sons, Hoboken, NJ, 2003.
[18] Montanari, A. and Zhong, Y., The interpolation phase transition in neural networks: Memorization and generalization under lazy training, Ann. Statist., 50 (2022), pp. 2816-2847. · Zbl 1539.68314
[19] Oymak, S. and Soltanolkotabi, M., Toward moderate overparameterization: Global convergence guarantees for training shallow neural networks, IEEE J. Sel. Areas Inf. Theory, 1 (2020), pp. 84-105.
[20] Song, Z. and Yang, X., Quadratic Suffices for Over-Parametrization via Matrix Chernoff Bound, preprint, arXiv:1906.03593, 2019.
[21] Tropp, J. A., User-friendly tail bounds for sums of random matrices, Found. Comput. Math., 12 (2012), pp. 389-434. · Zbl 1259.60008
[22] Vaswani, S., Bach, F., and Schmidt, M., Fast and faster convergence of SGD for over-parameterized models and an accelerated perceptron, in International Conference on Artificial Intelligence and Statistics, , Vol. 89, 2019, pp. 1195-1204.
[23] Xie, B., Liang, Y., and Song, L., Diverse neural network learns true target functions, in International Conference on Artificial Intelligence and Statistics, , Vol. 54, 2017, pp. 1216-1224.
[24] Yun, C., Sra, S., and Jadbabaie, A., Small ReLU networks are powerful memorizers: A tight analysis of memorization capacity, in Neural Information Processing Systems (NeurIPS), Vol. 32, 2019, pp. 15558-15569.
[25] Zhang, J., Zhang, Y., Hong, M., Sun, R., and Luo, Z.-Q., When expressivity meets trainability: Fewer than n neurons can work, in Neural Information Processing Systems (NeurIPS), Vol. 34, 2021, pp. 9167-9180.
[26] Zhū, S., Siyuan yujian, 1303.
This reference list is based on information provided by the publisher or from digital mathematics libraries. Its items are heuristically matched to zbMATH identifiers and may contain data conversion errors. In some cases that data have been complemented/enhanced by data from zbMATH Open. This attempts to reflect the references listed in the original paper as accurately as possible without claiming completeness or a perfect matching.