Document Zbl 1434.68516

Optimal approximation of piecewise smooth functions using deep ReLU neural networks. (English) Zbl 1434.68516

Neural Netw. 108, 296-330 (2018).

Summary: We study the necessary and sufficient complexity of ReLU neural networks – in terms of depth and number of weights – which is required for approximating classifier functions in an \(L^p\)-sense.
As a model class, we consider the set \(\mathcal{E}^\beta(\mathbb{R}^d)\) of possibly discontinuous piecewise \(C^\beta\) functions \(f:[-\frac{1}{2},\frac{1}{2}]^d\rightarrow\mathbb{R}\), where the different “smooth regions” of \(f\) are separated by \(C^\beta\) hypersurfaces. For given dimension \(d\geq 2\), regularity \(\beta>0\), and accuracy \(\varepsilon>0\), we construct artificial neural networks with ReLU activation function that approximate functions from \(\mathcal{E}^\beta(\mathbb{R}^d)\) up to an \(L^2\) error of \(\varepsilon\). The constructed networks have a fixed number of layers, depending only on \(d\) and \(\beta\), and they have \(\mathcal{O}(\varepsilon^{-2(d-1)/\beta})\) many nonzero weights, which we prove to be optimal. For the proof of optimality, we establish a lower bound on the description complexity of the class \(\mathcal{E}^\beta(\mathbb{R}^d)\). By showing that a family of approximating neural networks gives rise to an encoder for \(\mathcal{E}^\beta(\mathbb{R}^d)\), we then prove that one cannot approximate a general function \(f\in\mathcal{E}^\beta(\mathbb{R}^d)\) using neural networks that are less complex than those produced by our construction. In addition to the optimality in terms of the number of weights, we show that in order to achieve this optimal approximation rate, one needs ReLU networks of a certain minimal depth. Precisely, for piecewise \(C^\beta(\mathbb{R}^d)\) functions, this minimal depth is given – up to a multiplicative constant – by \(\beta/d\). Up to a log factor, our constructed networks match this bound. This partly explains the benefits of depth for ReLU networks by showing that deep networks are necessary to achieve efficient approximation of (piecewise) smooth functions.
Finally, we analyze approximation in high-dimensional spaces where the function \(f\) to be approximated can be factorized into a smooth dimension reducing feature map \(\tau\) and classifier function \(g\) – defined on a low-dimensional feature space – as \(f=g\circ\tau\). We show that in this case the approximation rate depends only on the dimension of the feature space and not the input dimension.

Cited in 134 Documents

MSC:

68T07	Artificial neural networks and deep learning
65D15	Algorithms for approximation of functions

Keywords:

deep neural networks; piecewise smooth functions; function approximation; sparse connectivity; metric entropy; curse of dimension

Software:

ImageNet; AlexNet

Cite Review PDF

Full Text: DOI arXiv

References:

[1]	Adams, R. A., Sobolev spaces, xviii+268 (1975), Academic Press: Academic Press New York-London · Zbl 0314.46030
[2]	Anthony, M.; Bartlett, P. L., Neural network learning: Theoretical foundations (2009), Cambridge University Press
[3]	Barron, A. R., Universal approximation bounds for superpositions of a sigmoidal function, IEEE Transaction on Information Theory, 39, 3, 930-945 (1993) · Zbl 0818.68126
[4]	Barron, A. R., Approximation and estimation bounds for artificial neural networks, Machine Learning, 14, 1, 115-133 (1994) · Zbl 0818.68127
[5]	Baxt, W. G., Use of an artificial neural network for data analysis in clinical decision-making: The diagnosis of acute coronary occlusion, Neural Computation, 2, 4, 480-489 (1990)
[6]	Bölcskei, H.; Grohs, P.; Kutyniok, G.; Petersen, P., Memory-optimal neural network approximation, (Proc. of SPIE (wavelets and sparsity XVII) (2017))
[7]	Bölcskei, H., Grohs, P., Kutyniok, G., & Petersen, P. (2017b). Optimal approximation with sparsely connected deep neural networks. arXiv preprint arXiv:1705.01714; Bölcskei, H., Grohs, P., Kutyniok, G., & Petersen, P. (2017b). Optimal approximation with sparsely connected deep neural networks. arXiv preprint arXiv:1705.01714
[8]	Burke, H. B., Artificial neural networks for cancer research: outcome prediction, Seminars in Surgical Oncology, 10, 73-79 (1994)
[9]	Candès, E. J.; Donoho, D. L., Curvelets: a surprisingly effective nonadaptive representation of objects with edges, (Curve and surface fitting (2000), Vanderbilt University Press), 105-120
[10]	Candès, E. J.; Donoho, D. L., New tight frames of curvelets and optimal representations of objects with piecewise \(C^2\) singularities, Communications on Pure and Applied Mathematics, 57, 2, 219-266 (2004) · Zbl 1038.94502
[11]	Chandrasekaran, V., Wakin, M., Baron, D., & Baraniuk, R. G. (2004). Compressing piecewise smooth multidimensional functions using surflets: Rate-distortion analysis. Rice University ECE Technical Report.; Chandrasekaran, V., Wakin, M., Baron, D., & Baraniuk, R. G. (2004). Compressing piecewise smooth multidimensional functions using surflets: Rate-distortion analysis. Rice University ECE Technical Report. · Zbl 1367.94077
[12]	Chandrasekaran, V.; Wakin, M.; Baron, D.; Baraniuk, R. G., Representation and compression of multidimensional piecewise functions using surflets, IEEE Transaction on Information Theory, 55, 1, 374-400 (2009) · Zbl 1367.94077
[13]	Clements, G. F., Entropies of several sets of real valued functions, Pacific Journal of Mathematics, 13, 1085-1095 (1963) · Zbl 0158.05002
[14]	Cybenko, G., Approximation by superpositions of a sigmoidal function, Mathematics of Control, Signals, 2, 4, 303-314 (1989) · Zbl 0679.94019
[15]	Delalleau, O.; Bengio, Y., Shallow vs. deep sum-product networks, (Advances in neural information processing systems, Vol. 24 (2011), Curran Associates, Inc.), 666-674
[16]	Donoho, D. L., Unconditional bases are optimal bases for data compression and for statistical estimation, Applied and Computational Harmonic Analysis, 1, 1, 100-115 (1993) · Zbl 0796.62083
[17]	Donoho, D. L., Sparse components of images and optimal atomic decompositions, Constructive Approximation, 17, 3, 353-382 (2001) · Zbl 0995.65150
[18]	Dudley, R. M., (Real analysis and probability. Real analysis and probability, Cambridge studies in advanced mathematics, vol. 74 (2002), Cambridge University Press: Cambridge University Press Cambridge), x+555 · Zbl 1023.60001
[19]	Evans, L. C.; Gariepy, R. F., Measure theory and fine properties of functions (1992), CRC press · Zbl 0804.28001
[20]	Folland, G. B., (Real analysis: Modern techniques and their applications. Real analysis: Modern techniques and their applications, Pure and applied mathematics (1999), Wiley) · Zbl 0924.28001
[21]	Goodfellow, I.; Bengio, Y.; Courville, A., Deep learning (2016), MIT Press · Zbl 1373.68009
[22]	Grohs, P., Optimally sparse data representations, (Harmonic and applied analysis (2015), Springer), 199-248 · Zbl 1332.42025
[23]	Guo, K.; Labate, D., Optimally sparse multidimensional representation using shearlets, SIAM Journal on Mathematical Analysis, 39, 1, 298-318 (2007) · Zbl 1197.42017
[24]	Guyon, I., Applications of neural networks to character recognition, International Journal of Pattern Recognition, 05, 01n02, 353-382 (1991)
[25]	Hinton, G.; Deng, L.; Yu, D.; Dahl, G. E.; Mohamed, A. R.; Jaitly, N., Deep neural networks for acoustic modeling in speech recognition: The shared views of four research groups, IEEE Signal Processing Magazine, 29, 6, 82-97 (2012)
[26]	Hornik, K.; Stinchcombe, M.; White, H., Multilayer feedforward networks are universal approximators, Neural Networks, 2, 5, 359-366 (1989) · Zbl 1383.92015
[27]	Kirszbraun, M. D., Über die zusammenziehende und Lipschitzsche Transformationen, Fundamental Mathematics, 22, 77-108 (1934) · JFM 60.0532.03
[28]	Knerr, S.; Personnaz, L.; Dreyfus, G., Handwritten digit recognition by neural networks with single-layer training, IEEE Transactions on Neural Networks, 3, 6, 962-968 (1992)
[29]	Krizhevsky, A.; Sutskever, I.; Hinton, G. E., Imagenet classification with deep convolutional neural networks, (Advances in neural information processing systems, Vol. 25 (2012), Curran Associates, Inc.), 1097-1105
[30]	Kutyniok, G.; Labate, D., Introduction to shearlets, (Shearlets. Shearlets, Appl. Numer. Harmon. Anal. (2012), Birkhäuser/Springer: Birkhäuser/Springer New York), 1-38 · Zbl 1251.42010
[31]	Kutyniok, G.; Lim, W.-Q., Compactly supported shearlets are optimally sparse, Journal of Approximation Theory, 163, 11, 1564-1589 (2011) · Zbl 1226.42031
[32]	Lang, S., (Real and functional analysis. Real and functional analysis, Graduate texts in mathematics, vol. 142 (1993), Springer-Verlag: Springer-Verlag New York), xiv+580 · Zbl 0831.46001
[33]	LeCun, Y.; Bengio, Y.; Hinton, G., Deep learning, Nature, 521, 7553, 436-444 (2015)
[34]	LeCun, Y.; Boser, B. E.; Denker, J. S.; Henderson, D.; Howard, R. E.; Hubbard, W. E., Handwritten digit recognition with a back-propagation network, (Advances in neural information processing systems, Vol. 2 (1990), Morgan-Kaufmann), 396-404
[35]	Lee, J. M., (Introduction to smooth manifolds. Introduction to smooth manifolds, Graduate texts in mathematics, vol. 218 (2013), Springer: Springer New York), xvi+708 · Zbl 1258.53002
[36]	Le Pennec, E.; Mallat, S., Sparse geometric image representations with bandelets, IEEE Transactions on Image Processing, 14, 423-438 (2005)
[37]	Leshno, M.; Lin, V. Ya.; Pinkus, A.; Schocken, S., Multilayer feedforward networks with a nonpolynomial activation function can approximate any function, Neural Networks, 6, 6, 861-867 (1993)
[38]	Maiorov, V.; Pinkus, A., Lower bounds for approximation by MLP neural networks, Neurocomputing, 25, 1-3, 81-91 (1999) · Zbl 0931.68093
[39]	Mallat, S., Group invariant scattering, Communications on Pure and Applied Mathematics, 65, 10, 1331-1398 (2012) · Zbl 1282.47009
[40]	Martin, G. L.; Pittman, J. A., Recognizing hand-printed letters and digits using backpropagation learning, Neural Computation, 3, 2, 258-267 (1991)
[41]	Mattila, P., Geometry of sets and measures in euclidean spaces: Fractals and rectifiability, Vol. 44 (1999), Cambridge university press · Zbl 0911.28005
[42]	McCulloch, W.; Pitts, W., A logical calculus of ideas immanent in nervous activity, Bulletin of Mathematical Biophysics, 5, 115-133 (1943) · Zbl 0063.03860
[43]	Megginson, R. E., (An introduction to banach space theory. An introduction to banach space theory, Graduate texts in mathematics, vol. 183 (1998), Springer-Verlag: Springer-Verlag New York), xx+596 · Zbl 0910.46008
[44]	Mhaskar, H. N., Neural networks for optimal approximation of smooth and analytic functions, Neural Computation, 8, 1, 164-177 (1996)
[45]	Mhaskar, H., Liao, Q., & Poggio, T. (2016). Learning functions: when is deep better than shallow. arXiv preprint arXiv:1603.00988; Mhaskar, H., Liao, Q., & Poggio, T. (2016). Learning functions: when is deep better than shallow. arXiv preprint arXiv:1603.00988
[46]	Montúfar, G.; Pascanu, R.; Cho, K.; Bengio, Y., On the number of linear regions of deep neural networks, (Proceedings of the 27th international conference on neural information processing systems. Proceedings of the 27th international conference on neural information processing systems, NIPS’14 (2014), MIT Press: MIT Press Cambridge, MA, USA), 2924-2932
[47]	Pinkus, A., Approximation theory of the MLP model in neural networks, Acta Numerica, 8, 143-195 (1999) · Zbl 0959.68109
[48]	Poggio, T.; Mhaskar, H. N.; Rosasco, L.; Miranda, B.; Liao, Q., Why and when can deep-but not shallow-networks avoid the curse of dimensionality: A review, International Journal of Automation and Computing (2017)
[49]	Rosenblatt, F., Principles of neurodynamics: Perceptrons and the theory of brain mechanisms (1962), Spartan · Zbl 0143.43504
[50]	Rudin, W., Principles of mathematical analysis, x+342 (1976), McGraw-Hill Book Co.: McGraw-Hill Book Co. New York-Auckland-Düsseldorf, International Series in Pure and Applied Mathematics · Zbl 0346.26002
[51]	Rudin, W., Functional analysis, International series in pure and applied mathematics, xviii+424 (1991), McGraw-Hill, Inc.: McGraw-Hill, Inc. New York · Zbl 0867.46001
[52]	Rumelhart, D. E.; Hinton, G. E.; Williams, R. J., Learning internal representations by error propagation, (Parallel distributed processing: Explorations in the microstructure of cognition (1986), MIT Press), 318-362
[53]	Safran, I., & Shamir, O. (2016). Depth-width tradeoffs in approximating natural functions with neural networks. arXiv preprint arXiv:1610.09887; Safran, I., & Shamir, O. (2016). Depth-width tradeoffs in approximating natural functions with neural networks. arXiv preprint arXiv:1610.09887
[54]	Safran, I.; Shamir, O., Depth-width tradeoffs in approximating natural functions with neural networks, (Proceedings of the 34th international conference on machine learning. Proceedings of the 34th international conference on machine learning, Proceedings of machine learning research, vol. 70 (2017)), 2979-2987
[55]	Silver, D.; Huang, A.; Maddison, C. J.; Guez, A.; Sifre, L.; van den Driessche, G., Mastering the game of Go with deep neural networks and tree search, Nature, 529, 7587, 484-489 (2016)
[56]	Telgarsky, M. (2015). Representation benefits of deep feedforward networks. arXiv preprint arXiv:1509.08101; Telgarsky, M. (2015). Representation benefits of deep feedforward networks. arXiv preprint arXiv:1509.08101
[57]	Telgarsky, M., Benefits of depth in neural networks, (Feldman, V.; Rakhlin, A.; Shamir, O., 29th annual conference on learning theory. 29th annual conference on learning theory, Proceedings of machine learning research, vol. 49 (2016), PMLR: PMLR Columbia University, New York, New York, USA), 1517-1539
[58]	Telgarsky, M. (2017). Neural networks and rational functions. arXiv preprint arXiv:1706.03301; Telgarsky, M. (2017). Neural networks and rational functions. arXiv preprint arXiv:1706.03301
[59]	Valentine, F. A., On the extension of a vector function so as to preserve a Lipschitz condition, American Mathematical Society. Bulletin, 49, 2, 100-108 (1943) · Zbl 0061.37505
[60]	Voigtlaender, F., & Pein, A. (2017). Analysis sparsity versus synthesis sparsity for \(\alpha\) arXiv:1702.03559; Voigtlaender, F., & Pein, A. (2017). Analysis sparsity versus synthesis sparsity for \(\alpha\) arXiv:1702.03559
[61]	Wiatowski, T.; Bölcskei, H., A mathematical theory of deep convolutional neural networks for feature extraction, IEEE Transaction on Information Theory, 64, 3, 1845-1866 (2018) · Zbl 1390.94053
[62]	Yarotsky, D., Error bounds for approximations with deep ReLU networks, Neural Networks, 94, 103-114 (2017) · Zbl 1429.68260
[63]	Zhang, G. P., Neural networks for classification: A survey, IEEE Transactions on Systems, Man, and Cybernetics Part C, 30, 4, 451-462 (2000)

This reference list is based on information provided by the publisher or from digital mathematics libraries. Its items are heuristically matched to zbMATH identifiers and may contain data conversion errors. In some cases that data have been complemented/enhanced by data from zbMATH Open. This attempts to reflect the references listed in the original paper as accurately as possible without claiming completeness or a perfect matching.