×

Deep nonparametric regression on approximate manifolds: nonasymptotic error bounds with polynomial prefactors. (English) Zbl 1539.62111

Summary: We study the properties of nonparametric least squares regression using deep neural networks. We derive nonasymptotic upper bounds for the excess risk of the empirical risk minimizer of feedforward deep neural regression. Our error bounds achieve minimax optimal rate and improve over the existing ones in the sense that they depend polynomially on the dimension of the predictor, instead of exponentially on dimension. We show that the neural regression estimator can circumvent the curse of dimensionality under the assumption that the predictor is supported on an approximate low-dimensional manifold or a set with low Minkowski dimension. We also establish the optimal convergence rate under the exact manifold support assumption. We investigate how the prediction error of the neural regression estimator depends on the structure of neural networks and propose a notion of network relative efficiency between two types of neural networks, which provides a quantitative measure for evaluating the relative merits of different network structures. To establish these results, we derive a novel approximation error bound for the Hölder smooth functions using ReLU activated neural networks, which may be of independent interest. Our results are derived under weaker assumptions on the data distribution and the neural network structure than those in the existing literature.

MSC:

62G08 Nonparametric regression and quantile regression
62G05 Nonparametric estimation
62G20 Asymptotic properties of nonparametric inference
62R30 Statistics on manifolds
68T07 Artificial neural networks and deep learning

Software:

MNIST; ImageNet; CIFAR

References:

[1] AAMARI, E., KIM, J., CHAZAL, F., MICHEL, B., RINALDO, A. and WASSERMAN, L. (2019). Estimating the reach of a manifold. Electron. J. Stat. 13 1359-1399. · Zbl 1418.62100 · doi:10.1214/19-ejs1551
[2] ALLEN-ZHU, Z., LI, Y. and SONG, Z. (2019). A convergence theory for deep learning via over-parameterization. In International Conference on Machine Learning 242-252. PMLR.
[3] Anthony, M. and Bartlett, P. L. (1999). Neural Network Learning: Theoretical Foundations. Cambridge Univ. Press, Cambridge. · Zbl 0968.68126 · doi:10.1017/CBO9780511624216
[4] ASWANI, A., BICKEL, P. and TOMLIN, C. (2011). Regression on manifolds: Estimation of the exterior derivative. Ann. Statist. 39 48-81. · Zbl 1209.62063 · doi:10.1214/10-AOS823
[5] BARANIUK, R. G. and WAKIN, M. B. (2009). Random projections of smooth manifolds. Found. Comput. Math. 9 51-77. · Zbl 1172.53005 · doi:10.1007/s10208-007-9011-z
[6] BARTLETT, P. L., HARVEY, N., LIAW, C. and MEHRABIAN, A. (2019). Nearly-tight VC-dimension and pseudodimension bounds for piecewise linear neural networks. J. Mach. Learn. Res. 20 Paper No. 63, 17. · Zbl 1489.62302
[7] Bauer, B. and Kohler, M. (2019). On deep learning as a remedy for the curse of dimensionality in nonparametric regression. Ann. Statist. 47 2261-2285. · Zbl 1421.62036 · doi:10.1214/18-AOS1747
[8] Belkin, M. and Niyogi, P. (2003). Laplacian eigenmaps for dimensionality reduction and data representation. Neural Comput. 15 1373-1396. · Zbl 1085.68119
[9] BELKIN, M. and NIYOGI, P. (2004). Semi-supervised learning on Riemannian manifolds. Mach. Learn. 56 209-239. · Zbl 1089.68086
[10] Bickel, P. J. and Li, B. (2007). Local polynomial regression on unknown manifolds. In Complex Datasets and Inverse Problems. Institute of Mathematical Statistics Lecture Notes—Monograph Series 54 177-186. IMS, Beachwood, OH. · doi:10.1214/074921707000000148
[11] Birgé, L. and Massart, P. (1993). Rates of convergence for minimum contrast estimators. Probab. Theory Related Fields 97 113-150. · Zbl 0805.62037 · doi:10.1007/BF01199316
[12] Birgé, L. and Massart, P. (1998). Minimum contrast estimators on sieves: exponential bounds and rates of convergence. Bernoulli 4 329-375. · Zbl 0954.62033
[13] BISHOP, C. J. and PERES, Y. (2017). Fractals in Probability and Analysis. Cambridge Studies in Advanced Mathematics 162. Cambridge Univ. Press, Cambridge. · Zbl 1390.28012 · doi:10.1017/9781316460238
[14] Boucheron, S., Lugosi, G. and Massart, P. (2013). Concentration Inequalities: A Nonasymptotic Theory of Independence. Oxford Univ. Press, Oxford. With a foreword by Michel Ledoux. · Zbl 1279.60005 · doi:10.1093/acprof:oso/9780199535255.001.0001
[15] BRAND, M. (2002). Charting a manifold. Adv. Neural Inf. Process. Syst. 15.
[16] CANCER GENOME ATLAS RESEARCH NETWORK, WEINSTEIN, J. N., COLLISSON, E. A., MILLS, G. B., SHAW, K. R. M., OZENBERGER, B. A., ELLROTT, K., SHMULEVICH, I., SANDER, C. et al. (2013). The cancer genome atlas pan-cancer analysis project. Nat. Genet. 45 1113-1120. · doi:10.1038/ng.2764
[17] CARLSSON, G. (2009). Topology and data. Bull. Amer. Math. Soc. (N.S.) 46 255-308. · Zbl 1172.62002 · doi:10.1090/S0273-0979-09-01249-X
[18] CHEN, M., JIANG, H., LIAO, W. and ZHAO, T. (2022). Nonparametric regression on low-dimensional manifolds using deep ReLU networks: Function approximation and statistical recovery. Inf. Inference 11 1203-1253. · Zbl 1524.62171 · doi:10.1093/imaiai/iaac001
[19] CHEN, M., JIANG, H. and ZHAO, T. (2019). Efficient approximation of deep relu networks for functions on low dimensional manifolds. Adv. Neural Inf. Process. Syst.
[20] Cheng, M.-Y. and Wu, H.-T. (2013). Local linear regression on manifolds and its geometric interpretation. J. Amer. Statist. Assoc. 108 1421-1434. · Zbl 1426.62402 · doi:10.1080/01621459.2013.827984
[21] CHUI, C. K., LI, X. and MHASKAR, H. N. (1996). Limitations of the approximation capabilities of neural networks with one hidden layer. Adv. Comput. Math. 5 233-243. · Zbl 0855.41026 · doi:10.1007/BF02124745
[22] CLONINGER, A. and KLOCK, T. (2020). ReLU nets adapt to intrinsic dimensionality beyond the target domain. Available at arXiv:2008.02545.
[23] COX, D. D. (1988). Approximation of least squares regression on nested subspaces. Ann. Statist. 16 713-732. · Zbl 0669.62047 · doi:10.1214/aos/1176350830
[24] Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K. and Fei-Fei, L. (2009). Imagenet: A large-scale hierarchical image database. In 2009 IEEE Conference on Computer Vision and Pattern Recognition 248-255. IEEE.
[25] Devroye, L., Györfi, L. and Lugosi, G. (1996). A Probabilistic Theory of Pattern Recognition. Applications of Mathematics (New York) 31. Springer, New York. · Zbl 0853.68150 · doi:10.1007/978-1-4612-0711-5
[26] DONOHO, D. L. and GRIMES, C. (2003). Hessian eigenmaps: Locally linear embedding techniques for high-dimensional data. Proc. Natl. Acad. Sci. USA 100 5591-5596. · Zbl 1130.62337 · doi:10.1073/pnas.1031596100
[27] DU, S., LEE, J., LI, H., WANG, L. and ZHAI, X. (2019). Gradient descent finds global minima of deep neural networks. In International Conference on Machine Learning 1675-1685. PMLR.
[28] ELDAN, R. and SHAMIR, O. (2016). The power of depth for feedforward neural networks. In Conference on Learning Theory 907-940. PMLR.
[29] Falconer, K. (2003). Fractal Geometry: Mathematical Foundations and Applications, 2nd ed. Wiley, Hoboken, NJ. · Zbl 1060.28005 · doi:10.1002/0470013850
[30] FARRELL, M. H., LIANG, T. and MISRA, S. (2021). Deep neural networks for estimation and inference. Econometrica 89 181-213. · Zbl 1479.62082 · doi:10.3982/ecta16901
[31] FEDERER, H. (1959). Curvature measures. Trans. Amer. Math. Soc. 93 418-491. · Zbl 0089.38402 · doi:10.2307/1993504
[32] Fefferman, C., Mitter, S. and Narayanan, H. (2016). Testing the manifold hypothesis. J. Amer. Math. Soc. 29 983-1049. · Zbl 1359.62122 · doi:10.1090/jams/852
[33] Friedman, J. H. and Stuetzle, W. (1981). Projection pursuit regression. J. Amer. Statist. Assoc. 76 817-823.
[34] GEMAN, S. and HWANG, C.-R. (1982). Nonparametric maximum likelihood estimation by the method of sieves. Ann. Statist. 10 401-414. · Zbl 0494.62041
[35] GHORBANI, B., MEI, S., MISIAKIEWICZ, T. and MONTANARI, A. (2020). Discussion of: “Nonparametric regression using deep neural networks with ReLU activation function” [MR4134774]. Ann. Statist. 48 1898-1901. · doi:10.1214/19-AOS1910
[36] Goodfellow, I., Bengio, Y. and Courville, A. (2016). Deep Learning. Adaptive Computation and Machine Learning. MIT Press, Cambridge, MA. · Zbl 1373.68009
[37] GYÖRFI, L., KOHLER, M., KRZY ˙ZAK, A. and WALK, H. (2002). A Distribution-Free Theory of Nonparametric Regression. Springer, New York. · Zbl 1021.62024 · doi:10.1007/b97848
[38] Härdle, W., Hall, P. and Ichimura, H. (1993). Optimal smoothing in single-index models. Ann. Statist. 21 157-178. · Zbl 0770.62049 · doi:10.1214/aos/1176349020
[39] HENDRIKS, H. (1990). Nonparametric estimation of a probability density on a Riemannian manifold using Fourier expansions. Ann. Statist. 18 832-849. · Zbl 0711.62036 · doi:10.1214/aos/1176347628
[40] HOFFMANN, H., SCHAAL, S. and VIJAYAKUMAR, S. (2009). Local dimensionality reduction for non-parametric regression. Neural Process. Lett. 29 109.
[41] HOROWITZ, J. L. and HÄRDLE, W. (1996). Direct semiparametric estimation of single-index models with discrete covariates. J. Amer. Statist. Assoc. 91 1632-1640. · Zbl 0881.62037 · doi:10.2307/2291590
[42] HUBBARD, J. H. and HUBBARD, B. B. (2015). Vector Calculus, Linear Algebra, and Differential Forms: A Unified Approach. Matrix Editions. · Zbl 1397.00010
[43] JIAO, Y., SHEN, G., LIN, Y. and HUANG, J. (2023). Supplement to “Deep nonparametric regression on approximate manifolds: Nonasymptotic error bounds with polynomial prefactors.” https://doi.org/10.1214/23-AOS2266SUPP
[44] KOHLER, M., KRZYŻAK, A. and LANGER, S. (2022). Estimation of a function of low local dimensionality by deep neural networks. IEEE Trans. Inf. Theory 68 4032-4042. · Zbl 1505.62504
[45] KOHLER, M. and LANGER, S. (2021). On the rate of convergence of fully connected deep neural network regression estimates. Ann. Statist. 49 2231-2249. · Zbl 1486.62112 · doi:10.1214/20-aos2034
[46] Kong, E. and Xia, Y. (2007). Variable selection for the single-index model. Biometrika 94 217-229. · Zbl 1142.62353 · doi:10.1093/biomet/asm008
[47] KPOTUFE, S. (2011). k-NN regression adapts to local intrinsic dimension. Available at arXiv:1110.4300.
[48] KPOTUFE, S. and GARG, V. K. (2013). Adaptivity to local smoothness and dimension in kernel regression. In NIPS 3075-3083.
[49] KRIZHEVSKY, A. (2009). Learning multiple layers of features from tiny images. Technical report, the CIFAR-10 dataset is available at https://www.cs.toronto.edu/texttildelowkriz/cifar.html.
[50] LECUN, Y., CORTES, C. and BURGES, C. (2010). MNIST handwritten digit database. AT&T Labs [online]. Available at http://yann.lecun.com/exdb/mnist, 2.
[51] LEE, J. A. and VERLEYSEN, M. (2007). Nonlinear Dimensionality Reduction. Information Science and Statistics. Springer, New York. · Zbl 1128.68024 · doi:10.1007/978-0-387-39351-3
[52] LEE, J. M. (2003). Introduction to Smooth Manifolds. Graduate Texts in Mathematics 218. Springer, New York. · doi:10.1007/978-0-387-21752-9
[53] LEE, J. M. (2006). Riemannian Manifolds: An Introduction to Curvature 176. Springer Science & Business Media. · doi:10.1007/b98852
[54] LEE, W. S., BARTLETT, P. L. and WILLIAMSON, R. C. (1996). Efficient agnostic learning of neural networks with bounded fan-in. IEEE Trans. Inf. Theory 42 2118-2132. · Zbl 0874.68253 · doi:10.1109/18.556601
[55] LIANG, S. and SRIKANT, R. (2016). Why deep neural networks for function approximation? Available at arXiv:1610.04161.
[56] LU, J., SHEN, Z., YANG, H. and ZHANG, S. (2021). Deep network approximation for smooth functions. SIAM J. Math. Anal. 53 5465-5506. · Zbl 07407717 · doi:10.1137/20M134695X
[57] LU, Z., PU, H., WANG, F., HU, Z. and WANG, L. (2017). The expressive power of neural networks: A view from the width. In Proceedings of the 31st International Conference on Neural Information Processing Systems 6232-6240.
[58] MOHRI, M., ROSTAMIZADEH, A. and TALWALKAR, A. (2018). Foundations of Machine Learning. Adaptive Computation and Machine Learning. MIT Press, Cambridge, MA. Second edition of [MR3057769]. · Zbl 1407.68007
[59] NAKADA, R. and IMAIZUMI, M. (2020). Adaptive approximation and generalization of deep neural network with intrinsic dimensionality. J. Mach. Learn. Res. 21 Paper No. 174, 38. · Zbl 1525.68135
[60] NEMIROVSKIĬ, A. S., POLYAK, B. T. and TSYBAKOV, A. B. (1983). Estimates of the maximum likelihood type for a nonparametric regression. Dokl. Akad. Nauk SSSR 273 1310-1314.
[61] NEMIROVSKIĬ, A. S., POLYAK, B. T. and TSYBAKOV, A. B. (1984). Signal processing by the nonparametric maximum likelihood method. Problemy Peredachi Informatsii 20 29-46. · Zbl 0565.62028
[62] NEMIROVSKIĬ, A. S., POLYAK, B. T. and TSYBAKOV, A. B. (1985). The rate of convergence of nonparametric estimates of maximum likelihood type. Problemy Peredachi Informatsii 21 17-33. · Zbl 0616.62048
[63] NGUYEN, P.-M. and PHAM, H. T. (2020). A rigorous framework for the mean field limit of multilayer neural networks. Available at arXiv:2001.11443.
[64] NIYOGI, P., SMALE, S. and WEINBERGER, S. (2008). Finding the homology of submanifolds with high confidence from random samples. Discrete Comput. Geom. 39 419-441. · Zbl 1148.68048 · doi:10.1007/s00454-008-9053-2
[65] Pelletier, B. (2005). Kernel density estimation on Riemannian manifolds. Statist. Probab. Lett. 73 297-304. · Zbl 1065.62063 · doi:10.1016/j.spl.2005.04.004
[66] Pollard, D. (1984). Convergence of Stochastic Processes. Springer Series in Statistics. Springer, New York. · Zbl 0544.60045 · doi:10.1007/978-1-4612-5254-2
[67] POPE, P., ZHU, C., ABDELKADER, A., GOLDBLUM, M. and GOLDSTEIN, T. (2020). The intrinsic dimension of images and its impact on learning. In International Conference on Learning Representations.
[68] RAFAJŁOWICZ, E. (1987). Nonparametric orthogonal series estimators of regression: A class attaining the optimal convergence rate in \(\mathit{L}_2\). Statist. Probab. Lett. 5 219-224. · Zbl 0605.62030 · doi:10.1016/0167-7152(87)90044-7
[69] RECANATESI, S., FARRELL, M., ADVANI, M., MOORE, T., LAJOIE, G. and SHEA-BROWN, E. (2019). Dimensionality compression and expansion in deep neural networks. Available at arXiv:1906.00443.
[70] Roweis, S. T. and Saul, L. K. (2000). Nonlinear dimensionality reduction by locally linear embedding. Science 290 2323-2326.
[71] SCHMIDT-HIEBER, J. (2019). Deep relu network approximation of functions on a manifold. Available at arXiv:1908.00695.
[72] Schmidt-Hieber, J. (2020). Nonparametric regression using deep neural networks with ReLU activation function. Ann. Statist. 48 1875-1897. · Zbl 1459.62059 · doi:10.1214/19-AOS1875
[73] SHEN, X. and WONG, W. H. (1994). Convergence rate of sieve estimates. Ann. Statist. 22 580-615. · Zbl 0805.62008 · doi:10.1214/aos/1176325486
[74] SHEN, Z., YANG, H. and ZHANG, S. (2019). Nonlinear approximation via compositions. Neural Netw. 119 74-84. · Zbl 1475.41013
[75] SHEN, Z., YANG, H. and ZHANG, S. (2020). Deep network approximation characterized by number of neurons. Commun. Comput. Phys. 28 1768-1811. · Zbl 1507.68276 · doi:10.4208/cicp.oa-2020-0149
[76] SHEN, Z., YANG, H. and ZHANG, S. (2022). Optimal approximation rate of ReLU networks in terms of width and depth. J. Math. Pures Appl. (9) 157 101-135. · Zbl 1501.41010 · doi:10.1016/j.matpur.2021.07.009
[77] Stone, C. J. (1982). Optimal global rates of convergence for nonparametric regression. Ann. Statist. 10 1040-1053. · Zbl 0511.62048
[78] Stone, C. J. (1986). The dimensionality reduction principle for generalized additive models. Ann. Statist. 14 590-606. · Zbl 0603.62050 · doi:10.1214/aos/1176349940
[79] Stone, C. J. (1994). The use of polynomial splines and their tensor products in multivariate function estimation. Ann. Statist. 22 118-184. With discussion by Andreas Buja and Trevor Hastie and a rejoinder by the author. · Zbl 0827.62038 · doi:10.1214/aos/1176325361
[80] SUZUKI, T. (2018). Adaptivity of deep ReLU network for learning in Besov and mixed smooth Besov spaces: Optimal rate and curse of dimensionality. Available at arXiv:1810.08033.
[81] TELGARSKY, M. (2016). Benefits of depth in neural networks. In Conference on Learning Theory 1517-1539. PMLR.
[82] Tenenbaum, J. B., de Silva, V. and Langford, J. C. (2000). A global geometric framework for nonlinear dimensionality reduction. Science 290 2319-2323.
[83] van de Geer, S. (1987). A new approach to least-squares estimation, with applications. Ann. Statist. 15 587-602. · Zbl 0625.62046 · doi:10.1214/aos/1176350362
[84] van de Geer, S. (1990). Estimating a regression function. Ann. Statist. 18 907-924. · Zbl 0709.62040 · doi:10.1214/aos/1176347632
[85] VAN DE GEER, S. and WEGKAMP, M. (1996). Consistency for the least squares estimator in nonparametric regression. Ann. Statist. 24 2513-2523. · Zbl 0867.62027 · doi:10.1214/aos/1032181165
[86] van de Geer, S. A. (2000). Applications of Empirical Process Theory. Cambridge Series in Statistical and Probabilistic Mathematics 6. Cambridge Univ. Press, Cambridge. · Zbl 0953.62049
[87] van der Vaart, A. W. and Wellner, J. A. (1996). Weak Convergence and Empirical Processes: With Applications to Statistics. Springer Series in Statistics. Springer, New York. · Zbl 0862.60002 · doi:10.1007/978-1-4757-2545-2
[88] Vershynin, R. (2018). High-Dimensional Probability: An Introduction with Applications in Data Science. Cambridge Series in Statistical and Probabilistic Mathematics 47. Cambridge Univ. Press, Cambridge. With a foreword by Sara van de Geer. · Zbl 1430.60005 · doi:10.1017/9781108231596
[89] Yang, Y. and Dunson, D. B. (2016). Bayesian manifold regression. Ann. Statist. 44 876-905. · Zbl 1341.62196 · doi:10.1214/15-AOS1390
[90] Yang, Y. and Tokdar, S. T. (2015). Minimax-optimal nonparametric regression in high dimensions. Ann. Statist. 43 652-674. · Zbl 1312.62052 · doi:10.1214/14-AOS1289
[91] YAROTSKY, D. (2017). Error bounds for approximations with deep ReLU networks. Neural Netw. 94 103-114. · Zbl 1429.68260 · doi:10.1016/j.neunet.2017.07.002
[92] YAROTSKY, D. (2018). Optimal approximation of continuous functions by very deep ReLU networks. In Conference on Learning Theory 639-649. PMLR
This reference list is based on information provided by the publisher or from digital mathematics libraries. Its items are heuristically matched to zbMATH identifiers and may contain data conversion errors. In some cases that data have been complemented/enhanced by data from zbMATH Open. This attempts to reflect the references listed in the original paper as accurately as possible without claiming completeness or a perfect matching.