Document Zbl 07368237

A selective overview of deep learning. (English) Zbl 07368237

Stat. Sci. 36, No. 2, 264-290 (2021).

Summary: Deep learning has achieved tremendous success in recent years. In simple words, deep learning uses the composition of many nonlinear functions to model the complex dependency between input features and labels. While neural networks have a long history, recent advances have significantly improved their empirical performance in computer vision, natural language processing and other predictive tasks. From the statistical and scientific perspective, it is natural to ask: What is deep learning? What are the new characteristics of deep learning, compared with classical statistical methods? What are the theoretical foundations of deep learning?
To answer these questions, we introduce common neural network models (e.g., convolutional neural nets, recurrent neural nets, generative adversarial nets) and training techniques (e.g., stochastic gradient descent, dropout, batch normalization) from a statistical point of view. Along the way, we highlight new characteristics of deep learning (including depth and overparametrization) and explain their practical and theoretical benefits. We also sample recent results on theories of deep learning, many of which are only suggestive. While a complete understanding of deep learning remains elusive, we hope that our perspectives and discussions serve as a stimulus for new statistical research.

Cited in 18 Documents

MSC:

62-XX

Statistics

Keywords:

approximation theory; generalization error; neural networks; overparametrization; stochastic gradient descent

Software:

PyTorch; torchdiffeq; TensorFlow; Wasserstein GAN; f-GAN; ImageNet; SqueezeNet; GNMT; AdaGrad; AlexNet; Adam

Cite Review PDF

Full Text: DOI arXiv

References:

[1]	Abadi, M. et al. (2015). TensorFlow: Large-scale machine learning on heterogeneous systems. Software available from tensorflow.org.
[2]	Abbasi-Asl, R., Chen, Y., Bloniarz, A., Oliver, M., Willmore, B. D., Gallant, J. L. and Yu, B. (2018). The DeepTune framework for modeling and characterizing neurons in visual cortex area V4. BioRxiv 465534.
[3]	Allen-Zhu, Z. and Li, Y. (2019). Can SGD learn recurrent neural networks with provable generalization? Preprint. Available at arXiv:1902.01028.
[4]	Allen-Zhu, Z., Li, Y. and Song, Z. (2018). A convergence theory for deep learning via over-parameterization. Preprint. Available at arXiv:1811.03962.
[5]	Anthony, M. and Bartlett, P. L. (2009). Neural Network Learning: Theoretical Foundations. Cambridge Univ. Press, Cambridge. · Zbl 0968.68126 · doi:10.1017/CBO9780511624216
[6]	Arjovsky, M., Chintala, S. and Bottou, L. (2017). Wasserstein generative adversarial networks. In Proceedings of the 34th International Conference on Machine Learning. PMLR 70 214-223.
[7]	Arnold, V. I. (2009). On functions of three variables. In Collected Works: Representations of Functions, Celestial Mechanics and KAM Theory, \(1957-1965 5-8\).
[8]	Arora, S., Ge, R., Liang, Y., Ma, T. and Zhang, Y. (2017). Generalization and equilibrium in generative adversarial nets (GANs). In Proceedings of the 34th International Conference on Machine Learning 70 224-232. JMLR.org.
[9]	Arora, S., Du, S. S., Hu, W., Li, Z. and Wang, R. (2019). Fine-grained analysis of optimization and generalization for overparameterized two-layer neural networks. Preprint. Available at arXiv:1901.08584.
[10]	Bai, Y., Ma, T. and Risteski, A. (2018). Approximability of discriminators implies diversity in GANs. Preprint. Available at arXiv:1806.10586.
[11]	Barron, A. R. (1993). Universal approximation bounds for superpositions of a sigmoidal function. IEEE Trans. Inf. Theory 39 930-945. · Zbl 0818.68126 · doi:10.1109/18.256500
[12]	Bartlett, P. L. (1998). The sample complexity of pattern classification with neural networks: The size of the weights is more important than the size of the network. IEEE Trans. Inf. Theory 44 525-536. · Zbl 0901.68177 · doi:10.1109/18.661502
[13]	Bartlett, P. L., Foster, D. J. and Telgarsky, M. J. (2017). Spectrally-normalized margin bounds for neural networks. In Advances in Neural Information Processing Systems 30 (I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan and R. Garnett, eds.) 6240-6249. Curran Associates, Red Hook.
[14]	Bauer, B. and Kohler, M. (2019). On deep learning as a remedy for the curse of dimensionality in nonparametric regression. Ann. Statist. 47 2261-2285. · Zbl 1421.62036 · doi:10.1214/18-AOS1747
[15]	Belkin, M., Hsu, D., Ma, S. and Mandal, S. (2019). Reconciling modern machine-learning practice and the classical bias-variance trade-off. Proc. Natl. Acad. Sci. USA 116 15849-15854. · Zbl 1433.68325 · doi:10.1073/pnas.1903070116
[16]	Bottou, L., ed. (1998) Online learning and stochastic approximations. In Online Learning in Neural Networks. Publications of the Newton Institute 17. Cambridge Univ. Press, Cambridge. · Zbl 0968.68127
[17]	Bousquet, O. and Elisseeff, A. (2002). Stability and generalization. J. Mach. Learn. Res. 2 499-526. · Zbl 1007.68083 · doi:10.1162/153244302760200704
[18]	Breiman, L. (1996a). Bagging predictors. Mach. Learn. 24 123-140. · Zbl 0858.68080
[19]	Breiman, L. (1996b). Heuristics of instability and stabilization in model selection. Ann. Statist. 24 2350-2383. · Zbl 0867.62055 · doi:10.1214/aos/1032181158
[20]	Candès, E. J. and Tao, T. (2010). The power of convex relaxation: Near-optimal matrix completion. IEEE Trans. Inf. Theory 56 2053-2080. · Zbl 1366.15021 · doi:10.1109/TIT.2010.2044061
[21]	Cao, C., Liu, F., Tan, H., Song, D., Shu, W., Li, W., Zhou, Y., Bo, X. and Xie, Z. (2018). Deep learning and its applications in biomedicine. Genomics Proteomics Bioinform. 16 17-32.
[22]	Chen, S., Dobriban, E. and Lee, J. H. (2019). Invariance reduces variance: Understanding data augmentation in deep learning and beyond. Preprint. Available at arXiv:1907.10905.
[23]	Chen, T., Rubanova, Y., Bettencourt, J. and Duvenaud, D. (2018). Neural ordinary differential equations. Preprint. Available at arXiv:1806.07366.
[24]	Chen, Y., Chi, Y., Fan, J. and Ma, C. (2019a). Gradient descent with random initialization: Fast global convergence for nonconvex phase retrieval. Math. Program. 176 5-37. · Zbl 1415.90086 · doi:10.1007/s10107-019-01363-6
[25]	Chen, Y., Chi, Y., Fan, J., Ma, C. and Yan, Y. (2019b). Noisy matrix completion: Understanding statistical guarantees for convex relaxation via nonconvex optimization. Preprint. Available at arXiv:1902.07698. · Zbl 1477.90060
[26]	Chizat, L. and Bach, F. (2018). On the global convergence of gradient descent for over-parameterized models using optimal transport. In Advances in Neural Information Processing Systems 3040-3050.
[27]	Cho, K., Van Merriënboer, B., Gulcehre, C., Bahdanau, D., Bougares, F., Schwenk, H. and Bengio, Y. (2014). Learning phrase representations using RNN encoder-decoder for statistical machine translation. Preprint. Available at arXiv:1406.1078.
[28]	Cortes, C., Gonzalvo, X., Kuznetsov, V., Mohri, M. and Yang, S. (2017). Adanet: Adaptive structural learning of artificial neural networks. In Proceedings of the 34th International Conference on Machine Learning 70 874-883. JMLR.org.
[29]	Cybenko, G. (1989). Approximation by superpositions of a sigmoidal function. Math. Control Signals Systems 2 303-314. · Zbl 0679.94019 · doi:10.1007/BF02551274
[30]	Devroye, L. P. and Wagner, T. J. (1979). Distribution-free performance bounds for potential function rules. IEEE Trans. Inf. Theory 25 601-604. · Zbl 0432.62040 · doi:10.1109/TIT.1979.1056087
[31]	De Fauw, J., Ledsam, J. R., Romera-Paredes, B., Nikolov, S., Tomasev, N., Blackwell, S., Askham, H., Glorot, X., O’Donoghue, B. et al. (2018). Clinically applicable deep learning for diagnosis and referral in retinal disease. Nat. Med. 24 1342.
[32]	Donoho, D. L. (2000). High-dimensional data analysis: The curses and blessings of dimensionality. AMS Math. Chall. Lect. 1 32.
[33]	Donoho, D. L. and Johnstone, I. M. (1994). Ideal spatial adaptation by wavelet shrinkage. Biometrika 81 425-455. · Zbl 0815.62019 · doi:10.1093/biomet/81.3.425
[34]	Du, S. S., Lee, J. D., Li, H., Wang, L. and Zhai, X. (2018). Gradient descent finds global minima of deep neural networks. Preprint. Available at arXiv:1811.03804.
[35]	Duchi, J., Hazan, E. and Singer, Y. (2011). Adaptive subgradient methods for online learning and stochastic optimization. J. Mach. Learn. Res. 12 2121-2159. · Zbl 1280.68164
[36]	E, W., Han, J. and Jentzen, A. (2017). Deep learning-based numerical methods for high-dimensional parabolic partial differential equations and backward stochastic differential equations. Commun. Math. Stat. 5 349-380. · Zbl 1382.65016 · doi:10.1007/s40304-017-0117-6
[37]	E, W., Ma, C. and Wang, Q. (2019). A priori estimates of the population risk for residual networks. Preprint. Available at arXiv:1903.02154.
[38]	Eldan, R. and Shamir, O. (2016). The power of depth for feedforward neural networks. In Conference on Learning Theory 907-940.
[39]	Fan, J. and Li, R. (2001). Variable selection via nonconcave penalized likelihood and its oracle properties. J. Amer. Statist. Assoc. 96 1348-1360. · Zbl 1073.62547 · doi:10.1198/016214501753382273
[40]	Feldman, V. and Vondrak, J. (2019). High probability generalization bounds for uniformly stable algorithms with nearly optimal rate. Preprint. Available arXiv:1902.10710.
[41]	Friedman, J. H. and Stuetzle, W. (1981). Projection pursuit regression. J. Amer. Statist. Assoc. 76 817-823.
[42]	Fu, H., Chi, Y. and Liang, Y. (2018). Local geometry of one-hidden-layer neural networks for logistic regression. Preprint. Available at arXiv:1802.06463.
[43]	Fukushima, K. and Miyake, S. (1982). Neocognitron: A self-organizing neural network model for a mechanism of visual pattern recognition. In Competition and Cooperation in Neural Nets 267-285. Springer, Berlin.
[44]	Gao, C., Liu, J., Yao, Y. and Zhu, W. (2018). Robust estimation and generative adversarial nets. Preprint. Available at arXiv:1810.02030.
[45]	Goel, S., Klivans, A. and Meka, R. (2018). Learning one convolutional layer with overlapping patches. Preprint. Available at arXiv:1802.02547.
[46]	Golowich, N., Rakhlin, A. and Shamir, O. (2020). Size-independent sample complexity of neural networks. Inf. Inference 9 473-504. · Zbl 1528.68354 · doi:10.1093/imaiai/iaz007
[47]	Golub, G. H. and Van Loan, C. F. (2013). Matrix Computations, 4th ed. Johns Hopkins Studies in the Mathematical Sciences. Johns Hopkins Univ. Press, Baltimore, MD. · Zbl 1268.65037
[48]	Goodfellow, I., Bengio, Y. and Courville, A. (2016). Deep Learning. Adaptive Computation and Machine Learning. MIT Press, Cambridge, MA. · Zbl 1373.68009
[49]	Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A. and Bengio, Y. (2014). Generative adversarial nets. In Advances in Neural Information Processing Systems 2672-2680.
[50]	Gunasekar, S., Lee, J., Soudry, D. and Srebro, N. (2018a). Characterizing implicit bias in terms of optimization geometry. Preprint. Available at arXiv:1802.08246.
[51]	Gunasekar, S., Lee, J. D., Soudry, D. and Srebro, N. (2018b). Implicit bias of gradient descent on linear convolutional networks. In Advances in Neural Information Processing Systems 9482-9491.
[52]	Härdle, W., Hall, P. and Ichimura, H. (1993). Optimal smoothing in single-index models. Ann. Statist. 21 157-178. · Zbl 0770.62049 · doi:10.1214/aos/1176349020
[53]	Hardt, M., Recht, B. and Singer, Y. (2015). Train faster, generalize better: Stability of stochastic gradient descent. Preprint. Available at rXiv:1509.01240.
[54]	Hastie, T., Montanari, A., Rosset, S. and Tibshirani, R. J. (2019). Surprises in high-dimensional ridgeless least squares interpolation. Preprint. Available at arXiv:1903.08560.
[55]	He, K., Zhang, X., Ren, S. and Sun, J. (2016a). Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition 770-778.
[56]	He, K., Zhang, X., Ren, S. and Sun, J. (2016b). Identity mappings in deep residual networks. In European Conference on Computer Vision 630-645. Springer, Berlin.
[57]	Hinton, G., Srivastava, N. and Swersky, K. (2012). Neural networks for machine learning lecture 6a overview of mini-batch gradient descent.
[58]	Hinton, G. E., Srivastava, N., Krizhevsky, A., Sutskever, I. and Salakhutdinov, R. R. (2012). Improving neural networks by preventing co-adaptation of feature detectors. Preprint. Available at arXiv:1207.0580. · Zbl 1318.68153
[59]	Hochreiter, S. and Schmidhuber, J. (1997). Long short-term memory. Neural Comput. 9 1735-1780. · doi:10.1162/neco.1997.9.8.1735
[60]	Hornik, K. (1991). Approximation capabilities of multilayer feedforward networks. Neural Netw. 4 251-257.
[61]	Huang, G., Liu, Z., Van Der Maaten, L. and Weinberger, K. Q. (2017). Densely connected convolutional networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition 4700-4708.
[62]	Hubel, D. H. and Wiesel, T. N. (1962). Receptive fields, binocular interaction and functional architecture in the cat’s visual cortex. J. Physiol. 160 106-154.
[63]	Iandola, F. N., Han, S., Moskewicz, M. W., Ashraf, K., Dally, W. J. and Keutzer, K. (2016). SqueezeNet: AlexNet-level accuracy with 50× fewer parameters and <0.5 MB model size. Preprint. Available at arXiv:1602.07360.
[64]	Ioffe, S. and Szegedy, C. (2015). Batch normalization: Accelerating deep network training by reducing internal covariate shift. Preprint. Available at arXiv:1502.03167.
[65]	Jacot, A., Gabriel, F. and Hongler, C. (2018). Neural tangent kernel: Convergence and generalization in neural networks. In Advances in Neural Information Processing Systems 8580-8589.
[66]	Jain, P., Kakade, S. M., Kidambi, R., Netrapalli, P. and Sidford, A. (2017). Accelerating stochastic gradient descent. Preprint. Available at arXiv:1704.08227. · Zbl 1469.68088
[67]	Javanmard, A., Mondelli, M. and Montanari, A. (2019). Analysis of a two-layer neural network via displacement convexity. Preprint. Available at arXiv:1901.01375. · Zbl 1464.62401
[68]	Ji, Z. and Telgarsky, M. (2018). Risk and parameter convergence of logistic regression. Preprint. Available at arXiv:1803.07300.
[69]	Kidambi, R., Netrapalli, P., Jain, P. and Kakade, S. (2018). On the insufficiency of existing momentum schemes for stochastic optimization. In 2018 Information Theory and Applications Workshop (ITA) 1-9. IEEE.
[70]	Kiefer, J. and Wolfowitz, J. (1952). Stochastic estimation of the maximum of a regression function. Ann. Math. Stat. 23 462-466. · Zbl 0049.36601 · doi:10.1214/aoms/1177729392
[71]	Kingma, D. P. and Ba, J. (2014). Adam: A method for stochastic optimization. Preprint. Available at arXiv:1412.6980.
[72]	Kingma, D. P. and Welling, M. (2013). Auto-encoding variational bayes. Preprint. Available at arXiv:1312.6114.
[73]	Klusowski, J. M. and Barron, A. R. (2016). Risk bounds for high-dimensional ridge function combinations including neural networks. Preprint. Available at arXiv:1607.01434.
[74]	Krizhevsky, A., Sutskever, I. and Hinton, G. E. (2012). Imagenet classification with deep convolutional neural networks. In Advances in Neural Information Processing Systems 1097-1105.
[75]	Kushner, H. J. and Yin, G. G. (2003). Stochastic Approximation and Recursive Algorithms and Applications: Stochastic Modelling and Applied Probability, 2nd ed. Applications of Mathematics (New York) 35. Springer, New York. · Zbl 1026.62084
[76]	LeCun, Y., Bengio, Y. and Hinton, G. (2015). Deep learning. Nature 521 436-444. · doi:10.1038/nature14539
[77]	LeCun, Y., Bottou, L., Bengio, Y. and Haffner, P. (1998). Gradient-based learning applied to document recognition. Proc. IEEE 86 2278-2324.
[78]	Li, Y., Swersky, K. and Zemel, R. (2015). Generative moment matching networks. In International Conference on Machine Learning 1718-1727.
[79]	Li, H., Xu, Z., Taylor, G., Studer, C. and Goldstein, T. (2018a). Visualizing the loss landscape of neural nets. In Advances in Neural Information Processing Systems 6391-6401.
[80]	Li, X., Lu, J., Wang, Z., Haupt, J. and Zhao, T. (2018b). On tighter generalization bound for deep neural networks: CNNs, ResNets, and beyond. Preprint. Available at arXiv:1806.05159.
[81]	Liang, T. (2017). How well can generative adversarial networks (GAN) learn densities: A nonparametric view. Preprint. Available at arXiv:1712.08244.
[82]	Lin, M., Chen, Q. and Yan, S. (2013). Network in network. Preprint. Available at arXiv:1312.4400.
[83]	Lin, H. W., Tegmark, M. and Rolnick, D. (2017). Why does deep and cheap learning work so well? J. Stat. Phys. 168 1223-1247. · Zbl 1373.82061 · doi:10.1007/s10955-017-1836-5
[84]	Maas, A. L., Hannun, A. Y. and Ng, A. Y. (2013). Rectifier nonlinearities improve neural network acoustic models. In Proc. ICML 30 3.
[85]	Maiorov, V. E. and Meir, R. (2000). On the near optimality of the stochastic approximation of smooth functions by neural networks. Adv. Comput. Math. 13 79-103. · Zbl 0939.41013 · doi:10.1023/A:1018993908478
[86]	Makovoz, Y. (1996). Random approximants and neural networks. J. Approx. Theory 85 98-109. · Zbl 0857.41024 · doi:10.1006/jath.1996.0031
[87]	Mei, S., Misiakiewicz, T. and Montanari, A. (2019). Mean-field theory of two-layers neural networks: Dimension-free bounds and kernel limit. Preprint. Available at arXiv:1902.06015.
[88]	Mei, S., Montanari, A. and Nguyen, P.-M. (2018). A mean field view of the landscape of two-layer neural networks. Proc. Natl. Acad. Sci. USA 115 E7665-E7671. · Zbl 1416.92014 · doi:10.1073/pnas.1806579115
[89]	Mhaskar, H. N. (1996). Neural networks for optimal approximation of smooth and analytic functions. Neural Comput. 8 164-177.
[90]	Mhaskar, H., Liao, Q. and Poggio, T. (2016). Learning functions: When is deep better than shallow. Preprint. Available at arXiv:1603.00988.
[91]	Mnih, V., Kavukcuoglu, K., Silver, D., Rusu, A. A., Veness, J., Bellemare, M. G., Graves, A., Riedmiller, M., Fidjeland, A. K. et al. (2015). Human-level control through deep reinforcement learning. Nature 518 529.
[92]	Mondelli, M. and Montanari, A. (2018). On the connection between learning two-layers neural networks and tensor decomposition. Preprint. Available at arXiv:1802.07301.
[93]	Nesterov, Y. E. (1983). A method for solving the convex programming problem with convergence rate \[O(1/{k^2})\]. Dokl. Akad. Nauk SSSR 269 543-547.
[94]	Neyshabur, B., Tomioka, R. and Srebro, N. (2015). Norm-based capacity control in neural networks. In Conference on Learning Theory 1376-1401.
[95]	Nowozin, S., Cseke, B. and Tomioka, R. (2016). f-gan: Training generative neural samplers using variational divergence minimization. In Advances in Neural Information Processing Systems 271-279.
[96]	Parberry, I. (1994). Circuit Complexity and Neural Networks. Foundations of Computing Series. MIT Press, Cambridge, MA. · Zbl 0864.68082
[97]	Paszke, A., Gross, S., Chintala, S., Chanan, G., Yang, E., DeVito, Z., Lin, Z., Desmaison, A., Antiga, L. et al. (2017). Automatic differentiation in PyTorch.
[98]	Pinkus, A. (1999). Approximation theory of the MLP model in neural networks. Acta Numer. 8 143-195. · Zbl 0959.68109 · doi:10.1017/S0962492900002919
[99]	Poggio, T., Mhaskar, H., Rosasco, L., Miranda, B. and Liao, Q. (2017). Why and when can deep-but not shallow-networks avoid the curse of dimensionality: A review. Int. J. Autom. Comput. 14 503-519.
[100]	Poljak, B. T. (1964). Some methods of speeding up the convergence of iterative methods. USSR Comput. Math. Math. Phys. 4 1-17. · Zbl 0147.35301
[101]	Polyak, B. T. and Juditsky, A. B. (1992). Acceleration of stochastic approximation by averaging. SIAM J. Control Optim. 30 838-855. · Zbl 0762.62022 · doi:10.1137/0330046
[102]	Polyak, B. T. and Tsypkin, Y. Z. (1979). Adaptive estimation algorithms (convergence, optimality, stability). Autom. Remote Control 3 71-84.
[103]	Poultney, C., Chopra, S., LeCun, Y. et al. (2007). Efficient learning of sparse representations with an energy-based model. In Advances in Neural Information Processing Systems 1137-1144.
[104]	Reddi, S. J., Kale, S. and Kumar, S. (2018). On the convergence of adam and beyond.
[105]	Robbins, H. and Monro, S. (1951). A stochastic approximation method. Ann. Math. Stat. 22 400-407. · Zbl 0054.05901 · doi:10.1214/aoms/1177729586
[106]	Rogers, W. H. and Wagner, T. J. (1978). A finite sample distribution-free performance bound for local discrimination rules. Ann. Statist. 6 506-514. · Zbl 0385.62041
[107]	Rolnick, D. and Tegmark, M. (2017). The power of deeper networks for expressing natural functions. Preprint. Available at arXiv:1705.05502.
[108]	Romano, Y., Sesia, M. and Candès, E. J. (2018). Deep knockoffs. Preprint. Available at arXiv:1811.06687. · Zbl 1452.62710
[109]	Rotskoff, G. M. and Vanden-Eijnden, E. (2018). Neural networks as interacting particle systems: Asymptotic convexity of the loss landscape and universal scaling of the approximation error. Preprint. Available at arXiv:1805.00915.
[110]	Rumelhart, D. E., Hinton, G. E. and Williams, R. J. (1985). Learning internal representations by error propagation. Technical report, California Univ. San Diego La Jolla Inst. for Cognitive Science.
[111]	Russakovsky, O., Deng, J., Su, H., Krause, J., Satheesh, S., Ma, S., Huang, Z., Karpathy, A., Khosla, A. et al. (2015). ImageNet large scale visual recognition challenge. Int. J. Comput. Vis. 115 211-252. · doi:10.1007/s11263-015-0816-y
[112]	Sak, H., Senior, A. and Beaufays, F. (2014). Long short-term memory recurrent neural network architectures for large scale acoustic modeling. In Fifteenth Annual Conference of the International Speech Communication Association.
[113]	Salakhutdinov, R. and Hinton, G. (2009). Deep Boltzmann machines. In Artificial Intelligence and Statistics 448-455. · Zbl 1247.68223
[114]	Salimans, T., Goodfellow, I., Zaremba, W., Cheung, V., Radford, A. and Chen, X. (2016). Improved techniques for training GANs. In Advances in Neural Information Processing Systems 2234-2242.
[115]	Schmidt-Hieber, J. (2017). Nonparametric regression using deep neural networks with ReLU activation function. Preprint. Available at arXiv:1708.06633. · Zbl 1459.62059
[116]	Shalev-Shwartz, S. and Ben-David, S. (2014). Understanding Machine Learning: From Theory to Algorithms. Cambridge Univ. Press, Cambridge. · Zbl 1305.68005
[117]	Shalev-Shwartz, S., Shamir, O., Srebro, N. and Sridharan, K. (2010). Learnability, stability and uniform convergence. J. Mach. Learn. Res. 11 2635-2670. · Zbl 1242.68247
[118]	Silver, D., Schrittwieser, J., Simonyan, K., Antonoglou, I., Huang, A., Guez, A., Hubert, T., Baker, L., Lai, M. et al. (2017). Mastering the game of go without human knowledge. Nature 550 354.
[119]	Silverman, B. W. (1998). Density Estimation for Statistics and Data Analysis. Monographs on Statistics and Applied Probability. CRC Press, London. · doi:10.1007/978-1-4899-3324-9
[120]	Singh, C., Murdoch, W. J. and Yu, B. (2018). Hierarchical interpretations for neural network predictions. Preprint. Available at arXiv:1806.05337.
[121]	Sirignano, J. and Spiliopoulos, K. (2020). Mean field analysis of neural networks: A law of large numbers. SIAM J. Appl. Math. 80 725-752. · Zbl 1440.60008 · doi:10.1137/18M1192184
[122]	Soltanolkotabi, M. (2017). Learning relus via gradient descent. In Advances in Neural Information Processing Systems 2007-2017.
[123]	Soudry, D., Hoffer, E., Nacson, M. S., Gunasekar, S. and Srebro, N. (2018). The implicit bias of gradient descent on separable data. J. Mach. Learn. Res. 19 Paper No. 70, 57. · Zbl 1477.62192
[124]	Sprecher, D. A. (1965). On the structure of continuous functions of several variables. Trans. Amer. Math. Soc. 115 340-355. · Zbl 0142.30401 · doi:10.2307/1994273
[125]	Stone, C. J. (1982). Optimal global rates of convergence for nonparametric regression. Ann. Statist. 10 1040-1053. · Zbl 0511.62048
[126]	Stone, C. J. (1985). Additive regression and other nonparametric models. Ann. Statist. 13 689-705. · Zbl 0605.62065 · doi:10.1214/aos/1176349548
[127]	Stone, C. J. (1994). The use of polynomial splines and their tensor products in multivariate function estimation. Ann. Statist. 22 118-184. · Zbl 0827.62038 · doi:10.1214/aos/1176325361
[128]	Sutskever, I., Martens, J., Dahl, G. and Hinton, G. (2013). On the importance of initialization and momentum in deep learning. In International Conference on Machine Learning 1139-1147.
[129]	Szegedy, C., Zaremba, W., Sutskever, I., Bruna, J., Erhan, D., Goodfellow, I. and Fergus, R. (2013). Intriguing properties of neural networks. Preprint. Available at arXiv:1312.6199.
[130]	Szegedy, C., Liu, W., Jia, Y., Sermanet, P., Reed, S., Anguelov, D., Erhan, D., Vanhoucke, V. and Rabinovich, A. (2015). Going deeper with convolutions. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition 1-9.
[131]	Telgarsky, M. (2016). Benefits of depth in neural networks. Preprint. Available at arXiv:1602.04485.
[132]	Tibshirani, R. (1996). Regression shrinkage and selection via the lasso. J. Roy. Statist. Soc. Ser. B 58 267-288. · Zbl 0850.62538
[133]	Vapnik, V. N. and Chervonenkis, A. Y. (1971). On the uniform convergence of relative frequencies of events to their probabilities. Theory Probab. Appl. 16 264-280. · Zbl 0247.60005
[134]	Vincent, P., Larochelle, H., Bengio, Y. and Manzagol, P.-A. (2008). Extracting and composing robust features with denoising autoencoders. In Proceedings of the 25th International Conference on Machine Learning 1096-1103. ACM, New York.
[135]	Wager, S., Wang, S. and Liang, P. S. (2013). Dropout training as adaptive regularization. In Advances in Neural Information Processing Systems 351-359.
[136]	Wilson, A. C., Roelofs, R., Stern, M., Srebro, N. and Recht, B. (2017). The marginal value of adaptive gradient methods in machine learning. In Advances in Neural Information Processing Systems 30 (I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan and R. Garnett, eds.) 4148-4158. Curran Associates, Red Hook.
[137]	Wu, Y., Schuster, M., Chen, Z., Le, Q. V., Norouzi, M., Macherey, W., Krikun, M., Cao, Y., Gao, Q. et al. (2016). Google’s neural machine translation system: Bridging the gap between human and machine translation. Preprint. Available at arXiv:1609.08144.
[138]	Yosinski, J., Clune, J., Bengio, Y. and Lipson, H. (2014). How transferable are features in deep neural networks? In Advances in Neural Information Processing Systems 3320-3328.
[139]	Yosinski, J., Clune, J., Nguyen, A., Fuchs, T. and Lipson, H. (2015). Understanding neural networks through deep visualization. Preprint. Available at arXiv:1506.06579.
[140]	Zhang, C., Bengio, S., Hardt, M., Recht, B. and Vinyals, O. (2016). Understanding deep learning requires rethinking generalization. Preprint. Available at arXiv:1611.03530.
[141]	Zhong, K., Song, Z., Jain, P., Bartlett, P. L. and Dhillon, I. S. (2017). Recovery guarantees for one-hidden-layer neural networks. In Proceedings of the 34th International Conference on Machine Learning 70 4140-4149. JMLR.org

This reference list is based on information provided by the publisher or from digital mathematics libraries. Its items are heuristically matched to zbMATH identifiers and may contain data conversion errors. In some cases that data have been complemented/enhanced by data from zbMATH Open. This attempts to reflect the references listed in the original paper as accurately as possible without claiming completeness or a perfect matching.