×

Entropic gradient descent algorithms and wide flat minima. (English) Zbl 1539.68318

Summary: The properties of flat minima in the empirical risk landscape of neural networks have been debated for some time. Increasing evidence suggests they possess better generalization capabilities with respect to sharp ones. In this work we first discuss the relationship between alternative measures of flatness: the local entropy, which is useful for analysis and algorithm development, and the local energy, which is easier to compute and was shown empirically in extensive tests on state-of-the-art networks to be the best predictor of generalization capabilities. We show semi-analytically in simple controlled scenarios that these two measures correlate strongly with each other and with generalization. Then, we extend the analysis to the deep learning scenario by extensive numerical validations. We study two algorithms, entropy-stochastic gradient descent and replicated-stochastic gradient descent, that explicitly include the local entropy in the optimization objective. We devise a training schedule by which we consistently find flatter minima (using both flatness measures), and improve the generalization error for common architectures (e.g. ResNet, EfficientNet).

MSC:

68T07 Artificial neural networks and deep learning
90C26 Nonconvex programming, global optimization
90C90 Applications of mathematical programming

References:

[1] Baldassi, C.; Ingrosso, A.; Lucibello, C.; Saglietti, L.; Zecchina, R., Subdominant dense clusters allow for simple learning and high computational performance in neural networks with discrete synapses, Phys. Rev. Lett., 115 (2015) · doi:10.1103/physrevlett.115.128101
[2] Baldassi, C.; Borgs, C.; Chayes, J. T.; Ingrosso, A.; Lucibello, C.; Saglietti, L.; Zecchina, R., Unreasonable effectiveness of learning neural networks: from accessible states and robust ensembles to basic algorithmic schemes, Proc. Natl Acad. Sci. USA, 113, E7655-E7662 (2016) · doi:10.1073/pnas.1608103113
[3] Baldassi, C.; Ingrosso, A.; Lucibello, C.; Saglietti, L.; Zecchina, R., Local entropy as a measure for sampling solutions in constraint satisfaction problems, J. Stat. Mech., P023301 (2016) · Zbl 1456.94029 · doi:10.1088/1742-5468/2016/02/023301
[4] Baldassi, C.; Malatesta, E. M.; Zecchina, R., Properties of the geometry of solutions and capacity of multilayer neural networks with rectified linear unit activations, Phys. Rev. Lett., 123 (2019) · doi:10.1103/physrevlett.123.170602
[5] Baldassi, C.; Pittorino, F.; Zecchina, R., Shaping the learning landscape in neural networks around wide flat minima, Proc. Natl Acad. Sci. USA, 117, 161-170 (2020) · Zbl 1456.92009 · doi:10.1073/pnas.1908636117
[6] Buntine, W. L.; Weigend, A. S., Bayesian back-propagation, Complex Syst., 5, 603-643 (1991) · Zbl 0761.62031
[7] Chaudhari, P.; Baldassi, C.; Zecchina, R.; Soatto, S.; Talwalkar, A., Parle: parallelizing stochastic gradient descent (2017)
[8] Chaudhari, P.; Oberman, A.; Osher, S.; Soatto, S.; Carlier, G., Deep relaxation: partial differential equations for optimizing deep neural networks, Res. Math. Sci., 5, 30 (2018) · Zbl 1427.82032 · doi:10.1007/s40687-018-0148-y
[9] Chaudhari, P.; Choromanska, A.; Soatto, S.; LeCun, Y.; Baldassi, C.; Borgs, C.; Chayes, J.; Sagun, L.; Zecchina, R., Entropy-SGD: biasing gradient descent into wide valleys, J. Stat. Mech. (2019) · Zbl 1459.65091 · doi:10.1088/1742-5468/ab39d9
[10] Cubuk, E. D.; Zoph, B.; Mané, D.; Vasudevan, V.; Le, Q. V., AutoAugment: learning augmentation policies from data (2018)
[11] Devries, T.; Taylor, G. W., Improved regularization of convolutional neural networks with cutout (2017)
[12] Dinh, L.; Pascanu, R.; Bengio, S.; Bengio, Y., Sharp minima can generalize for deep nets, vol 3, 1705-1714 (2017)
[13] Dziugaite, G. K.; Roy, D. M., Computing nonvacuous generalization bounds for deep (stochastic) neural networks with many more parameters than training data, Proc. 33rd Conf. on Uncertainty in Artificial Intelligence, UAI 2017 (2017)
[14] Dziugaite, G. K.; Roy, D. M., Entropy-SGD optimizes the prior of a PAC-Bayes bound: data-dependent PAC-Bayes priors via differential privacy (2018)
[15] Han, D.; Kim, J.; Kim, J., Deep pyramidal residual networks (2016)
[16] He, K.; Zhang, X.; Ren, S.; Sun, J., Deep residual learning for image recognition, 770-778 (2016)
[17] Hinton, G. E.; van Camp, D., Keeping the neural networks simple by minimizing the description length of the weights, 5-13 (1993), New York: Association for Computing Machinery, New York
[18] Hochreiter, S.; Schmidhuber, J., Flat minima, Neural Comput., 9, 1-42 (1997) · Zbl 0872.68150 · doi:10.1162/neco.1997.9.1.1
[19] Jiang, Y.; Neyshabur, B.; Mobahi, H.; Krishnan, D.; Bengio, S., Fantastic generalization measures and where to find them, Int. Conf. on Learning Representations (2020)
[20] Keskar, N. S.; Mudigere, D.; Nocedal, J.; Smelyanskiy, M.; Tang, P. T P., On large-batch training for deep learning: generalization gap and sharp minima (2016)
[21] LeCun, Y.; Bottou, L.; Bengio, Y.; Haffner, P., Gradient-based learning applied to document recognition, Proc. IEEE, 86, 2278-2324 (1998) · doi:10.1109/5.726791
[22] Lim, S.; Kim, I.; Kim, T.; Kim, C.; Kim, S., Fast AutoAugment (2019)
[23] Mezard, M.; Montanari, A., Information, Physics, and Computation (2009), New York: Oxford University Press, New York · Zbl 1163.94001
[24] Mézard, M.; Parisi, G.; Virasoro, M., Spin Glass Theory and Beyond: An Introduction to the Replica Method and its Applications, vol 9 (1987), Singapore: World Scientific, Singapore · Zbl 0992.82500
[25] Paszke, A., PyTorch: an imperative style, high-performance deep learning library, 8024-8035 (2019)
[26] Salimans, T.; Kingma, D. P., Weight normalization: a simple reparameterization to accelerate training of deep neural networks, 901-909 (2016)
[27] Tan, M.; Le, Q. V., EfficientNet: rethinking model scaling for convolutional neural networks, 36th Int. Conf. on Machine Learning (2019)
[28] Welling, M.; Teh, Y. W., Bayesian learning via stochastic gradient Langevin dynamics, 681-688 (2011)
[29] Yamada, Y.; Iwamura, M.; Kise, K., ShakeDrop regularization (2018)
[30] Zhang, M.; Lucas, J.; Ba, J.; Hinton, G. E., Lookahead optimizer: k steps forward, 1 step back, 9593-9604 (2019)
[31] Zhang, S.; Choromanska, A.; LeCun, Y., Deep learning with elastic averaging SGD, Advances in Neural Information Processing Systems (2014)
[32] Zhou, W.; Veitch, V.; Austern, M.; Adams, R. P.; Orbanz, P., Non-vacuous generalization bounds at the ImageNet scale: a PAC-Bayesian compression approach, Int. Conf. on Learning Representations (2018)
This reference list is based on information provided by the publisher or from digital mathematics libraries. Its items are heuristically matched to zbMATH identifiers and may contain data conversion errors. In some cases that data have been complemented/enhanced by data from zbMATH Open. This attempts to reflect the references listed in the original paper as accurately as possible without claiming completeness or a perfect matching.