×

Improved complexities for stochastic conditional gradient methods under interpolation-like conditions. (English) Zbl 1525.90314

Summary: We analyze stochastic conditional gradient methods for constrained optimization problems arising in over-parametrized machine learning. We show that one could leverage the interpolation-like conditions satisfied by such models to obtain improved oracle complexities. Specifically, when the objective function is convex, we show that the conditional gradient method requires \(\mathcal{O}( \epsilon^{- 2})\) calls to the stochastic gradient oracle to find an \(\varepsilon \)-optimal solution. Furthermore, by including a gradient sliding step, we show that the number of calls reduces to \(\mathcal{O}( \epsilon^{- 1.5})\).

MSC:

90C15 Stochastic programming
90C52 Methods of reduced gradient type

Software:

Spearmint

References:

[1] Agarwal, A.; Beygelzimer, A.; Dudik, M.; Langford, J.; Wallach, H., A reductions approach to fair classification, (International Conference on Machine Learning (2018)), 60-69
[2] Balasubramanian, K.; Ghadimi, S., Zeroth-order nonconvex stochastic optimization: handling constraints, high dimensionality, and saddle points, Found. Comput. Math., 1-42 (2021)
[3] Bassily, R.; Belkin, M.; Ma, S., On exponential convergence of SGD in non-convex over-parametrized learning (2018), preprint
[4] Berk, R.; Heidari, H.; Jabbari, S.; Kearns, M.; Roth, A., Fairness in criminal justice risk assessments: the state of the art, Sociol. Methods Res. (2018)
[5] Berrada, L.; Zisserman, A.; Kumar, M. P., Deep Frank-Wolfe for neural network optimization (2018), preprint
[6] Chen, P.-Y.; Zhang, H.; Sharma, Y.; Yi, J.; Hsieh, C.-J., Zoo: zeroth order optimization based black-box attacks to deep neural networks without training substitute models, (Proceedings of the 10th ACM Workshop on Artificial Intelligence and Security (2017), ACM), 15-26
[7] Choromanski, K.; Rowland, M.; Sindhwani, V.; Turner, R.; Weller, A., Structured evolution with compact architectures for scalable policy optimization (2018), preprint
[8] Choromanski, K.; Rowland, M.; Sindhwani, V.; Turner, R.; Weller, A., Structured evolution with compact architectures for scalable policy optimization, (Proceedings of the 35th International Conference on Machine Learning. Proceedings of the 35th International Conference on Machine Learning, PMLR (2018))
[9] Defazio, A.; Bottou, L., On the ineffectiveness of variance reduced optimization for deep learning, (Advances in Neural Information Processing Systems (2019)), 1753-1763
[10] Demyanov, V.; Rubinov, A., Approximate Methods in Optimization Problems (1970), American Elsevier Publishing Co · Zbl 0217.46203
[11] Donini, M.; Oneto, L.; Ben-David, S.; Shawe-Taylor, J. S.; Pontil, M., Empirical risk minimization under fairness constraints, (Advances in Neural Information Processing Systems (2018)), 2791-2801
[12] Duchi, J. C.; Jordan, M. I.; Wainwright, M. J.; Wibisono, A., Optimal rates for zero-order convex optimization: the power of two function evaluations, IEEE Trans. Inf. Theory, 61, 5, 2788-2806 (2015) · Zbl 1359.90155
[13] Dwork, C.; Immorlica, N.; Kalai, A. T.; Leiserson, M., Decoupled classifiers for group-fair and efficient machine learning, (Conference on Fairness, Accountability and Transparency (2018)), 119-133
[14] Freund, R. M.; Grigas, P.; Mazumder, R., An extended Frank-Wolfe method with in-face directions, and its application to low-rank matrix completion, SIAM J. Optim., 27, 1, 319-346 (2017) · Zbl 1357.90115
[15] Garber, D.; Hazan, E., A linearly convergent conditional gradient algorithm with applications to online and stochastic optimization (2013), preprint
[16] Ghadimi, S., Conditional gradient type methods for composite nonlinear and stochastic optimization, Math. Program., 173, 1-2, 431-464 (2019) · Zbl 1410.90150
[17] Ghadimi, S.; Lan, G., Stochastic first-and zeroth-order methods for nonconvex stochastic programming, SIAM J. Optim., 23, 4, 2341-2368 (2013) · Zbl 1295.90026
[18] Gower, R. M.; Loizou, N.; Qian, X.; Sailanbayev, A.; Shulgin, E.; Richtárik, P., Sgd: general analysis and improved rates (2019), preprint
[19] Harchaoui, Z.; Juditsky, A.; Nemirovski, A., Conditional gradient algorithms for norm-regularized smooth convex optimization, Math. Program., 152, 1-2, 75-112 (2015) · Zbl 1336.90069
[20] Hardt, M.; Price, E.; Srebro, N., Equality of opportunity in supervised learning, (Advances in Neural Information Processing Systems (2016)), 3315-3323
[21] Hastie, T.; Montanari, A.; Rosset, S.; Tibshirani, R. J., Surprises in high-dimensional ridgeless least squares interpolation (2019), preprint
[22] Hazan, E.; Luo, H., Variance-reduced and projection-free stochastic optimization, (International Conference on Machine Learning (2016)), 1263-1271
[23] Hearn, D. W., The gap function of a convex program, Oper. Res. Lett., 1, 2, 67-71 (1982) · Zbl 0486.90070
[24] Huang, G.; Liu, Z.; Van Der Maaten, L.; Weinberger, K. Q., Densely connected convolutional networks, (Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (2017)), 4700-4708
[25] Jaggi, M., Revisiting Frank-Wolfe: projection-free sparse convex optimization, (ICML (1) (2013)), 427-435
[26] Kleinberg, J.; Mullainathan, S.; Raghavan, M., Inherent trade-offs in the fair determination of risk scores (2016), preprint · Zbl 1402.68156
[27] Lan, G.; Pokutta, S.; Zhou, Y.; Zink, D., Conditional accelerated lazy stochastic gradient descent, (International Conference on Machine Learning (2017)), 1965-1974
[28] Lan, G.; Zhou, Y., Conditional gradient sliding for convex optimization, SIAM J. Optim., 26, 2, 1379-1409 (2016) · Zbl 1342.90132
[29] Liang, T.; Rakhlin, A., Just interpolate: kernel “ridgeless” regression can generalize (2018), preprint · Zbl 1453.68155
[30] Ma, S.; Bassily, R.; Belkin, M., The power of interpolation: understanding the effectiveness of SGD in modern over-parametrized learning, (International Conference on Machine Learning (2018)), 3325-3334
[31] Meng, S. Y.; Vaswani, S.; Laradji, I.; Schmidt, M.; Lacoste-Julien, S., Fast and furious convergence: stochastic second order methods under interpolation (2020), preprint
[32] Montanari, A.; Ruan, F.; Sohn, Y.; Yan, J., The generalization error of max-margin linear classifiers: high-dimensional asymptotics in the overparametrized regime (2019), preprint
[33] Needell, D.; Ward, R.; Srebro, N., Stochastic gradient descent, weighted sampling, and the randomized Kaczmarz algorithm, (Advances in Neural Information Processing Systems (2014)), 1017-1025
[34] Nesterov, Y.; Spokoiny, V., Random gradient-free minimization of convex functions, Found. Comput. Math., 17, 2, 527-566 (2017) · Zbl 1380.90220
[35] Ravi, S. N.; Dinh, T.; Lokhande, V. S.R.; Singh, V., Constrained deep learning using conditional gradient and applications in computer vision (2018), preprint
[36] Reddi, S. J.; Sra, S.; Póczos, B.; Smola, A., Stochastic Frank-Wolfe methods for nonconvex optimization, (2016 54th Annual Allerton Conference on Communication, Control, and Computing (Allerton) (2016), IEEE), 1244-1251
[37] Sahu, A. K.; Zaheer, M.; Kar, S., Towards gradient free and projection free stochastic optimization, (The 22nd International Conference on Artificial Intelligence and Statistics (2019)), 3468-3477
[38] Salimans, T.; Ho, J.; Chen, X.; Sidor, S.; Sutskever, I., Evolution strategies as a scalable alternative to reinforcement learning (2017), preprint
[39] Schmidt, M., Faster algorithms for deep learning? (2020), presentation in vector institute
[40] Snoek, J.; Larochelle, H.; Adams, R., Practical Bayesian optimization of machine learning algorithms, (Advances in Neural Information Processing Systems (2012)), 2951-2959
[41] Vaswani, S.; Bach, F.; Schmidt, M., Fast and faster convergence of SGD for over-parameterized models and an accelerated perceptron, (The 22nd International Conference on Artificial Intelligence and Statistics (2019)), 1195-1204
[42] Vaswani, S.; Mishkin, A.; Laradji, I.; Schmidt, M.; Gidel, G.; Lacoste-Julien, S., Painless stochastic gradient: interpolation, line-search, and convergence rates, (Advances in Neural Information Processing Systems (2019)), 3727-3740
[43] Yurtsever, A.; Sra, S.; Cevher, V., Conditional gradient methods via stochastic path-integrated differential estimator, (Proceedings of the International Conference on Machine Learning-ICML 2019 (2019))
[44] Zhang, C.; Bengio, S.; Hardt, M.; Recht, B.; Vinyals, O., Understanding deep learning requires rethinking generalization (2016), preprint
[45] Zhang, M.; Shen, Z.; Mokhtari, A.; Hassani, H.; Karbasi, A., One sample stochastic Frank-Wolfe (2019), preprint
This reference list is based on information provided by the publisher or from digital mathematics libraries. Its items are heuristically matched to zbMATH identifiers and may contain data conversion errors. In some cases that data have been complemented/enhanced by data from zbMATH Open. This attempts to reflect the references listed in the original paper as accurately as possible without claiming completeness or a perfect matching.