×

Block stochastic gradient iteration for convex and nonconvex optimization. (English) Zbl 1342.93125

Summary: The Stochastic Gradient (SG) method can quickly solve a problem with a large number of components in the objective, or a stochastic optimization problem, to a moderate accuracy. The Block Coordinate Descent/update (BCD) method, on the other hand, can quickly solve problems with multiple (blocks of) variables. This paper introduces a method that combines the great features of SG and BCD for problems with many components in the objective and with multiple (blocks of) variables. This paper proposes a block SG (BSG) method for both convex and nonconvex programs. BSG generalizes SG by updating all the blocks of variables of Gauss-Seidel type (updating the current block depends on the previously updated block), in either a fixed or randomly shuffled order. Although BSG has slightly more work at each iteration, it typically outperforms SG because of BSG’s Gauss-Seidel updates and larger step sizes, the latter of which are determined by the smaller per-block Lipschitz constants. The convergence of BSG is established for both convex and nonconvex cases. In the convex case, BSG has the same order of convergence rate as SG. In the nonconvex case, its convergence is established in terms of the expected violation of a first-order optimality condition. In both cases, our analysis is nontrivial since the typical unbiasedness assumption no longer holds. BSG is numerically evaluated on the following problems: stochastic least squares and logistic regression, which are convex, and low-rank tensor recovery and bilinear logistic regression, which are nonconvex. On the convex problems, BSG performed significantly better than SG. On the nonconvex problems, BSG significantly outperformed the deterministic BCD method because the latter tends to stagnate early near local minimizers. Overall, BSG inherits the benefits of both SG approximation and block coordinate updates and is especially useful for solving large-scale nonconvex problems.

MSC:

93E25 Computational methods in stochastic control (MSC2010)
93E20 Optimal stochastic control
49M05 Numerical methods based on necessary conditions
90C15 Stochastic programming
90C25 Convex programming
90C26 Nonconvex programming, global optimization
90C30 Nonlinear programming
65K05 Numerical mathematical programming methods
65K10 Numerical optimization and variational techniques
65B99 Acceleration of convergence in numerical analysis

References:

[1] A. Auslender, {\it Asymptotic properties of the Fenchel dual functional and applications to decomposition problems}, J. Optim. Theory Appl., 73 (1992), pp. 427-449. · Zbl 0794.49026
[2] A. Beck and M. Teboulle, {\it A fast iterative shrinkage-thresholding algorithm for linear inverse problems}, SIAM J. Imaging Sci., 2 (2009), pp. 183-202. · Zbl 1175.94009
[3] A. Beck and L. Tetruashvili, {\it On the convergence of block coordinate descent type methods}, SIAM J. Optim., 23 (2013), pp. 2037-2060. · Zbl 1297.90113
[4] D. P. Bertsekas, {\it Nonlinear Programming}, Athena Scientific, Belmont, MA, 1999. · Zbl 1015.90077
[5] S. Boyd, L. Xiao, and A. Mutapcic, {\it Subgradient Methods}, Lecture Notes of EE392o, Autumn Quarter, 2004, Stanford University, Palo Alto, CA, 2003.
[6] E. J. Candes and B. Recht, {\it Exact matrix completion via convex optimization}, Found. Comput. Math., 9 (2009), pp. 717-772. · Zbl 1219.90124
[7] K.-W. Chang, C.-J. Hsieh, and C.-J. Lin, {\it Coordinate descent method for large-scale \({L}2\)-loss linear support vector machines}, J. Mach. Learn. Res., 9 (2008), pp. 1369-1398. · Zbl 1225.68157
[8] K.-L. Chung, {\it On a stochastic approximation method}, Ann. Math. Statist., 25 (1954), pp. 463-483. · Zbl 0059.13203
[9] C. D. Dang and G. Lan, {\it Stochastic block mirror descent methods for nonsmooth and stochastic optimization}, SIAM J. Optim., 25 (2015), pp. 856-881. · Zbl 1353.90095
[10] M. Dyrholm, C. Christoforou, and L. C. Parra, {\it Bilinear discriminant component analysis}, J. Mach. Learn. Res., 8 (2007), pp. 1097-1111.
[11] J. Eckstein and D. P. Bertsekas, {\it On the Douglas-Rachford splitting method and the proximal point algorithm for maximal monotone operators}, Math. Program., 55 (1992), pp. 293-318. · Zbl 0765.90073
[12] R.-E. Fan, K.-W. Chang, C.-J. Hsieh, X.-R. Wang, and C.-J. Lin, {\it Liblinear: A library for large linear classification}, J. Mach. Learn. Res., 9 (2008), pp. 1871-1874. · Zbl 1225.68175
[13] M. P. Friedlander and M. Schmidt, {\it Hybrid deterministic-stochastic methods for data fitting}, SIAM J. Sci. Comput., 34 (2012), pp. A1380-A1405. · Zbl 1262.90090
[14] R. Gemulla, E. Nijkamp, P. J. Haas, and Y. Sismanis, {\it Large-scale matrix factorization with distributed stochastic gradient descent}, in Proceedings of the 17th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, ACM, New York, 2011, pp. 69-77.
[15] S. Ghadimi and G. Lan, {\it Accelerated gradient methods for nonconvex nonlinear and stochastic programming}, Math. Program. (2015), DOI:10.1007/s10107-015-0871-8. · Zbl 1335.62121
[16] S. Ghadimi and G. Lan, {\it Optimal stochastic approximation algorithms for strongly convex stochastic composite optimization, II: Shrinking procedures and optimal algorithms}, SIAM J. Optim., 23 (2013), pp. 2061-2089. · Zbl 1293.62167
[17] S. Ghadimi and G. Lan, {\it Stochastic first and zeroth-order methods for nonconvex stochastic programming}, SIAM J. Optim., 23 (2013), pp. 2341-2368. · Zbl 1295.90026
[18] S. Ghadimi, G. Lan, and H. Zhang, {\it Mini-batch stochastic approximation methods for nonconvex stochastic composite optimization}, Math. Program. (2015), DOI:10.1007/s10107-014-0846-1. · Zbl 1332.90196
[19] J. L. Goffin, {\it On convergence rates of subgradient optimization methods}, Math. Program., 13 (1977), pp. 329-347. · Zbl 0368.90119
[20] L. Grippo and M. Sciandrone, {\it On the convergence of the block nonlinear Gauss-Seidel method under convex constraints}, Oper. Res. Lett., 26 (2000), pp. 127-136. · Zbl 0955.90128
[21] L. Grippo and M. Sciandrone, {\it On the convergence of the block nonlinear Gauss-Seidel method under convex constraints}, Oper. Res. Lett., 26 (2000), pp. 127-136. · Zbl 0955.90128
[22] C. Hildreth, {\it A quadratic programming procedure}, Naval Res. Logist. Q., 4 (1957), pp. 79-85.
[23] M. Hong, X. Wang, M. Razaviyayn, and Z.-Q. Luo, {\it Iteration Complexity Analysis of Block Coordinate Descent Methods}, preprint, arXiv:1310.6957, 2013.
[24] A. J. Kleywegt, A. Shapiro, and T. Homem-de-Mello, {\it The sample average approximation method for stochastic discrete optimization}, SIAM J. Optim., 12 (2002), pp. 479-502. · Zbl 0991.90090
[25] G. Lan, {\it An optimal method for stochastic composite optimization}, Math. Program., 133 (2012), pp. 365-397. · Zbl 1273.90136
[26] J. Liu, S. J. Wright, and S. Sridhar, {\it An Asynchronous Parallel Randomized Kaczmarz Algorithm}, preprint, arXiv:1401.4780, 2014.
[27] Z. Lu and L. Xiao, {\it On the complexity analysis of randomized block-coordinate descent methods}, Math. Program., 152 (2015), pp. 615-642. · Zbl 1321.65100
[28] Z. Lu and L. Xiao, {\it Randomized Block Coordinate Non-Monotone Gradient Method for a Class of Nonlinear Programming}, preprint, arXiv:1306.5918, 2013.
[29] Z.-Q. Luo and P. Tseng, {\it On the convergence of the coordinate descent method for convex differentiable minimization}, J. Optim. Theory Appl., 72 (1992), pp. 7-35. · Zbl 0795.90069
[30] J. Mairal, {\it Stochastic majorization-minimization algorithms for large-scale optimization}, in Advances in Neural Information Processing Systems 26, 2013.
[31] J. Mairal, F. Bach, J. Ponce, and G. Sapiro, {\it Online dictionary learning for sparse coding}, in Proceedings of the 26th Annual International Conference on Machine Learning, ACM, New York, 2009, pp. 689-696. · Zbl 1242.62087
[32] C. Navasca, L. De Lathauwer, and S. Kindermann, {\it Swamp reducing technique for tensor decomposition}, in Proceedings of the 16th European Signal Processing Conference (EUSIPCO 2008), IEEE, Piscataway, NJ, 2008.
[33] A. Nemirovski, A. Juditsky, G. Lan, and A. Shapiro, {\it Robust stochastic approximation approach to stochastic programming}, SIAM J. Optim., 19 (2009), pp. 1574-1609. · Zbl 1189.90109
[34] A. Nemirovski and D. B. Yudin, {\it Problem Complexity and Method Efficiency in Optimization}, Wiley, Chichester, 1983. · Zbl 0501.90062
[35] Y. Nesterov, {\it Introductory Lectures on Convex Optimization}, Appl. Optim. 87, 2004. · Zbl 1086.90045
[36] Y. Nesterov, {\it Primal-dual subgradient methods for convex problems}, Math. Program., 120 (2009), pp. 221-259. · Zbl 1191.90038
[37] Y. Nesterov, {\it Efficiency of coordinate descent methods on huge-scale optimization problems}, SIAM J. Optim., 22 (2012), pp. 341-362. · Zbl 1257.90073
[38] Y. Nesterov and V. Shikhman, {\it Convergent Subgradient Methods for Nonsmooth Convex Minimization}, Technical report, Université Catholique de Louvain, Center for Operations Research and Econometrics (CORE), Louvain-la-neuve, Belgium, 2014. · Zbl 1330.90078
[39] Z. Peng, M. Yan, and W. Yin, {\it Parallel and distributed sparse optimization}, in Proceedings of the 2013 Asilomar Conference on Signals, Systems, and Computers, IEEE, Piscataway, NJ, 2013, pp. 659-646.
[40] B. T. Polyak, {\it New stochastic approximation type procedures}, Avtomat. i Telemekh., 51 (1990), pp. 98-107. · Zbl 0737.93080
[41] M. Razaviyayn, M. Hong, and Z.-Q. Luo, {\it A unified convergence analysis of block successive minimization methods for nonsmooth optimization}, SIAM J. Optim., 23 (2013), pp. 1126-1153. · Zbl 1273.90123
[42] B. Recht and C. Ré, {\it Parallel stochastic gradient algorithms for large-scale matrix completion}, Math. Program. Comput., 5 (2013), pp. 201-226. · Zbl 1275.90039
[43] P. Richtárik and M. Takáč, {\it Iteration complexity of randomized block-coordinate descent methods for minimizing a composite function}, Math. Program., 144 (2014), pp. 1-38. · Zbl 1301.65051
[44] H. Robbins and S. Monro, {\it A stochastic approximation method}, Ann. Math. Statist., 22 (1951), pp. 400-407. · Zbl 0054.05901
[45] R. T. Rockafellar, {\it Monotone operators and the proximal point algorithm}, SIAM J. Control Optim., 14 (1976), pp. 877-898. · Zbl 0358.90053
[46] R. T. Rockafellar and R. J. B. Wets, {\it Variational Analysis}, Grundkhren Math. Wiss. 317, Springer-Verlag, Berlin, 1998. · Zbl 0888.49001
[47] J. Sacks, {\it Asymptotic distribution of stochastic approximation procedures}, Ann. Math. Statist., 29 (1958), pp. 373-405. · Zbl 0229.62010
[48] A. Saha and A. Tewari, {\it On the nonasymptotic convergence of cyclic coordinate descent methods}, SIAM J. Optim., 23 (2013), pp. 576-601. · Zbl 1270.90032
[49] T. Schaul, S. Zhang, and Y. Lecun, {\it No more pesky learning rates}, in Proceedings of the 30th International Conference on Machine Learning (ICML-13), ACM, New York, 2013, pp. 343-351.
[50] S. Shalev-Shwartz, Y. Singer, N. Srebro, and A. Cotter, {\it Pegasos: Primal estimated sub-gradient solver for SVM}, Math. Program., 127 (2011), pp. 3-30. · Zbl 1211.90239
[51] S. Shalev-Shwartz and A. Tewari, {\it Stochastic methods for l1 regularized loss minimization}, in ICML ’09: Proceedings of the 26th Annual International Conference on Machine Learning, ACM, New York, 2009, pp. 929-936.
[52] S. K. Shevade and S. S. Keerthi, {\it A simple and efficient algorithm for gene selection using sparse logistic regression}, Bioinformatics, 19 (2003), pp. 2246-2253.
[53] J. V. Shi, Y. Xu, and R. G. Baraniuk, {\it Sparse Bilinear Logistic Regression}, preprint, arXiv:1404.4104, 2014.
[54] R. Tibshirani, {\it Regression shrinkage and selection via the lasso}, J. R. Stat. Soc. Ser. B Methodol., 58 (1996), pp. 267-288. · Zbl 0850.62538
[55] P. Tseng, {\it Convergence of a block coordinate descent method for nondifferentiable minimization}, J. Optim. Theory Appl., 109 (2001), pp. 475-494. · Zbl 1006.65062
[56] P. Tseng and S. Yun, {\it A coordinate gradient descent method for nonsmooth separable minimization}, Math. Program., 117 (2009), pp. 387-423. · Zbl 1166.90016
[57] Z. Wen, D. Goldfarb, and K. Scheinberg, {\it Block coordinate descent methods for semidefinite programming}, in Handbook on Semidefinite, Conic and Polynomial Optimization, Springer, New York, 2012, pp. 533-564.f̌ill · Zbl 1334.90118
[58] Y. Xu and W. Yin, {\it A block coordinate descent method for regularized multiconvex optimization with applications to nonnegative tensor factorization and completion}, SIAM J. Imaging Sci., 6 (2013), pp. 1758-1789. · Zbl 1280.49042
[59] Y. Xu and W. Yin, {\it A Globally Convergent Algorithm for Nonconvex Optimization Based on Block Coordinate Update}, preprint, arXiv:1410.1386, 2014. · Zbl 1378.65126
[60] T. Zhang, {\it Solving large scale linear prediction problems using stochastic gradient descent algorithms}, in Proceedings of the Twenty-first International Conference on Machine Learning, ACM, New York, 2004, 116.
This reference list is based on information provided by the publisher or from digital mathematics libraries. Its items are heuristically matched to zbMATH identifiers and may contain data conversion errors. In some cases that data have been complemented/enhanced by data from zbMATH Open. This attempts to reflect the references listed in the original paper as accurately as possible without claiming completeness or a perfect matching.