×

Proximal average approximated incremental gradient descent for composite penalty regularized empirical risk minimization. (English) Zbl 1459.62156

Summary: Composite penalties have been widely used for inducing structured properties in the empirical risk minimization (ERM) framework in machine learning. Such composite regularizers, despite their superior performance in grasping structural sparsity properties, are often nonsmooth and even nonconvex, which makes the problem difficult to optimize. Proximal average (PA) is a recently proposed approximation technique targeting these regularizers, which features the tractability of implementation and theoretical analysis. However, current PA-based methods, notwithstanding the promising performance of handling composite penalties against traditional techniques, are either slow in convergence or do not scale well to large datasets. To make PA an ideal technique for optimizing ERM with composite penalties, this paper proposes a new PA-based algorithm called IncrePA by incorporating PA approximation into an incremental gradient framework. The proposed method is a more optimal PA-based method that features lower per-iteration cost, a faster convergence rate for convex composite penalties, and guaranteed convergence for even nonconvex composite penalties. Experiments on both synthetic and real datasets demonstrate the efficacy of the proposed method in optimizing convex and nonconvex ERM with composite penalties.

MSC:

62L20 Stochastic approximation
68T05 Learning and adaptive systems in artificial intelligence
90C52 Methods of reduced gradient type

Software:

Finito; Saga
Full Text: DOI

References:

[1] Azadi, S., & Sra, S. (2014). Towards an optimal stochastic alternating direction method of multipliers. In Proceedings of the 31st international conference on machine learning (pp. 620-628).
[2] Bauschke, H. H., Goebel, R., Lucet, Y., & Wang, X. (2008). The proximal average: Basic theory. SIAM Journal on Optimization, 19(2), 766-785. · Zbl 1172.26003 · doi:10.1137/070687542
[3] Beck, A., & Teboulle, M. (2009). A fast iterative shrinkage-thresholding algorithm for linear inverse problems. SIAM Journal on Imaging Sciences, 2(1), 183-202. · Zbl 1175.94009 · doi:10.1137/080716542
[4] Borwein, J. M., & Lewis, A. S. (2010). Convex analysis and nonlinear optimization: Theory and examples. Berlin: Springer.
[5] Bottou, L. (2010). Large-scale machine learning with stochastic gradient descent. In Proceedings of COMPSTAT’2010 (pp. 177-186). Berlin: Springer. · Zbl 1436.68293
[6] Boyd, S., Parikh, N., Chu, E., Peleato, B., & Eckstein, J. (2011). Distributed optimization and statistical learning via the alternating direction method of multipliers. Foundations and Trends in Machine Learning, 3(1), 1-122. · Zbl 1229.90122 · doi:10.1561/2200000016
[7] Defazio, A., Bach, F., & Lacoste-Julien, S. (2014a). Saga: A fast incremental gradient method with support for non-strongly convex composite objectives. In NIPS. arXiv:1407.0202.
[8] Defazio, A., Domke, J., & Caetano, T. (2014b). Finito: A faster, permutable incremental gradient method for big data problems. In Proceedings of the 31st international conference on machine learning (ICML-14) (pp. 1125-1133).
[9] Ghadimi, S., & Lan, G. (2012). Optimal stochastic approximation algorithms for strongly convex stochastic composite optimization. I: A generic algorithmic framework. SIAM Journal on Optimization, 22(4), 1469-1492. · Zbl 1301.62077 · doi:10.1137/110848864
[10] Gong, P., Zhang, C., Lu, Z., Huang, J., & Ye, J. (2013). A general iterative shrinkage and thresholding algorithm for non-convex regularized optimization problems. In Proceedings of the 30th international conference on machine learning (pp. 37-45).
[11] Jacob, L., Vert, J., & Obozinski, G. R. (2009). Group lasso with overlap and graph lasso. In Proceedings of the 26th international conference on machine learning (ICML-09) (p. 55).
[12] Johnson, R., & Zhang, T. (2013). Accelerating stochastic gradient descent using predictive variance reduction. In Advances in neural information processing systems (pp. 315-323).
[13] Kim, S., & Xing, E. P. (2009). Statistical estimation of correlated genome associations to a quantitative trait network. PLoS Genetics, 5(8), e1000587. · doi:10.1371/journal.pgen.1000587
[14] Konečnỳ, J., & Richtárik, P. (2013). Semi-stochastic gradient descent methods. arXiv:1312.1666. · Zbl 1386.90080
[15] Lacoste-Julien, S., Schmidt, M., & Bach, F. (2012). A simpler approach to obtaining an o (1/t) convergence rate for the projected stochastic subgradient method. arXiv:1212.2002.
[16] Lu, Z. (2012). Sequential convex programming methods for a class of structured nonlinear programming. arXiv:1210.3039.
[17] Mairal, J. (2014). Incremental majorization-minimization optimization with application to large-scale machine learning. arXiv:1402.4419. · Zbl 1320.90047
[18] Nesterov, Y., & Nesterov, I. U. E. (2004). Introductory lectures on convex optimization: A basic course (Vol. 87). London: Springer. · Zbl 1086.90045
[19] Ouyang, H., He, N., Tran, L., & Gray, A. (2013). Stochastic alternating direction method of multipliers. In Proceedings of the 30th international conference on machine learning (pp. 80-88).
[20] Roux, NL; Schmidt, M.; Bach, FR; Pereira, F. (ed.); Burges, CJC (ed.); Bottou, L. (ed.); Weinberger, KQ (ed.), A stochastic gradient method with an exponential convergence rate for finite training sets, No. 25, 2663-2671 (2012), Newry
[21] Shalev-Shwartz, S., & Zhang, T. (2013). Stochastic dual coordinate ascent methods for regularized loss. The Journal of Machine Learning Research, 14(1), 567-599. · Zbl 1307.68073
[22] Shamir, O., & Zhang, T. (2013). Stochastic gradient descent for non-smooth optimization: Convergence results and optimal averaging schemes. In Proceedings of the 30th international conference on machine learning (pp. 71-79).
[23] Shen, X., & Huang, H.-C. (2010). Grouping pursuit through a regularization solution surface. Journal of the American Statistical Association, 105(490), 727-739. · Zbl 1392.62192 · doi:10.1198/jasa.2010.tm09380
[24] Suzuki, T. (2013). Dual averaging and proximal gradient descent for online alternating direction multiplier method. In Proceedings of the 30th international conference on machine learning (ICML-13) (pp. 392-400).
[25] Suzuki, T. (2014). Stochastic dual coordinate ascent with alternating direction method of multipliers. In Proceedings of the 31st international conference on machine learning (pp. 736-744).
[26] Xiang, S., Tong, X., & Ye, J. (2013). Efficient sparse group feature selection via nonconvex optimization. In Proceedings of the 30th international conference on machine learning (ICML-13) (pp. 284-292).
[27] Xiao, L. (2010). Dual averaging methods for regularized stochastic learning and online optimization. Journal of Machine Learning Research, 11, 2543-2596. · Zbl 1242.62011
[28] Xiao, L., & Zhang, T. (2014). A proximal stochastic gradient method with progressive variance reduction. arXiv:1403.4699. · Zbl 1321.65016
[29] Yu, Y.-L. (2013). Better approximation and faster algorithm using the proximal average. In Advances in neural information processing systems (pp. 458-466).
[30] Zhang, C.-H. (2010). Nearly unbiased variable selection under minimax concave penalty. The Annals of Statistics, 38, 894-942. · Zbl 1183.62120 · doi:10.1214/09-AOS729
[31] Zhang, T. (2010). Analysis of multi-stage convex relaxation for sparse regularization. The Journal of Machine Learning Research, 11, 1081-1107. · Zbl 1242.68262
[32] Zheng, S., & Kwok J. T. (2016). Fast-and-light stochastic ADMM. arXiv:1604.07070.
[33] Zhong, W., & Kwok, J. (2014a). Fast stochastic alternating direction method of multipliers. In Proceedings of the 31st international conference on machine learning (pp. 46-54).
[34] Zhong, W., & Kwok, J. (2014b). Gradient descent with proximal average for nonconvex and composite regularization. In AAAI conference on artificial intelligence.
[35] Zhong, L. W., & Kwok, J. T. (2014c). Accelerated stochastic gradient method for composite regularization. In Proceedings of the seventeenth international conference on artificial intelligence and statistics (pp. 1086-1094).
This reference list is based on information provided by the publisher or from digital mathematics libraries. Its items are heuristically matched to zbMATH identifiers and may contain data conversion errors. In some cases that data have been complemented/enhanced by data from zbMATH Open. This attempts to reflect the references listed in the original paper as accurately as possible without claiming completeness or a perfect matching.