×

Stabilized sparse online learning for sparse data. (English) Zbl 1435.68276

Summary: Stochastic gradient descent (SGD) is commonly used for optimization in large-scale machine learning problems. J. Langford et al. [ibid. 10, 777–801 (2009; Zbl 1235.68167)] introduce a sparse online learning method to induce sparsity via truncated gradient. With high-dimensional sparse data, however, this method suffers from slow convergence and high variance due to heterogeneity in feature sparsity. To mitigate this issue, we introduce a stabilized truncated stochastic gradient descent algorithm. We employ a soft-thresholding scheme on the weight vector where the imposed shrinkage is adaptive to the amount of information available in each feature. The variability in the resulted sparse weight vector is further controlled by stability selection integrated with the informative truncation. To facilitate better convergence, we adopt an annealing strategy on the truncation rate, which leads to a balanced trade-off between exploration and exploitation in learning a sparse weight vector. Numerical experiments show that our algorithm compares favorably with the original truncated gradient SGD in terms of prediction accuracy, achieving both better sparsity and stability.

MSC:

68T05 Learning and adaptive systems in artificial intelligence
62J07 Ridge regression; shrinkage estimators (Lasso)
68T09 Computational aspects of data analysis and big data
68W27 Online algorithms; streaming algorithms

Citations:

Zbl 1235.68167

Software:

AdaGrad

References:

[1] Amir Beck and Marc Teboulle. Mirror descent and nonlinear projected subgradient methods for convex optimization. Operations Research Letters, 31(3):167-175, 2003. · Zbl 1046.90057
[2] L´eon Bottou. Online learning and stochastic approximations. Online Learning in Neural Networks, 17(9):142, 1998. · Zbl 0968.68127
[3] L´eon Bottou. Large-scale machine learning with stochastic gradient descent. In Proceedings of COMPSTAT’2010, pages 177-186. Springer, 2010. · Zbl 1436.68293
[4] Olivier Bousquet and Andr´e Elisseeff.Stability and generalization.The Journal of Machine Learning Research, 2:499-526, 2002. · Zbl 1007.68083
[5] Deng Cai and Xiaofei He.Manifold adaptive experimental design for text categorization. Knowledge and Data Engineering, IEEE Transactions on, 24(4):707-719, 2012.
[6] Bob Carpenter.Lazy sparse stochastic gradient descent for regularized multinomial logistic regression. Alias-i, Inc., Tech. Rep, pages 1-20, 2008.
[7] Nicolo Cesa-Bianchi, Alex Conconi, and Claudio Gentile. On the generalization ability of online learning algorithms. Information Theory, IEEE Transactions on, 50(9):2050-2057, 2004. 34 · Zbl 1295.68182
[8] Jacob Cohen. A coefficient of agreement for nominal scales. Educational and Psychological Measurement, 20(1):37-46, 1960.
[9] John Duchi and Yoram Singer. Efficient online and batch learning using forward backward splitting. Journal of Machine Learning Research, 10:2899-2934, 2009. · Zbl 1235.62151
[10] John Duchi, Elad Hazan, and Yoram Singer. Adaptive subgradient methods for online learning and stochastic optimization. The Journal of Machine Learning Research, 12:2121-2159, 2011. · Zbl 1280.68164
[11] John C Duchi, Shai Shalev-Shwartz, Yoram Singer, and Ambuj Tewari. Composite objective mirror descent. In COLT, pages 14-26. Citeseer, 2010.
[12] Isabelle Guyon, Steve Gunn, Asa Ben-Hur, and Gideon Dror. Result analysis of the nips 2003 feature selection challenge. In Advances in Neural Information Processing Systems, pages 545– 552, 2004.
[13] Moritz Hardt, Benjamin Recht, and Yoram Singer. Train faster, generalize better: Stability of stochastic gradient descent. arXiv preprint arXiv:1509.01240, 2015.
[14] Samuel Kutin and Partha Niyogi. Almost-everywhere algorithmic stability and generalization error. In Proceedings of the Eighteenth Conference on Uncertainty in Artificial Intelligence, pages 275– 282. Morgan Kaufmann Publishers Inc., 2002.
[15] John Langford, Lihong Li, and Tong Zhang. Sparse online learning via truncated gradient. In Advances in Neural Information Processing Systems, pages 905-912, 2009. · Zbl 1235.68167
[16] Pierre-Louis Lions and Bertrand Mercier.Splitting algorithms for the sum of two nonlinear operators. SIAM Journal on Numerical Analysis, 16(6):964-979, 1979. · Zbl 0426.65050
[17] Nicolai Meinshausen and Peter B¨uhlmann. Stability selection. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 72(4):417-473, 2010. · Zbl 1411.62142
[18] Andrew P Morris and Eleftheria Zeggini. An evaluation of statistical approaches to rare variant analysis in genetic association studies. Genetic Epidemiology, 34(2):188-193, 2010.
[19] Yurii Nesterov. Primal-dual subgradient methods for convex problems. Mathematical Programming, 120(1):221-259, 2009. · Zbl 1191.90038
[20] Hidekazu Oiwa, Shin Matsushima, and Hiroshi Nakagawa. Frequency-aware truncated methods for sparse online learning. In Machine Learning and Knowledge Discovery in Databases, pages 533-548. Springer, 2011. · Zbl 1343.68202
[21] Hidekazu Oiwa, Satoru Matsushima, and Hirotoshi Nakagawa.Healing truncation bias: selfweighted truncation framework for dual averaging.In 2012 IEEE Twelfth International Conference on Data Mining (ICDM), pages 575-584. IEEE, 2012.
[22] Alexander Rakhlin, Sayan Mukherjee, and Tomaso Poggio. Stability results in learning theory. Analysis and Applications, 3(04):397-417, 2005. · Zbl 1101.68621
[23] Shai Shalev-Shwartz and Ambuj Tewari. Stochastic methods for l 1-regularized loss minimization. The Journal of Machine Learning Research, 12:1865-1892, 2011. 35 · Zbl 1280.62081
[24] Shai Shalev-Shwartz, Ohad Shamir, Nathan Srebro, and Karthik Sridharan. Learnability, stability and uniform convergence. The Journal of Machine Learning Research, 11:2635-2670, 2010. · Zbl 1242.68247
[25] Wei Sun, Junhui Wang, and Yixin Fang. Consistent selection of tuning parameters via variable selection stability. The Journal of Machine Learning Research, 14(1):3419-3440, 2013. · Zbl 1318.62241
[26] Robert Tibshirani. Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society. Series B (Methodological), pages 267-288, 1996. · Zbl 0850.62538
[27] Panos Toulis, Dustin Tran, and Edoardo M Airoldi. Stability and optimality in stochastic gradient descent. arXiv preprint arXiv:1505.02417, 2015. · Zbl 1332.62291
[28] Dayong Wang, Pengcheng Wu, Peilin Zhao, and Steven CH Hoi. A framework of sparse online learning and its applications. arXiv preprint arXiv:1507.07146, 2015.
[29] Jialei Wang, Peilin Zhao, Steven CH Hoi, and Rong Jin.Online feature selection and its applications. Knowledge and Data Engineering, IEEE Transactions on, 26(3):698-710, 2014.
[30] Yue Wu, Steven CH Hoi, Tao Mei, and Nenghai Yu. Large-scale online feature selection for ultrahigh dimensional sparse data. arXiv preprint arXiv:1409.7794, 2014.
[31] Lin Xiao. Dual averaging method for regularized stochastic learning and online optimization. In Advances in Neural Information Processing Systems, pages 2116-2124, 2009.
[32] Tong Zhang.Solving large scale linear prediction problems using stochastic gradient descent algorithms. In Proceedings of the Twenty-First International Conference on Machine Learning, page 116. ACM, 2004.
[33] Martin Zinkevich, Markus Weimer, Lihong Li, and Alex J Smola. Parallelized stochastic gradient descent. In Advances in Neural Information Processing Systems, pages 2595-2603, 2010.
This reference list is based on information provided by the publisher or from digital mathematics libraries. Its items are heuristically matched to zbMATH identifiers and may contain data conversion errors. In some cases that data have been complemented/enhanced by data from zbMATH Open. This attempts to reflect the references listed in the original paper as accurately as possible without claiming completeness or a perfect matching.