Abstract
We describe and analyze a simple and effective stochastic sub-gradient descent algorithm for solving the optimization problem cast by Support Vector Machines (SVM). We prove that the number of iterations required to obtain a solution of accuracy \({\epsilon}\) is \({\tilde{O}(1 / \epsilon)}\), where each iteration operates on a single training example. In contrast, previous analyses of stochastic gradient descent methods for SVMs require \({\Omega(1 / \epsilon^2)}\) iterations. As in previously devised SVM solvers, the number of iterations also scales linearly with 1/λ, where λ is the regularization parameter of SVM. For a linear kernel, the total run-time of our method is \({\tilde{O}(d/(\lambda \epsilon))}\), where d is a bound on the number of non-zero features in each example. Since the run-time does not depend directly on the size of the training set, the resulting algorithm is especially suited for learning from large datasets. Our approach also extends to non-linear kernels while working solely on the primal objective function, though in this case the runtime does depend linearly on the training set size. Our algorithm is particularly well suited for large text classification problems, where we demonstrate an order-of-magnitude speedup over previous SVM learning methods.
Similar content being viewed by others
References
Amari S.: Natural gradient works efficiently in learning. Neural Comput. 10, 251–276 (1998)
Bordes A., Ertekin S., Weston J., Bottou L.: Fast kernel classifiers with online and active learning. J. Mach. Learn. Res. 6, 1579–1619 (2005)
Bottou L.: Online Algorithms and Stochastic Approximations. In: Saad, D. (eds) Online learning and neural networks, Cambridge University Press, Cambridge (1998)
Bottou, L., Bousquet, O.: The tradeoffs of large scale learning. In: Advances in Neural Information Processing Systems 20, pp. 161–168 (2008)
Bottou L., LeCun Y.: Large scale online learning. In: Thrun, S., Saul, L., Schölkopf, B. (eds) Advances in Neural Information Processing Systems 16, MIT Press, Cambridge (2004)
Bottou L., Murata N.: Stochastic approximations and efficient learning. In: Arbib, M.A. (eds) The Handbook of Brain Theory and Neural Networks, The MIT Press, Cambridge (2002)
Boyd S., Vandenberghe L.: Convex Optimization, 2nd edn. Cambridge University Press, Cambridge (2004)
Censor Y., Zenios S.: Parallel Optimization: Theory, Algorithms, and Applications. Oxford University Press, New York (1997)
Cesa-Bianchi N., Conconi A., Gentile C.: On the generalization ability of on-line learning algorithms. IEEE Trans. Inf, Theory 50(9), 2050–2057 (2004)
Chapelle, O.: Training a support vector machine in the primal. Neural Comput. 19(5), 1155–1178 (2007). doi:10.1162/neco.2007.19.5.1155. http://www.mitpressjournals.org/doi/abs/10.1162/neco.2007.19.5.1155
Crammer K., Dekel O., Keshet J., Shalev-Shwartz S., Singer Y.: Online passive aggressive algorithms. J. Mach. Learn. Res. 7, 551–585 (2006)
Cristianini N., Shawe-Taylor J.: An Introduction to Support Vector Machines. Cambridge University Press, Cambridge (2000)
Do, C., Le, Q., Foo, C.: Proximal regularization for online and batch learning. In: Proceedings of the 26th International Conference on Machine Learning (2009)
Duda R.O., Hart P.E.: Pattern Classification and Scene Analysis. Wiley, New York (1973)
Fine S., Scheinberg K.: Efficient SVM training using low-rank kernel representations. J. Mach. Lear. Res. 2, 242–264 (2001)
Freund Y., Schapire R.E.: Large margin classification using the perceptron algorithm. Mach. Learn. 37(3), 277–296 (1999)
Hazan, E., Kalai, A., Kale, S., Agarwal, A.: Logarithmic regret algorithms for online convex optimization. In: Proceedings of the Nineteenth Annual Conference on Computational Learning Theory (2006)
Hsieh, C., Chang, K., Lin, C., Keerthi, S., Sundararajan, S.: A dual coordinate descent method for large-scale linear SVM. In: ICML, pp. 408–415 (2008)
Hush, D., Kelly, P., Scovel, C., Steinwart, I.: Qp algorithms with guaranteed accuracy and run time for support vector machines. J. Mach. Learn. Res. (2006)
Joachims T.: Making large-scale support vector machine learning practical. In: Schölkopf, B., Burges, C., Smola, A. (eds) Advances in Kernel Methods—Support Vector Learning., MIT Press, Cambridge (1998)
Joachims, T.: Training linear SVMs in linear time. In: Proceedings of the ACM Conference on Knowledge Discovery and Data Mining (KDD), pp. 216–226 (2006)
Kakade, S., Tewari, A.: On the generalization ability of online strongly convex programming algorithms. In: Advances in Neural Information Processing Systems 22 (2009)
Kimeldorf G., Wahba G.: Some results on tchebycheffian spline functions. J. Math. Anal. Appl. 33, 82–95 (1971)
Kivinen J., Smola A.J., Williamson R.C.: Online learning with kernels. IEEE Trans. Signal Process. 52(8), 2165–2176 (2002)
Kushner H., Yin G.: Stochastic Approximation Algorithms and Applications. Springer, New York (1997)
Murata N.: A statistical study of on-line learning. In: Saad, D. (eds) Online Learning and Neural Networks, Cambridge University Press, Cambridge (1998)
Murata N., Amari S.: Statistical analysis of learning dynamics. Signal Process. 74(1), 3–28 (1999)
Nesterov, Y.: Primal-dual subgradient methods for convex problems. Tech. rep., Center for Operations Research and Econometrics (CORE), Catholic University of Louvain (UCL) (2005)
Platt J.C.: Fast training of Support Vector Machines using sequential minimal optimization. In: Schölkopf, B., Burges, C., Smola, A. (eds) Advances in Kernel Methods—Support Vector Learning, MIT Press, Cambridge (1998)
Rockafellar R.: Convex Analysis. Princeton University Press, Princeton (1970)
Shalev-Shwartz, S., Singer, Y., Srebro, N.: Pegasos: Primal Estimated sub-GrAdient SOlver for SVM. In: Proceedings of the 24th International Conference on Machine Learning, pp. 807–814 (2007)
Shalev-Shwartz, S., Srebro, N.: SVM optimization: inverse dependence on training set size. In: Proceedings of the 25th International Conference on Machine Learning, pp. 928–935 (2008)
Smola, A., Vishwanathan, S., Le, Q.: Bundle methods for machine learning. In: Advances in Neural Information Processing Systems 21 (2007)
Spall J.C.: Introduction to Stochastic Search and Optimization. Wiley, New York (2003)
Sridharan, K., Srebro, N., Shalev-Shwartz, S.: Fast rates for regularized objectives. In: Advances in Neural Information Processing Systems 22 (2009)
Vapnik V.N.: Statistical Learning Theory. Wiley, New York (1998)
Zhang, T.: Solving large scale linear prediction problems using stochastic gradient descent algorithms. In: Proceedings of the Twenty-First International Conference on Machine Learning (2004)
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Shalev-Shwartz, S., Singer, Y., Srebro, N. et al. Pegasos: primal estimated sub-gradient solver for SVM. Math. Program. 127, 3–30 (2011). https://doi.org/10.1007/s10107-010-0420-4
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10107-010-0420-4