×

Parallelizing stochastic gradient descent for least squares regression: mini-batching, averaging, and model misspecification. (English) Zbl 1469.68088

Summary: This work characterizes the benefits of averaging techniques widely used in conjunction with stochastic gradient descent (SGD). In particular, this work presents a sharp analysis of: (1) mini-batching, a method of averaging many samples of a stochastic gradient to both reduce the variance of a stochastic gradient estimate and for parallelizing SGD and (2) tail-averaging, a method involving averaging the final few iterates of SGD in order to decrease the variance in SGD’s final iterate. This work presents sharp finite sample generalization error bounds for these schemes for the stochastic approximation problem of least squares regression.
Furthermore, this work establishes a precise problem-dependent extent to which mini-batching can be used to yield provable near-linear parallelization speedups over SGD with batch size one. This characterization is used to understand the relationship between learning rate versus batch size when considering the excess risk of the final iterate of an SGD procedure. Next, this mini-batching characterization is utilized in providing a highly parallelizable SGD method that achieves the minimax risk with nearly the same number of serial updates as batch gradient descent, improving significantly over existing SGD-style methods. Following this, a non-asymptotic excess risk bound for model averaging (which is a communication efficient parallelization scheme) is provided.
Finally, this work sheds light on fundamental differences in SGD’s behavior when dealing with mis-specified models in the non-realizable least squares problem. This paper shows that maximal stepsizes ensuring minimax risk for the mis-specified case must depend on the noise properties.
The analysis tools used by this paper generalize the operator view of averaged SGD (Défossez and Bach, 2015) followed by developing a novel analysis in bounding these operators to characterize the generalization error. These techniques are of broader interest in analyzing various computational aspects of stochastic approximation.

MSC:

68T05 Learning and adaptive systems in artificial intelligence
62J05 Linear regression; mixed models
68W10 Parallel algorithms in computer science
90C15 Stochastic programming

Software:

DiSCO; HOGWILD; Saga

References:

[1] Alekh Agarwal, Peter L. Bartlett, Pradeep Ravikumar, and Martin J. Wainwright. Informationtheoretic lower bounds on the oracle complexity of stochastic convex optimization. IEEE Transactions on Information Theory, 2012. · Zbl 1365.94132
[2] Zeyuan Allen-Zhu. Katyusha: The first direct acceleration of stochastic gradient methods. CoRR, abs/1603.05953, 2016. · Zbl 1369.68273
[3] Dan Anbar. On Optimal Estimation Methods Using Stochastic Approximation Procedures. University of California, 1971. · Zbl 0277.62064
[4] Francis Bach and Eric Moulines. Non-asymptotic analysis of stochastic approximation algorithms for machine learning. In Neural Information Processing Systems (NIPS) 24, 2011.
[5] Francis R. Bach. Adaptivity of averaged stochastic gradient descent to local strong convexity for logistic regression. In Journal of Machine Learning Research (JMLR), volume 15, 2014. · Zbl 1318.62224
[6] Francis R. Bach and Eric Moulines. Non-strongly-convex smooth stochastic approximation with convergence rate O(1/n). In Neural Information Processing Systems (NIPS) 26, 2013.
[7] Rajendra Bhatia. Positive Definite Matrices. Princeton Series in Applied Mathematics. Princeton University Press, 2007. · Zbl 1133.15017
[8] L´eon Bottou and Olivier Bousquet. The tradeoffs of large scale learning. In Neural Information Processing Systems (NIPS) 20, 2007.
[9] L´eon Bottou, Frank E Curtis, and Jorge Nocedal. Optimization methods for large-scale machine learning. arXiv preprint arXiv:1606.04838, 2016. · Zbl 1397.65085
[10] Joseph K. Bradley, Aapo Kyrola, Danny Bickson, and Carlos Guestrin. Parallel coordinate descent for l1-regularized loss minimization. In International Conference on Machine Learning (ICML), 2011.
[11] Louis Augustin Cauchy. M´ethode g´en´erale pour la r´esolution des syst´emes d’´equations simultanees. C. R. Acad. Sci. Paris, 1847.
[12] Andrew Cotter, Ohad Shamir, Nati Srebro, and Karthik Sridharan. Better mini-batch algorithms via accelerated gradient methods. In Neural Information Processing Systems (NIPS) 24, 2011.
[13] Aaron Defazio. A simple practical accelerated method for finite sums. In Neural Information Processing Systems (NIPS) 29, 2016.
[14] Aaron Defazio, Francis R. Bach, and Simon Lacoste-Julien. SAGA: A fast incremental gradient method with support for non-strongly convex composite objectives. In Neural Information Processing Systems (NIPS) 27, 2014.
[15] Alexandre D´efossez and Francis R. Bach. Averaged least-mean-squares: Bias-variance trade-offs and optimal sampling distributions. In Artifical Intelligence and Statistics (AISTATS), 2015. 39
[16] Ofer Dekel, Ran Gilad-Bachrach, Ohad Shamir, and Lin Xiao. Optimal distributed online prediction using mini-batches. Journal of Machine Learning Research (JMLR), volume 13, 2012. · Zbl 1283.68404
[17] Aymeric Dieuleveut and Francis Bach. Non-parametric stochastic approximation with large step sizes. The Annals of Statistics, 2015. · Zbl 1346.60041
[18] John C. Duchi, Sorathan Chaturapruek, and Christopher R´e. Asynchronous stochastic convex optimization. CoRR, abs/1508.00882, 2015.
[19] Vaclav Fabian. Asymptotically efficient stochastic approximation; the RM case. Annals of Statistics, 1(3), 1973. · Zbl 0258.62048
[20] Roy Frostig, Rong Ge, Sham Kakade, and Aaron Sidford. Un-regularizing: approximate proximal point and faster stochastic algorithms for empirical risk minimization. In International Conference on Machine Learning (ICML), 2015a.
[21] Roy Frostig, Rong Ge, Sham M. Kakade, and Aaron Sidford. Competing with the empirical risk minimizer in a single pass. In Conference on Learning Theory (COLT), 2015b.
[22] Priya Goyal, Piotr Doll´ar, Ross Girshick, Pieter Noordhuis, Lukasz Wesolowski, Aapo Kyrola, Andrew Tulloch, Yangqing Jia, and Kaiming He. Accurate, large minibatch sgd: training imagenet in 1 hour. arXiv preprint arXiv:1706.02677, 2017.
[23] Prateek Jain, Chi Jin, Sham M. Kakade, Praneeth Netrapalli, and Aaron Sidford. Streaming pca: Matching matrix bernstein and near-optimal finite sample guarantees for oja’s algorithm. In Conference on Learning Theory (COLT), 2016a.
[24] Prateek Jain, Sham M Kakade, Rahul Kidambi, Praneeth Netrapalli, and Aaron Sidford. Parallelizing stochastic approximation through mini-batching and tail-averaging. arXiv preprint arXiv:1610.03774, 2016b.
[25] Prateek Jain, Sham M Kakade, Rahul Kidambi, Praneeth Netrapalli, Venkata Krishna Pillutla, and Aaron Sidford. A markov chain theory approach to characterizing the minimax optimality of stochastic gradient descent (for least squares). arXiv preprint arXiv:1710.09430, 2017a.
[26] Prateek Jain, Sham M Kakade, Rahul Kidambi, Praneeth Netrapalli, and Aaron Sidford. Accelerating stochastic gradient descent. arXiv preprint arXiv:1704.08227, 2017b.
[27] Rie Johnson and Tong Zhang. Accelerating stochastic gradient descent using predictive variance reduction. In Neural Information Processing Systems (NIPS) 26, 2013.
[28] Harold J. Kushner and Dean S. Clark. Stochastic Approximation Methods for Constrained and Unconstrained Systems.Springer-Verlag, 1978. · Zbl 0381.60004
[29] Harold J. Kushner and G. Yin. Asymptotic properties of distributed and communicating stochastic approximation algorithms. SIAM Journal on Control and Optimization, 25(5):1266–1290, 1987. · Zbl 0637.62078
[30] Harold J. Kushner and G. Yin. Stochastic approximation and recursive algorithms and applications. Springer-Verlag, 2003. 40 · Zbl 1026.62084
[31] Erich L. Lehmann and George Casella. Theory of Point Estimation. Springer Texts in Statistics. Springer, 1998. · Zbl 0916.62017
[32] Mu Li, Tong Zhang, Yuqiang Chen, and Alexander J. Smola. Efficient mini-batch training for stochastic optimization. In Knowledge Discovery and Data Mining (KDD), 2014.
[33] Hongzhou Lin, Julien Mairal, and Za¨ıd Harchaoui. A universal catalyst for first-order optimization. In Neural Information Processing Systems (NIPS), 2015.
[34] Gideon Mann, Ryan T. McDonald, Mehryar Mohri, Nathan Silberman, and Dan Walker. Efficient large-scale distributed training of conditional maximum entropy models. In Neural Information Processing Systems (NIPS) 22, 2009.
[35] Stephen Merity, Nitish Shirish Keskar, and Richard Socher. Regularizing and optimizing lstm language models. arXiv preprint arXiv:1708.02182, 2017.
[36] Deanna Needell, Nathan Srebro, and Rachel Ward. Stochastic gradient descent, weighted sampling, and the randomized kaczmarz algorithm. Mathematical Programming, volume 155, 2016. · Zbl 1333.65070
[37] Arkadi S. Nemirovsky and David B. Yudin. Problem Complexity and Method Efficiency in Optimization. John Wiley, 1983. · Zbl 0501.90062
[38] Yurii E. Nesterov. A method for unconstrained convex minimization problem with the rate of convergence O(1/k2). Doklady AN SSSR, 269, 1983.
[39] Feng Niu, Benjamin Recht, Christopher Re, and Stephen J. Wright. Hogwild: A lock-free approach to parallelizing stochastic gradient descent. In Neural Information Processing Systems (NIPS) 24, 2011.
[40] Boris T. Polyak. Some methods of speeding up the convergence of iteration methods. USSR Computational Mathematics and Mathematical Physics, 4, 1964.
[41] Boris T. Polyak and Anatoli B. Juditsky. Acceleration of stochastic approximation by averaging. SIAM J Control Optim, volume 30, 1992. · Zbl 0762.62022
[42] Herbert Robbins and Sutton Monro. A stochastic approximation method. Ann. Math. Stat., vol. 22, 1951. · Zbl 0054.05901
[43] Jonathan Rosenblatt and Boaz Nadler. On the optimality of averaging in distributed statistical learning. CoRR, abs/1407.2724, 2014. · Zbl 1426.68241
[44] Nicolas Le Roux, Mark Schmidt, and Francis R. Bach. A stochastic gradient method with an exponential convergence rate for strongly-convex optimization with finite training sets. In Neural Information Processing Systems (NIPS) 25, 2012.
[45] David Ruppert. Efficient estimations from a slowly convergent robbins-monro process. Tech. Report, ORIE, Cornell University, 1988.
[46] Shai Shalev-Shwartz and Tong Zhang. Stochastic dual coordinate ascent methods for regularized loss minimization. CoRR, abs/1209.1873, 2012. 41 · Zbl 1307.68073
[47] Shai Shalev-Shwartz and Tong Zhang. Accelerated mini-batch stochastic dual coordinate ascent. In Neural Information Processing Systems (NIPS) 26, 2013a. · Zbl 1307.68073
[48] Shai Shalev-Shwartz and Tong Zhang. Accelerated mini-batch stochastic dual coordinate ascent. In Neural Information Processing Systems (NIPS) 26, 2013b. · Zbl 1307.68073
[49] Samuel L Smith, Pieter-Jan Kindermans, and Quoc V Le. Don’t decay the learning rate, increase the batch size. arXiv preprint arXiv:1711.00489, 2017.
[50] Martin Tak´ac, Avleen Singh Bijral, Peter Richt´arik, and Nati Srebro. Mini-batch primal and dual methods for SVMs. In International Conference on Machine Learning (ICML), volume 28, 2013.
[51] Martin Tak´ac, Peter Richt´arik, and Nati Srebro.Distributed mini-batch sdca.CoRR, abs/1507.08322, 2015.
[52] Aad W. van der Vaart. Asymptotic Statistics. Cambridge University Publishers, 2000. · Zbl 1013.62031
[53] Yuchen Zhang and Lin Xiao. Disco: Distributed optimization for self-concordant empirical loss. In International Conference on Machine Learning (ICML), 2015.
[54] Yuchen Zhang, John C. Duchi, and Martin Wainwright. Divide and conquer ridge regression: A distributed algorithm with minimax optimal rates. Journal of Machine Learning Research (JMLR), volume 16, 2015. · Zbl 1351.62142
[55] Martin A. Zinkevich, Alex Smola, Markus Weimer, and Lihong Li. Parallelized stochastic gradient descent. In Neural Information Processing Systems (NIPS) 24, 2011.
This reference list is based on information provided by the publisher or from digital mathematics libraries. Its items are heuristically matched to zbMATH identifiers and may contain data conversion errors. In some cases that data have been complemented/enhanced by data from zbMATH Open. This attempts to reflect the references listed in the original paper as accurately as possible without claiming completeness or a perfect matching.