×

RidgeSketch: a fast sketching based solver for large scale ridge regression. (English) Zbl 1508.62001

Summary: We propose new variants of the sketch-and-project method for solving large scale ridge regression problems. First, we propose a new momentum alternative and provide a theorem showing it can speed up the convergence of sketch-and-project, through a fast sublinear convergence rate. We carefully delimit under what settings this new sublinear rate is faster than the previously known linear rate of convergence of sketch-and-project without momentum. Second, we consider combining the sketch-and-project method with new modern sketching methods such as Count sketch, SubCount sketch (a new method we propose), and subsampled Hadamard transforms. We show experimentally that when combined with the sketch-and-project method, the (Sub)Count sketch is very effective on sparse data and the standard Subsample sketch is effective on dense data. Indeed, we show that these sketching methods, combined with our new momentum scheme, result in methods that are competitive even when compared to the conjugate gradient method on real large scale data. On the contrary, we show the subsampled Hadamard transform does not perform well in this setting, despite the use of fast Hadamard transforms, and nor do recently proposed acceleration schemes work well in practice. To support all of our experimental findings, and invite the community to validate and extend our results, with this paper we are also releasing an open source software package: RidgeSketch. We designed this object-oriented package in Python for testing sketch-and-project methods and benchmarking ridge regression solvers. RidgeSketch is highly modular, and new sketching methods can easily be added as subclasses. We provide code snippets of our package in the appendix.

MSC:

62-04 Software, source code, etc. for problems pertaining to statistics
62J07 Ridge regression; shrinkage estimators (Lasso)
65K10 Numerical optimization and variational techniques
65Y20 Complexity and performance of numerical algorithms
90C06 Large-scale problems in mathematical programming
90C90 Applications of mathematical programming

References:

[1] N. Ailon and B. Chazelle, The fast Johnson-Lindenstrauss transform and approximate nearest neighbors, SIAM J. Comput., 39 (2009), pp. 302-322. · Zbl 1185.68327
[2] N. Ailon and E. Liberty, Fast dimension reduction using Rademacher series on dual BCH codes, Discrete Comput. Geom., 42 (2009), pp. 615-630. · Zbl 1180.94083
[3] E. Anderson, Z. Bai, C. Bischof, S. Blackford, J. D. J. Dongarra, J. D. Croz, A. Greenbaum, S. Hammarling, A. McKenney, and D. Sorensen, LAPACK Users’ Guide, 3rd ed., SIAM, Philadelphia, 1999. · Zbl 0934.65030
[4] H. Avron, P. Maymounkov, and S. Toledo, Blendenpik: Supercharging LAPACK’s Least-Squares Solver, SIAM J. Sci. Comput., 32 (2010), pp. 1217-1236, http://dx.doi.org/10.1137/090767911. · Zbl 1213.65069
[5] Z.-Z. Bai and W.-T. Wu, On greedy randomized Kaczmarz method for solving large sparse linear systems, SIAM J. Sci. Comput., 40 (2018), pp. A592-A606. · Zbl 1383.65024
[6] C. Boutsidis and A. Gittens, Improved matrix algorithms via the subsampled randomized Hadamard transform, SIAM J. Matrix Anal. Appl., 34 (2013), pp. 1301-1340. · Zbl 1286.65054
[7] M. Charikar, K. Chen, and M. Farach-Colton, Finding frequent items in data streams, in Proceedings of the 29th International Colloquium on Automata, Languages and Programming, Springer-Verlag, London, 2002, pp. 693-703. · Zbl 1057.68600
[8] K. L. Clarkson and D. P. Woodruff, Low-rank approximation and regression in input sparsity time, J. ACM, 63 (2017), pp. 1-45. · Zbl 1426.65057
[9] G. Cormode, Count-min sketch, in Encyclopedia of Database Systems, Springer, New York, 2009, pp. 511-516, https://doi.org/10.1007/978-0-387-39940-9_87.
[10] G. Cormode and S. Muthukrishnan, An improved data stream summary: The count-min sketch and its applications, J. Algorithms, 55 (2005), pp. 58-75. · Zbl 1068.68048
[11] J. A. De Loera, J. Haddock, and D. Needell, A sampling Kaczmarz-Motzkin algorithm for linear feasibility, SIAM J. Sci. Comput., 39 (2017), pp. S66-S87. · Zbl 1373.90070
[12] K. Du and H. Gao, A new theoretical estimate for the convergence rate of the maximal weighted residual Kaczmarz algorithm, Numer. Math. Theory Methods Appl., 12 (2019), pp. 627-639. · Zbl 1449.65054
[13] B. J. Fino and V. R. Algazi, Unified matrix treatment of the fast Walsh-Hadamard transform, IEEE Trans. Computers, 25 (1976), pp. 1142-1146. · Zbl 0357.44004
[14] G. H. Golub, P. C. Hansen, and D. P. O’Leary, Tikhonov regularization and total least squares, SIAM J. Matrix Anal. Appl., 21 (1999), pp. 185-194. · Zbl 0945.65042
[15] R. Gower, F. Hanzely, P. Richtarik, and S. Stich, Accelerated stochastic matrix inversion: General theory and speeding up BFGS rules for faster second-order optimization, in Advances in Neural Information Processing Systems 31, 2018, pp. 2292-2300.
[16] R. M. Gower and P. Richtárik, Randomized iterative methods for linear systems, SIAM J. Matrix Anal. Appl., 36 (2015), pp. 1660-1690. · Zbl 1342.65110
[17] R. M. Gower and P. Richtárik, Stochastic Dual Ascent for Solving Linear Systems, arXiv:1512.06890, 2015. · Zbl 1342.65110
[18] J. Haddock and A. Ma, Greed works: An improved analysis of sampling Kaczmarz-Motzkin, SIAM J. Math. Data Sci., 3 (2021), pp. 342-368. · Zbl 07368791
[19] M. R. Hestenes and E. Stiefel, Methods of conjugate gradients for solving linear systems, J. Res. National Bureau of Standards, 49 (1952). · Zbl 0048.09901
[20] T. S. Jayram and D. P. Woodruff, Optimal bounds for Johnson-Lindenstrauss transforms and streaming problems with subconstant error, ACM Trans. Algorithms, 9 (2013), pp. 26:1-26:17. · Zbl 1301.68162
[21] W. Johnson and J. Lindenstrauss, Extensions of Lipschitz mappings into a Hilbert space, in Conference in Modern Analysis and Probability (New Haven, CT, 1982), Contemp. Math. 26, AMS, Providence, RI, 1984, pp. 189-206. · Zbl 0539.46017
[22] M. S. Kaczmarz, Angenäherte auflösung von systemen linearer gleichungen, Bulletin International de l’Académie Polonaise des Sciences et des Lettres. Classe des Sciences Mathématiques et Naturelles. Série A, Sciences Mathématiques, 35 (1937), pp. 355-357. · Zbl 0017.31703
[23] T. N. Krishnamurti, Numerical weather prediction, Annu. Rev. Fluid Mech., 27 (1995), pp. 195-224.
[24] Y. T. Lee and A. Sidford, Efficient accelerated coordinate descent methods and faster algorithms for solving linear systems, in Proceedings of the Annual IEEE Symposium on Foundations of Computer Science, 2013, pp. 147-156, https://doi.org/10.1109/FOCS.2013.24.
[25] D. Leventhal and A. S. Lewis, Randomized methods for linear constraints: Convergence rates and conditioning, Math. Oper. Res., 35 (2010), pp. 641-654, https://doi.org/10.1287/moor.1100.0456. · Zbl 1216.15006
[26] J. Liu and S. J. Wright, An accelerated randomized Kaczmarz algorithm, Math. Comp., 85 (2016), pp. 153-178. · Zbl 1327.65065
[27] N. Loizou and P. Richtarik, Momentum and stochastic momentum for stochastic gradient, Newton, proximal point and subspace descent methods, Comput. Optim. Appl., 77 (2020), pp. 653-710. · Zbl 1466.90065
[28] A. Ma, D. Needell, and A. Ramdas, Convergence properties of the randomized extended Gauss-Seidel and Kaczmarz methods, SIAM J. Matrix Anal. Appl., 36 (2015), pp. 1590-1604. · Zbl 1327.65112
[29] M. W. Mahoney, Randomized algorithms for matrices and data, Found. Trends Mach. Learn., 3 (2011), pp. 123-224. · Zbl 1232.68173
[30] M. S. Morshed, S. Ahmad, and M. Noor-E-Alam, Stochastic Steepest Descent Methods for Linear Systems: Greedy Sampling & Momentum, arXiv:2012.13087, 2020.
[31] M. S. Morshed, M. S. Islam, and M. Noor-E-Alam, Accelerated sampling Kaczmarz Motzkin algorithm for the linear feasibility problem, J. Global Optim., 77 (2020), pp. 361-382. · Zbl 1442.90123
[32] M. S. Morshed and M. Noor-E-Alam, Sketch & Project Methods for Linear Feasibility Problems: Greedy Sampling & Momentum, arXiv:2012.02913, 2020.
[33] I. Necoara, Faster randomized block Kaczmarz algorithms, SIAM J. Matrix Anal. Appl., 40 (2019), pp. 1425-1452. · Zbl 1453.65074
[34] F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M. Blondel, P. Prettenhofer, R. Weiss, V. Dubourg, J. Vanderplas, A. Passos, D. Cournapeau, M. Brucher, M. Perrot, and E. Duchesnay, SCIKIT-learn: Machine learning in Python, J. Mach. Learn. Res., 12 (2011), pp. 2825-2830. · Zbl 1280.68189
[35] S. Petra and C. Popa, Single projection Kaczmarz extended algorithms, Numer. Algorithms, 73 (2016), pp. 791-806. · Zbl 1376.65062
[36] M. Pilanci and M. Wainwright, Randomized sketches of convex programs with sharp guarantees, Proc. IEEE Trans. Inform. Theory, 61 (2015), pp. 5096-5115. · Zbl 1359.90097
[37] M. Pilanci and M. J. Wainwright, Iterative Hessian sketch: Fast and accurate solution approximation for constrained least-squares, J. Mach. Learn. Res., 17 (2016), pp. 1-33. · Zbl 1360.62400
[38] B. T. Polyak, Some methods of speeding up the convergence of iteration methods, USSR Comput. Math. Math. Phys., 4 (1964), pp. 1-17. · Zbl 0147.35301
[39] P. Richtárik and M. Takáč, Stochastic reformulations of linear systems: Algorithms and convergence theory, SIAM J. Matrix Anal. Appl., 41 (2020), pp. 487-524. · Zbl 1440.65045
[40] G. Saunders, A. Gammerman, and V. Vovk, Ridge regression learning algorithm in dual variables, in Proceedings of the 15th International Conference on Machine Learning, Morgan Kaufmann, San Francisco, CA, 1998, pp. 515-521.
[41] O. Sebbouh, R. M. Gower, and A. Defazio, On the Convergence of the Stochastic Heavy Ball Method, arXiv:2006.07867, 2020.
[42] S. Shalev-Shwartz and S. Ben-David, Understanding Machine Learning: From Theory to Algorithms, Cambridge University Press, Cambridge, UK, 2014. · Zbl 1305.68005
[43] T. Strohmer and R. Vershynin, A randomized Kaczmarz algorithm with exponential convergence, J. Fourier Anal. Appl., 15 (2009), pp. 262-278. · Zbl 1169.68052
[44] J. A. Tropp, Improved analysis of the subsampled randomized Hadamard transform, Adv. Adapt. Data Anal., 3 (2011), pp. 115-126. · Zbl 1232.15029
[45] S. Tu, S. Venkataraman, A. C. Wilson, A. Gittens, M. I. Jordan, and B. Recht, Breaking locality accelerates block Gauss-Seidel, in Proceedings of the 34th International Conference on Machine Learning, Sydney, Australia, 2017, pp. 3482-3491.
[46] S. Van Der Walt, S. C. Colbert, and G. Varoquaux, The Numpy array: A structure for efficient numerical computation, Comput. Sci. Eng., 13 (2011), pp. 22-30.
[47] P. Virtanen, R. Gommers, T. E. Oliphant, M. Haberland, T. Reddy, D. Cournapeau, E. Burovski, P. Peterson, W. Weckesser, J. Bright, et al., SCIPY 1.0: Fundamental algorithms for scientific computing in Python, Nature Methods, 17 (2020), pp. 261-272.
[48] V. Vovk, Kernel ridge regression, in Empirical Inference, Springer, New York, 2013, pp. 105-116. · Zbl 1325.62094
[49] S. Wang, A. Gittens, and M. W. Mahoney, Sketched ridge regression: Optimization perspective, statistical perspective, and model averaging., J. Mach. Learn. Res., 18 (2017), pp. 218:1-218:50. · Zbl 1473.62253
[50] F. Woolfe, E. Liberty, V. Rokhlin, and M. Tygert, A fast randomized algorithm for the approximation of matrices, Appl. Comput. Harmon. Anal., 25 (2008), pp. 335-366. · Zbl 1155.65035
[51] S. J. Wright, Coordinate descent algorithms, Math. Program., 151 (2015), pp. 3-34. · Zbl 1317.49038
[52] R. Yuan, A. Lazaric, and R. M. Gower, Sketched Newton-Raphson, arXiv:2006.12120, 2020. · Zbl 1496.90112
This reference list is based on information provided by the publisher or from digital mathematics libraries. Its items are heuristically matched to zbMATH identifiers and may contain data conversion errors. In some cases that data have been complemented/enhanced by data from zbMATH Open. This attempts to reflect the references listed in the original paper as accurately as possible without claiming completeness or a perfect matching.