×

Two-level preconditioning for ridge regression. (English) Zbl 07396251

Summary: Solving linear systems is often the computational bottleneck in real-life problems. Iterative solvers are the only option due to the complexity of direct algorithms or because the system matrix is not explicitly known. Here, we develop a two-level preconditioner for regularized least squares linear systems involving a feature or data matrix. Variants of this linear system may appear in machine learning applications, such as ridge regression, logistic regression, support vector machines and Bayesian regression. We use clustering algorithms to create a coarser level that preserves the principal components of the covariance or Gram matrix. This coarser level approximates the dominant eigenvectors and is used to build a subspace preconditioner accelerating the Conjugate Gradient method. We observed speed-ups for artificial and real-life data.

MSC:

65F08 Preconditioners for iterative methods
65F22 Ill-posedness and regularization problems in numerical linear algebra
68T05 Learning and adaptive systems in artificial intelligence
68T10 Pattern recognition, speech recognition

References:

[1] DemmelJ. Applied numerical linear algebra. Philadelphia, PA: Society for Industrial and Applied Mathematics; 1997. · Zbl 0879.65017
[2] TrefethenLN, BauDIII. Numerical linear algebra. Vol 50. Philadelphia, PA: Siam; 1997. · Zbl 0874.65013
[3] van derSluisA. Condition numbers and equilibration of matrices. Numer Math. 1969;14(1):14-23. https://doi.org/10.1007/BF02165096. · Zbl 0182.48906 · doi:10.1007/BF02165096
[4] KershawDS. The incomplete Cholesky-conjugate gradient method for the iterative solution of systems of linear equations. J Comput Phys. 1978;26(1):43-65. http://www.sciencedirect.com/science/article/pii/0021999178900980. · Zbl 0367.65018
[5] LinCJ, MoréJJ. Incomplete Cholesky factorizations with limited memory. SIAM J Sci Comput.1999;21(1):24-45. https://doi.org/10.1137/S1064827597327334. · Zbl 0941.65033 · doi:10.1137/S1064827597327334
[6] SaadY. ILUT: a dual threshold incomplete LU factorization. Numer Linear Algebra Appl. 1994;1(4):387-402. https://doi.org/10.1002/nla.1680010405. · Zbl 0838.65026 · doi:10.1002/nla.1680010405
[7] BenziM, MeyerCD, TůmaM. A Sparse approximate inverse preconditioner for the conjugate gradient method. SIAM J Sci Comput.1996;17(5):1135-49. https://doi.org/10.1137/S1064827594271421. · Zbl 0856.65019 · doi:10.1137/S1064827594271421
[8] BriggsW, HensonV, McCormickS. A multigrid tutorial. 2nd ed.Philadelphia, PA: Society for Industrial and Applied Mathematics; 2000. · Zbl 0958.65128
[9] RugeJW, StübenK. Algebraic multigrid. Multigrid methods. Philadelphia, PA: Society for Industrial and Applied Mathematics; 1987.p. 73-130. https://doi.org/10.1137/1.9781611971057. · doi:10.1137/1.9781611971057
[10] StübenK. A review of algebraic multigrid. In: SloanD (ed.), SüliE (ed.), VandewalleS (ed.), editors. Partial differential equations volume 7 of numerical analysis 2000 Amsterdam. Netherlands: Elsevier; 2001. p. 281-309. · Zbl 0979.65111
[11] VassilevskiPS. Multilevel block factorization preconditioners: matrix‐based analysis and algorithms for solving finite element equations. New York, NY: Springer‐Verlag; 2008. · Zbl 1170.65001
[12] BrezinaM, FalgoutR, MacLachlanS, ManteuffelT, McCormickS, RugeJ. Adaptive smoothed aggregation (αSA) multigrid. SIAM Rev.2005;47(2):317-46. https://doi.org/10.1137/050626272. · Zbl 1075.65042 · doi:10.1137/050626272
[13] BrezinaM, FalgoutR, MacLachlanS, ManteuffelT, McCormickS, RugeJ. Adaptive algebraic multigrid. SIAM J Sci Comput.2006;27(4):1261-86. https://doi.org/10.1137/040614402. · Zbl 1100.65025 · doi:10.1137/040614402
[14] MacLachlanS, ManteuffelT, McCormickS. Adaptive reduction‐based AMG. Numer Linear Algebra Appl. 2006;13(8):599-620. https://doi.org/10.1002/nla.486. · Zbl 1174.65549 · doi:10.1002/nla.486
[15] BrandtA, BrannickJ, KahlK, LivshitsI. Bootstrap amg. SIAM J Sci Comput.2011;33(2):612-32. https://doi.org/10.1137/090752973. · Zbl 1227.65120 · doi:10.1137/090752973
[16] D’AmbraP, VassilevskiPS. Improving solve time of aggregation‐based adaptive AMG. Numer Linear Algebra Appl.2019;26(6):e2269. https://doi.org/10.1002/nla.2269. · Zbl 1463.65401 · doi:10.1002/nla.2269
[17] ReichelL, ShyshkovA. Cascadic multilevel methods for ill‐posed problems. J Comput Appl Math.2010;233(5):1314-25. Special Issue Dedicated to William B. Gragg on the Occasion of His 70th Birthday. http://www.sciencedirect.com/science/article/pii/S0377042709002192. · Zbl 1186.65069
[18] JacobsenM, HansenPC, SaundersMA. Subspace preconditioned LSQR for discrete Ill‐posed problems. BIT Numer Math. 2003;43(5):975-89. https://doi.org/10.1023/B:BITN.0000014547.88978.05. · Zbl 1046.65030 · doi:10.1023/B:BITN.0000014547.88978.05
[19] PaigeCC, SaundersMA. LSQR: an algorithm for sparse linear equations and sparse least squares. ACM Trans Math Softw. 1982;8(1):43-71. · Zbl 0478.65016
[20] RobbinsH, MonroS. A Stochastic approximation method. Ann Math Stat. 1951;22(3):400-7. http://www.jstor.org/stable/2236626. · Zbl 0054.05901
[21] BottouL. Large‐scale machine learning with stochastic gradient descent. In: LechevallierY (ed.), SaportaG (ed.), editors. Proceedings of COMPSTAT ’2010. Heidelberg, Germany: Physica‐Verlag HD; 2010. p. 177-86. · Zbl 1436.68293
[22] JohnsonR, ZhangT. Accelerating stochastic gradient descent using predictive variance reduction. In: CJCB (ed.), BottouL (ed.), WellingM (ed.), GhahramaniZ (ed.), WeinbergerKQ (ed.), editors. Advances in neural information processing systems 26. Volume 315. New York, NY: Curran Associates, Inc.; 2013. -323http://papers.nips.cc/paper/4937‐accelerating‐stochastic‐gradient‐desce
[23] ZouziasA, FrerisNM. Randomized extended kaczmarz for solving least squares. SIAM J Matrix Anal Appl.2013;34(2):773-93. https://doi.org/10.1137/120889897. · Zbl 1273.65053 · doi:10.1137/120889897
[24] GowerRM, RichtárikP. Randomized iterative methods for linear systems. SIAM J Matrix Anal Appl.2015;36(4):1660-90. https://doi.org/10.1137/15M1025487. · Zbl 1342.65110 · doi:10.1137/15M1025487
[25] BishopCM. Pattern recognition and machine learning. New York, NY: Springer‐Verlag; 2006. · Zbl 1107.68072
[26] GelmanA, CarlinJB, SternHS, DunsonDB, VehtariA, RubinDB. Bayesian data analysis. Boca Raton, FL: Chapman & Hall/CRC Press; 2013.
[27] SimmJ, AranyA, ZakeriP, et al. Macau: scalable Bayesian factorization with high‐dimensional side information using MCMC. Paper presented at: Proceedings of the 2017 IEEE 27th International Workshop on Machine Learning for Signal Processing (MLSP). Tokyo, Japan; 2017. p. 1-6.
[28] SalakhutdinovR, MnihA. Bayesian probabilistic matrix factorization using Markov Chain Monte Carlo. Proceedings of the 25th International Conference on Machine Learning. ICML ’08. New York, NY: ACM; 2008. p. 880-887. 10.1145/1390156.1390267
[29] HosmerDWJr, LemeshowS, SturdivantRX. Applied logistic regression. Vol 398. Boca Raton, FL: John Wiley & Sons; 2013. · Zbl 1276.62050
[30] VapnikVN. An overview of statistical learning theory. IEEE Trans Neural Netw. 1999;10(5):988-99.
[31] LinCJ, WengRC, KeerthiSS. Trust region Newton method for logistic regression. J Mach Learn Res. 2008;9(Apr):627-50. · Zbl 1225.68195
[32] CaiD, HeX, HanJ. SRDA: an efficient algorithm for large‐scale discriminant analysis. IEEE Trans Knowl Data Eng. 2008;20(1):1-12.
[33] SaadY. Iterative methods for sparse linear systems. 2nd ed.Philadelphia, PA: Society for Industrial and Applied Mathematics; 2003. https://doi.org/10.1137/1.9780898718003. · Zbl 1031.65046 · doi:10.1137/1.9780898718003
[34] DouglasC, DouglasJJr. A unified convergence theory for abstract multigrid or multilevel algorithms serial and parallel. SIAM J Numer Anal. 1993;30(1):136-58. · Zbl 0772.65020
[35] HartiganJA. Clustering algorithms. 99th New York. NY: John Wiley & Sons, Inc; 1975. · Zbl 0372.62040
[36] ArthurD, VassilvitskiiS. k‐means++: the advantages of careful seeding. Proceedings of the 18th Annual ACM‐SIAM Symposium on Discrete Algorithms. Philadelphia, PA: Society for Industrial and Applied Mathematics; 2007. p. 1027-1035. · Zbl 1302.68273
[37] GirolamiM. Orthogonal series density estimation and the kernel eigenvalue problem. Neural Comput. 2002;14(3):669-88. https://doi.org/10.1162/089976602317250942. · Zbl 0986.62021 · doi:10.1162/089976602317250942
[38] RényiA.On measures of entropy and information. Proceedings of the 4th Berkeley Symposium on Mathematical Statistics and Probability. vol. 1, Contributions to the Theory of Statistics. Berkeley; 1961. p. 547-561. · Zbl 0106.33001
[39] BrabanterKD, BrabanterJD, SuykensJAK, MoorBD. Optimized fixed‐size kernel models for large data sets. Comput Stat Data Anal. 2010;54(6):1484-504. http://www.sciencedirect.com/science/article/pii/S0167947310000393. · Zbl 1284.62369
[40] SchaefferSE. Graph clustering. Comput Sci Rev. 2007;1(1):27-64. · Zbl 1302.68237
[41] BlondelVD, GuillaumeJL, LambiotteR, LefebvreE. Fast unfolding of communities in large networks. J Stat Mech Theory Exper. 2008;2008(10):P10008. · Zbl 1459.91130
[42] LancichinettiA, FortunatoS. Community detection algorithms: a comparative analysis. Phys Rev E. 2009;80(5):056117.
[43] D’AmbraP, CutilloL, VassilevskiPS. Bootstrap AMG for spectral clustering. Comput Math Methods. 2019;1(2):e1020. https://doi.org/10.1002/cmm4.1020. · doi:10.1002/cmm4.1020
[44] MujaM, LoweDG. Scalable nearest neighbor algorithms for high dimensional data. IEEE Trans Pattern Anal Mach Intell. 2014;36(11):2227-40.
[45] BrannickJ, ChenY, ZikatanovL. An algebraic multilevel method for anisotropic elliptic equations based on subgraph matching. Numer Linear Algebra Appl. 2012;19(2):279-95. · Zbl 1274.65314
[46] BrannickJ, ChenY, KrausJ, ZikatanovL. Algebraic multilevel preconditioners for the graph Laplacian based on matching in graphs. SIAM J Numer Anal. 2013;51(3):1805-27. · Zbl 1281.65152
[47] D’ambraP, FilipponeS, VassilevskiPS. BootCMatch: a software package for bootstrap AMG based on graph weighted matching. ACM Trans Math Softw. 2018;44(4):39. https://doi.org/10.1145/3190647. · Zbl 1484.65059 · doi:10.1145/3190647
[48] ManteuffelT, McCormickS, ParkM, RugeJ. Operator‐based interpolation for bootstrap algebraic multigrid. Numer Linear Algebra Appl. 2010;17(2‐3):519-37. · Zbl 1240.65115
[49] ArnoldiWE. The principle of minimized iterations in the solution of the matrix eigenvalue problem. Q Appl Math. 1951;9(1):17-29. · Zbl 0042.12801
[50] BrandtA. Multiscale scientific computation: review 2001. In: BarthTJ (ed.), ChanT (ed.), HaimesR (ed.), editors. Multiscale and multiresolution methods. Berlin, Heidelberg/Germany: Springer; 2002. p. 3-95. · Zbl 0989.65140
[51] SaadY. A Flexible inner‐outer preconditioned GMRES algorithm. SIAM J Sci Comput.1993;14(2):461-9. https://doi.org/10.1137/0914028. · Zbl 0780.65022 · doi:10.1137/0914028
[52] NotayY. Flexible conjugate gradients. SIAM J Sci Comput.2000;22(4):1444-60. https://doi.org/10.1137/S1064827599362314. · Zbl 0980.65030 · doi:10.1137/S1064827599362314
[53] NotayY, VassilevskiPS. Recursive Krylov‐based multigrid cycles. Numer Linear Algebra Appl. 2008;15(5):473-87. https://doi.org/10.1002/nla.542. · Zbl 1212.65132 · doi:10.1002/nla.542
[54] DavisTA, HuY. The university of Florida sparse matrix collection. ACM Trans Math Softw. 2011;38(1):1. · Zbl 1365.65123
[55] LichmanM.UCI machine learning repository. Irvine, CA: University of California, School of Information and Computer Sciences; 2013. http://archive.ics.uci.edu/ml.
[56] MATLAB. Statistics and machine learning toolbox. Natick, MA: The MathWorks; 2018.
[57] HickeyJM, GorjancG, HearneS, HuangBE. AlphaMPSim: flexible simulation of multi‐parent crosses. Bioinformatics. 2014;30(18):2686-8.
[58] LewisDD, YangY, RoseTG, LiF. RCV1: a new benchmark collection for text categorization research. J Mach Learn Res. 2004;5:361-97. http://dl.acm.org/citation.cfm?id=1005332.1005345.
[59] KeerthiSS, DeCosteD. A modified finite Newton method for fast solution of large scale linear SVMs. J Mach Learn Res. 2005;6(Mar):341-61. · Zbl 1222.68231
[60] RübelO, GreinerA, CholiaS, LouieK, BethelEW, NorthenTR. OpenMSI: a high‐performance web‐based platform for mass spectrometry imaging. Anal Chem. 2013;85(21):10354-61. https://doi.org/10.1021/ac402540a. · doi:10.1021/ac402540a
[61] KoganS, LevinD, RoutledgeBR, SagiJS, SmithNA. Predicting risk from financial reports with regression. Proceedings of the 2009 Annual Conference of the North American Chapter of the Association for Computational Linguistics Human Language Technologies. Boulder, Colorado: Association for Computational Linguistics; 2009. p. 272-280. https://www.aclweb.org/anthology/N09‐1031.
[62] BentoAP, GaultonA, HerseyA, et al. The ChEMBL bioactivity database: an update. Nucleic Acids Res. 2014;42(D1):D1083-90.
[63] AndersonE, BaiZ, BischofC, et al. LAPACK users’ guide. 3rd. Philadelphia, PA: Society for Industrial and Applied Mathematics; 1999. · Zbl 0934.65030
[64] GionisA, IndykP, MotwaniR. Similarity search in high dimensions via hashing. Paper presented at: Proceedings of the 25th International Conference on Very Large Data Bases. VLDB ’99. San Francisco, CA: Morgan Kaufmann Publishers Inc.; 1999. p. 518-529. http://dl.acm.org/citation.cfm?id=645925.671516.
[65] AndoniA, RazenshteynI. Optimal datadependent hashing for approximate near neighbors. Proceedings of the 47th Annual ACM Symposium on Theory of Computing. STOC ’15. New York, NY: ACM; 2015. p. 793-801. https://doi.org/10.1145/2746539.2746553 · Zbl 1321.68212 · doi:10.1145/2746539.2746553
[66] SIAM Text Mining Competition. 2007. https://www.csie.ntu.edu.tw/\∼cjlin/libsvmtools/datasets/multilabel.html#siam‐competition2007
This reference list is based on information provided by the publisher or from digital mathematics libraries. Its items are heuristically matched to zbMATH identifiers and may contain data conversion errors. In some cases that data have been complemented/enhanced by data from zbMATH Open. This attempts to reflect the references listed in the original paper as accurately as possible without claiming completeness or a perfect matching.