×

An adaptive factorized Nyström preconditioner for regularized kernel matrices. (English) Zbl 1543.65033

Summary: The spectrum of a kernel matrix significantly depends on the parameter values of the kernel function used to define the kernel matrix. This makes it challenging to design a preconditioner for a regularized kernel matrix that is robust across different parameter values. This paper proposes the adaptive factorized Nyström (AFN) preconditioner. The preconditioner is designed for the case where the rank \(k\) of the Nyström approximation is large, i.e., for kernel function parameters that lead to kernel matrices with eigenvalues that decay slowly. AFN deliberately chooses a well-conditioned submatrix to solve with and corrects a Nyström approximation with a factorized sparse approximate matrix inverse. This makes AFN efficient for kernel matrices with large numerical ranks. AFN also adaptively chooses the size of this submatrix to balance accuracy and cost.

MSC:

65F08 Preconditioners for iterative methods
65F55 Numerical methods for low-rank matrix approximation; matrix compression
65F15 Numerical computation of eigenvalues and eigenvectors of matrices
62G08 Nonparametric regression and quantile regression
60G15 Gaussian processes

References:

[1] Alaoui, A. and Mahoney, M. W., Fast randomized kernel ridge regression with statistical guarantees, in , Cortes, C., Lawrence, N., Lee, D., Sugiyama, M., and Garnett, R., eds., Curran Associates, 2015.
[2] Ambikasaran, S. and Darve, E., An \(\mathcal{O}(n \log n)\) fast direct solver for partial hierarchically semi-separable matrices: With application to radial basis function interpolation, J. Sci. Comput., 57 (2013), pp. 477-501. · Zbl 1292.65030
[3] Ambikasaran, S., Foreman-Mackey, D., Greengard, L., Hogg, D. W., and O’Neil, M., Fast direct methods for Gaussian processes, IEEE Trans. Pattern Anal. Mach. Intell., 38 (2016), pp. 252-265, doi:10.1109/TPAMI.2015.2448083.
[4] Belabbas, M.-A. and Wolfe, P. J., Spectral methods in machine learning and new strategies for very large datasets, Proc. Natl. Acad. Sci. USA, 106 (2009), pp. 369-374.
[5] Belkin, M., Approximation beats concentration? An approximation view on inference with smooth radial kernels, in Proceedings of the 31st Conference On Learning Theory, , , PMLR, 2018, pp. 1348-1361.
[6] Binois, M. and Wycoff, N., A survey on high-dimensional Gaussian process modeling with application to Bayesian optimization, ACM Trans. Evol. Learn. Optim., 2 (2022), pp. 1-26.
[7] Cai, D., Chow, E., Erlandson, L., Saad, Y., and Xi, Y., SMASH: Structured matrix approximation by separation and hierarchy, Numer. Linear Algebra Appl., 25 (2018), e2204. · Zbl 1513.65128
[8] Cai, D., Chow, E., and Xi, Y., Data-driven linear complexity low-rank approximation of general kernel matrices: A geometric approach, Numer. Linear Algebra Appl., 30 (2023), e2519. · Zbl 07790637
[9] Cai, D., Huang, H., Chow, E., and Xi, Y., Data-driven construction of hierarchical matrices with nested bases, SIAM J. Sci. Comput., 46 (2024), pp. S24-S50. · Zbl 07837042
[10] Cai, D., Nagy, J., and Xi, Y., Fast deterministic approximation of symmetric indefinite kernel matrices with high dimensional datasets, SIAM J. Matrix Anal. Appl., 43 (2022), pp. 1003-1028, doi:10.1137/21M1424627. · Zbl 1508.65040
[11] Chang, C.-C. and Lin, C.-J., LIBSVM: A library for support vector machines, ACM Trans. Intell. Syst. Technol. (TIST), 2 (2011), pp. 1-27.
[12] Chen, Y., Epperly, E. N., Tropp, J. A., and Webber, R. J., Randomly Pivoted Cholesky: Practical Approximation of a Kernel Matrix with Few Entry Evaluations, preprint, arXiv:2207.06503, 2022.
[13] Chenhan, D. Y., Reiz, S., and Biros, G., Distributed O(n) linear solver for dense symmetric hierarchical semi-separable matrices, in Proceedings of the 2019 IEEE 13th International Symposium on Embedded Multicore/Many-Core Systems-on-Chip (MCSoC), , IEEE, 2019, pp. 1-8.
[14] Datta, A., Banerjee, S., Finley, A. O., and Gelfand, A. E., Hierarchical nearest-neighbor Gaussian process models for large geostatistical datasets, J. Amer. Statist. Assoc., 111 (2016), pp. 800-812.
[15] Domingos, P., A few useful things to know about machine learning, Commun. ACM, 55 (2012), pp. 78-87.
[16] Drineas, P. and Mahoney, M. W., On the Nyström method for approximating a Gram matrix for improved kernel-based learning, J. Mach. Learn. Res., 6 (2005), pp. 2153-2175. · Zbl 1222.68186
[17] Eldar, Y., Lindenbaum, M., Porat, M., and Zeevi, Y. Y., The farthest point strategy for progressive image sampling, IEEE Trans. Image Process., 6 (1997), pp. 1305-1315.
[18] Erlandson, L., Cai, D., Xi, Y., and Chow, E., Accelerating parallel hierarchical matrix-vector products via data-driven sampling, in Proceedings of the 2020 IEEE International Parallel and Distributed Processing Symposium (IPDPS), New Orleans, LA, , 2020, pp. 749-758, doi:10.1109/IPDPS47924.2020.00082.
[19] Fasshauer, G. E., Meshfree Approximation Methods with Matlab, , World Scientific, 2007, doi:10.1142/6437. · Zbl 1123.65001
[20] Frangella, Z., Tropp, J. A., and Udell, M., Randomized Nyström preconditioning, SIAM J. Matrix Anal. Appl., 44 (2023), pp. 718-752, doi:10.1137/21M1466244. · Zbl 1517.65020
[21] Gittens, A. and Mahoney, M. W., Revisiting the Nyström method for improved large-scale machine learning, J. Mach. Learn. Res., 17 (2016), pp. 3977-4041. · Zbl 1367.68223
[22] Gonzalez, T. F., Clustering to minimize the maximum intercluster distance, Theoret. Comput. Sci., 38 (1985), pp. 293-306. · Zbl 0567.62048
[23] Greengard, L. and Strain, J., The fast Gauss transform, SIAM J. Sci. Statist. Comput., 12 (1991), pp. 79-94, doi:10.1137/0912004. · Zbl 0721.65089
[24] Guinness, J., Permutation and grouping methods for sharpening Gaussian process approximations, Technometrics, 60 (2018), pp. 415-429.
[25] Huang, H., Xing, X., and Chow, E., H2Pack: High-performance \(H^2\) matrix package for kernel matrices using the proxy point method, ACM Trans. Math. Softw., 47 (2020), pp. 1-29, doi:10.1145/3412850. · Zbl 07467963
[26] Katzfuss, M. and Guinness, J., A General Framework for Vecchia Approximations of Gaussian Processes, preprint, https://arxiv.org/abs/1708.06302, 2019.
[27] Kelly, M., Longjohn, R., and Nottingham, K., The UCI Machine Learning Repository, https://archive.ics.uci.edu.
[28] Kolotilina, L. Y. and Yeremin, A. Yu., Factorized sparse approximate inverse preconditionings \({\rm I.}\) Theory, SIAM J. Matrix Anal. Appl., 14 (1993), pp. 45-58, doi:10.1137/0614004. · Zbl 0767.65037
[29] Lazzaro, D. and Montefusco, L. B., Radial basis functions for the multivariate interpolation of large scattered data sets, J. Comput. Appl. Math., 140 (2002), pp. 521-536. · Zbl 1025.65015
[30] Li, C., Jegelka, S., and Sra, S., Fast DPP sampling for Nyström with application to kernel methods, in Proceedings of the 33rd International Conference on Machine Learning, New York, NY, , , PMLR, 2016, pp. 2061-2070.
[31] March, W. B., Xiao, B., Tharakan, S., Yu, C. D., and Biros, G., Robust treecode approximation for kernel machines, in Proceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, , 2015, pp. 775-784.
[32] Martinsson, P.-G. and Tropp, J. A., Randomized numerical linear algebra: Foundations and algorithms, Acta Numer., 29 (2020), pp. 403-572, doi:10.1017/S0962492920000021. · Zbl 07674565
[33] Müller, S., Komplexität und Stabilität von kernbasierten Rekonstruktionsmethoden, Ph.D. thesis, Niedersächsische Staats-und Universitätsbibliothek Göttingen, 2009.
[34] Musco, C. and Musco, C., Recursive sampling for the Nyström method, in Proceedings of the 31st Conference on Neural Information Processing Systems (NIPS 2017), Long Beach, CA, , , Curran Associates, 2017 pp. 3836-3848.
[35] Peyré, G. and Cohen, L. D., Geodesic remeshing using front propagation, Int. J. Comput. Vision, 69 (2006), pp. 145-156.
[36] Pourahmadi, M., Joint mean-covariance models with applications to longitudinal data: Unconstrained parameterisation, Biometrika, 86 (1999), pp. 677-690. · Zbl 0949.62066
[37] Rasmussen, C. and Williams, C., Gaussian Processes for Machine Learning, MIT Press, 2005.
[38] Rebrova, E., Chávez, G., Liu, Y., Ghysels, P., and Li, X. S., A study of clustering techniques and hierarchical matrix formats for kernel ridge regression, in Proceedings of the 2018 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW), , IEEE, 2018, pp. 883-892.
[39] Schäfer, F., Katzfuss, M., and Owhadi, H., Sparse Cholesky factorization by Kullback-Leibler minimization, SIAM J. Sci. Comput., 43 (2021), pp. A2019-A2046, doi:10.1137/20M1336254. · Zbl 07364386
[40] Schäfer, F., Sullivan, T. J., and Owhadi, H., Compression, inversion, and approximate PCA of dense kernel matrices at near-linear computational complexity, Multiscale Model. Simul., 19 (2021), pp. 688-730, doi:10.1137/19M129526X. · Zbl 1461.65067
[41] Schlömer, T., Heck, D., and Deussen, O., Farthest-point optimized point sets with maximized minimum distance, in Proceedings of the ACM SIGGRAPH Symposium on High Performance Graphics, , , 2011, pp. 135-142.
[42] Shabat, G., Choshen, E., Or, D. B., and Carmel, N., Fast and accurate Gaussian kernel ridge regression using matrix decompositions for preconditioning, SIAM J. Matrix Anal. Appl., 42 (2021), pp. 1073-1095, doi:10.1137/20M1343993. · Zbl 1528.65019
[43] Si, S., Hsieh, C.-J., and Dhillon, I., Memory efficient kernel approximation, J. Mach. Learn. Res., 18 (2017), pp. 3-32. · Zbl 1433.68399
[44] Stein, M. L., The screening effect in kriging, Ann. Statist., 30 (2002), pp. 298-323. · Zbl 1012.62102
[45] Stein, M. L., 2010 Rietz lecture: When does the screening effect hold?, Ann. Statist., 39 (2011), pp. 2795-2819. · Zbl 1246.60061
[46] Stein, M. L., When does the screening effect not hold?, Spat. Stat., 11 (2015), pp. 65-80.
[47] Vecchia, A. V., Estimation and model identification for continuous spatial processes, J. R. Stat. Soc. Ser. B Methodol., 50 (1988), pp. 297-312.
[48] Wendland, H., Scattered Data Approximation, , Cambridge University Press, 2004.
[49] Wendland, H., Computational aspects of radial basis function approximation, Stud. Comput. Math., 12 (2006), pp. 231-256. · Zbl 1205.65044
[50] Wenger, J., Pleiss, G., Hennig, P., Cunningham, J., and Gardner, J., Preconditioning for scalable Gaussian process hyperparameter optimization, in Proceedings of the 39th International Conference on Machine Learning, Baltimore, MD, , , 2022, pp. 23751-23780.
[51] White, D. J., The maximal-dispersion problem, IMA J. Manag. Math., 3 (1991), pp. 131-140. · Zbl 0741.90042
[52] Williams, C. K. and Seeger, M., Using the Nyström method to speed up kernel machines, in , MIT Press, 2001, pp. 682-688.
[53] Wu, Y., Packing, Covering, and Consequences on Minimax Risk, , Yale University, 2016.
[54] Yang, C., Duraiswami, R., and Davis, L. S., Efficient kernel machines using the improved fast Gauss transform, in , Saul, L., Weiss, Y., and Bottou, L., eds., MIT Press, 2004, https://proceedings.neurips.cc/paper/2004/file/85353d3b2f39b9c9b5ee3576578c04b7-Paper.pdf.
[55] Zhang, K. and Kwok, J. T., Clustered Nyström method for large scale manifold learning and dimension reduction, IEEE Trans. Neural Netw., 21 (2010), pp. 1576-1587.
This reference list is based on information provided by the publisher or from digital mathematics libraries. Its items are heuristically matched to zbMATH identifiers and may contain data conversion errors. In some cases that data have been complemented/enhanced by data from zbMATH Open. This attempts to reflect the references listed in the original paper as accurately as possible without claiming completeness or a perfect matching.