Article Contents

2022, Volume 4, Issue 3: 423-440. Doi: 10.3934/fods.2022012

This issue Previous Article Multimodal correlations-based data clustering Next Article Geometric structure guided model and algorithms for complete deconvolution of gene expression data

Learning in high-dimensional feature spaces using ANOVA-based fast matrix-vector multiplication

1.
Department of Mathematics, Chair of Applied Functional Analysis, TU Chemnitz, Germany
2.
Department of Mathematics, Chair of Scientific Computing, TU Chemnitz, Germany

^*Corresponding author: Theresa Wagner
^*Corresponding author: Theresa Wagner

Received: November 2021

Revised: March 2022

Early access: July 2022

Published: September 2022

Abstract / Introduction Full Text(HTML) Figure(4) / Table(7) Related Papers Cited by

Abstract

Kernel matrices are crucial in many learning tasks such as support vector machines or kernel ridge regression. The kernel matrix is typically dense and large-scale. Depending on the dimension of the feature space even the computation of all of its entries in reasonable time becomes a challenging task. For such dense matrices the cost of a matrix-vector product scales quadratically with the dimensionality $ N $, if no customized methods are applied. We propose the use of an ANOVA kernel, where we construct several kernels based on lower-dimensional feature spaces for which we provide fast algorithms realizing the matrix-vector products. We employ the non-equispaced fast Fourier transform (NFFT), which is of linear complexity for fixed accuracy. Based on a feature grouping approach, we then show how the fast matrix-vector products can be embedded into a learning method choosing kernel ridge regression and the conjugate gradient solver. We illustrate the performance of our approach on several data sets.

Keywords:

Mathematics Subject Classification: Primary: 65F45, 65T50; Secondary: 62J10.

Citation:

Full Text(HTML)

Figure 1. Kernel-vector multiplication with and without NFFT-approach on various balanced samples of different size on synthetic data, with $ \sigma = 100 $ and $ n_{\text{runs}} = 100 $

Download: Full-size image PowerPoint slide

Figure 2. Embedding 2-dimensional data into a 3-dimensional space via an embedding map

Download: Full-size image PowerPoint slide

Figure 3. The univariate kernel function $ \kappa $, here a Gaussian, is truncated and a polynomial $ p $ is constructed such that we end up with a periodic function $ \tilde\kappa $, which is smooth up to a certain degree. In higher-dimensional settings, the kernel function is truncated at $ \|{\mathit{\boldsymbol{ r }}}\| = L $, i. e., radially. The resulting periodic function will then have the period $ h $ with respect to each dimension

Download: Full-size image PowerPoint slide

Figure 4. Accuracy and runtime of GridSearch on various balanced samples of different size

Download: Full-size image PowerPoint slide

Table 1. Benchmark data sets for numerical experiments

data set	$ {N} $	$ {d} $	number of data points in under-represented class	largest balanced sample possible
Telescope²	19020	10	6688	13376
HIGGS³	11000000	28	5170877	10341754
SUSY⁴	5000000	18	2287827	4575654
YearPredictionMSD⁵	515345	90	206945	413890
cod-rna⁶	488565	8	162855	209884⁷

| Show Table

DownLoad: CSV

Table 2. Parameter-grids for the considered classifiers

classifier	kernel coefficient	regularization parameter
NFFTKernelRidge	$ \sigma: [10^{-3}, 10^{-2}, 10^{-1}, 1, 10^{1}, 10^{2}, 10^{3}] $	$ \lambda: [1, 10^{1}, 10^{2}, 10^{3}] $
sklearn KRR	$ \gamma: [10^{6}, 10^{4}, 10^{2}, 1, 10^{-2}, 10^{-4}, 10^{-6}] $	$ \alpha: [1, 10^{1}, 10^{2}, 10^{3}] $
sklearn SVC	$ \gamma: [10^{6}, 10^{4}, 10^{2}, 1, 10^{-2}, 10^{-4}, 10^{-6}] $	$ C: [1, 10^{-1}, 10^{-2}, 10^{-3}] $

| Show Table

DownLoad: CSV

Table 3. Telescope Data Set - Accuracy and runtime of GridSearch on various balanced samples of different size

sample size	best accuracy			mean runtime fit			mean runtime predict			mean total runtime
sample size	NFFT KRR			NFFT KRR			NFFT KRR			NFFT KRR
100	69.6	74.2	57.0	0.1091	0.0026	0.0008	0.0121	0.0011	0.0003	0.1212	0.0036	0.0011
500	80.4	78.1	76.6	0.1600	0.0075	0.0036	0.0168	0.0029	0.0021	0.1768	0.0104	0.0056
1000	81.3	79.7	78.3	0.2104	0.0121	0.0125	0.0225	0.0089	0.0082	0.2328	0.0210	0.0207
5000*	83.2	82.0	81.4	0.4207	0.2086	0.1694	0.0376	0.1138	0.1300	0.4583	0.3224	0.2994
10000*	83.9	83.8	83.7	0.8128	0.8929	0.7027	0.0675	0.3991	0.5210	0.8803	1.2929	1.2237
13376*	83.9	83.8	83.6	1.2122	1.7200	1.2932	0.0978	0.7063	0.8661	1.3100	2.4264	2.1594

| Show Table

DownLoad: CSV

Table 4. HIGGS Data Set - Accuracy and runtime of GridSearch on various balanced samples of different size

sample size	best accuracy			mean runtime fit			mean runtime predict			mean total runtime
sample size	NFFT KRR			NFFT KRR			NFFT KRR			NFFT KRR
100	55.8	54.4	40.4	0.2624	0.0020	0.0011	0.0344	0.0008	0.0006	0.2968	0.0028	0.0017
500	58.8	55.7	54.3	0.3561	0.0048	0.0031	0.0392	0.0020	0.0020	0.3953	0.0068	0.0051
1000	62.1	60.5	59.3	0.4325	0.0133	0.0107	0.0450	0.0040	0.0080	0.4776	0.0173	0.0187
5000*	63.3	62.9	62.0	1.0610	0.1914	0.2557	0.1056	0.1056	0.2084	1.1666	0.2970	0.4641
10000*	66.7	65.5	65.0	2.1512	0.8663	1.0339	0.1992	0.3673	0.7972	2.3504	1.2336	1.8312
25000*	64.8	67.0	66.5	6.3951	10.4802	9.4575	0.4987	7.5416	4.3869	6.8938	18.0218	13.8443
50000*	65.5	-	67.3	14.4947	-	56.6365	0.9397	-	20.4737	15.4344	-	77.1102
100000*	67.1	-	67.9	34.6103	-	197.9109	1.8479	-	84.8191	36.4582	-	282.7300
200000*	68.6	-	68.5	105.0926	-	991.7101	4.1666	-	337.1600	109.2592	-	1328.8701

| Show Table

DownLoad: CSV

Table 5. SUSY Data Set - Accuracy and runtime of GridSearch on various balanced samples of different size

sample size	best accuracy			mean runtime fit			mean runtime predict			mean total runtime
sample size	NFFT KRR			NFFT KRR			NFFT KRR			NFFT KRR
100	62.2	62.0	55.6	0.1218	0.0009	0.0005	0.0159	0.0004	0.0001	0.1377	0.0014	0.0006
500	74.3	74.2	71.8	0.1754	0.0044	0.0028	0.0216	0.0029	0.0016	0.1969	0.0073	0.0044
1000	76.8	75.5	73.6	0.2427	0.0114	0.0091	0.0307	0.0039	0.0058	0.2734	0.0153	0.0149
5000*	77.6	77.3	76.9	0.8570	0.1879	0.2033	0.0775	0.0939	0.1614	0.9345	0.2817	0.3647
10000*	78.4	78.4	78.0	1.5382	1.0586	0.8214	0.1356	0.3487	0.6078	1.6737	1.4073	1.4291
50000*	78.9	79.4	79.2	9.1113	42.7143	31.1025	0.6305	9.0420	15.5091	9.7417	51.7563	46.6116
100000*	78.7	-	79.2	23.9190	-	148.1214	1.2657	-	62.3419	25.1847	-	210.4633
200000*	78.9	-	79.3	56.7895	-	732.4476	2.3603	-	255.1349	59.1497	-	987.5824

| Show Table

DownLoad: CSV

Table 6. YearPredictionMSD Data Set - Accuracy and runtime of GridSearch on various balanced samples of different size

sample size	best accuracy			mean runtime fit			mean runtime predict			mean total runtime
sample size	NFFT KRR			NFFT KRR			NFFT KRR			NFFT KRR
100	46.0	50.8	42.6	0.6578	0.0011	0.0007	0.0899	0.0005	0.0006	0.7476	0.0016	0.0013
500	61.4	64.2	64.2	0.8981	0.0051	0.0064	0.1122	0.0022	0.0054	1.0103	0.0074	0.0118
1000	66.0	69.0	68.2	1.1817	0.0107	0.0218	0.1405	0.0066	0.0196	1.3221	0.0173	0.0413
5000*	70.3	71.7	71.7	3.2502	0.1861	0.5164	0.3882	0.0966	0.4705	3.6384	0.2827	0.9869
10000*	72.3	73.4	73.1	6.7636	0.8400	2.1104	0.7016	0.3625	1.8575	7.4651	1.2025	3.9679
50000*	73.4	74.8	74.8	53.3774	43.2971	122.4035	3.9307	9.1279	50.3554	57.3081	52.4250	172.7589
100000*	74.2	-	75.8	102.0414	-	426.8239	6.1424	-	201.7097	108.1838	-	628.5336
200000*	74.4	-	76.4	245.1812	-	2006.6694	12.3315	-	785.3145	257.5127	-	2791.9839
413890*	74.3	-	76.7	669.4028	-	7696.3063	24.4887	-	3335.6310	693.8915	-	11031.9373

| Show Table

DownLoad: CSV

Table 7. cod-rna Data Set - Accuracy and runtime of GridSearch on various balanced samples of different size

sample size	best accuracy			mean runtime fit			mean runtime predict			mean total runtime
sample size	NFFT KRR			NFFT KRR			NFFT KRR			NFFT KRR
100	78.0	77.2	72.2	0.0459	0.0010	0.0005	0.0066	0.0006	0.0002	0.0524	0.0016	0.0007
500	89.2	92.7	89.6	0.0671	0.0039	0.0021	0.0074	0.0023	0.0012	0.0745	0.0063	0.0034
1000	92.5	94.7	93.7	0.0882	0.0110	0.0071	0.0100	0.0037	0.0044	0.0982	0.0147	0.0116
5000*	95.4	95.7	95.6	0.3046	0.1788	0.1564	0.0274	0.0952	0.1031	0.3320	0.2740	0.2595
10000*	95.7	95.8	95.9	0.5864	0.8309	0.6044	0.0472	0.3723	0.4101	0.6336	1.2032	1.0145
50000*	95.7	95.9	96.0	3.9697	43.4380	17.9735	0.2376	9.5084	9.8435	4.2073	52.9464	27.8170
100000*	95.9	-	96.2	10.1743	-	73.8995	0.4686	-	39.0615	10.6430	-	112.9610
209884*	96.0	-	96.5	28.9638	-	422.5391	1.0058	-	170.0743	29.9696	-	592.6134

| Show Table

DownLoad: CSV

Related Papers

Cited by

References

[1]	A. Adeyemo, H. Wimmer and L. M. Powell, Effects of normalization techniques on logistic regression in data science, Journal of Information Systems Applied Research, 12 (2019), 37.
[2]	D. Alfke, D. Potts, M. Stoll and T. Volkmer, NFFT meets krylov methods: Fast matrix-vector products for the graph Laplacian of fully connected networks, Frontiers in Applied Mathematics and Statistics, 4 (2018), 61. doi: 10.3389/fams.2018.00061.
[3]	H. Avron, K. L. Clarkson and D. P. Woodruff, Faster kernel ridge regression using sketching and preconditioning, SIAM J. Matrix Anal. Appl., 38 (2017), 1116-1138. doi: 10.1137/16M1105396.
[4]	P. Baldi, P. Sadowski and D. Whiteson, Searching for exotic particles in high-energy physics with deep learning, Nature Communications, 5 (2014), 1-9. doi: 10.1038/ncomms5308.
[5]	R. Battiti, Using mutual information for selecting features in supervised neural net learning, IEEE Transactions on Neural Networks, 5 (1994), 537-550. doi: 10.1109/72.298224.
[6]	G. Beylkin, On the Fast Fourier Transform of Functions with Singularities, Applied Computational Harmonic Analysis, 2 (1995), 363-381. doi: 10.1006/acha.1995.1026.
[7]	C. M. Bishop, Pattern Recognition and Machine Learning, Springer, 2006. doi: 10.1007/978-0-387-45528-0.
[8]	A. R. T. Donders, G. J. M. G. Van Der Heijden, T. Stijnen and K. G. M. Moons, A gentle introduction to imputation of missing values, Journal of Clinical Epidemiology, 59 (2006), 1087-1091. doi: 10.1016/j.jclinepi.2006.01.014.
[9]	A. J. W. Duijndam and M. A. Schonewille, Nonuniform fast Fourier transform, GEOPHYSICS, 64 (1999), 539-551. doi: 10.1190/1.1444560.
[10]	A. Dutt and V. Rokhlin, Fast Fourier transforms for nonequispaced data, SIAM J. Sci. Comput., 14 (1993), 1368-1393. doi: 10.1137/0914081.
[11]	M. Fenn and G. Steidl, Fast NFFT based summation of radial functions, Sampling Theory in Signal and Image Processing, 3 (2004), 1-28.
[12]	G. H. Golub and C. F. Van Loan, Matrix Computations, JHU Press, 2013.
[13]	M. Gönen and E. Alpaydın, Multiple kernel learning algorithms, J. Mach. Learn. Res., 12 (2011), 2211-2268.
[14]	M. R. Hestenes and E. Stiefel, Methods of conjugate gradients for solving linear systems, Journal of Research of the National Bureau of Standards, 49 (1952), 409-436. doi: 10.6028/jres.049.044.
[15]	T. Hofmann, B. Schölkopf and A. J. Smola, Kernel methods in machine learning., Ann. Statist., 36 (2008), 1171-1220. doi: 10.1214/009053607000000677.
[16]	J. Keiner, S. Kunis and D. Potts, Using NFFT3- a software library for various nonequispaced fast Fourier transforms, ACM Trans. Math. Software, 36 (2009), Article 19, 1–30. doi: 10.1145/1555386.1555388.
[17]	T. N. Kipf and M. Welling, Semi-supervised classification with graph convolutional networks, arXiv preprint, arXiv: 1609.02907, 2016.
[18]	W. March, B. Xiao and G. Biros, ASKIT: Approximate skeletonization kernel-independent treecode in high dimensions, SIAM J. Sci. Comput., 37 (2014), A1089–A1110. doi: 10.1137/140989546.
[19]	A. I. Marqués, V. García and J. S. Sánchez, On the suitability of resampling techniques for the class imbalance problem in credit scoring, Journal of the Operational Research Society, 64 (2013), 1060-1070. doi: 10.1057/jors.2012.120.
[20]	V. I. Morariu, B. V. Srinivasan, V. C. Raykar, R. Duraiswami and L. S. Davis, Automatic online tuning for fast Gaussian summation, Advances in Neural Information Processing Systems, 21 (2008).
[21]	S. G. K. Patro and K. K. Sahu, Normalization: A preprocessing stage, International Advanced Research Journal in Science, Engineering and Technology, 2 (2015), 20–22. arXiv preprint, arXiv: 1503.06462, 2015. doi: 10.17148/IARJSET.2015.2305.
[22]	D. Potts and M. Schmischke, Approximations of high-dimensional periodic functions with Fourier-Based methods, SIAM J. Numer. Anal., 59 (2021), 2393-2429. doi: 10.1137/20M1354921.
[23]	D. Potts and M. Schmischke, Learning multivariate functions with low-dimensional structures using polynomial bases, J. Comput. Appl. Math., 403 (2022), 113821, 19 pp. doi: 10.1016/j.cam.2021.113821.
[24]	D. Potts and G. Steidl, Fast summation at nonequispaced knots by NFFT, SIAM J. Sci. Comput., 24 (2003), 2013-2037. doi: 10.1137/S1064827502400984.
[25]	D. Potts, G. Steidl and A. Nieslony, Fast convolution with radial kernels at nonequispaced knots, Numer. Math., 98 (2004), 329-351. doi: 10.1007/s00211-004-0538-5.
[26]	C. E. Rasmussen, Gaussian processes in machine learning, In Summer School on Machine Learning, Springer, 2003, 63–71. doi: 10.1007/978-3-540-28650-9_4.
[27]	V. C. Raykar and R. Duraiswami, Fast large scale Gaussian process regression using approximate matrix-vector products, In Learning Workshop, 2007.
[28]	Y. Saad, Iterative Methods for Sparse Linear Systems, SIAM, 2003. doi: 10.1137/1.9780898718003.
[29]	W. Sarle, comp. ai. neural-nets FAQ, Part 2 of 7: Learning, http://www.faqs.org/faqs/ai-faq/neural-nets/part2, (1997), (accessed: 22 February 2021).
[30]	B. Schölkopf and A. J. Smola, Learning with Kernels: Support Vector Machines, Regularization, Optimization, and Beyond, MIT Press, 2002.
[31]	J. Shawe-Taylor and N. Cristianini, Kernel Methods for Pattern Analysis, Cambridge University Press, 2004. doi: 10.1017/CBO9780511809682.
[32]	G. Steidl, A note on fast Fourier transforms for nonequispaced grids, Adv. Comput. Math., 9 (1998), 337-353. doi: 10.1023/A:1018901926283.
[33]	M. Stoll, A literature survey of matrix methods for data science, GAMM-Mitt., 43 (2020), e202000013, 26 pp.
[34]	H. Tanabe, T. B. Ho, C. H. Nguyen and S. Kawasaki, Simple but effective methods for combining kernels in computational biology, In 2008 IEEE International Conference on Research, Innovation and Vision for the Future in Computing and Communication Technologies, IEEE, 2008, 71–78. doi: 10.1109/RIVF.2008.4586335.
[35]	A. V. Uzilov, J. M. Keegan and D. H. Mathews, Detection of non-coding RNAs on the basis of predicted secondary structure formation free energy change, BMC Bioinformatics, 7 (2006), 1-30. doi: 10.1186/1471-2105-7-173.
[36]	A. G. Wilson, Z. Hu, R. Salakhutdinov and E. P. Xing, Deep kernel learning, In Artificial Intelligence and Statistics, Proc. Mach. Learn. Res. (PMLR), 2016,370–378.
[37]	C. Yu, W. March, B. Xiao and G. Biros, INV-ASKIT: A parallel fast direct solver for kernel matrices, 2016 IEEE International Parallel and Distributed Processing Symposium, 2016,161–171. doi: 10.1109/IPDPS.2016.12.
[38]	A. Zheng and A. Casari, Feature Engineering for Machine Learning: Principles and Techniques for Data Scientists, O'Reilly Media, Inc., 2018.