\`x^2+y_1+z_12^34\`
Article Contents
Article Contents

Learning in high-dimensional feature spaces using ANOVA-based fast matrix-vector multiplication

  • *Corresponding author: Theresa Wagner

    *Corresponding author: Theresa Wagner
Abstract / Introduction Full Text(HTML) Figure(4) / Table(7) Related Papers Cited by
  • Kernel matrices are crucial in many learning tasks such as support vector machines or kernel ridge regression. The kernel matrix is typically dense and large-scale. Depending on the dimension of the feature space even the computation of all of its entries in reasonable time becomes a challenging task. For such dense matrices the cost of a matrix-vector product scales quadratically with the dimensionality $ N $, if no customized methods are applied. We propose the use of an ANOVA kernel, where we construct several kernels based on lower-dimensional feature spaces for which we provide fast algorithms realizing the matrix-vector products. We employ the non-equispaced fast Fourier transform (NFFT), which is of linear complexity for fixed accuracy. Based on a feature grouping approach, we then show how the fast matrix-vector products can be embedded into a learning method choosing kernel ridge regression and the conjugate gradient solver. We illustrate the performance of our approach on several data sets.

    Mathematics Subject Classification: Primary: 65F45, 65T50; Secondary: 62J10.

    Citation:

    \begin{equation} \\ \end{equation}
  • 加载中
  • Figure 1.  Kernel-vector multiplication with and without NFFT-approach on various balanced samples of different size on synthetic data, with $ \sigma = 100 $ and $ n_{\text{runs}} = 100 $

    Figure 2.  Embedding 2-dimensional data into a 3-dimensional space via an embedding map

    Figure 3.  The univariate kernel function $ \kappa $, here a Gaussian, is truncated and a polynomial $ p $ is constructed such that we end up with a periodic function $ \tilde\kappa $, which is smooth up to a certain degree. In higher-dimensional settings, the kernel function is truncated at $ \|{\mathit{\boldsymbol{ r }}}\| = L $, i. e., radially. The resulting periodic function will then have the period $ h $ with respect to each dimension

    Figure 4.  Accuracy and runtime of GridSearch on various balanced samples of different size

    Table 1.  Benchmark data sets for numerical experiments

    data set $ {N} $ $ {d} $ number of data points in under-represented class largest balanced sample possible
    Telescope2 19020 10 6688 13376
    HIGGS3 11000000 28 5170877 10341754
    SUSY4 5000000 18 2287827 4575654
    YearPredictionMSD5 515345 90 206945 413890
    cod-rna6 488565 8 162855 2098847
     | Show Table
    DownLoad: CSV

    Table 2.  Parameter-grids for the considered classifiers

    classifier kernel coefficient regularization parameter
    NFFTKernelRidge $ \sigma: [10^{-3}, 10^{-2}, 10^{-1}, 1, 10^{1}, 10^{2}, 10^{3}] $ $ \lambda: [1, 10^{1}, 10^{2}, 10^{3}] $
    sklearn KRR $ \gamma: [10^{6}, 10^{4}, 10^{2}, 1, 10^{-2}, 10^{-4}, 10^{-6}] $ $ \alpha: [1, 10^{1}, 10^{2}, 10^{3}] $
    sklearn SVC $ \gamma: [10^{6}, 10^{4}, 10^{2}, 1, 10^{-2}, 10^{-4}, 10^{-6}] $ $ C: [1, 10^{-1}, 10^{-2}, 10^{-3}] $
     | Show Table
    DownLoad: CSV

    Table 3.  Telescope Data Set - Accuracy and runtime of GridSearch on various balanced samples of different size

    sample size best accuracy mean runtime fit mean runtime predict mean total runtime
    NFFT KRR NFFT KRR NFFT KRR NFFT KRR
    100 69.6 74.2 57.0 0.1091 0.0026 0.0008 0.0121 0.0011 0.0003 0.1212 0.0036 0.0011
    500 80.4 78.1 76.6 0.1600 0.0075 0.0036 0.0168 0.0029 0.0021 0.1768 0.0104 0.0056
    1000 81.3 79.7 78.3 0.2104 0.0121 0.0125 0.0225 0.0089 0.0082 0.2328 0.0210 0.0207
    5000* 83.2 82.0 81.4 0.4207 0.2086 0.1694 0.0376 0.1138 0.1300 0.4583 0.3224 0.2994
    10000* 83.9 83.8 83.7 0.8128 0.8929 0.7027 0.0675 0.3991 0.5210 0.8803 1.2929 1.2237
    13376* 83.9 83.8 83.6 1.2122 1.7200 1.2932 0.0978 0.7063 0.8661 1.3100 2.4264 2.1594
     | Show Table
    DownLoad: CSV

    Table 4.  HIGGS Data Set - Accuracy and runtime of GridSearch on various balanced samples of different size

    sample size best accuracy mean runtime fit mean runtime predict mean total runtime
    NFFT KRR NFFT KRR NFFT KRR NFFT KRR
    100 55.8 54.4 40.4 0.2624 0.0020 0.0011 0.0344 0.0008 0.0006 0.2968 0.0028 0.0017
    500 58.8 55.7 54.3 0.3561 0.0048 0.0031 0.0392 0.0020 0.0020 0.3953 0.0068 0.0051
    1000 62.1 60.5 59.3 0.4325 0.0133 0.0107 0.0450 0.0040 0.0080 0.4776 0.0173 0.0187
    5000* 63.3 62.9 62.0 1.0610 0.1914 0.2557 0.1056 0.1056 0.2084 1.1666 0.2970 0.4641
    10000* 66.7 65.5 65.0 2.1512 0.8663 1.0339 0.1992 0.3673 0.7972 2.3504 1.2336 1.8312
    25000* 64.8 67.0 66.5 6.3951 10.4802 9.4575 0.4987 7.5416 4.3869 6.8938 18.0218 13.8443
    50000* 65.5 - 67.3 14.4947 - 56.6365 0.9397 - 20.4737 15.4344 - 77.1102
    100000* 67.1 - 67.9 34.6103 - 197.9109 1.8479 - 84.8191 36.4582 - 282.7300
    200000* 68.6 - 68.5 105.0926 - 991.7101 4.1666 - 337.1600 109.2592 - 1328.8701
     | Show Table
    DownLoad: CSV

    Table 5.  SUSY Data Set - Accuracy and runtime of GridSearch on various balanced samples of different size

    sample size best accuracy mean runtime fit mean runtime predict mean total runtime
    NFFT KRR NFFT KRR NFFT KRR NFFT KRR
    100 62.2 62.0 55.6 0.1218 0.0009 0.0005 0.0159 0.0004 0.0001 0.1377 0.0014 0.0006
    500 74.3 74.2 71.8 0.1754 0.0044 0.0028 0.0216 0.0029 0.0016 0.1969 0.0073 0.0044
    1000 76.8 75.5 73.6 0.2427 0.0114 0.0091 0.0307 0.0039 0.0058 0.2734 0.0153 0.0149
    5000* 77.6 77.3 76.9 0.8570 0.1879 0.2033 0.0775 0.0939 0.1614 0.9345 0.2817 0.3647
    10000* 78.4 78.4 78.0 1.5382 1.0586 0.8214 0.1356 0.3487 0.6078 1.6737 1.4073 1.4291
    50000* 78.9 79.4 79.2 9.1113 42.7143 31.1025 0.6305 9.0420 15.5091 9.7417 51.7563 46.6116
    100000* 78.7 - 79.2 23.9190 - 148.1214 1.2657 - 62.3419 25.1847 - 210.4633
    200000* 78.9 - 79.3 56.7895 - 732.4476 2.3603 - 255.1349 59.1497 - 987.5824
     | Show Table
    DownLoad: CSV

    Table 6.  YearPredictionMSD Data Set - Accuracy and runtime of GridSearch on various balanced samples of different size

    sample size best accuracy mean runtime fit mean runtime predict mean total runtime
    NFFT KRR NFFT KRR NFFT KRR NFFT KRR
    100 46.0 50.8 42.6 0.6578 0.0011 0.0007 0.0899 0.0005 0.0006 0.7476 0.0016 0.0013
    500 61.4 64.2 64.2 0.8981 0.0051 0.0064 0.1122 0.0022 0.0054 1.0103 0.0074 0.0118
    1000 66.0 69.0 68.2 1.1817 0.0107 0.0218 0.1405 0.0066 0.0196 1.3221 0.0173 0.0413
    5000* 70.3 71.7 71.7 3.2502 0.1861 0.5164 0.3882 0.0966 0.4705 3.6384 0.2827 0.9869
    10000* 72.3 73.4 73.1 6.7636 0.8400 2.1104 0.7016 0.3625 1.8575 7.4651 1.2025 3.9679
    50000* 73.4 74.8 74.8 53.3774 43.2971 122.4035 3.9307 9.1279 50.3554 57.3081 52.4250 172.7589
    100000* 74.2 - 75.8 102.0414 - 426.8239 6.1424 - 201.7097 108.1838 - 628.5336
    200000* 74.4 - 76.4 245.1812 - 2006.6694 12.3315 - 785.3145 257.5127 - 2791.9839
    413890* 74.3 - 76.7 669.4028 - 7696.3063 24.4887 - 3335.6310 693.8915 - 11031.9373
     | Show Table
    DownLoad: CSV

    Table 7.  cod-rna Data Set - Accuracy and runtime of GridSearch on various balanced samples of different size

    sample size best accuracy mean runtime fit mean runtime predict mean total runtime
    NFFT KRR NFFT KRR NFFT KRR NFFT KRR
    100 78.0 77.2 72.2 0.0459 0.0010 0.0005 0.0066 0.0006 0.0002 0.0524 0.0016 0.0007
    500 89.2 92.7 89.6 0.0671 0.0039 0.0021 0.0074 0.0023 0.0012 0.0745 0.0063 0.0034
    1000 92.5 94.7 93.7 0.0882 0.0110 0.0071 0.0100 0.0037 0.0044 0.0982 0.0147 0.0116
    5000* 95.4 95.7 95.6 0.3046 0.1788 0.1564 0.0274 0.0952 0.1031 0.3320 0.2740 0.2595
    10000* 95.7 95.8 95.9 0.5864 0.8309 0.6044 0.0472 0.3723 0.4101 0.6336 1.2032 1.0145
    50000* 95.7 95.9 96.0 3.9697 43.4380 17.9735 0.2376 9.5084 9.8435 4.2073 52.9464 27.8170
    100000* 95.9 - 96.2 10.1743 - 73.8995 0.4686 - 39.0615 10.6430 - 112.9610
    209884* 96.0 - 96.5 28.9638 - 422.5391 1.0058 - 170.0743 29.9696 - 592.6134
     | Show Table
    DownLoad: CSV
  • [1] A. AdeyemoH. Wimmer and L. M. Powell, Effects of normalization techniques on logistic regression in data science, Journal of Information Systems Applied Research, 12 (2019), 37. 
    [2] D. AlfkeD. PottsM. Stoll and T. Volkmer, NFFT meets krylov methods: Fast matrix-vector products for the graph Laplacian of fully connected networks, Frontiers in Applied Mathematics and Statistics, 4 (2018), 61.  doi: 10.3389/fams.2018.00061.
    [3] H. AvronK. L. Clarkson and D. P. Woodruff, Faster kernel ridge regression using sketching and preconditioning, SIAM J. Matrix Anal. Appl., 38 (2017), 1116-1138.  doi: 10.1137/16M1105396.
    [4] P. BaldiP. Sadowski and D. Whiteson, Searching for exotic particles in high-energy physics with deep learning, Nature Communications, 5 (2014), 1-9.  doi: 10.1038/ncomms5308.
    [5] R. Battiti, Using mutual information for selecting features in supervised neural net learning, IEEE Transactions on Neural Networks, 5 (1994), 537-550.  doi: 10.1109/72.298224.
    [6] G. Beylkin, On the Fast Fourier Transform of Functions with Singularities, Applied Computational Harmonic Analysis, 2 (1995), 363-381.  doi: 10.1006/acha.1995.1026.
    [7] C. M. Bishop, Pattern Recognition and Machine Learning, Springer, 2006. doi: 10.1007/978-0-387-45528-0.
    [8] A. R. T. DondersG. J. M. G. Van Der HeijdenT. Stijnen and K. G. M. Moons, A gentle introduction to imputation of missing values, Journal of Clinical Epidemiology, 59 (2006), 1087-1091.  doi: 10.1016/j.jclinepi.2006.01.014.
    [9] A. J. W. Duijndam and M. A. Schonewille, Nonuniform fast Fourier transform, GEOPHYSICS, 64 (1999), 539-551.  doi: 10.1190/1.1444560.
    [10] A. Dutt and V. Rokhlin, Fast Fourier transforms for nonequispaced data, SIAM J. Sci. Comput., 14 (1993), 1368-1393.  doi: 10.1137/0914081.
    [11] M. Fenn and G. Steidl, Fast NFFT based summation of radial functions, Sampling Theory in Signal and Image Processing, 3 (2004), 1-28. 
    [12] G. H. Golub and C. F. Van Loan, Matrix Computations, JHU Press, 2013.
    [13] M. Gönen and E. Alpaydın, Multiple kernel learning algorithms, J. Mach. Learn. Res., 12 (2011), 2211-2268. 
    [14] M. R. Hestenes and E. Stiefel, Methods of conjugate gradients for solving linear systems, Journal of Research of the National Bureau of Standards, 49 (1952), 409-436.  doi: 10.6028/jres.049.044.
    [15] T. HofmannB. Schölkopf and A. J. Smola, Kernel methods in machine learning., Ann. Statist., 36 (2008), 1171-1220.  doi: 10.1214/009053607000000677.
    [16] J. Keiner, S. Kunis and D. Potts, Using NFFT3- a software library for various nonequispaced fast Fourier transforms, ACM Trans. Math. Software, 36 (2009), Article 19, 1–30. doi: 10.1145/1555386.1555388.
    [17] T. N. Kipf and M. Welling, Semi-supervised classification with graph convolutional networks, arXiv preprint, arXiv: 1609.02907, 2016.
    [18] W. March, B. Xiao and G. Biros, ASKIT: Approximate skeletonization kernel-independent treecode in high dimensions, SIAM J. Sci. Comput., 37 (2014), A1089–A1110. doi: 10.1137/140989546.
    [19] A. I. MarquésV. García and J. S. Sánchez, On the suitability of resampling techniques for the class imbalance problem in credit scoring, Journal of the Operational Research Society, 64 (2013), 1060-1070.  doi: 10.1057/jors.2012.120.
    [20] V. I. Morariu, B. V. Srinivasan, V. C. Raykar, R. Duraiswami and L. S. Davis, Automatic online tuning for fast Gaussian summation, Advances in Neural Information Processing Systems, 21 (2008).
    [21] S. G. K. Patro and K. K. Sahu, Normalization: A preprocessing stage, International Advanced Research Journal in Science, Engineering and Technology, 2 (2015), 20–22. arXiv preprint, arXiv: 1503.06462, 2015. doi: 10.17148/IARJSET.2015.2305.
    [22] D. Potts and M. Schmischke, Approximations of high-dimensional periodic functions with Fourier-Based methods, SIAM J. Numer. Anal., 59 (2021), 2393-2429.  doi: 10.1137/20M1354921.
    [23] D. Potts and M. Schmischke, Learning multivariate functions with low-dimensional structures using polynomial bases, J. Comput. Appl. Math., 403 (2022), 113821, 19 pp. doi: 10.1016/j.cam.2021.113821.
    [24] D. Potts and G. Steidl, Fast summation at nonequispaced knots by NFFT, SIAM J. Sci. Comput., 24 (2003), 2013-2037.  doi: 10.1137/S1064827502400984.
    [25] D. PottsG. Steidl and A. Nieslony, Fast convolution with radial kernels at nonequispaced knots, Numer. Math., 98 (2004), 329-351.  doi: 10.1007/s00211-004-0538-5.
    [26] C. E. Rasmussen, Gaussian processes in machine learning, In Summer School on Machine Learning, Springer, 2003, 63–71. doi: 10.1007/978-3-540-28650-9_4.
    [27] V. C. Raykar and R. Duraiswami, Fast large scale Gaussian process regression using approximate matrix-vector products, In Learning Workshop, 2007.
    [28] Y. Saad, Iterative Methods for Sparse Linear Systems, SIAM, 2003. doi: 10.1137/1.9780898718003.
    [29] W. Sarle, comp. ai. neural-nets FAQ, Part 2 of 7: Learning, http://www.faqs.org/faqs/ai-faq/neural-nets/part2, (1997), (accessed: 22 February 2021).
    [30] B. Schölkopf and A. J. Smola, Learning with Kernels: Support Vector Machines, Regularization, Optimization, and Beyond, MIT Press, 2002.
    [31] J. Shawe-Taylor and N. Cristianini, Kernel Methods for Pattern Analysis, Cambridge University Press, 2004. doi: 10.1017/CBO9780511809682.
    [32] G. Steidl, A note on fast Fourier transforms for nonequispaced grids, Adv. Comput. Math., 9 (1998), 337-353.  doi: 10.1023/A:1018901926283.
    [33] M. Stoll, A literature survey of matrix methods for data science, GAMM-Mitt., 43 (2020), e202000013, 26 pp.
    [34] H. Tanabe, T. B. Ho, C. H. Nguyen and S. Kawasaki, Simple but effective methods for combining kernels in computational biology, In 2008 IEEE International Conference on Research, Innovation and Vision for the Future in Computing and Communication Technologies, IEEE, 2008, 71–78. doi: 10.1109/RIVF.2008.4586335.
    [35] A. V. UzilovJ. M. Keegan and D. H. Mathews, Detection of non-coding RNAs on the basis of predicted secondary structure formation free energy change, BMC Bioinformatics, 7 (2006), 1-30.  doi: 10.1186/1471-2105-7-173.
    [36] A. G. Wilson, Z. Hu, R. Salakhutdinov and E. P. Xing, Deep kernel learning, In Artificial Intelligence and Statistics, Proc. Mach. Learn. Res. (PMLR), 2016,370–378.
    [37] C. Yu, W. March, B. Xiao and G. Biros, INV-ASKIT: A parallel fast direct solver for kernel matrices, 2016 IEEE International Parallel and Distributed Processing Symposium, 2016,161–171. doi: 10.1109/IPDPS.2016.12.
    [38] A. Zheng and A. Casari, Feature Engineering for Machine Learning: Principles and Techniques for Data Scientists, O'Reilly Media, Inc., 2018.
  • 加载中

Figures(4)

Tables(7)

SHARE

Article Metrics

HTML views(2715) PDF downloads(183) Cited by(0)

Access History

Other Articles By Authors

Catalog

    /

    DownLoad:  Full-Size Img  PowerPoint
    Return
    Return