Abstract
The block version of GMRES (BGMRES) is most advantageous over the single right hand side (RHS) counterpart when the cost of communication is high while the cost of floating point operations is not. This is the particular case on modern graphics processing units (GPUs), while it is generally not the case on traditional central processing units (CPUs). In this paper, experiments on both GPUs and CPUs are shown that compare the performance of BGMRES against GMRES as the number of RHS increases, with a particular forcus on GPU performance. The experiments indicate that there are many cases in which BGMRES is slower than GMRES on CPUs, but faster on GPUs. Furthermore, when varying the number of RHS on the GPU, there is an optimal number of RHS where BGMRES is clearly most advantageous over GMRES. A computational model for the GPU is developed using hardware specific parameters, providing insight towards how the qualitative behavior of BGMRES changes as the number of RHS increase, and this model also helps explain the phenomena observed in the experiments.
Similar content being viewed by others
Notes
Alternatively, nz can represent the number of non-zeros needed in the application of the coefficient matrix and the preconditioner.
References
Arafa, Y., Badawy, A.A., Chennupati, G., Santhi, N., Eidenbenz, S.: PPT-GPU: scalable GPU performance modeling. IEEE Comput. Archit. Lett. 18(1), 55–58 (2019)
The Belos Project Team: The Belos Project Website. https://docs.trilinos.org/dev/packages/belos/doc/html/classBelos_1_1PseudoBlockGmresSolMgr.html. Accessed 2021 Dec 01
Susan Blackford, L., Demmel, J., Dongarra, J.J., Duff, I.S., Hammarling, S., Henry, G., Heroux, M., Kaufman, L., Lumsdaine, A., Petitet, A., Pozo, R., Remington, K., Clint Whaley, R.: An updated set of basic linear algebra subprograms (BLAS). ACM Trans. Math. Softw. 28(2), 135–151 (2002)
Carson, E., Lund, K., Rozloznik, M.: The stability of block variants of classical Gram-Schmidt. SIAM J. Matrix Anal. Appl. 42(3), 1365–1380 (2021)
Dagum, L., Menon, R.: OpenMP : an industry standard API for shared-memory programming. IEEE Comput. Sci. Eng. 5(1), 46–55 (1998)
Davis, T.A., Hu, Y.: The university of Florida sparse matrix collection. ACM Trans. Math. Softw. 38(1) (2011)
Dongarra, J.J., Croz, J.D., Hammarling, S., Duff, I.S.: A set of level 3 basic linear algebra subprograms. ACM Trans. Math. Softw. 16(1), 1–17 (1990)
Carter, E.H., Trott, C.R., Sunderland, D.: Kokkos: enabling manycore performance portability through polymorphic memory access patterns. J. Parallel Distrib. Comput. 74(12), 3202–3216 (2014). Domain-Specific Languages and High-Level Frameworks for High-Performance Computing
Gutknecht, M.H.: Block Krylov subspace methods for linear systems with multiple right-hand sides: an introduction. In: Siddiqi, A.H., Duff, I.S., Ole, C. (eds.) Modern Mathematical Models, Methods and Algorithms for Real World Systems, Chapter 10, pp. 420–447. Anamaya Publishers, New Dehli (2006)
Gutknecht, M.H., Schmelzer, T.: Updating the QR decomposition of block tridiagonal and block Hessenberg matrices. Appl. Numer. Math. 58(6), 871–883 (2008)
Hoemmen, M.: Communication-avoiding Krylov subspace methods. PhD thesis, EECS Department, University of California, Berkeley (2010)
Intel. Intel math kernel library. https://www.intel.com/content/www/us/en/developer/tools/oneapi/onemkl.html. Accessed 2022 Apr 24
Langou, J.: Solving large linear systems with multiple right-hand sides. PhD thesis, INSA de Toulouse (2003)
Liesen, J., Strakoš, Z.: Krylov subspace methods. Principles and analysis. Numerical Mathematics and Scientific Computation, 1st edn. Oxford University Press, Oxford (2013)
NVIDIA. cuBLAS documentation. https://docs.nvidia.com/cuda/cublas/index.html. Accessed 2021 Dec 08
NVIDIA. cuSPARSE documentation. https://docs.nvidia.com/cuda/cusparse/index.html. Accessed 2021 Dec 08
NVIDIA. NVIDIA V100 Tensor Core GPU. https://images.nvidia.com/content/technologies/volta/pdf/volta-v100-datasheet-update-us-1165301-r5.pdf Accessed 2021 Apr 07
O’Leary, D.P.: The block conjugate gradient algorithm and related methods. Linear Algebra Appl. 29, 293–322 (1980). Special Volume Dedicated to Alson S. Householder.
Rashedi, S., Ebadi, G., Birk, S., Frommer, A.: On short recurrence Krylov type methods for linear systems with many right-hand sides. J. Comput. Appl. Math. 300, 18–29 (2016)
Saad, Y.: Iterative Methods for Sparse Linear Systems, 2nd edn. Society for Industrial and Applied Mathematics, Philadelphia (2003)
Yousef, S., Schultz, M.H.: GMRES: a generalized minimal residual algorithm for solving nonsymmetric linear systems. SIAM J. Sci. Stat. Comput. 7 (3), 856–869 (1986)
Simoncini, V., Gallopoulos, E.: An iterative method for nonsymmetric systems with multiple right-hand sides. SIAM J. Sci. Comput. 16, 917–933 (1995)
Simoncini, V., Gallopoulos, E.: Convergence properties of block GMRES and matrix polynomials. Linear Algebra Appl. 247, 97–119 (1996)
Soodhalter, K.: Block Krylov subspace recycling for shifted systems with unrelated right-hand sides. SIAM J. Sci. Comput. 38, A302–A324 (2016)
Vital, B.: Etude de quelques méthodes de résolution de problémes linéaires de grande taille sur multiprocesseur. PhD thesis, Université de Rennes I (1990)
Yamazaki, I., Hoemmen, M., Luszczek, P., Dongarra, J.J.: Improving performance of GMRES by reducing communication and pipelining global collectives. In: 2017 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW), pp. 1118–1127 (2017)
Acknowledgements
We benefited from helpful comments and questions we received from Valeria Simoncini and Yousef Saad. We thank the two referees for their careful reading of the manuscript and their comments and questions which helped improve our presentation.
Funding
Sandia National Laboratories is a multimission laboratory managed and operated by National Technology and Engineering Solutions of Sandia, LLC., a wholly owned subsidiary of Honeywell International, Inc., for the US Department of Energy’s National Nuclear Security Administration under contract DE-NA-0003525. This work was in part supported by the Exascale Computing Project (17-SC-20-SC), a collaborative effort of the US Department of Energy Office of Science and the National Nuclear Security Administration. This research includes calculations carried out on HPC resources at Temple University supported in part by the National Science Foundation through major research instrumentation grant number 1625061 and by the US Army Research Laboratory under contract number W911NF-16-2-0189.
Author information
Authors and Affiliations
Corresponding author
Ethics declarations
Conflict of interest
The authors declare no competing interests.
Additional information
Publisher’s note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Appendix: Complete results
Appendix: Complete results
In the tables, entries with a ‘-’ indicate the method failed to converge. All times are taken in seconds, and k is the restart parameter. All iteration counts taken for s ×GMRES are the average number of iterations required for each RHS until convergence. Data for GMRES-LI are only included for problems where the method converges for at least s = 5 RHS.
We note that BGMRES is often slower than s ×GMRES for s small, e.g., s = 1. This is due to different implementations of solving the least squares problem among the two algorithms. BGMRES uses Householder reflections, which is advantageous for multiple RHS, while the Givens rotations in s ×GMRES are designed with minimal effort for one RHS in mind.
Rights and permissions
Springer Nature or its licensor holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Boman, E.G., Higgins, A.J. & Szyld, D.B. Optimal size of the block in block GMRES on GPUs: computational model and experiments. Numer Algor 92, 119–147 (2023). https://doi.org/10.1007/s11075-022-01439-z
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11075-022-01439-z