Document Zbl 1527.15001

Pebbling game and alternative basis for high performance matrix multiplication. (English) Zbl 1527.15001

SIAM J. Sci. Comput. 45, No. 6, C277-C303 (2023).

Summary: Matrix multiplication is one of the most extensively used kernels in scientific computing. Although subcubic algorithms exist, most high performance implementations are based on the classical \(\Theta(n^3)\) matrix multiplication. Designing an algorithm that obtains even modest improvements in performance over existing implementations, requires carefully addressing challenges such as reducing computation costs, communication costs, and memory footprint. We provide the first high performance general matrix-matrix multiplication that utilizes the alternative basis method on Strassen’s algorithm. We reduce the basis transformation overheads and decrease the memory footprint of the bilinear phase by using the pebbling game optimization scheme, consequentially improving both arithmetic and communication costs. Our algorithm outperforms DGEMM on feasible matrix dimensions starting at \(n=96\). It obtains an increasing speedup of up to nearly \(\times 2\) speedup for larger matrix dimensions when running sequentially, and even larger speedups for certain matrix dimensions when running in parallel.

Cited in 1 Document

MSC:

15-04	Software, source code, etc. for problems pertaining to linear algebra
65F99	Numerical linear algebra
65K05	Numerical mathematical programming methods

Keywords:

fast matrix multiplication; bilinear algorithm; pebbling game; alternative basis method

Software:

SUMMA; BLAS; MKL; mctoolbox; ScaLAPACK; AlphaTensor; LAPACK; oneAPI

Cite Review PDF

Full Text: DOI

References:

[1]	Tips to Measure the Performance of Matrix Multiplication using Intel^® MKL, https://www.intel.com/content/www/us/en/developer/articles/technical/a-simple-example-to-measure-the-performance-of-an-intel-mkl-function.html (2017).
[2]	Intel^® oneAPI Math Kernel Library, https://www.intel.com/content/www/us/en/developer/tools/oneapi/onemkl.html (2020).
[3]	AMD Optimizing CPU Libraries (AOCL), https://developer.amd.com/amd-aocl/#documentationl (2022).
[4]	Nvidia cuBLAS, https://developer.nvidia.com/cublas (2022).
[5]	Agarwal, R. C., Balle, S. M., Gustavson, F. G., Joshi, M., and Palkar, P., A three-dimensional approach to parallel matrix multiplication, IBM J. Res. Dev., 39 (1995), pp. 575-582.
[6]	Al-Mouhamed, M., Fatayer, A., Mohammad, N., ul Hassan Khan, A., Optimizing the matrix multiplication using Strassen and Winograd algorithms with limited recursions on many-core, Int. J. Parallel Program., 44 (2016), pp. 801-830.
[7]	Alman, J. and Williams, V. V., A refined laser method and faster matrix multiplication, in Proceedings of the 2021 ACM-SIAM Symposium on Discrete Algorithms (SODA), , SIAM, Philadelphia, 2021, pp. 522-539. · Zbl 07788370
[8]	Anderson, E., Bai, Z., Bischof, C., Blackford, S., Demmel, J., Dongarra, J., Du Croz, J., Greenbaum, A., Hammarling, S., McKenney, A., and Sorensen, D., LAPACK Users’ Guide, 3rd, ed., SIAM, Philadelphia, 1999. · Zbl 0934.65030
[9]	Bader, M., Franz, R., Günther, S., and Heinecke, A., Hardware-oriented implementation of cache oblivious matrix operations based on space-filling curves, in Parallel Processing and Applied Mathematics, Springer, Berlin, 2008, pp. 628-638.
[10]	Bailey, D. H., Extra high speed matrix multiplication on the Cray-2, SIAM J. Sci. Stat. Comput., 9 (1988), pp. 603-607. · Zbl 0644.65030
[11]	Bailey, D. H., Lee, K., and Simon, H. D., Using Strassen’s algorithm to accelerate the solution of linear systems, J. Supercomput., 4 (1991), pp. 357-371. · Zbl 1215.65049
[12]	Ballard, G., Benson, A. R., Druinsky, A., Lipshitz, B., and Schwartz, O., Improving the numerical stability of fast matrix multiplication, SIAM J. Matrix Anal. Appl., 37 (2016), pp. 1382-1418. · Zbl 1348.65080
[13]	Ballard, G., Demmel, J., Holtz, O., Lipshitz, B., and Schwartz, O., Brief announcement: Strong scaling of matrix multiplication algorithms and memory-independent communication lower bounds, in Proceedings of the Twenty-fourth Annual ACM Symposium on Parallelism in Algorithms and Architectures, , ACM, New York, 2012, pp. 77-79.
[14]	Ballard, G., Demmel, J., Holtz, O., Lipshitz, B., and Schwartz, O., Communication-optimal parallel algorithm for Strassen’s matrix multiplication, in Proceedings of the Twenty-fourth Annual ACM Symposium on Parallelism in Algorithms and Architectures, , ACM, New York, 2012, pp. 193-204. · Zbl 1385.68057
[15]	Ballard, G., Demmel, J., Holtz, O., and Schwartz, O., Minimizing communication in numerical linear algebra, SIAM J. Matrix Anal. Appl., 32 (2011), pp. 866-901. · Zbl 1246.68128
[16]	Ballard, G., Demmel, J., Holtz, O., and Schwartz, O., Graph expansion and communication costs of fast matrix multiplication, J. ACM, 59 (2013), pp. 1-23. · Zbl 1281.68241
[17]	Beniamini, G., Cheng, N., Holtz, O., Karstadt, E., and Schwartz, O., Sparsifying the Operators of Fast Matrix Multiplication Algorithms, preprint, arXiv:2008.03759, 2020.
[18]	Beniamini, G. and Schwartz, O., Faster matrix multiplication via sparse decomposition, in The 31st ACM Symposium on Parallelism in Algorithms and Architectures, , ACM, New York, 2019, pp. 11-22.
[19]	Benson, A. R. and Ballard, G., A framework for practical parallel fast matrix multiplication, ACM SIGPLAN Notices, 50 (2015), pp. 42-53.
[20]	Berntsen, J., Communication efficient matrix multiplication on hypercubes, Parallel Comput., 12 (1989), pp. 335-342. · Zbl 0689.65024
[21]	Bilardi, G. and De Stefani, L., The I/O complexity of Strassen’s matrix multiplication with recomputation, in Workshop on Algorithms and Data Structures, Springer, Cham, Switzerland, 2017, pp. 181-192. · Zbl 1491.68274
[22]	Blackford, L. S., Choi, J., Cleary, A., D’Azevedo, E., Demmel, J., Dhillon, I., Dongarra, J., Hammarling, S., Henry, G., Petitet, A., Stanley, K., Walker, D., and Whaley, R. C., ScaLAPACK Users’ Guide, SIAM, Philadelphia, 1997. · Zbl 0886.65022
[23]	Boyer, B., Dumas, J.-G., Pernet, C., and Zhou, W., Memory efficient scheduling of Strassen-Winograd’s matrix multiplication algorithm, in Proceedings of the 2009 International Symposium on Symbolic and Algebraic Computation, , ACM, New York, 2009, pp. 55-62. · Zbl 1237.68252
[24]	Brent, R. P., Algorithms for Matrix Multiplication, Technical report Stan-CS-70-157, Department of Computer Science, Stanford University, Stanford, CA, 1970. · Zbl 0193.11902
[25]	Cannon, L. E., A Cellular Computer to Implement the Kalman Filter Algorithm, Ph.D. thesis, Montana State University, Bozeman, MT, 1969.
[26]	Cenk, M. and Hasan, M. A., On the arithmetic complexity of Strassen-like matrix multiplications, J. Symbolic Comput., 80 (2017), pp. 484-501. · Zbl 1353.68309
[27]	Cohen, J. and Roth, M., On the implementation of Strassen’s fast multiplication algorithm, Acta Inform., 6 (1976), pp. 341-355. · Zbl 0312.68026
[28]	Coppersmith, D. and Winograd, S., On the asymptotic complexity of matrix multiplication, SIAM J. Comput., 11 (1982), pp. 472-492. · Zbl 0486.68030
[29]	Coppersmith, D. and Winograd, S., Matrix multiplication via arithmetic progressions, J. Symbolic Comput., 9 (1990), pp. 251-280. · Zbl 0702.65046
[30]	D’alberto, P., Bodrato, M., and Nicolau, A., Exploiting parallelism in matrix-computation kernels for symmetric multiprocessor systems: Matrix-multiplication and matrix-addition algorithm optimizations by software pipelining and threads allocation, ACM Trans. Math. Software, 38 (2011), pp. 1-30. · Zbl 1365.65109
[31]	Demmel, J., Dumitriu, I., and Holtz, O., Fast linear algebra is stable, Numer. Math., 108 (2007), pp. 59-91. · Zbl 1133.65015
[32]	Demmel, J., Eliahu, D., Fox, A., Kamil, S., Lipshitz, B., Schwartz, O., and Spillinger, O., Communication-optimal parallel recursive rectangular matrix multiplication, in 2013 IEEE 27th International Symposium on Parallel and Distributed Processing, , IEEE, Piscataway, NJ, 2013, pp. 261-272.
[33]	Desprez, F. and Suter, F., Impact of mixed-parallelism on parallel implementations of the Strassen and Winograd matrix multiplication algorithms, Concurrency Comput. Pract. Experience, 16 (2004), pp. 771-797.
[34]	Dongarra, J. J., Du Croz, J., Hammarling, S., and Duff, I. S., A set of level 3 basic linear algebra subprograms, ACM Trans. Math. Software, 16 (1990), pp. 1-17. · Zbl 0900.65115
[35]	Douglas, C. C., Heroux, M., Slishman, G., and Smith, R. M., GEMMW: A portable level 3 BLAS Winograd variant of Strassen’s matrix-matrix multiply algorithm, J. Comput. Phys., 110 (1994), pp. 1-10. · Zbl 0793.65031
[36]	Dumitrescu, B., Improving and estimating the accuracy of Strassen’s algorithm, Numer. Math., 79 (1998), pp. 485-499. · Zbl 0911.65034
[37]	Elmroth, E., Gustavson, F., Jonsson, I., and Kågström, B., Recursive blocked algorithms and hybrid data structures for dense matrix library software, SIAM Rev., 46 (2004), pp. 3-45. · Zbl 1068.65040
[38]	Fawzi, A., Balog, M., Huang, A., Hubert, T., Romera-Paredes, B., Barekatain, M., Novikov, A., Ruiz, F. J. R., Schrittwieser, J., Swirszcz, G., Silver, D., Hassabis, D., and Kohli, P., Discovering faster matrix multiplication algorithms with reinforcement learning, Nature, 610 (2022), pp. 47-53. · Zbl 1496.65060
[39]	Frigo, M., Leiserson, C. E., Prokop, H., and Ramachandran, S., Cache-oblivious algorithms, in 40th Annual Symposium on Foundations of Computer Science, , IEEE Computer Society, Los Alamitos, CA, 1999, pp. 285-297. · Zbl 1295.68236
[40]	Gepner, P., Gamayunov, V., and Fraser, D. L., Effective implementation of DGEMM on modern multicore CPU, Procedia Comput. Sci., 9 (2012), pp. 126-135.
[41]	Grayson, B. and Van De Geijn, R., A high performance parallel Strassen implementation, Parallel Process. Lett., 6 (1996), pp. 3-12.
[42]	Hadas, T. and Schwartz, O., Towards practical fast matrix multiplication based on trilinear aggregation, in Proceedings of the 48th International Symposium on Symbolic and Algebraic Computation, , ACM, New York, 2023. · Zbl 07760773
[43]	Higham, N. J., Accuracy and Stability of Numerical Algorithms, 2nd ed., SIAM, Philadelphia, 2002. · Zbl 1011.65010
[44]	Hopcroft, J. E. and Kerr, L. R., On minimizing the number of multiplications necessary for matrix multiplication, SIAM J. Appl. Math., 20 (1971), pp. 30-36. · Zbl 0215.55501
[45]	Huang, J., Rice, L., Matthews, D. A., and van de Geijn, R. A., Generating families of practical fast matrix multiplication algorithms, in 2017 IEEE International Parallel and Distributed Processing Symposium (IPDPS), , IEEE, Piscataway, NJ, 2017, pp. 656-667.
[46]	Huang, J., Smith, T. M., Henry, G. M., and Van De Geijn, R. A., Strassen’s algorithm reloaded, in SC’16: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, , IEEE, Piscataway, NJ, 2016, pp. 690-701.
[47]	Huang, J., Yu, C. D., and Geijn, R. A. v. d., Strassen’s algorithm reloaded on GPUs, ACM Trans. Math. Software, 46 (2020), pp. 1-22. · Zbl 1484.65089
[48]	Huss-Lederman, S., Jacobson, E. M., Johnson, J. R., Tsao, A., and Turnbull, T., Implementation of Strassen’s algorithm for matrix multiplication, in Supercomputing’96: Proceedings of the 1996 ACM/IEEE Conference on Supercomputing, , IEEE, Piscataway, NJ, 1996, pp. 32-32.
[49]	Huss-Lederman, S., Jacobson, E. M., Johnson, J. R., Tsao, A., and Turnbull, T., Strassen’s algorithm for matrix multiplication: Modeling, analysis, and implementation, in Proceedings of Supercomputing, , IEEE, Piscataway, NJ, 1996, pp. 31-es.
[50]	Irony, D., Toledo, S., and Tiskin, A., Communication lower bounds for distributed-memory matrix multiplication, J. Parallel Distrib. Comput., 64 (2004), pp. 1017-1026. · Zbl 1114.68081
[51]	Johnson, R. W. and McLoughlin, A. M., Noncommutative bilinear algorithms for 3 x 3 matrix multiplication, SIAM J. Comput., 15 (1986), pp. 595-603. · Zbl 0622.68037
[52]	Jouppi, N. P., Yoon, D. H., Ashcraft, M., Gottscho, M., Jablin, T. B., Kurian, G., Laudon, J., Li, S., Ma, P., Ma, X., et al., Ten lessons from three generations shaped Google’s TPUv4i, in 2021 ACM/IEEE 48th Annual International Symposium on Computer Architecture (ISCA), , IEEE, Piscataway, NJ, 2021, pp. 1-14.
[53]	Kaporin, I., The aggregation and cancellation techniques as a practical tool for faster matrix multiplication, Theoret. Comput. Sci., 315 (2004), pp. 469-510. · Zbl 1057.65020
[54]	Karppa, M. and Kaski, P., Engineering Boolean Matrix Multiplication for Multiple-Accelerator Shared-Memory Architectures, preprint, arXiv:1909.01554, 2019. · Zbl 1431.68141
[55]	Karstadt, E. and Schwartz, O., Matrix multiplication, a little faster, J. ACM, 67 (2020), pp. 1-31. · Zbl 1491.68278
[56]	Kreczmar, A., On memory requirements of Strassen’s algorithms, in International Symposium on Mathematical Foundations of Computer Science, , Springer, Berlin, 1976, pp. 404-407. · Zbl 0342.68025
[57]	Krishnan, A. G. and Goswami, D., Multi-stage memory efficient Strassen’s matrix multiplication on GPU, in 2021 IEEE 28th International Conference on High Performance Computing, Data, and Analytics (HiPC), , IEEE, Piscataway, NJ, 2021, pp. 212-221.
[58]	Kumar, B., Huang, C.-H., Sadayappan, P., and Johnson, R. W., A tensor product formulation of Strassen’s matrix multiplication algorithm with memory reduction, Sci. Program., 4 (1995), pp. 275-289.
[59]	Kwasniewski, G., Kabić, M., Besta, M., VandeVondele, J., and Solcà, R., Red-blue pebbling revisited: Near optimal parallel matrix-matrix multiplication, in Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis (SC’19), , Association for Computing Machinery, ACM, New York, 2019, 24.
[60]	Laderman, J., Pan, V., and Sha, X.-H., On practical algorithms for accelerated matrix multiplication, Linear Algebra Appl., 162 (1992), pp. 557-588. · Zbl 0748.65043
[61]	Laderman, J. D., A noncommutative algorithm for multiplying \(3\times 3\) matrices using 23 multiplications, Bull. Amer. Math. Soc., 82 (1976), pp. 126-128. · Zbl 0322.65021
[62]	Lai, P.-W., Arafat, H., Elango, V., and Sadayappan, P., Accelerating Strassen-Winograd’s matrix multiplication algorithm on GPUs, in 20th Annual International Conference on High Performance Computing, , IEEE, Piscataway, NJ, 2013, pp. 139-148.
[63]	Le Gall, F., Powers of tensors and fast matrix multiplication, in Proceedings of the 39th International Symposium on Symbolic and Algebraic Computation, , ACM, New York, 2014, pp. 296-303. · Zbl 1325.65061
[64]	Lipshitz, B., Ballard, G., Demmel, J., and Schwartz, O., Communication-avoiding parallel Strassen: Implementation and performance, in SC’12: Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis, , IEEE, Piscataway, NJ, 2012, pp. 1-11.
[65]	Luo, Q. and Drake, J. B., A scalable parallel Strassen’s matrix multiplication algorithm for distributed-memory computers, in Proceedings of the 1995 ACM Symposium on Applied Computing, , ACM, New York, 1995, pp. 221-226.
[66]	McColl, W. F. and Tiskin, A., Memory-efficient matrix multiplication in the BSP model, Algorithmica, 24 (1999), pp. 287-297. · Zbl 0943.68066
[67]	Medina, E. and Dagan, E., Habana labs purpose-built AI inference and training processor architectures: Scaling AI training systems using standard ethernet with Gaudi processor, IEEE Micro, 40 (2020), pp. 17-24.
[68]	Nissim, R. and Schwartz, O., Revisiting the I/O-complexity of fast matrix multiplication with recomputations, in 2019 IEEE International Parallel and Distributed Processing Symposium (IPDPS), , IEEE, Piscataway, NJ, 2019, pp. 482-490.
[69]	Pan, V. Y., Strassen’s algorithm is not optimal trilinear technique of aggregating, uniting and canceling for constructing fast algorithms for matrix operations, in 19th Annual Symposium on Foundations of Computer Science, , IEEE Computer Society, Long Beach, CA, 1978, pp. 166-176.
[70]	Pan, V. Y., New fast algorithms for matrix operations, SIAM J. Comput., 9 (1980), pp. 321-342. · Zbl 0446.68034
[71]	Pan, V. Y., Trilinear aggregating with implicit canceling for a new acceleration of matrix multiplication, Comput. Math. Appl., 8 (1982), pp. 23-34. · Zbl 0474.65024
[72]	Probert, R. L., On the additive complexity of matrix multiplication, SIAM J. Comput., 5 (1976), pp. 187-203. · Zbl 0328.65029
[73]	Scott, J., Holtz, O., and Schwartz, O., Matrix multiplication I/O-complexity by path routing, in Proceedings of the 27th ACM symposium on Parallelism in Algorithms and Architectures, , ACM, New York, 2015, pp. 35-45.
[74]	Smirnov, A. V., The bilinear complexity and practical algorithms for matrix multiplication, Comput. Math. Math. Phys., 53 (2013), pp. 1781-1795. · Zbl 1313.65094
[75]	Smirnov, A. V., Several Bilinear Algorithms for Matrix Multiplication, Technical report, 2017, https://www.researchgate.net/publication/313064941_Several_bilinear_algorithms_for_matrix_multiplication.
[76]	Solomonik, E., Bhatele, A., and Demmel, J., Improving communication performance in dense linear algebra via topology aware collectives, in SC’11: Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis, , IEEE, Piscataway, NJ, 2011, pp. 1-11.
[77]	Solomonik, E. and Demmel, J., Communication-optimal parallel 2.5D matrix multiplication and lu factorization algorithms, in European Conference on Parallel Processing, , Springer, Berlin, 2011, pp. 90-109.
[78]	Stothers, A. J., On the Complexity of Matrix Multiplication, Ph.D. thesis, University of Edinburgh, Edingburgh, UK, 2010.
[79]	Strassen, V., Gaussian elimination is not optimal, Numer. Math., 13 (1969), pp. 354-356. · Zbl 0185.40101
[80]	Strassen, V., The asymptotic spectrum of tensors and the exponent of matrix multiplication, in 27th Annual Symposium on Foundations of Computer Science, , IEEE Computer Society, Los Angeles, CA, 1986, pp. 49-54.
[81]	Tang, Y., Balanced partitioning of several cache-oblivious algorithms, in Proceedings of the 32nd ACM Symposium on Parallelism in Algorithms and Architectures, , ACM, New York, 2020, pp. 575-577.
[82]	Thottethodi, M., Chatterjee, S., and Lebeck, A. R., Tuning Strassen’s matrix multiplication for memory efficiency, in SC’98: Proceedings of the 1998 ACM/IEEE Conference on Supercomputing, , IEEE, Piscataway, NJ, 1998, pp. 36-36.
[83]	Trefethen, L. N. and Bau, D., Numerical Linear Algebra, SIAM, Philadelphia, 2022. · Zbl 0984.65019
[84]	Van De Geijn, R. A. and Watts, J., SUMMA: Scalable universal matrix multiplication algorithm, Concurrency Pract. Experience, 9 (1997), pp. 255-274.
[85]	Williams, V. V., Multiplying matrices faster than Coppersmith-Winograd, in Proceedings of the Forty-fourth Annual ACM Symposium on Theory of Computing, , ACM, New York, 2012, pp. 887-898. · Zbl 1286.65056
[86]	Winograd, S., Private communication with R. Probert, 1976.
[87]	Wise, D. S., Ahnentafel indexing into Morton-ordered arrays, or matrix locality for free, in European Conference on Parallel Processing, , Springer, Berlin, 2000, pp. 774-783.

This reference list is based on information provided by the publisher or from digital mathematics libraries. Its items are heuristically matched to zbMATH identifiers and may contain data conversion errors. In some cases that data have been complemented/enhanced by data from zbMATH Open. This attempts to reflect the references listed in the original paper as accurately as possible without claiming completeness or a perfect matching.