×

Scaling up parallel computation of tiled QR factorizations by a distributed scheduling runtime system and analytical modeling. (English) Zbl 1492.65097

Summary: Implementing parallel software for QR factorizations to achieve scalable performance on massively parallel manycore systems requires a comprehensive design that includes algorithm redesign, efficient runtime systems, synchronization and communication reduction, and analytical performance modeling. This paper presents a piece of tiled communication-avoiding QR factorization software that is able to scale efficiently for matrices with general dimensions. We design a tiled communication-avoiding QR factorization algorithm and implement it with a fully distributed dynamic scheduling runtime system to minimize both synchronization and communication. The whole class of communication-avoiding QR factorization algorithms uses an important parameter of \(D\) (i.e., the number of domains), whose best solution is still unknown so far and requires manual tuning and empirical searching to find it. To that end, we introduce a simplified analytical performance model to determine an optimal number of domains \(D^\ast\). The experimental results show that our new parallel implementation is faster than a state-of-the-art multicore-based numerical library by up to 30%, and faster than ScaLAPACK by up to 30 times with thousands of CPU cores. Furthermore, using the new analytical model to predict an optimal number of domains is as competitive as exhaustive searching, and exhibits an average performance difference of 1%.

MSC:

65F25 Orthogonalization in numerical linear algebra
65Y05 Parallel numerical computation
68M07 Mathematical problems of computer architecture
68M14 Distributed systems

Software:

LAPACK; ScaLAPACK
Full Text: DOI

References:

[1] Anderson, E., Bai, Z. and Dongarra, J., Generalized QR factorization and its applications, Linear Algebra and its Applications162 :243-271, 1992. · Zbl 0747.65022
[2] Björck, A., Numerical Methods for Least Squares Problems, SIAM, 1996. · Zbl 0847.65023
[3] Demmel, J. W., Applied Numerical Linear Algebra, SIAM, 1997. · Zbl 0879.65017
[4] Anderson, E., Bai, Z., Bischof, C., Blackford, L., Demmel, J., Dongarra, J., Du Croz, J., Greenbaum, A., Hammarling, S., McKenney, A. and Sorensen, D., LAPACK Users’ Guide, SIAM, 1992.
[5] Blackford, L. S., Choi, J., Cleary, A., D’Azevedo, E., Demmel, J., Dhillon, I., Dongarra, J., Hammarling, S., Henry, G., Petitet, A.et al., ScaLAPACK Users’ Guide, SIAM, volume 4, 1997. · Zbl 0886.65022
[6] Choi, J., Demmel, J., Dhillon, I., Dongarra, J., Ostrouchov, S., Petitet, A., Stanley, K., Walker, D. and Whaley, R. C., ScaLAPACK: A portable linear algebra library for distributed memory computers: Design issues and performance, in Applied Parallel Computing Computations in Physics, Chemistry and Engineering Science, Springer, 1996, pages 95-106. · Zbl 0926.65148
[7] J. W. Demmel, L. Grigori, M. F. Hoemmen and J. Langou, Communication-optimal parallel and sequential QR and LU factorizations. LAPACK Working Note 204, UTK, August 2008. · Zbl 1241.65028
[8] Ballard, G., Demmel, J., Grigori, L., Jacquelin, M., Knight, N. and Nguyen, H. D., Reconstructing householder vectors from tall-skinny QR, Journal of Parallel and Distributed Computing85 :3-31, 2015. IPDPS 2014 Selected Papers on Numerical and Combinatorial Algorithms.
[9] Hoemmen, M., A communication-avoiding, hybrid-parallel, rank-revealing orthogonalization method, in 2011 IEEE International Parallel Distributed Processing Symposium \((\) IPDPS \()\), May 2011, pages 966-977.
[10] Song, F., Ltaief, H., Hadri, B. and Dongarra, J., Scalable tile communication-avoiding QR factorization on multicore cluster systems, in Proceedings of the 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis \((\) SC’\(10 ), 2010\), pages 1-11.
[11] Gentleman, W. M., Row elimination for solving sparse linear systems and least squares problems, Numer. Anal., Lect Notes Math, 506 :122-133, 1976. · Zbl 0334.65032
[12] Pothen, A. and Raghavan, P., Distributed orthogonal factorization: Givens and Householder algorithms, SIAM Journal on Scientific and Statistical Computing, 10 :1113-1134, 1989. · Zbl 0693.65031
[13] Mohiyuddin, M., Hoemmen, M., Demmel, J. and Yelick, K., Minimizing communication in sparse matrix solvers, in Conference on High Performance Computing Networking, Storage and Analysis \((\) SC’\(09 )\), page 36. ACM, 2009.
[14] Khabou, A., Demmel, J. W., Grigori, L. and Gu, M., LU factorization with panel rank revealing pivoting and its communication avoiding version, SIAM Journal on Matrix Analysis and Applications34(3) :1401-1429, 2013. · Zbl 1279.65034
[15] aRSEC. http://icl.utk.edu/parsec.
[16] Song, F., YarKhan, A. and Dongarra, J., Dynamic task scheduling for linear algebra algorithms on distributed-memory multicore systems, in Proceedings of the 2009 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis \((\) SC’\(09)\), pages 1-11. IEEE, 2009.
[17] Song, F., Tomov, S. and Dongarra, J., Enabling and scaling matrix computations on heterogeneous multi-core and multi-GPU systems, in Proceedings of the International Conference on Supercomputing, ICS’12, pages 1-11. ACM, 2012.
[18] Song, F. and Dongarra, J., A scalable approach to solving dense linear algebra problems on hybrid CPU-GPU systems, Concurrency and Computation \(:\) Practice and Experience27(14) :3702-3723, 2015.
[19] Agullo, E., Giraud, L., Guermouche, A., Nakov, S. and Roman, J., Task-based conjugate gradient: from multi-GPU towards heterogeneous architectures, in European Conference on Parallel Processing, Springer, 2016, pages 69-82.
[20] Zhuang, S. and Casas, M., Iteration-fusing conjugate gradient, in Proceedings of the International Conference on Supercomputing, page 21. ACM, 2017.
[21] BigRedII. https://kb.iu.edu/d/bcqt.
[22] Comet. https://portal.xsede.org/sdsc-comet.
This reference list is based on information provided by the publisher or from digital mathematics libraries. Its items are heuristically matched to zbMATH identifiers and may contain data conversion errors. In some cases that data have been complemented/enhanced by data from zbMATH Open. This attempts to reflect the references listed in the original paper as accurately as possible without claiming completeness or a perfect matching.