×

Towards dense linear algebra for hybrid GPU accelerated manycore systems. (English) Zbl 1204.68268

Summary: We highlight the trends leading to the increased appeal of using hybrid multicore + GPU systems for high performance computing. We present a set of techniques that can be used to develop efficient dense linear algebra algorithms for these systems. We illustrate the main ideas with the development of a hybrid LU factorization algorithm where we split the computation over a multicore and a graphics processor, and use particular techniques to reduce the amount of pivoting and communication between the hybrid components. This results in an efficient algorithm with balanced use of a multicore processor and a graphics processor.

MSC:

68W10 Parallel algorithms in computer science
68M99 Computer system organization
65F99 Numerical linear algebra
65Y05 Parallel numerical computation

References:

[1] E. Anderson, Z. Bai, C. Bischof, S. Blackford, J. Demmel, J. Dongarra, J. Du Croz, A. Greenbaum, S. Hammarling, A. McKenney, D. Sorensen, LAPACK user’s guide, SIAM, third ed., 1999. · Zbl 0934.65030
[2] M. Baboulin, J. Dongarra, S. Tomov, Some issues in dense linear algebra for multicore and special purpose architectures, Technical Report UT-CS-08-615, University of Tennessee, 2008, LAPACK Working Note 200.
[3] G. Ballard, J. Demmel, O. Holtz, O. Schwartz, Minimizing communication in linear algebra, Technical Report, LAPACK Working Note 218, May 2009. · Zbl 1246.68128
[4] S. Barrachina, M. Castillo, F. Igual, R. Mayo, E. Quintana-Ortí, Solving dense linear systems on graphics processors, Technical Report ICC 02-02-2008, Universidad Jaime I, February, 2008.
[5] A. Buttari, J. Dongarra, J. Kurzak, J. Langou, P. Luszczek, S. Tomov, The impact of multicore on math software, PARA 2006, in: B. Kågström et al. (Ed.), Lecture Notes in Computer Science, vol. 4699, Springer, 2007, pp. 1 – 10.
[6] Buttari, A.; Dongarra, J.; Kurzak, J.; Luszczek, P.; Tomov, S.: Using mixed precision for sparse matrix computations to enhance the performance while achieving 64-bit accuracy, ACM trans. Math. software 34, No. 4 (2008) · Zbl 1190.65117 · doi:10.1145/1377596.1377597
[7] A. Buttari, J. Langou, J. Kurzak, J. Dongarra, A class of parallel tiled linear algebra algorithms for multicore architectures, Technical Report UT-CS-07-600, University of Tennessee, 2007, LAPACK Working Note 191.
[8] J. Demmel, J. Dongarra, B. Parlett, W. Kahan, M. Gu, D. Bindel, Y. Hida, X. Li, O. Marques, E. Riedy, C. Vömel, J. Langou, P. Luszczek, J. Kurzak, A. Buttari, J. Langou, S. Tomov, Prospectus for the next LAPACK and ScaLAPACK libraries, in: PARA’06: State-of-the-Art in Scientific and Parallel Computing (Umeå, Sweden), High Performance Computing Center North (HPC2N) and the Department of Computing Science, Umeå University, Springer, June 2006.
[9] J. Demmel, L. Grigori, M. Hoemmen, J. Langou, Communication-avoiding parallel and sequential QR factorizations, CoRR abs/0806.2159, 2008. · Zbl 1241.65028
[10] Dongarra, J.; Luszczek, P.; Petitet, A.: The LINPACK benchmark: past, present, and future, Concurrency and computation: practice and experience 15, 820 (2003)
[11] J. Dongarra, S. Moore, G. Peterson, S. Tomov, J. Allred, V. Natoli, D. Richie, Exploring new architectures in accelerating CFD for air force applications, in: Proceedings of HPCMP Users Group Conference 2008, July 14 – 17, 2008. <http://www.cs.utk.edu/ tomov/ugc2008_final.pdf>.
[12] K. Fatahalian, J. Sugerman, P. Hanrahan, Understanding the efficiency of GPU algorithms for matrix – matrix multiplication, in: HWWS ’04: Proceedings of the ACM Siggraph/Eurographics Conference on Graphics Hardware (New York, NY, USA), ACM, 2004, pp. 133 – 137.
[13] M. Fatica, Accelerating LINPACK with CUDA on heterogenous clusters, in: GPGPU-2: Proceedings of 2nd Workshop on General Purpose Processing on Graphics Processing Units (New York, NY, USA), ACM, 2009, pp. 46 – 51.
[14] N. Galoppo, N. Govindaraju, M. Henson, D. Manocha, LU-GPU: efficient algorithms for solving dense linear systems on graphics hardware, in: SC ’05: Proceedings of the 2005 ACM/IEEE Conference on Supercomputing (Washington, DC, USA), IEEE Computer Society, 2005, p. 3.
[15] L. Grigori, J. Demmel, H. Xiang, Communication avoiding Gaussian elimination, Technical Report 6523, INRIA, 2008.
[16] Wolfgang Gruener, Larrabee, CUDA and the quest for the free lunch, TGDaily. <http://www.tgdaily.com/content/view/38750/113/,08/2008>.
[17] Higham, N.: Accuracy and stability of numerical algorithms, (2002) · Zbl 1011.65010 · doi:10.1137/1.9780898718027
[18] Hruska, J.: AMD fusion now pushed back to 2011, Art technica (2008)
[19] Kågström, B.; Ling, P.; Van Loan, C.: GEMM-based level 3 BLAS: high-performance model implementations and performance evaluation benchmark, ACM trans. Math. software 24, No. 3, 268-302 (1998) · Zbl 0930.65047 · doi:10.1145/292395.292412
[20] Julie Langou, Julien Langou, P. Luszczek, J. Kurzak, A. Buttari, J. Dongarra, Exploiting the performance of 32 bit floating point arithmetic in obtaining 64 bit accuracy (revisiting iterative refinement for linear systems), in: SC ’06: Proceedings of the 2006 ACM/IEEE Conference on Supercomputing (New York, NY, USA), ACM, 2006, p. 113.
[21] Y. Li, J. Dongarra, S. Tomov, A note on auto-tuning GEMM for GPUs, Technical Report, LAPACK Working Note 212, January 2009.
[22] NVIDIA, Nvidia Tesla doubles the performance for CUDA developers, Computer Graphics World (06/30/2008).
[23] NVIDIA, NVIDIA CUDA Programming Guide, 6/07/2008, Version 2.0.
[24] Owens, J.; Houston, M.; Luebke, D.; Green, S.; Stone, J.; Phillips, J.: GPU computing, Proceedings of the IEEE 96, No. 5, 879-899 (2008)
[25] Owens, J.; Luebke, D.; Govindaraju, N.; Harris, M.; Krüger, J.; Lefohn, A.; Purcell, T.: A survey of general-purpose computation on graphics hardware, Comput. graphics forum 26, No. 1, 80-113 (2007)
[26] D. Parker, Random butterfly transformations with applications in computational linear algebra, Technical Report CSD-950023, Computer Science Department, UCLA, 1995.
[27] D. Parker, B. Pierce, The randomizing FFT: an alternative to pivoting in Gaussian elimination, Technical Report CSD-950037, Computer Science Department, UCLA, 1995.
[28] Pharr, M.; Fernando, R.: GPU gems 2: programming techniques for high-performance graphics and general-purpose computation (gpu gems), (2005)
[29] G. Quintana-Ortí, F.Igual, E.Quintana-Ortí, R. van de Geijn, Solving dense linear systems on platforms with multiple hardware accelerators, in: PPoPP ’09: Proceedings of the 14th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (New York, NY, USA), ACM, 2009, pp. 121 – 130. · Zbl 1201.68039
[30] G. Quintana-Orti, E. Quintana-Orti, E. Chan, F. van Zee, R. van de Geijn, Programming algorithms-by-blocks for matrix computations on multithreaded architectures, Technical Report TR-08-04, University of Texas at Austin, 2008, FLAME Working Note 29.
[31] Seiler, L.; Carmean, D.; Sprangle, E.; Forsyth, T.; Abrash, M.; Dubey, P.; Junkins, S.; Lake, A.; Sugerman, J.; Cavin, R.; Espasa, R.; Grochowski, E.; Juan, T.; Hanrahan, P.: Larrabee: a many-core\(\times 86\) architecture for visual computing, ACM trans. Graph. 27, No. 3, 1-15 (2008)
[32] S. Tomov, M. Baboulin, J. Dongarra, S. Moore, V. Natoli, G. Peterson, D. Richie, Special-purpose hardware and algorithms for accelerating dense linear algebra, in: Parallel Processing for Scientific Computing, Atlanta, March 12 – 14, 2008. <http://www.cs.utk.edu/ tomov/PP08_Tomov.pdf>.
[33] S. Tomov, J. Dongarra, Accelerating the reduction to upper Hessenberg form through hybrid GPU-based computing, Technical Report 219, LAPACK Working Note, May 2009. · Zbl 1214.65020
[34] V. Volkov, J. Demmel, Benchmarking gpus to tune dense linear algebra, in: SC ’08: Proceedings of the 2008 ACM/IEEE Conference on Supercomputing (Piscataway, NJ, USA), IEEE Press, 2008, pp. 1 – 11.
[35] LU, QR, Cholesky factorizations using vector capabilities of GPUs, Technical Report UCB/EECS-2008-49, EECS Department, University of California, Berkeley, May 2008.
[36] Using GPUs to accelerate linear algebra routines, Poster at PAR lab winter retreat, January 9, 2008. <http://www.eecs.berkeley.edu/ volkov/volkov08-parlab.pdf>.
[37] General-purpose computation using graphics hardware, <http://www.gpgpu.org>.
[38] Nvidia cuda zone, NVIDIA. <http://www.nvidia.com/object/cuda_home.html>.
This reference list is based on information provided by the publisher or from digital mathematics libraries. Its items are heuristically matched to zbMATH identifiers and may contain data conversion errors. In some cases that data have been complemented/enhanced by data from zbMATH Open. This attempts to reflect the references listed in the original paper as accurately as possible without claiming completeness or a perfect matching.