×

A fine-grained block ILU scheme on regular structures for GPGPUs. (English) Zbl 1390.65139

Summary: Iterative methods based on block incomplete LU (BILU) factorization are considered highly effective for solving large-scale block-sparse linear systems resulting from coupled PDE systems with \(n\) equations. However, efforts on porting implicit PDE solvers to massively parallel shared-memory heterogeneous architectures, such as general-purpose graphics processing units (GPGPUs), have largely avoided BILU, leaving their enormous performance potential unfulfilled in many applications where the use of implicit schemes and BILU-type preconditioners/solvers is highly preferred. Indeed, strong inherent data dependency and high memory bandwidth demanded by block matrix operations render naive adoptions of existing sequential BILU algorithms extremely inefficient on GPGPUs. In this study, we present a fine-grained BILU (FGBILU) scheme which is particularly effective on GPGPUs. A straightforward one-sweep wavefront ordering is employed to resolve data dependency. Granularity is substantially refined as block matrix operations are carried out in a true element-wise approach. Particularly, the inversion of diagonal blocks, a well-known bottleneck, is accomplished by a parallel in-place Gauss-Jordan elimination. As a result, FGBILU is able to offer low-overhead concurrent computation at \(O(n^2 N^2)\) scale on a 3D PDE domain with a linear scale of \(N\). FGBILU has been implemented with both OpenACC and CUDA and tested as a block-sparse linear solver on a structured 3D grid. While FGBILU remains mathematically identical to sequential global BILU, numerical experiments confirm its exceptional performance on an Nvidia GPGPU.

MSC:

65N22 Numerical solution of discretized equations for boundary value problems involving PDEs
65F08 Preconditioners for iterative methods
65Y05 Parallel numerical computation
65Y10 Numerical algorithms for specific classes of architectures

Software:

CUDA; FParser; OpenACC; GPGPU
Full Text: DOI

References:

[1] Chan, T. F.; van der Vorst, H. A., Approximate and incomplete factorizations, (Keyes, D.; Sameh, A.; Venkatakrishnan, V., Parallel numerical algorithms, ICASE/LaRC Interdiscipl Ser Sci Eng, vol. 4, (1997), Springer Netherlands), 167-202 · Zbl 0865.65015
[2] Hysom, D.; Pothen, A., A scalable parallel algorithm for incomplete factor preconditioning, SIAM J Sci Comput, 22, 6, 2194-2215, (2001) · Zbl 0986.65048
[3] Hénon, P.; Saad, Y., A parallel multistage ILU factorization based on a hierarchical graph decomposition, SIAM J Sci Comput, 28, 6, 2266-2293, (2006) · Zbl 1126.65028
[4] monga Made, M. M.; van der Vorst, H. A., Parallel incomplete factorizations with pseudo-overlapped subdomains, Parallel Comput, 27, 8, 989-1008, (2001) · Zbl 0971.68060
[5] Thies J, Wubs F. Design of a parallel hybrid direct/iterative solver for CFD problems. In: IEEE 7th international conference on e-Science; 2011. p. 387-94.
[6] Heuveline, V.; Lukarski, D.; Trost, N.; Weiss, J.-P., Parallel smoothers for matrix-based geometric multigrid methods on locally refined meshes using multicore CPUs and gpus, (Keller, R.; Kramer, D.; Weiss, J.-P., Facing the multicore - challenge II, Lect Notes Comput Sci, vol. 7174, (2012), Springer Berlin, Heidelberg), 158-171
[7] Saad, Y.; van der Vorst, H. A., Iterative solution of linear systems in the 20th century, J Comput Appl Math, 123, 1-33, (2000), numerical Analysis 2000. Vol. III: Linear Algebra · Zbl 0965.65051
[8] Duff, I.; Meurant, G. A., The effect of ordering on preconditioned conjugate gradients, BIT Numer Math, 29, 4, 635-657, (1989) · Zbl 0687.65037
[9] Doi, S.; Washio, T., Ordering strategies and related techniques to overcome the trade-off between parallelism and convergence in incomplete factorizations, Parallel Comput, 25, 1995-2014, (1999)
[10] Georgescu, S.; Chow, P.; Okuda, H., GPU acceleration for FEM-based structural analysis, Arch Comput Meth Eng, 20, 2, 111-121, (2013) · Zbl 1354.65246
[11] Xia Y, Luo H, Luo L, Edwards J, Lou J, Mueller F. OpenACC-based GPU acceleration of a 3-D unstructured discontinuous galerkin method. In: 52nd AIAA aerospace sciences meeting; 2014.
[12] Jacobsen DA, Thibault JC, Senocak I. An MPI-CUDA implementation for massively parallel incompressible flow computations on multi-GPU clusters. In: 48th AIAA aerospace sciences meeting and exhibit, vol. 16; 2010.
[13] Brandvik T, Pullan G. Acceleration of a 3D Euler solver using commodity graphics hardware. In: 46th AIAA aerospace sciences meeting and exhibit; 2008. p. 607.
[14] Corrigan, A.; Camelli, F.; Löhner, R.; Mut, F., Semi-automatic porting of a large-scale Fortran CFD code to gpus, Int J Numer Meth Fluids, 69, 2, 314-331, (2012) · Zbl 1245.76003
[15] Duffy AC, Hammond DP, Nielsen EJ. Production level CFD code acceleration for hybrid many-core architectures, Tech. rep., NASA/TM-2012-217770; 2012.
[16] Fu, L.; Gao, Z.; Xu, K.; Xu, F., A multi-block viscous flow solver based on GPU parallel methodology, Comput Fluids, 95, 19-39, (2014) · Zbl 1391.76218
[17] Luo L, Edwards JR, Luo H, Mueller F. GPU port of a parallel incompressible Navier-Stokestokes solver based on OpenACC and MVAPICH2. In: AIAA aviation and aeronautics forum and exposition; 2014.
[18] van der Vorst, H., High performance preconditioning, SIAM J Sci Statist Comput, 10, 6, 1174-1185, (1989) · Zbl 0693.65027
[19] The OpenACC application programming interface; 2013.
[20] CAPS Enterprise. The OpenHMPP open standard; 2009.
[21] Beyer, J.; Stotzer, E.; Hart, A.; de Supinski, B., Openmp for accelerators, (Chapman, B.; Gropp, W.; Kumaran, K.; M uller, M., OpenMP in the Petascale Era, Lect Notes Comput Sci, vol. 6665, (2011), Springer Berlin, Heidelberg), 108-121
[22] Meijerink, J.; van der Vorst, H. A., An iterative solution method for linear systems of which the coefficient matrix is a symmetric m-matrix, Math Comput, 31, 137, 148-162, (1977) · Zbl 0349.65020
[23] Axelsson, O.; Brinkkemper, S.; Iln, V., On some versions of incomplete block-matrix factorization iterative methods, Linear Algebra Appl, 58, 3-15, (1984) · Zbl 0548.65016
[24] Wittum, G., On the robustness of ILU smoothing, SIAM J Sci Statist Comput, 10, 4, 699-717, (1989) · Zbl 0677.65096
[25] Meurant, G., The block preconditioned conjugate gradient method on vector computers, BIT Numer Math, 24, 4, 623-633, (1984) · Zbl 0556.65023
[26] Concus, P.; Golub, G.; Meurant, G., Block preconditioning for the conjugate gradient method, SIAM J Sci Statist Comput, 6, 1, 220-252, (1985) · Zbl 0556.65022
[27] Edwards, J. R.; Liou, M.-S., Low-diffusion flux-splitting methods for flows at all speeds, AIAA J, 36, 9, 1610-1617, (1998)
[28] Choi, J.-I.; Oberoi, R. C.; Edwards, J. R.; Rosati, J. A., An immersed boundary method for complex incompressible flows, J Comput Phys, 224, 2, 757-784, (2007) · Zbl 1123.76351
[29] Chorin, A. J., Numerical solution of the Navier-Stokes equations, Math Comput, 22, 104, 745-762, (1968) · Zbl 0198.50103
[30] Ramesh, K.; Gopalarathnam, A.; Edwards, J. R.; Ol, M. V.; Granlund, K., An unsteady airfoil theory applied to pitching motions validated against experiment and computation, Theoret Comput Fluid Dyn, 27, 6, 843-864, (2013)
[31] McGowan, G. Z.; Granlund, K.; Ol, M. V.; Gopalarathnam, A.; Edwards, J. R., Investigations of lift-based pitch-plunge equivalence for airfoils at low Reynolds numbers, AIAA J, 49, 7, 1511-1524, (2011)
[32] Cassidy, D. A.; Edwards, J. R.; Tian, M., An investigation of interface-sharpening schemes for multi-phase mixture flows, J Comput Phys, 228, 16, 5628-5649, (2009) · Zbl 1280.76033
[33] Choi, J.-I.; Edwards, J. R., Large eddy simulation and zonal modeling of human-induced contaminant transport, Indoor Air, 18, 3, 233-249, (2008)
[34] Choi, J.-I.; Edwards, J. R., Large-eddy simulation of human-induced contaminant transport in room compartments, Indoor Air, 22, 1, 77-87, (2012)
[35] Narsipur, S.; Gopalarathnam, A.; Edwards, J. R., A time-lag approach for prediction of trailing edge separation in unsteady flow, (AIAA aviation and aeronautics forum and exposition, (2014), AIAA)
[36] Forsythe, G. E.; Moler, C. B., Computer solution of linear algebraic systems, (1967), Prentice-Hall Englewood Cliffs, NJ, vol. 7 · Zbl 0154.40401
[37] Xiao, S.; Feng, W., Inter-block GPU communication via fast barrier synchronization, (2010 IEEE International Symposium on Parallel & Distributed Processing (IPDPS), (2010), IEEE), 1-12
This reference list is based on information provided by the publisher or from digital mathematics libraries. Its items are heuristically matched to zbMATH identifiers and may contain data conversion errors. In some cases that data have been complemented/enhanced by data from zbMATH Open. This attempts to reflect the references listed in the original paper as accurately as possible without claiming completeness or a perfect matching.