Document Zbl 1397.65340

Akhtar, Muhammad Naveed; Durad, Muhammad Hanif; Usman, Anila; Mughal, Muhammad Abid

Efficient memory access patterns for solving 3D Laplace equation on GPU. (English) Zbl 1397.65340

Iran. J. Sci. Technol., Trans. A, Sci. 42, No. 2, 623-633 (2018).

Summary: Graphic processor units (GPUs) are highly scalable parallel platforms for computation. A GPU contains thousands of cores along with different types of memory spaces having varying bandwidths. The maximum throughput of GPU computation lies in efficient use of these memory types. This paper presents research involving 12 different kernels to solve the standard Laplace equation in three dimensions. Each kernel uses a unique memory access pattern. The benchmarks have been established for the said problem and a novel efficient kernel is suggested after in-depth analysis. A throughput of more than 50 Giga floating point operations per seconds (GFLOPS) has been obtained on an average GPU as consequence of optimizing the memory access path. The best approach achieves a speedup of about 70 on the GPU in comparison to a CPU.

MSC:

65Y10	Numerical algorithms for specific classes of architectures
65N06	Finite difference methods for boundary value problems involving PDEs

Keywords:

GPU; Laplace 3D; GPU texture; GPU surface references; GPU shared memory; lockless synchronization; compute unified device architecture (CUDA)

Software:

Mint; CUDA

Cite Review PDF

Full Text: DOI

References:

[1]	Gray A, Sjöström A, llieva-Litova N (2013) Best Practice mini-guide accelerated clusters. Using General Purpose GPUs
[2]	Chen, F, A new framework of GPU-accelerated spectral solvers: collocation and glerkin methods for systems of coupled elliptic equations, J Sci Comput, 62, 575-600, (2015) · Zbl 1320.65183 · doi:10.1007/s10915-014-9868-3
[3]	Cheney E, Kincaid D (2012) Numerical mathematics and computing. Nelson Education · Zbl 0487.65001
[4]	Cheng J, Grossman M, McKercher T (2014) Professional Cuda C Programming. Wiley
[5]	Dugan, N; Genovese, L; Goedecker, S, A customized 3D GPU Poisson solver for free boundary conditions, Comput Phys Commun, 184, 1815-1820, (2013) · doi:10.1016/j.cpc.2013.02.024
[6]	Glaskowsky PN (2009) NVIDIA’s Fermi: the first complete GPU computing architecture. White paper
[7]	Helfenstein, R; Koko, J, Parallel preconditioned conjugate gradient algorithm on GPU, J Comput Appl Math, 236, 3584-3590, (2012) · Zbl 1245.65034 · doi:10.1016/j.cam.2011.04.025
[8]	Jiang, B; Dai, W; Khaliq, A; Carey, M; Zhou, X; Zhang, L, Novel 3D GPU based numerical parallel diffusion algorithms in cylindrical coordinates for health care simulation, Math Comput Simul, 109, 1-19, (2015) · Zbl 1519.92098 · doi:10.1016/j.matcom.2014.07.003
[9]	Jost T, Contassot-Vivier S, Vialle S (2009) An efficient multi-algorithms sparse linear solver for GPUs. Paper presented at the ParCo
[10]	Konstantinidis, E; Cotronis, Y, Graphics processing unit acceleration of the red/black SOR method, Concurr ComputPract Exp, 25, 1107-1120, (2013) · doi:10.1002/cpe.2952
[11]	Kumar V, Grama A, Gupta A, Karypis G (1994) Introduction to parallel computing: design and analysis of algorithms. Benjamin/Cummings Publishing Company, Redwood City · Zbl 0861.68040
[12]	Michael TH (2002) Scientific computing: an introductory survey. The McGraw-Hill Companies Inc., New York · Zbl 0903.68072
[13]	Nvidia (2011) Tuning CUDA Applications for fermi version 1.0. NVIDIA, May
[14]	Nvidia (2012) NVIDIA GeForce GTX 680 Whitepaper: NVIDIA Corporation
[15]	Nvidia (2014a) CUDA C Best Practices Guide version 6.5
[16]	Nvidia (2014b) CUDA C programming guide version 6.5. NVIDIA Corporation, Santa Clara
[17]	Nvidia (2014c) Tuning CUDA applications for Kepler
[18]	Papageorgiou, A; Platis, N, Triangular mesh simplification on the GPU, Vis Comp, 31, 235-244, (2015) · doi:10.1007/s00371-014-1039-x
[19]	Unat D, Cai X, Baden SB (2011) Mint: realizing CUDA performance in 3D stencil methods with annotated C. In: Paper presented at the Proceedings of the international conference on Supercomputing
[20]	Whitehead N, Fit-Florea A (2011) Precision and performance: floating point and IEEE 754 compliance for NVIDIA GPUs. rn (A + B), 21:1-1874919424
[21]	Xiao S, Feng WC (2010) Inter-block GPU communication via fast barrier synchronization. In: Parallel & Distributed Processing (IPDPS), 2010 IEEE International Symposium, IEEE, pp 1-12

This reference list is based on information provided by the publisher or from digital mathematics libraries. Its items are heuristically matched to zbMATH identifiers and may contain data conversion errors. In some cases that data have been complemented/enhanced by data from zbMATH Open. This attempts to reflect the references listed in the original paper as accurately as possible without claiming completeness or a perfect matching.