Document Zbl 1397.65337

Pikle, Nileshchandra K.; Sathe, Shailesh R.; Vyavhare, Arvind Y.

GPGPU-based parallel computing applied in the FEM using the conjugate gradient algorithm: a review. (English) Zbl 1397.65337

Sādhanā 43, No. 7, Paper No. 111, 21 p. (2018).

Summary: Parallelization of the finite-element method (FEM) has been contemplated by the scientific and high-performance computing community for over a decade. Most of the computations in the FEM are related to linear algebra that includes matrix and vector computations. These operations have the single-instruction multiple-data (SIMD) computation pattern, which is beneficial for shared-memory parallel architectures. General-purpose graphics processing units (GPGPUs) have been effectively utilized for the parallelization of FEM computations ever since 2007. The solver step of the FEM is often carried out using conjugate gradient (CG)-type iterative methods because of their larger convergence rates and greater opportunities for parallelization. Although the SIMD computation patterns in the FEM are intrinsic for GPU computing, there are some pitfalls, such as the underutilization of threads, uncoalesced memory access, lower arithmetic intensity, limited faster memories on GPUs and synchronizations. Nevertheless, FEM applications have been successfully deployed on GPUs over the last 10 years to achieve a significant performance improvement. This paper presents a comprehensive review of the parallel optimization strategies applied in each step of the FEM. The pitfalls and trade-offs linked to each step in the FEM are also discussed in this paper. Furthermore, some extraordinary methods that exploit the tremendous amount of computing power of a GPU are also discussed. The proposed review is not limited to a single field of engineering. Rather, it is applicable to all fields of engineering and science in which FEM-based simulations are necessary.

MSC:

65Y05	Parallel numerical computation
65N30	Finite element, Rayleigh-Ritz and Galerkin methods for boundary value problems involving PDEs
65M60	Finite element, Rayleigh-Ritz and Galerkin methods for initial value and initial-boundary value problems involving PDEs
65-02	Research exposition (monographs, survey articles) pertaining to numerical analysis

Keywords:

finite-element method (FEM); conjugate gradient (CG); sparse matrix-vector multiplication (SpMV); assembly-free FEM (AF-FEM); graphics processing units (GPUs); compute unified device architecture (CUDA); parallel computing

Software:

CUSPARSE; CUDA; FEniCS; SyFi; ITER-REF; OpenCL; CUBLAS; MKL; COFFEE; SELL_C_sigma

Cite Review PDF

Full Text: DOI Link

References:

[1]	Zienkiewicz O C, Taylor R L and Nithiarasu P 2000 The finite element method: solid mechanics, vol. 2. Oxford: Butterworth-heinemann · Zbl 0991.74003
[2]	Singh, IV; Mishra, BK; Brahmankar, M; Bhasin, V; Sharma, K; Khan, IA, Numerical simulations of 3-d cracks using coupled EFGM and FEM, Int. J. Comput. Methods Eng. Sci. Mech., 15, 227-231, (2014) · doi:10.1080/15502287.2014.882438
[3]	Jin J M 2015 The finite element method in electromagnetics, 3rd ed. New York: John Wiley & Sons
[4]	Moratal D 2012 Finite element analysis-from biomedical applications to industrial development. London: InTech · doi:10.5772/2552
[5]	Argyris J 1954 and 1955 Energy theorems and structural analysis. Aircraft Engineering re-printed 1990 London: Butterworth’s Scientific Publications
[6]	Clough W R 1960 The finite element method in plane stress analysis. In: Proceedings of the 2nd Conference on Electronic Computation, A.S.C.E. Structural Division, Pittsburgh, Pennsylvania
[7]	Banaś, K; Płaszewski, P; Macoił, P, Numerical integration on GPUs for higher order finite elements, Comput. Math. Appl., 67, 1319-1344, (2014) · Zbl 1382.65064 · doi:10.1016/j.camwa.2014.01.021
[8]	Komatitsch, D; Michéa, D; Erlebacher, G, Porting a high-order finite-element earthquake modeling application to NVIDIA graphics cards using CUDA, J. Parallel Distrib. Comput., 69, 451-460, (2009) · doi:10.1016/j.jpdc.2009.01.006
[9]	Dongarra J Survey of sparse matrix storage formats. www.netlib.org/utk/papers/templates/node90.html (visited 10th May 2017)
[10]	Bell N and Garland M 2008 Efficient sparse matrix-vector multiplication on CUDA. Nvidia Technical Report NVR-2008-004, Nvidia Corporation
[11]	Barrett R, Berry M, Chan T F, Demmel J, Donato J, Dongarra J, Eijkhout V, Pozo R, Romine C and Van der Vorst H 1994 Templates for the solution of linear systems: building blocks for iterative methods, 2nd ed. Society for Industrial and Applied Mathematics, Philadelphia, Pennsylvania · Zbl 0814.65030 · doi:10.1137/1.9781611971538
[12]	Carey, GF; Jiang, B, Element-by-element linear and nonlinear solution schemes, Commun. Appl. Numer. Methods, 2, 145-153, (1986) · Zbl 0591.65075 · doi:10.1002/cnm.1630020205
[13]	Carey, GF; Barragy, E; Mclay, R; Sharma, M, Element-by-element vector and parallel computations, Commun. Appl. Numer. Methods, 4, 299-307, (1988) · Zbl 0638.73042 · doi:10.1002/cnm.1630040303
[14]	Nickolls J and Kirk D 2009 Graphics and computing GPUs. In: Patterson D A and Hennessy J L Computer organization and design, 4th ed. Appendix A: 1-77 · Zbl 1213.68005
[15]	NVIDIA CUDA 2007 Compute unified device architecture programming guide. http://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html (visited 23rd September 2017)
[16]	Owens, JD; Luebke, D; Govindaraju, N; Harris, M; Krüger, J; Lefohn, AE; Purcell, TJ, A survey of general-purpose computation on graphics hardware, Comput. Graph. Forum, 26, 80-113, (2007) · doi:10.1111/j.1467-8659.2007.01012.x
[17]	Liu Y, Jiao S, Wu W and De S 2008 GPU accelerated fast FEM deformation simulation. In: Proceedings of the Asia Pacific Conference on Circuits and Systems, APCCAS 2008, IEEE Macao, pp. 606-609
[18]	Kákay, A; Westphal, E; Hertel, R, Speedup of FEM micromagnetic simulations with graphical processing units, IEEE Trans. Magn., 46, 2303-2306, (2010) · doi:10.1109/TMAG.2010.2048016
[19]	Brodtkorb, AR; Hagen, TR; Sætra, ML, Graphics processing unit (GPU) programming strategies and trends in GPU computing, J. Parallel Distrib. Comput., 73, 4-13, (2013) · doi:10.1016/j.jpdc.2012.04.003
[20]	Hoole, SRH; Karthik, VU; Sivasuthan, S; Rahunanthan, A; Tyagarajan, RS; Jayakumar, P, Finite elements, design optimization, and nondestructive evaluation: a review in magnetics, and future directions in GPU-based, element-by-element coupled optimization and NDE, Int. J. Appl. Electromagn. Mech., 47, 607-627, (2015)
[21]	Sanders J and Kandrot E 2010 CUDA by example: an introduction to general-purpose GPU programming. Massachusetts: Addison-Wesley Professional
[22]	Ho-Le, K, Finite element mesh generation methods: a review and classification, Comput. Aided Des., 20, 27-38, (1988) · Zbl 0661.65124 · doi:10.1016/0010-4485(88)90138-8
[23]	Sivasuthan, S; Karthik, VU; Jayakumar, P; Thyagarajan, RS; Udpa, L; Hoole, SRH, A script-based, parameterized finite element mesh for design and NDE on a GPU, IETE Tech. Rev., 32, 94-103, (2015) · doi:10.1080/02564602.2014.983192
[24]	Reddy J N 1993 An introduction to the finite element method, 2nd ed. New York: McGraw-Hill, vol. 2, no. 2.2
[25]	Garcia-Ruiz, MJ; Steven, GP, Fixed grid finite elements in elasticity problems, Eng. Comput., 16, 145-164, (1999) · Zbl 0948.74059 · doi:10.1108/02644409910257430
[26]	Krużel, F; Banaś, K, Vectorized opencl implementation of numerical integration for higher order finite elements, Comput. Math. Appl., 66, 2030-2044, (2013) · Zbl 1350.65126 · doi:10.1016/j.camwa.2013.08.026
[27]	Solin P, Segeth K and Dolezel I 2003 Higher-order finite element methods. Boca Raton: Chapman & Hall, CRC Press
[28]	Macioł, P; Płaszewski, P; Banaś, K, 3D finite element numerical integration on gpus, Procedia Comput. Sci., 1, 1093-1100, (2010) · Zbl 1382.65064 · doi:10.1016/j.procs.2010.04.121
[29]	Filipovič J, Peterlík I and Fousek J 2009 GPU acceleration of equations assembly in finite elements method—preliminary results. In: Proceedings of the Symposium on Application Accelerators in HPC (SAAHPC)
[30]	Dziekonski, A; Sypek, P; Lamecki, A; Mrozowski, M, Accuracy, memory, and speed strategies in GPU-based finite-element matrix-generation, IEEE Antennas Wirel. Propag. Lett., 11, 1346-1349, (2012) · Zbl 1352.65494 · doi:10.1109/LAWP.2012.2227449
[31]	Dziekonski, A; Sypek, P; Lamecki, A; Mrozowski, M, Generation of large finite element matrices on multiple graphics processors, Int. J. Numer. Methods Eng., 94, 204-220, (2013) · Zbl 1352.65494 · doi:10.1002/nme.4452
[32]	Nvidia Corporation 2008 Cublas library. Version 2.0, NVIDIA, Santa Clara, California
[33]	Dziekonski, A; Sypek, P; Lamecki, A; Mrozowski, M, Finite element matrix generation on a GPU, Prog. Electromagn. Res., 128, 249-265, (2012) · Zbl 1352.65494 · doi:10.2528/PIER12040301
[34]	Munshi A, Gaster B R, Mattson T G, Fung J and Ginsburg D 2011 OpenCL programming guide. London: Pearson Education
[35]	Banaś, K; Krużel, F; Bielański, J, Finite element numerical integration for first order approximations on multi-and many-core architectures, Comput. Methods Appl. Mech. Eng., 305, 827-848, (2016) · Zbl 1425.65144 · doi:10.1016/j.cma.2016.03.038
[36]	Woźniak, M, Fast GPU integration algorithm for isogeometric finite element method solvers using task dependency graphs, J. Comput. Sci., 11, 145-152, (2015) · doi:10.1016/j.jocs.2015.02.007
[37]	Mamza J, Makyla P, Dziekonski A, Lamecki A and Mrozowski M 2012 Multi-core and multiprocessor implementation of numerical integration in Finite Element Method. In: Proceedings of the 19th International Conference on Microwaves, Radar & Wireless Communications, IEEE, Warsaw, vol. 2, pp. 457-461
[38]	Knepley, MG; Terrel, AR, Finite element integration on gpus, ACM Trans. Math. Softw. (TOMS), 39, 10:1-10:13, (2013) · Zbl 1298.65176 · doi:10.1145/2427023.2427027
[39]	Cecka, C; Lew, A; Darve, E, Introduction to assembly of finite element methods on graphics processors, IOP Conf. Ser. Mater. Sci. Eng., 10, 012009, (2010) · Zbl 1217.80146 · doi:10.1088/1757-899X/10/1/012009
[40]	Iwashita, T; Shimasaki, M, Algebraic multicolor ordering for parallelized ICCG solver in finite-element analyses, IEEE Trans. Magn., 38, 429-432, (2002) · doi:10.1109/20.996114
[41]	Iwashita, T; Shimasaki, M, Algebraic block red-black ordering method for parallelized ICCG solver with fast convergence and low communication costs, IEEE Trans. Magn., 39, 1713-1716, (2003) · doi:10.1109/TMAG.2003.810531
[42]	Fu, Z; Lewis, TJ; Kirby, RM; Whitaker, RT, Architecting the finite element method pipeline for the GPU, J. Comput. Appl. Math., 257, 195-211, (2014) · Zbl 1291.65397 · doi:10.1016/j.cam.2013.09.001
[43]	Cecka, C; Lew, AJ; Darve, E, Assembly of finite element methods on graphics processors, Int. J. Numer. Methods Eng., 85, 640-669, (2011) · Zbl 1217.80146 · doi:10.1002/nme.2989
[44]	Markall, GR; Ham, DA; Kelly Paul, HJ, Towards generating optimized finite element solvers for GPUs from high-level specifications, Procedia Comput. Sci., 1, 1815-1823, (2010) · doi:10.1016/j.procs.2010.04.203
[45]	Markall, GR; Slemmer, A; Ham, DA; Kelly, PHJ; Cantwell, CD; Sherwin, SJ, Finite element assembly strategies on multicore and manycore architectures, Int. J. Numer. Methods Fluids, 71, 80-97, (2013) · Zbl 1431.65217 · doi:10.1002/fld.3648
[46]	Sanfui S and Sharma D 2017 A two-kernel based strategy for performing assembly in FEA on the graphic processing unit. In: Proceedings of the IEEE International Conference on Advances in Mechanical, Industrial, Automation and Management Systems (AMIAMS), pp. 1-9
[47]	Cecka C, Lew A and Darve E 2011 Application of assembly of finite element methods on graphics processors for real-time elastodynamics. In: GPU computing gems, Jade ed. Massachusetts: Morgan Kaufmann, chapter 16, pp. 187-205 · Zbl 1217.80146
[48]	Meng, HT; Nie, BL; Wong, S; Macon, C; Jin, JM, GPU accelerated finite-element computation for electromagnetic analysis, IEEE Antennas Propag. Mag., 56, 39-62, (2014) · doi:10.1109/MAP.2014.6837065
[49]	Reguly, IZ; Giles, MB, Finite element algorithms and data structures on graphical processing units, Int. J. Parallel Program., 43, 203-239, (2015) · doi:10.1007/s10766-013-0301-6
[50]	Dziekonski, A; Sypek, P; Lamecki, A; Mrozowski, M, GPU-accelerated finite-element matrix generation for lossless, lossy, and tensor media, IEEE Antennas Propag. Mag., 56, 186-197, (2014) · Zbl 1352.65494 · doi:10.1109/MAP.2014.6971943
[51]	Dziekonski, A; Sypek, P; Lamecki, A; Mrozowski, M, Communication and load balancing optimization for finite element electromagnetic simulations using multi-GPU workstation, IEEE Trans. Microw. Theory Tech., 65, 2661-2671, (2017) · doi:10.1109/TMTT.2017.2714670
[52]	Logg A, Mardal M A and Wells G N 2012 Automated solution of differential equations by the finite element method: the FEniCS book, vol. 84. New York-Heidelberg-Dordrecht-London: Springer · Zbl 1247.65105
[53]	Dupont T, Hoffman J, Jansson J, Johnson C, Kirby Robert C, Knepley M, Larson M , Logg A and Scott R 2003 The fenics project. Tech. Rep. 200321, Chalmers Finite Element Center Preprint Series
[54]	Luporini F, Varbanescu A L, Rathgeber F, Bercea G T, Ramanujam J, Ham D A and Kelly P H J 2014 COFFEE: an optimizing compiler for finite element local assembly. arXiv preprint arXiv:1407.0904
[55]	Shewchuk J R 1994 An introduction to the conjugate gradient method without the agonizing pain. Technical Report CMU-CS-94-125, School of Computer Science, Carnegie Mellon University, Pittsburgh, Pennsylvania
[56]	Itu, LM; Suciu, C; Moldoveanu, F; Postelnicu, A, Comparison of single and double floating point precision performance for tesla architecture gpus, Bull. Transilv. Univ. Brasov Ser. I Eng. Sci., 4, 131-138, (2011)
[57]	Göddeke, D; Strzodka, R; Turek, R, Performance and accuracy of hardware-oriented native-, emulated-and mixed-precision solvers in FEM simulations, Int. J. Parallel Emerg. Distrib. Syst., 22, 221-256, (2007) · Zbl 1188.68084 · doi:10.1080/17445760601122076
[58]	Baboulin, M; Buttari, A; Dongarra, J; Kurzak, J; Langou, J; Julien, Langou; Luszczek, P; Tomov, S, Accelerating scientific computations with mixed precision algorithms, Comput. Phys. Commun., 180, 2526-2533, (2009) · Zbl 1197.65240 · doi:10.1016/j.cpc.2008.11.005
[59]	Buttari A, Dongarra J, Kurzak J, Langou Julie, Langou Julien, Luszczek P and Tomov S 2006 Exploiting mixed precision floating point hardware in scientific computations. In: Proceedings of the High Performance Computing Workshop, pp. 19-36 · Zbl 1197.65240
[60]	Göddeke D, Strzodka R and Turek S 2005 Accelerating double precision FEM simulations with GPUs. In: Proceedings of ASIM 18th Symposium on Simulation Technique
[61]	Cosgrove, JDF; Díaz, JC; Griewank, A, Approximate inverse preconditionings for sparse linear systems, Int. J. Comput. Math., 44, 91-110, (1992) · Zbl 0762.65025 · doi:10.1080/00207169208804097
[62]	Li, R; Saad, Y, GPU-accelerated preconditioned iterative linear solvers, J. Supercomput., 63, 443-466, (2013) · doi:10.1007/s11227-012-0825-3
[63]	Naumov M, Chien L S, Vandermersch P and Kapasi U 2010 Cusparse library. Presented at: GPU Technology Conference San Jose
[64]	Wang E, Zhang Q, Shen B, Zhang G, Lu X, Wu Q and Wang Y 2014 Intel math kernel library. In: High-performance computing on the Intel®Xeon Phi\(^{TM}.\) Springer International Publishing, pp. 167-188
[65]	Naumov M 2011 Incomplete-LU and Cholesky preconditioned iterative methods using CUSPARSE and CUBLAS. Nvidia Technical Report and White Paper
[66]	Fialko, SY; Zeglen, F, Preconditioned conjugate gradient method for solution of large finite element problems on CPU and GPU, J. Telecommun. Inf. Technol., 2, 26-33, (2016)
[67]	Gao, J; Liang, R; Wang, J, Research on the conjugate gradient algorithm with a modified incomplete Cholesky preconditioner on GPU, J. Parallel Distrib. Comput., 74, 2088-2098, (2014) · doi:10.1016/j.jpdc.2013.10.002
[68]	Benzi, M; Meyer, CD; Tůma, M, A sparse approximate inverse preconditioner for the conjugate gradient method, SIAM J. Sci. Comput., 17, 1135-1149, (1996) · Zbl 0856.65019 · doi:10.1137/S1064827594271421
[69]	Grote, MJ; Huckle, T, Parallel preconditioning with sparse approximate inverses, SIAM J. Sci. Comput., 18, 838-853, (1997) · Zbl 0872.65031 · doi:10.1137/S1064827594276552
[70]	Ament M, Knittel G, Weiskopf D and Straßer W 2010 A parallel preconditioned conjugate gradient solver for the Poisson problem on a multi-GPU platform. In: Proceedings of the 18th Euromicro Conference on Parallel, Distributed and Network-based Processing, IEEE, pp. 583-592
[71]	Helfenstein, R; Koko, J, Parallel preconditioned conjugate gradient algorithm on GPU, J. Comput. Appl. Math., 236, 3584-3590, (2012) · Zbl 1245.65034 · doi:10.1016/j.cam.2011.04.025
[72]	Gravvanis, GA, Explicit approximate inverse preconditioning techniques, Arch. Comput. Methods Eng., 9, 371-402, (2002) · Zbl 1032.65046 · doi:10.1007/BF03041466
[73]	Gravvanis, GA; Filelis-Papadopoulos, CK; Giannoutakis, KM, Solving finite difference linear systems on GPUs: CUDA based parallel explicit preconditioned biconjugate conjugate gradient type methods, J. Supercomput., 61, 590-604, (2012) · doi:10.1007/s11227-011-0619-z
[74]	Cuthill, E; McKee, J; Rose, DJ (ed.); Willoughby, RA (ed.), Several strategies for reducing the bandwidth of matrices, 157-166, (1972), New York · doi:10.1007/978-1-4615-8675-3_14
[75]	Fujiwara, K; Nakata, T; Fusayasu, H, Acceleration of convergence characteristic of the ICCG method, IEEE Trans. Magn., 29, 1958-1961, (1993) · doi:10.1109/20.250792
[76]	De, Camargos A F P; Silva, VC; Guichon, JM; Munier, G, Efficient parallel preconditioned conjugate gradient solver on GPU for FE modeling of electromagnetic fields in highly dissipative media, IEEE Trans. Magn., 50, 569-572, (2014) · doi:10.1109/TMAG.2013.2285091
[77]	Bernaschi, M; Bisson, M; Fantozzi, C; Janna, C, A factored sparse approximate inverse preconditioned conjugate gradient solver on graphics processing units, SIAM J. Sci. Comput., 38, c53-c72, (2016) · Zbl 1336.65036 · doi:10.1137/15M1027826
[78]	Bell N and Garland M 2017 https://code.google.com/archive/p/cusp-library/downloads (visited 23rd June)
[79]	Monakov A and Avetisyan A 2009 Implementing blocked sparse matrix-vector multiplication on NVIDIA GPUs. Embedded computer systems: architectures, modeling, and simulation, pp. 289-297
[80]	Choi, JW; Singh, A; Vuduc, RW, Model-driven autotuning of sparse matrix-vector multiply on gpus, ACM Sigplan Not., 45, 115-126, (2010) · doi:10.1145/1837853.1693471
[81]	Vázquez, F; Fernández, JJ; Garzón, EM, A new approach for sparse matrix vector product on NVIDIA gpus, Concurr. Comput. Pract. Exp., 23, 815-826, (2011) · doi:10.1002/cpe.1658
[82]	Pichel, JC; Rivera, FF; Fernández, M; Rodríguez, A, Optimization of sparse matrix-vector multiplication using reordering techniques on gpus, Microprocess. Microsyst., 36, 65-77, (2012) · doi:10.1016/j.micpro.2011.05.005
[83]	Dang, HV; Schmidt, B, The sliced COO format for sparse matrix-vector multiplication on CUDA-enabled gpus, Procedia Comput. Sci., 9, 57-66, (2012) · doi:10.1016/j.procs.2012.04.007
[84]	Dang, HV; Schmidt, B, CUDA-enabled sparse matrix-vector multiplication on GPUs using atomic operations, Parallel Comput., 39, 737-750, (2013) · doi:10.1016/j.parco.2013.09.005
[85]	Monakov, A; Lokhmotov, A; Avetisyan, A, Automatically tuning sparse matrix-vector multiplication for GPU architectures, HiPEAC Proc. Lect. Notes Comput. Sci., 5952, 111-125, (2010) · doi:10.1007/978-3-642-11515-8_10
[86]	Kreutzer, M; Hager, G; Wellein, G; Fehske, H; Bishop, AR, A unified sparse matrix data format for efficient general sparse matrix-vector multiplication on modern processors with wide SIMD units, SIAM J. Sci. Comput., 36, c401-c423, (2014) · Zbl 1307.65055 · doi:10.1137/130930352
[87]	Anzt H, Tomov S and Dongarra J Implementing a sparse matrix vector product for the SELL-C/SELL-C-\(σ \)formats on NVIDIA GPUs. University of Tennessee, Tech. Rep., UT-EECS-14-727
[88]	Filippone, S; Cardellini, V; Barbieri, D; Fanfarillo, A, Sparse matrix-vector multiplication on gpgpus, ACM Trans. Math. Softw. (TOMS), 43, 30, (2017) · Zbl 1380.65079 · doi:10.1145/3017994
[89]	Gao, J; Wang, Y; Wang, J, A novel multigraphics processing unit parallel optimization framework for the sparse matrixvector multiplication, Concurr. Comput. Pract. Exp., 29, e3936, (2017) · doi:10.1002/cpe.3936
[90]	Gao, J; Zhou, Y; He, G; Xia, Y, A multi-GPU parallel optimization model for the preconditioned conjugate gradient algorithm, Parallel Comput., 63, 1-16, (2017) · doi:10.1016/j.parco.2017.04.003
[91]	Flegar G and Quintana-Ortí E S Balanced CSR sparse matrix-vector product on graphics processors. In: Proceedings of the European Conference on Parallel Processing. Cham: Springer, pp. 697-709
[92]	Merrill D and Garland M 2016 Merge-based parallel sparse matrix-vector multiplication. In: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, UT, Salt Lake City, pp. 678-689
[93]	Yang, W; Li, K; Li, K, A hybrid computing method of spmv on CPUGPU heterogeneous computing systems, J. Parallel Distrib. Comput., 104, 49-60, (2017) · doi:10.1016/j.jpdc.2016.12.023
[94]	Lin, S; Xie, Z, A Jacobi PCG solver for sparse linear systems on multi-GPU cluster, J. Supercomput., 73, 433-454, (2017) · doi:10.1007/s11227-016-1887-4
[95]	Cevahir A, Nukada A and Matsuoka S 2009 Fast conjugate gradients with multiple GPUs. In: Proceedings of the International Conference on Computational Science, LNCS 5544. Berlin-Heidelberg: Springer, pp. 893-903
[96]	Martínez-Frutos, J; Martínez-Castejón, PJ; Herrero-Pérez, D, Fine-grained GPU implementation of assembly-free iterative solver for finite element problems, Comput. Struct., 157, 9-18, (2015) · doi:10.1016/j.compstruc.2015.05.010
[97]	Kiss, I; Gyimothy, S; Badics, Z, Parallel realization of the element-by-element FEM technique by CUDA, IEEE Trans. Magn., 48, 507-510, (2012) · doi:10.1109/TMAG.2011.2175905
[98]	Fernández, DM; Dehnavi, MM; Gross, WJ; Giannacopoulos, D, Alternate parallel processing approach for FEM, IEEE Trans. Magn., 48, 399-402, (2012) · doi:10.1109/TMAG.2011.2173304
[99]	Hughes, TJR; Levit, I; Winget, J, An element-by-element solution algorithm for problems of structural and solid mechanics, Comput. Methods Appl. Mech. Eng., 36, 241-254, (1983) · Zbl 0487.73083 · doi:10.1016/0045-7825(83)90115-9
[100]	Yan, X; Han, X; Wu, D; Xie, D; Bai, B; Ren, Z, Research on preconditioned conjugate gradient method based on EBE-FEM and the application in electromagnetic field analysis, IEEE Trans. Magn., 53, 1-4, (2017) · doi:10.1109/TMAG.2017.2657764
[101]	Akbariyeh A, Dennis B H, Wang B P and Lawrence K L 2015 Comparison of GPU-based parallel assembly and assembly-free sparse matrix vector multiplication for finite element analysis of three-dimensional structures. In: Proceedings of the Fifteenth International Conference on Civil, Structural and Environmental Engineering Computing, Civil-Comp Press, Stirlingshire, Scotland
[102]	Martínez-Frutos, J; Herrero-Pérez, D, Efficient matrix-free GPU implementation of fixed grid finite element analysis, Finite Elem. Anal. Des., 104, 61-71, (2015) · doi:10.1016/j.finel.2015.06.005
[103]	Bendsøe M P and Sigmund O 2004 Topology optimization theory, methods, and applications. Berlin-Heidelberg: Springer · Zbl 1059.74001
[104]	Martínez-Frutos, J; Martínez-Castejón, PJ; Herrero-Pérez, D, Efficient topology optimization using GPU computing with multilevel granularity, Adv. Eng. Softw., 106, 47-62, (2017) · doi:10.1016/j.advengsoft.2017.01.009
[105]	Martínez-Frutos, J; Herrero-Pérez, D, GPU acceleration for evolutionary topology optimization of continuum structures using isosurfaces, Comput. Struct., 182, 119-136, (2017) · doi:10.1016/j.compstruc.2016.10.018
[106]	Ram, L; Sharma, D, Evolutionary and GPU computing for topology optimization of structures, Swarm Evol. Comput., 35, 1-13, (2017) · doi:10.1016/j.swevo.2016.08.004
[107]	Martínez-Frutos, J; Herrero-Pérez, D, Large-scale robust topology optimization using multi-GPU systems, Comput. Methods Appl. Mech. Eng., 311, 393-414, (2016) · Zbl 1439.74291 · doi:10.1016/j.cma.2016.08.016
[108]	Baca, V; Horak, Z; Mikulenka, P; Dzupa, V, Comparison of an inhomogeneous orthotropic and isotropic material models used for FE analyses, Med. Eng. Phys., 30, 924-930, (2008) · doi:10.1016/j.medengphy.2007.12.009
[109]	Cai, Y; Li, G; Wang, H, A parallel node-based solution scheme for implicit finite element method using GPU, Procedia Eng., 61, 318-324, (2013) · doi:10.1016/j.proeng.2013.08.022

This reference list is based on information provided by the publisher or from digital mathematics libraries. Its items are heuristically matched to zbMATH identifiers and may contain data conversion errors. In some cases that data have been complemented/enhanced by data from zbMATH Open. This attempts to reflect the references listed in the original paper as accurately as possible without claiming completeness or a perfect matching.