×

BLIS: a framework for rapidly instantiating BLAS functionality. (English) Zbl 1347.65054


MSC:

65Fxx Numerical linear algebra
65-04 Software, source code, etc. for problems pertaining to numerical analysis
65Y15 Packaged methods for numerical algorithms
65Y20 Complexity and performance of numerical algorithms

References:

[1] R. Agarwal, F. Gustavson, and M. Zubair. 1994. Exploiting functional parallelism of POWER2 to design high-performance numerical algorithms. IBM J. Res. Deve. 38, 5. · doi:10.1147/rd.385.0563
[2] E. Agullo, H. Bouwmeester, J. Dongarra, J. Kurzak, J. Langou, and L. Rosenberg. 2011. Towards an efficient tile matrix inversion of symmetric positive definite matrices on multicore architectures. In High Performance Computing for Computational Science (VECPAR 2010). Lecture Notes in Computer Science, vol. 6449, Springer, 129–138. · Zbl 1323.65022 · doi:10.1007/978-3-642-19328-6_14
[3] E. Agullo, J. Demmel, J. Dongarra, B. Hadri, J. Kurzak, J. Langou, H. Ltaief, P. Luszczek, and S. Tomov. 2009. Numerical linear algebra on emerging architectures: The PLASMA and MAGMA projects. J. Phys. Conference Series 180. · doi:10.1088/1742-6596/180/1/012037
[4] AMD. 2012. AMD Core Math Library. http://developer.amd.com/tools/cpu/acml/pages/default.aspx.
[5] E. Anderson, Z. Bai, C. Bischof, L. S. Blackford, J. Demmel, J. J. Dongarra, J. D. Croz, S. Hammarling, A. Greenbaum, A. McKenney, and D. Sorensen. 1999. LAPACK Users’ Guide. 3rd Ed. Society for Industrial and Applied Mathematics, Philadelphia, PA, USA. · Zbl 0934.65030 · doi:10.1137/1.9780898719604
[6] G. Belter, E. R. Jessup, I. Karlin, and J. G. Siek. 2009. Automating the generation of composed linear algebra kernels. In Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis (SC’09). 59:1–59:12. · doi:10.1145/1654059.1654119
[7] P. Bientinesi, J. A. Gunnels, M. E. Myers, E. S. Quintana-Ortí, and R. A. van de Geijn. 2005. The science of deriving dense linear algebra algorithms. ACM Trans. Math. Softw. 31, 1, 1–26. · Zbl 1073.65036
[8] J. Bilmes, K. Asanović, C. whye Chin, and J. Demmel. 1997. Optimizing matrix multiply using PHiPAC: A portable, High-Performance, ANSI C coding methodology. In Proceedings of the International Conference on Supercomputing.
[9] C. Bischof and C. Van Loan. 1987. The WY representation for products of Householder matrices. SIAM J. Sci. Stat. Comput. 8, 1, s2–s13. · Zbl 0628.65033 · doi:10.1137/0908009
[10] BLAS 2012. http://www.netlib.org/blas/.
[11] BLAST 2002. Basic linear algebra subprograms technical forum standard. Int. J. High Perf. Appl. Supercomput. 16, 1. · Zbl 1070.65521 · doi:10.1177/10943420020160010101
[12] E. Chan, E. S. Quintana-Ortí, G. Quintana-Ortí, and R. van de Geijn. 2007. SuperMatrix out-of-order scheduling of matrix operations for SMP and multi-core architectures. In Proceedings of the 19th ACM Symposium on Parallelism in Algorithms and Architectures (SPAA’07). ACM, New York, 116–125.
[13] E. Chan, F. G. Van Zee, P. Bientinesi, E. S. Quintana-Ortí, G. Quintana-Ortí, and R. van de Geijn 2008. SuperMatrix: A multithreaded runtime scheduling system for algorithms-by-blocks. In Proceeding of the ACM SIGPLAN Symposium on Principles and Practices of Parallel Programming (PPoPP’08). ACM, New York, 123–132.
[14] J. Choi, J. J. Dongarra, R. Pozo, and D. W. Walker. 1992. ScaLAPACK: A scalable linear algebra library for distributed memory concurrent computers. In Proceedings of the 4th Symposium on the Frontiers of Massively Parallel Computation. IEEE Computer Society Press, 120–127. · doi:10.1109/FMPC.1992.234898
[15] J. J. Dongarra, J. Du Croz, S. Hammarling, and I. Duff. 1990. A set of level 3 basic linear algebra subprograms. ACM Trans. Math. Softw. 16, 1, 1–17. · Zbl 0900.65115 · doi:10.1145/77626.79170
[16] J. J. Dongarra, J. Du Croz, S. Hammarling, and R. J. Hanson. 1988. An extended set of FORTRAN basic linear algebra subprograms. ACM Trans. Math. Softw. 14, 1, 1–17. · Zbl 0639.65016 · doi:10.1145/42288.42291
[17] J. J. Dongarra, R. A. van de Geijn, and R. C. Whaley. 1993. Two dimensional basic linear algebra communication subprograms. In Proceedings of the 6th SIAM Conference on Parallel Processing for Scientific Computing.
[18] K. Goto and R. van de Geijn. 2008a. Anatomy of high-performance matrix multiplication. ACM Trans. Math. Softw. 34, 3, 12:1–12:25. · Zbl 1190.65064
[19] K. Goto and R. van de Geijn. 2008b. High-performance implementation of the level-3 BLAS. ACM Trans. Math. Softw. 35, 1, 1–14. · doi:10.1145/1377603.1377607
[20] J. A. Gunnels, F. G., Gustavson, G. M. Henry, and R. A. van de Geijn. 2001a. FLAME: Formal linear algebra methods environment. ACM Trans. Math. Softw. 27, 4, 422–455. · Zbl 1070.65522 · doi:10.1145/504210.504213
[21] J. A. Gunnels, G. M. Henry, and R. A. van de Geijn. 2001b. A family of high-performance matrix multiplication algorithms. In Proceeding of the International Conference on Computational Science (ICCS 2001), Part I, V. N. Alexandrov, J. J. Dongarra, B. A. Juliano, R. S. Renner, and C. K. Tan, Eds., Lecture Notes in Computer Science, vol. 2073, Springer-Verlag, 51–60. · Zbl 0982.68505 · doi:10.1007/3-540-45545-0_15
[22] J. A. Gunnels and R. A. van de Geijn. 2001. Formal methods for high-performance linear algebra libraries. In The Architecture of Scientific Software, R. F. Boisvert and P. T. P. Tang, Eds., Kluwer Academic Press, 193–210. · doi:10.1007/978-0-387-35407-1_12
[23] G. W. Howell, J. W. Demmel, C. T. Fulton, S. Hammarling, and K. Marmol. 2008. Cache efficient bidiagonalization using BLAS 2.5 operators. ACM Trans. Math. Software 34, 3, 14:1–14:33. · Zbl 1190.65056
[24] K. Huang and J. Abraham. 1984. Algorithm-based fault tolerance for matrix operations. IEEE Trans. Comput. 33, 6, 518–528. · Zbl 0557.68027 · doi:10.1109/TC.1984.1676475
[25] IBM. 2012. Engineering and Scientific Subroutine Library. http://www.ibm.com/systems/software/essl/.
[26] Intel. 2012. Math Kernel Library. http://developer.intel.com/software/products/mkl/.
[27] T. Joffrain, T. M. Low, E. S. Quintana-Ortí, R. van de Geijn, and F. Van Zee. 2006. Accumulating Householder transformations, revisited. ACM Trans. Math. Softw. 32, 2, 169–179. · Zbl 1365.65106
[28] B. Kågström, P. Ling, and C. V. Loan. 1998. GEMM-based level 3 BLAS: High performance model implementations and performance evaluation benchmark. ACM Trans. Math. Soft. 24, 3, 268–302. · Zbl 0930.65047
[29] C. L. Lawson, R. J. Hanson, D. R. Kincaid, and F. T. Krogh. 1979. Basic linear algebra subprograms for Fortran usage. ACM Trans. Math. Softw. 5, 3, 308–323. · Zbl 0412.65022 · doi:10.1145/355841.355847
[30] B. Marker, J. Poulson, D. Batory, and R. van de Geijn. 2012. Designing linear algebra algorithms by transformation: Mechanizing the expert developer. In Proceedings of VECPAR Conference International Workshop on Automatic Performance Tuning (iWAPT2012).
[31] C. Moler, J. Little, and S. Bangert. 1987. Pro-Matlab, User’s Guide. The Mathworks, Inc.
[32] OpenBLAS 2012. http://xianyi.github.com/OpenBLAS/.
[33] A. Pedram, A. Gerstlauer, and R. A. van de Geijn. 2012a. On the efficiency of register file versus broadcast interconnect for collective communications in data-parallel hardware accelerators. In Proceedings of the International Symposium on Computer Architecture and High Performance Computing, 19–26. · doi:10.1109/SBAC-PAD.2012.35
[34] A. Pedram, R. A. van de Geijn, and A. Gerstlauer. 2012b. Codesign tradeoffs for high-performance, low-power linear algebra architectures. IEEE Trans. Comput. 61, 12, 1724–1736. · Zbl 1365.65315 · doi:10.1109/TC.2012.132
[35] J. Poulson, B. Marker, R. A. van de Geijn, J. R. Hammond, and N. A. Romero. 2013. Elemental: A new framework for distributed memory dense matrix computations. ACM Trans. Math. Softw. 39, 2, 13:1–13:24. · Zbl 1295.65137
[36] M. Püschel, J. M. F. Moura, J. Johnson, D. Padua, M. Veloso, B. Singer, J. Xiong, F. Franchetti, A. Gacic, Y. Voronenko, K. Chen, R. W. Johnson, and N. Rizzolo. 2005. SPIRAL: Code generation for DSP transforms. Proc. IEEE, Special Issue on Program Generation, Optimization, and Adaptation 93, 2, 232–275.
[37] G. Quintana-Ortí, E. S. Quintana-Ortí, R. A. van de Geijn, F. G. Van Zee, and E. Chan. 2009. Programming matrix algorithms-by-blocks for thread-level parallelism. ACM Trans. Math. Softw. 36, 3, 14:1–14:26. · Zbl 1364.65105
[38] M. D. Schatz, T. M. Low, R. A. van de Geijn, and T. G. Kolda. 2014. Exploiting symmetry in tensors for high performance multiplication with symmetric tensors. SIAM J. Sci. Comput. 36, 5, C453–C479. · Zbl 1307.65057 · doi:10.1137/130907215
[39] R. Schreiber and C. Van Loan. 1989. A storage-efficient WY representation for products of Householder transformations. SIAM J. Sci. Stat. Comput. 10, 1, 53–57. · Zbl 0664.65025 · doi:10.1137/0910005
[40] J. G. Siek, I. Karlin, and E. R. Jessup. 2008. Build to order linear algebra kernels. In Proceeding of the International Symposium on Parallel and Distributed Processing 2008 (IPDPS 2008). 1–8. · doi:10.1109/IPDPS.2008.4536183
[41] T. M. Smith, R. A. van de Geijn, M. Smelyanskiy. J. R. Hammond, and F. G. Van Zee. 2014. Anatomy of high-performance many-threaded matrix multiplication. In Proceedings of the 28th International Parallel & Distributed Processing Symposium (IPDPS). IEEE Computer Society Press, 1049–1059. · doi:10.1109/IPDPS.2014.110
[42] E. Solomonik, J. Hammond, and J. Demmel. 2014. A preliminary analysis of Cyclops Tensor Framework. Tech. Rep. UCB/EECS-2012-29, EECS Department, University of California, Berkeley.
[43] R. A. van de Geijn. 1997. Using PLAPACK: Parallel Linear Algebra Package. The MIT Press.
[44] R. A. van de Geijn and E. S. Quintana-Ortí. 2008. The science of programming matrix computations. www.lulu.com.
[45] F. G. Van Zee. 2012. libflame: The complete reference. www.lulu.com.
[46] F. G. Van Zee, E. Chan, R. van de Geijn, E. S. Quintana-Ortí, and G. Quintana-Ortí. 2009. The libflame library for dense matrix computations. IEEE Computat. Sci. Eng. 11, 6, 56–62.
[47] F. G. Van Zee, T. Smith, F. D. Igual, M. Smelyanskiy, X. Zhang, M. Kistler, V. Austel, J. Gunnels, T. M. Low, B. Marker, L. Killough, and R. A. van de Geijn. 2013. Implementing level-3 BLAS with BLIS: Early experience, FLAME Working Note #69. Tech. Rep. TR-13-03, Department of Computer Sciences The University of Texas at Austin. To appear in ACM TOMS.
[48] F. G. Van Zee, R. A. van de Geijn, and G. Quintana-Ortí. 2014. Restructuring the tridiagonal and bidiagonal QR algorithms for performance. ACM Trans. Math. Soft. 40, 3 (2014), Article 18. · Zbl 1322.65051
[49] F. G. Van Zee, R. A. van de Geijn, G. Quintana-Ortí, and G. J. Elizondo. 2012. Families of algorithms for reducing a matrix to condensed form. ACM Trans. Math. Softw. 39, 1, 2:1–2:32. · Zbl 1295.65052
[50] V. Volkov and J. Demmel. 2008. LU, QR and Cholesky factorizations using vector capabilities of GPUs. Tech. Rep. UCB/EECS-2008-49, EECS Department, University of California, Berkeley.
[51] R. C. Whaley and J. J. Dongarra. 1998. Automatically tuned linear algebra software. In Proceedings of SC’98. · doi:10.1109/SC.1998.10004
[52] K. Yotov, X. Li, M. J. Garzarán, D. Padua, K. Pingali, and P. Stodghill. 2005. Is search really necessary to generate high-performance BLAS? Proc. IEEE, Special Issue on Program Generation, Optimization, and Adaptation 93, 2.
This reference list is based on information provided by the publisher or from digital mathematics libraries. Its items are heuristically matched to zbMATH identifiers and may contain data conversion errors. In some cases that data have been complemented/enhanced by data from zbMATH Open. This attempts to reflect the references listed in the original paper as accurately as possible without claiming completeness or a perfect matching.