×

A framework for high-performance matrix multiplication based on hierarchical abstractions, algorithms and optimized low-level kernels. (English) Zbl 1008.68530

Summary: Despite extensive research, optimal performance has not easily been available previously for matrix multiplication (especially for large matrices) on most architectures because of the lack of a structured approach and the limitations imposed by matrix storage formats. A simple but effective framework is presented here that lays the foundation for building high-performance matrix-multiplication codes in a structured, portable and efficient manner. The resulting codes are validated on three different representative RISC and CISC architectures on which they significantly outperform highly optimized libraries such as ATLAS and other competing methodologies reported in the literature. The main component of the proposed approach is a hierarchical storage format that efficiently generalizes the applicability of the memory hierarchy friendly Morton ordering to arbitrary-sized matrices. The storage format supports polyalgorithms, which are shown here to be essential for obtaining the best possible performance for a range of problem sizes. Several algorithmic advances are made in this paper, including an oscillating iterative algorithm for matrix multiplication and a variable recursion cutoff criterion for Strassen’s algorithm. The authors expose the need to standardize linear algebra kernel interfaces, distinct from the BLAS, for writing portable high-performance code. These kernel routines operate on small blocks that fit in the L1 cache. The performance advantages of the proposed framework can be effectively delivered to new and existing applications through the use of object-oriented or compiler-based approaches.

MSC:

68U99 Computing methodologies and applications
65F30 Other matrix algorithms (MSC2010)
65Y10 Numerical algorithms for specific classes of architectures
68W15 Distributed algorithms
Full Text: DOI

References:

[1] Iteration space tiling for memory hierarchies. Proceedings of the Third SIAM Conference on Parallel Processing for Scientific Computing, December 1987. SIAM Press, 1987.
[2] Gallivan, The International Journal of Supercomputer Applications 2 pp 12– (1988)
[3] Automatic blocking of nested loops. Technical Report CS-90-108, Department of Computer Science, University of Tennessee, May 1990.
[4] The cache performance and optimizations of blocked algorithms. Proceedings of the Fourth International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS IV), April 1991. ACM Press, 1991.
[5] A data locality optimizing algorithm. Proceedings of the ACM SIGPLAN 1991 Conference on Programming Language Design and Implementation, June 1991. ACM Press, 1991.
[6] Hierarchical blocking and data flow analysis for numerical linear algebra. Proceedings of the ACM International Conference on Supercomputing. ACM Press, 1991; 12-19.
[7] Agarwal, IBM Journal of Research and Development 38 pp 563– (1994)
[8] MOB forms: A class of multilevel block algorithms for dense linear algebra operations. International Conference on Supercomputing, 1994; 354-363.
[9] Matrix Computations (3rd edn) ch. 1. The Johns Hopkins University Press: Baltimore, MD, 1996; 43-46.
[10] Dongarra, ACM Transactions on Mathematical Software 16 pp 1– (1990)
[11] Basic linear algebra subprograms (BLAS). http://www.netlib.org/blas/ [31 May 2001].
[12] Automated empirical optimization of software and the ATLAS project. Technical Report, University of Tennessee, September 2000. Available at: http://www.netlib.org/atlas/ [31 May 2001].
[13] Automatically tuned linear algebra software (ATLAS). http://netlib2.cs.utk.edu/atlas/ [31 May 2001].
[14] Aberdeen, Concurrency and Computation: Practice and Experience 13 pp 103– (2001)
[15] A family of high-performance matrix multiplication algorithms. Technical Report, Department of Computer Science, The University of Texas, Austin, 2001. · Zbl 0982.68505
[16] MIPSpro family of compilers. http://www.sgi.com/developers/devtools/languages/mipspro.html [31 May 2001].
[17] Space Filling Curves. Springer: Berlin, 1994. · Zbl 0806.01019 · doi:10.1007/978-1-4612-0871-6
[18] Auto-blocking matrix multiplication or tracking BLAS3 performance with source code. Proceedings of the 6th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, June 1997.
[19] Recursive array layouts and fast parallel matrix multiplication. Proceedings of the 11th ACM Symposium on Parallel Algorithms and Architectures (SPAA ’99), June 1999.
[20] Recursive blocked data formats and BLAS’s for dense linear algebra algorithms. Proceedings of the 4th International Workshop on Applied Parallel Computing in Large Scale Scientific and Industrial Problems (PARA ’98) (Lecture Notes in Computer Science, vol. 1541). Springer: Berlin, 1998; 195-206.
[21] Gustavson, IBM Journal of Research and Development 41 pp 737– (1997)
[22] LAWRA?linear algebra with recursive algorithms. Proceedings of the Conference on Parallel Processing and Applied Mathematics, September 1999; 63-76.
[23] Gustavson, IBM Journal of Research and Development 44 pp 823– (2000)
[24] New generalized data structures for matrices lead to a variety of high performance algorithms. Proceedings of the IFIP TC2/WG2.5 Working Conference on the Architecture of Scientific Software, October 2000.
[25] Nonlinear array layouts for hierarchical memory systems. Proceedings of the 13th ACM International Conference on Supercomputing (ICS ’99), June 1999.
[26] Optimizing matrix multiply using PHiPAC: A portable, high-performance, ANSI C coding methodology. Proceedings of the International Conference on Supercomputing, July 1997.
[27] The PHiPAC v1.0 matrix-multiply distribution. Technical Report UCB/CSD-98-1020, CS Division, University of California at Berkeley, October 1998.
[28] Automatically tuned linear algebra software (ATLAS). Technical Report, University of Tennessee, July 1997.
[29] The matrix template library: A generic programming approach to high performance numerical linear algebra. Proceedings of the International Symposium on Computing in Object-Oriented Parallel Environments (ISCOPE), December 1998.
[30] The matrix template library: A unifying framework for numerical linear algebra. Workshop on Parallel Object-Oriented Scientific Computing (POOSC ’98), Proceedings of the 12th European Conference on Object-Oriented Programming, July 1998.
[31] The matrix template library (MTL). http://www.lsc.nd.edu/research/mtl/ [31 May 2001].
[32] The parallel mathematical libraries project (PMLP): Overview, design innovations and preliminary results. Fifth International Conference on Parallel Computing Technologies (PACT ’99) (Lecture Notes in Computer Science, vol. 1662). Springer: Berlin, 1999; 186-193.
[33] The parallel mathematical libraries project (PMLP)?a next generation scalable, sparse, object-oriented, mathematical library suite. Proceedings of the Ninth SIAM Conference on Parallel Processing for Scientific Computing, March 1999.
[34] The parallel mathematical libraries project (PMLP). http://WWW.ERC.MsState.Edu/labs/hpcl/pmlp/ [31 May 2001].
[35] Strassen, Numerische Mathematik 13 pp 354– (1969)
[36] IBM Corporation. Engineering and scientific subroutine library (ESSL). http://www.rs6000.ibm.com/software/Apps/essl.html [31 May 2001].
[37] IBM Corporation. Engineering and scientific subroutine library (ESSL) for AIX: Guide and reference, version 3 release 2. http://www.austin.ibm.com/resource/aix_resource/sp_books/essl/ [31 May 2001].
[38] Algorithms for matrix multiplication. Technical Report CS 157, Computer Science Department, Stanford University, Palo Alto, CA, 1970.
[39] Higham, ACM Transactions on Mathematical Software 16 (1990)
[40] Douglas, Journal of Computational Physics 110 pp 1– (1994)
[41] Implementation of Strassen’s algorithm for matrix multiplication. Proceedings of Supercomputing ’96, November 1996.
[42] Architecture-efficient Strassen’s matrix multiplication: A case study of divide-and-conquer algorithms. Proceedings of the International Linear Algebra Society (ILAS) Symposium on Algorithms for Control, Signals and Image Processing, June 1997.
[43] Tuning Strassen’s matrix multiplication for memory efficiency. Proceedings of Supercomputing ’98, November 1998.
[44] Ahnentafel indexing into morton-ordered arrays, or matrix locality for free. European Conference on Parallel Computing, Euro-Par 2000 (Lecture Notes in Computer Science, vol. 1900). Springer: Berlin, 2000; 774-784.
[45] Li, Concurrency and Computation: Practice and Experience 9 pp 345– (1997) · doi:10.1002/(SICI)1096-9128(199705)9:5<345::AID-CPE258>3.0.CO;2-7
[46] A flexible class of parallel matrix multiplication algorithms. Proceedings First Merged International Parallel Processing Symposium and Symposium on Parallel and Distributed Processing (IPPS/SPDP ’98). IEEE Computer Society Press, 1998; 110-116.
[47] Agarwal, IBM Journal of Research and Development 39 pp 575– (1995)
[48] Fast integer dilation for structured problems. Technical Report MSSU-COE-ERC-00-05, Engineering Research Center, Mississippi State University, MS, March 2000.
[49] Schrack, CVGIP: Image Understanding 55 pp 221– (1992)
[50] Morton-order matrices deserve compilers’ support. Technical Report 533, Department of Computer Science, Indiana University, Bloomington, IN, November 1999.
[51] Computational plant (Cplant). http://www.cs.sandia.gov/cplant/ [31 May 2001].
[52] Integer dilation and contraction for quadtrees and octrees. Proceedings of the IEEE Pacific Rim Conference on Communications, Computers, and Signal Processing, May 1995.
[53] Matrix Computations (3rd edn) ch. 2. The Johns Hopkins University Press: Baltimore, MD, 1996; 66-67.
[54] Kumar, Scientific Programming 4 pp 275– (1995) · doi:10.1155/1995/636457
[55] Efficient procedures for using matrix algorithms. Automata, Languages and Programming (Lecture Notes in Computer Science, vol. 14). Springer: Berlin, 1974; 413-427. · doi:10.1007/3-540-06841-4_78
[56] Strassen’s algorithm for matrix multiplication: Modeling, analysis and implementation. Technical Report CCS-TR-96-147, Center for Computing Sciences, 1996.
[57] Draft Document for the BLAS Lite Specification, Department of Computer Science, Mississippi State University, February 1997.
[58] Standard sequential mathematical libraries: Promises and pitfalls, opportunities and challenges. The Multicomputer Toolbox Project BLAIS Working Note 0, Department of Computer Science, Mississippi State University, May 1996.
[59] A rational approach to portable high performance: The basic linear algebra instruction set (BLAIS) and the fixed algorithm size template (FAST) library. Workshop on Parallel Object-Oriented Scientific Computing (POOSC ’98), Proceedings of the 12th European Conference on Object-Oriented Programming, July 1998.
[60] Formal linear algebra methods environment (FLAME) overview. FLAME Working Note 1, Technical Report, The University of Texas at Austin, November 2000.
[61] High performance software on Intel Pentium Pro processors or micro-ops to TeraFLOPS. SC97 Conference Proceedings, 1997. Available at: http://www.supercomp.org/sc97/proceedings/TECH/GREER/INDEX.HTM [31 May 2001].
[62] Superscalar GEMM-based level 3 BLAS?the on-going evolution of a portable and high-performance library. Applied Parallel Computing in Large Scale Scientific and Industrial Problems (PARA) (Lecture Notes in Computer Science, vol. 1541). Springer: Berlin, 1998; 207-215.
[63] Linear algebra based on hierarchical extension of recursive orderings (LAB-HERO). http://www.hpcl.cs.msstate.edu/lab-hero/ [31 May 2001].
This reference list is based on information provided by the publisher or from digital mathematics libraries. Its items are heuristically matched to zbMATH identifiers and may contain data conversion errors. In some cases that data have been complemented/enhanced by data from zbMATH Open. This attempts to reflect the references listed in the original paper as accurately as possible without claiming completeness or a perfect matching.