×

Software for sparse tensor decomposition on emerging computing architectures. (English) Zbl 07099291

Summary: In this paper, we develop software for decomposing sparse tensors that is portable to and performant on a variety of multicore, manycore, and GPU computing architectures. The result is a single code whose performance matches optimized architecture-specific implementations. The key to a portable approach is to determine multiple levels of parallelism that can be mapped in different ways to different architectures, and we explain how to do this for the matricized tensor times Khatri-Rao product (MTTKRP), which is the key kernel in canonical polyadic tensor decomposition. Our implementation leverages the Kokkos framework, which enables a single code to achieve high performance across multiple architectures that differ in how they approach fine-grained parallelism. We also introduce a new construct for portable thread-local arrays, which we call compile-time polymorphic arrays. Not only are the specifics of our approaches and implementation interesting for tuning tensor computations, but they also provide a roadmap for developing other portable high-performance codes. As a last step in optimizing performance, we modify the MTTKRP algorithm itself to do a permuted traversal of tensor nonzeros to reduce atomic-write contention. We test the performance of our implementation on 16- and 68-core Intel CPUs and the K80 and P100 NVIDIA GPUs, showing that we are competitive with state-of-the-art architecture-specific codes while having the advantage of being able to run on a variety of architectures.

MSC:

65Y10 Numerical algorithms for specific classes of architectures
15A72 Vector and tensor algebra, theory of invariants
15-04 Software, source code, etc. for problems pertaining to linear algebra

References:

[1] B. W. Bader and T. G. Kolda, Algorithm 862: MATLAB tensor classes for fast algorithm prototyping, ACM Trans. Math. Software, 32 (2006), pp. 635-653, https://doi.org/10.1145/1186785.1186794. · Zbl 1230.65054
[2] B. W. Bader and T. G. Kolda, Efficient MATLAB computations with sparse and factored tensors, SIAM J. Sci. Comput., 30 (2007), pp. 205-231, https://doi.org/10.1137/060676489. · Zbl 1159.65323
[3] N. Bell and M. Garland, Implementing sparse matrix-vector multiplication on throughput-oriented processors, in Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis, SC ’09, ACM, New York, 2009, 18, https://doi.org/10.1145/1654059.1654078.
[4] J. D. Carroll and J. J. Chang, Analysis of individual differences in multidimensional scaling via an N-way generalization of “Eckart-Young” decomposition, Psychometrika, 35 (1970), pp. 283-319, https://doi.org/10.1007/BF02310791. · Zbl 0202.19101
[5] J. H. Choi and S. Vishwanathan, DFacTo: Distributed factorization of tensors, in Advances in Neural Information Processing Systems (NIPS 2014), 2014, pp. 1296-1304, http://papers.nips.cc/paper/5395-dfacto-distributed-factorization-of-tensors.
[6] N. Cohen, O. Sharir, and A. Shashua, On the expressive power of deep learning: A tensor analysis, in 29th Annual Conference on Learning Theory, Vol. 49, Proceedings of Machine Learning Research, 2016, pp. 698-728, http://proceedings.mlr.press/v49/cohen16.html.
[7] H. C. Edwards, D. Sunderland, V. Porter, C. Amsler, and S. Mish, Manycore performance-portability: Kokkos multidimensional array library, Sci. Program., 20 (2012), pp. 89-114, https://doi.org/10.3233/SPR-2012-0343.
[8] H. C. Edwards, C. R. Trott, and D. Sunderland, Kokkos: Enabling manycore performance portability through polymorphic memory access patterns, J. Parallel Distrib. Comput., 74 (2014), pp. 3202-3216, https://doi.org/10.1016/j.jpdc.2014.07.003.
[9] H. Fanaee-T and J. Gama, Tensor-based anomaly detection: An interdisciplinary survey, Knowledge-Based Systems, 98 (2016), pp. 130-147, https://doi.org/10.1016/j.knosys.2016.01.027.
[10] E. Gujral, R. Pasricha, and E. E. Papalexakis, SamBaTen: Sampling-based Batch Incremental Tensor Decomposition, preprint, https://arxiv.org/abs/1709.00668v1, 2017.
[11] R. A. Harshman, Foundations of the PARAFAC procedure: Models and conditions for an “explanatory” multimodal factor analysis, UCLA Working Papers in Phonetics, 16 (1970), pp. 1-84; available at http://www.psychology.uwo.ca/faculty/harshman/wpppfac0.pdf.
[12] D. Hong, T. G. Kolda, and J. A. Duersch, Generalized Canonical Polyadic Tensor Decomposition, preprint, https://arxiv.org/abs/1808.07452, 2018. · Zbl 1432.68385
[13] M. Janzamin, H. Sedghi, and A. Anandkumar, Beating the Perils of Non-convexity: Guaranteed Training of Neural Networks using Tensor Methods, preprint, https://arxiv.org/abs/1506.08473v3, 2016.
[14] P. Karpiński and J. McDonald, A high-performance portable abstract interface for explicit SIMD vectorization, in Proceedings of the 8th International Workshop on Programming Models and Applications for Multicores and Manycores, PMAM’17, ACM, New York, 2017, pp. 21-28, https://doi.org/10.1145/3026937.3026939.
[15] O. Kaya and B. Uçar, Parallel CANDECOMP/PARAFAC decomposition of sparse tensors using dimension trees, SIAM J. Sci. Comput., 40 (2018), pp. C99-C130, https://doi.org/10.1137/16M1102744. · Zbl 1383.65037
[16] K. Kim, T. B. Costa, M. Deveci, A. M. Bradley, S. D. Hammond, M. E. Guney, S. Knepper, S. Story, and S. Rajamanickam, Designing vector-friendly compact BLAS and LAPACK kernels, in Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, SC ’17, ACM, New York, 2017, 55, https://doi.org/10.1145/3126908.3126941.
[17] T. G. Kolda and B. W. Bader, Tensor decompositions and applications, SIAM Rev., 51 (2009), pp. 455-500, https://doi.org/10.1137/07070111X. · Zbl 1173.65029
[18] M. Kretz and V. Lindenstruth, Vc: A C++ library for explicit vectorization, Software: Practice and Experience, 42 (2012), pp. 1409-1430, https://doi.org/10.1002/spe.1149.
[19] J. Li, J. Choi, I. Perros, J. Sun, and R. Vuduc, Model-driven sparse CP decomposition for higher-order tensors, in Proceedings of the 2017 IEEE International Parallel and Distributed Processing Symposium (IPDPS), IEEE, 2017, pp. 1048-1057, https://doi.org/10.1109/IPDPS.2017.80.
[20] J. Li, Y. Ma, and R. Vuduc, ParTI!: A Parallel Tensor Infrastructure for Multicore CPUs and GPUs, Oct. 2018, https://github.com/hpcgarage/ParTI.
[21] J. Li, J. Sun, and R. Vuduc, HiCOO: Hierarchical Storage of Sparse Tensors, in Proceedings of the ACM/IEEE International Conference for High-Performance Computing, Networking, Storage, and Analysis (SC18), IEEE Press, Piscataway, NJ, 2018, 19.
[22] B. Liu, C. Wen, A. D. Sarwate, and M. Mehri Dehnavi, A Unified Optimization Approach for Sparse Tensor Operations on GPUs, preprint, https://arxiv.org/abs/1705.09905, 2017.
[23] J. D. McCalpin, Memory bandwidth and machine balance in current high performance computers, IEEE Computer Society Technical Committee on Computer Architecture (TCCA) Newsletter, 12 (1995), pp. 19-25.
[24] A. Novikov, D. Podoprikhin, A. Osokin, and D. Vetrov, Tensorizing Neural Networks, preprint, http://arxiv.org/abs/1509.06569, 2015.
[25] OpenMP Architecture Review Board, OpenMP Application Program Interface, 2013, https://www.openmp.org/wp-content/uploads/OpenMP4.0.0.pdf.
[26] E. Phipps, M. D’Elia, H. C. Edwards, M. Hoemmen, J. Hu, and S. Rajamanickam, Embedded ensemble propagation for improving performance, portability, and scalability of uncertainty quantification on emerging computational architectures, SIAM J. Sci. Comput., 39 (2017), pp. C162-C193, https://doi.org/10.1137/15M1044679. · Zbl 1365.65017
[27] T. B. Rolinger, T. A. Simon, and C. D. Krieger, Performance challenges for heterogeneous distributed tensor decompositions, in IEEE High Performance Extreme Computing Conference HPEC 2017, IEEE, 2017.
[28] S. Smith, J. W. Choi, J. Li, R. Vuduc, J. Park, X. Liu, and G. Karypis, FROSTT: The Formidable Repository of Open Sparse Tensors and Tools, 2017, http://frostt.io/.
[29] S. Smith and G. Karypis, Tensor-matrix products with a compressed sparse tensor, in Proceedings of the 5th Workshop on Irregular Applications: Architectures and Algorithms, IA3 ’15, ACM, New York, 2015, 5, https://doi.org/10.1145/2833179.2833183.
[30] S. Smith, J. Park, and G. Karypis, Sparse tensor factorization on many-core processors with high-bandwidth memory, in Proceedings of the 2017 IEEE International Parallel and Distributed Processing Symposium (IPDPS), IEEE, 2017, pp. 1058-1067, https://doi.org/10.1109/IPDPS.2017.84.
[31] S. Smith, N. Ravindran, N. D. Sidiropoulos, and G. Karypis, SPLATT: Efficient and parallel sparse tensor-matrix multiplication, in IPDPS 2015: IEEE International Parallel and Distributed Processing Symposium, IEEE, 2015, pp. 61-70, https://doi.org/10.1109/ipdps.2015.27.
[32] E. Solomonik, D. Matthews, J. R. Hammond, J. F. Stanton, and J. Demmel, A massively parallel tensor contraction framework for coupled-cluster computations, J. Parallel Distrib. Comput., 74 (2014), pp. 3176-3190, https://doi.org/10.1016/j.jpdc.2014.06.002.
[33] D. Vandevoorde and N. M. Josuttis, C++ Templates: The Complete Guide, Addison-Wesley, Reading, MA, 2002.
[34] H. Wang, P. Wu, I. G. Tanase, M. J. Serrano, and J. E. Moreira, Simple, portable and fast SIMD intrinsic programming: Generic simd library, in Proceedings of the 2014 Workshop on Programming Models for SIMD/Vector Processing, WPMVP ’14, ACM, New York, 2014, pp. 9-16, https://doi.org/10.1145/2568058.2568059.
[35] Y. Wang, R. Chen, J. Ghosh, J. C. Denny, A. Kho, Y. Chen, B. A. Malin, and J. Sun, Rubik: Knowledge guided tensor factorization and completion for health data analytics, in Proceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD ’15, ACM, New York, 2015, pp. 1265-1274, https://doi.org/10.1145/2783258.2783395.
This reference list is based on information provided by the publisher or from digital mathematics libraries. Its items are heuristically matched to zbMATH identifiers and may contain data conversion errors. In some cases that data have been complemented/enhanced by data from zbMATH Open. This attempts to reflect the references listed in the original paper as accurately as possible without claiming completeness or a perfect matching.