Low precision matrix multiplication for efficient deep learning in NVIDIA Carmel processors

417 Accesses
2 Citations
Explore all metrics

Abstract

We introduce a high performance, multi-threaded realization of the gemm kernel for the ARMv8.2 architecture that operates with 16-bit (half precision)/queryKindly check and confirm whether the corresponding author is correctly identified. floating point operands. Our code is especially designed for efficient machine learning inference (and to a certain extent, also training) with deep neural networks. The results on the NVIDIA Carmel multicore processor, which implements the ARMv8.2 architecture, show considerable performance gains for the gemm kernel, close to the theoretical peak acceleration that could be expected when moving from 32-bit arithmetic/data to 16-bit. Combined with the type of convolution operator arising in convolutional neural networks, the speed-ups are more modest though still relevant.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Micro-kernels for portable and efficient matrix multiplication in deep learning

Article Open access 14 December 2022

A BLIS-like matrix multiplication for machine learning in the RISC-V ISA-based GAP8 processor

Article Open access 28 May 2022

Automatic generation of ARM NEON micro-kernels for matrix multiplication

Article Open access 12 March 2024

References

Deng L et al (2013) Recent advances in deep learning for speech research at Microsoft. In: 2013 IEEE international conference on acoustics, speech and signal processing, May, pp 8604–8608
Krizhevsky A, Sutskever I, Hinton GE (2012) ImageNet classification with deep convolutional neural networks. In: Proceedings of the 25th international conference on neural information processing systems—vol 1, ser. NIPS’12. Curran Associates Inc., USA, pp 1097–1105
Zhang J, Zong C (2015) Deep neural networks in machine translation: an overview. IEEE Intell Syst 30(5):16–25
Article Google Scholar
Devlin J et al (2019) BERT: pre-training of deep bidirectional transformers for language understanding. In: Proceedings of 2019 conference North American Chapter of the Association for Computational Linguistics: Human Language Technologies, vol 1, pp 4171–4186
Sze V et al (2017) Efficient processing of deep neural networks: a tutorial and survey. Proc IEEE 105(12):2295–2329
Article Google Scholar
Vaswani A et al (2017) Attention is all you need. In: Advances in neural information processing systems, vol 30, pp 5998–6008
Chellapilla K, Puri S, Simard P (2006) High performance convolutional neural networks for document processing. In: International workshop on frontiers in handwriting recognition, available as INRIA-00112631 report from https://hal.inria.fr/inria-00112631
Van Zee FG, van de Geijn RA (2015) BLIS: a framework for rapidly instantiating BLAS functionality. ACM Trans Math Softw 41(3):14:1–14:33
MathSciNet MATH Google Scholar
Dongarra JJ, Du Croz J, Hammarling S, Duff I (1990) A set of level 3 basic linear algebra subprograms. ACM Trans Math Softw 16(1):1–17
Article MathSciNet Google Scholar
Goto K, van de Geijn R (2008) Anatomy of high-performance matrix multiplication. ACM Trans Math Softw 34(3):12:1–12:25
Article MathSciNet Google Scholar
Low TM, Igual FD, Smith TM, Quintana-Orti ES (2016) Analytical modeling is enough for high-performance blis. ACM Trans Math Softw 43(2):1–18. https://doi.org/10.1145/2925987
Article MathSciNet MATH Google Scholar
Fabeiro JF, Andrade D, Fraguela BB (2016) Writing a performance-portable matrix multiplication. Parallel Comput 52:65–77
Article MathSciNet Google Scholar
Zee FGV, Smith TM, Marker B, Low TM, Geijn RAVD, Igual FD, Smelyanskiy M, Zhang X, Kistler M, Austel V, Gunnels JA, Killough L (2016) The BLIS framework: experiments in portability. ACM Trans Math Softw 42(2):1–19. https://doi.org/10.1145/2755561
Article MathSciNet Google Scholar
Smith TM, van de Geijn R, Smelyanskiy M, Hammond JR, Zee FGV (2014) Anatomy of high-performance many-threaded matrix multiplication. In: IPDPS ’14: Proceedings of the international parallel and distributed processing symposium (to appear)
Catalán S et al (2016) Architecture-aware configuration and scheduling of matrix multiplication on asymmetric multicore processors. Cluster Comput 19(3):1037–1051
Article Google Scholar
Hennessy JL, Patterson DA (2003) Computer architecture: a quantitative approach. Morgan Kaufmann Pub, San Francisco
MATH Google Scholar
San Juan P, Castelló PS, Dolz MF, Alonso-Jordá P, Quintana-Ortí ES (2020) High performance and portable convolution operators for multicore processors. In: Proceedings of 32nd international Symposium on Computer Architecture and High Performance Computing (SBAC-PAD), pp 91–98
BLIS Performance benchmarks (2020). https://github.com/flame/blis/blob/master/docs/Performance.md

Download references

Acknowledgements

This work was supported by projects TIN2017-82972-R and RTI2018-093684-B-I00 from the Ministerio de Ciencia, Innovación y Universidades, project S2018/TCS-4423 of the Comunidad de Madrid, project PR65/19-22445 of the UCM, and project Prometeo/2019/109 of the Generalitat Valenciana.

Author information

Authors and Affiliations

Universidad Complutense de Madrid, Madrid, Spain
Rafael Rodríguez-Sánchez & Francisco D. Igual
Universitat Politècnica de València, Valencia, Spain
Pablo San Juan, Pedro Alonso-Jordá & Enrique S. Quintana-Ortí

Authors

Pablo San Juan
View author publications
You can also search for this author in PubMed Google Scholar
Rafael Rodríguez-Sánchez
View author publications
You can also search for this author in PubMed Google Scholar
Francisco D. Igual
View author publications
You can also search for this author in PubMed Google Scholar
Pedro Alonso-Jordá
View author publications
You can also search for this author in PubMed Google Scholar
Enrique S. Quintana-Ortí
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Pablo San Juan.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Cite this article

San Juan, P., Rodríguez-Sánchez, R., Igual, F.D. et al. Low precision matrix multiplication for efficient deep learning in NVIDIA Carmel processors. J Supercomput 77, 11257–11269 (2021). https://doi.org/10.1007/s11227-021-03636-4

Download citation

Accepted: 13 January 2021
Published: 22 March 2021
Issue Date: October 2021
DOI: https://doi.org/10.1007/s11227-021-03636-4

Low precision matrix multiplication for efficient deep learning in NVIDIA Carmel processors

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

Micro-kernels for portable and efficient matrix multiplication in deep learning

A BLIS-like matrix multiplication for machine learning in the RISC-V ISA-based GAP8 processor

Automatic generation of ARM NEON micro-kernels for matrix multiplication

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Keywords

Subscribe and save

Buy Now

Navigation

Low precision matrix multiplication for efficient deep learning in NVIDIA Carmel processors

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

Micro-kernels for portable and efficient matrix multiplication in deep learning

A BLIS-like matrix multiplication for machine learning in the RISC-V ISA-based GAP8 processor

Automatic generation of ARM NEON micro-kernels for matrix multiplication

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Subscribe and save

Buy Now

Search

Navigation