research-article

Open access

Tile Size and Loop Order Selection using Machine Learning for Multi-/Many-Core Architectures

Authors:

Shilpa Babalad,

Shirish Shevade,

Matthew Jacob Thazhuthaveetil,

R GovindarajanAuthors Info & Claims

ICS '24: Proceedings of the 38th ACM International Conference on Supercomputing

Pages 388 - 399

https://doi.org/10.1145/3650200.3656630

Published: 03 June 2024 Publication History

All formats PDF

Abstract

Loop tiling and loop interchange (or permutation) are techniques that can expose task and data-level parallelisms and can exploit data locality available in multi-dimensional loop nests. Choosing the appropriate tile size and loop order is important to achieve significant performance improvement. However, the effect of these transformations on the performance of the loop nest is not straightforward due to the complex interplay of several architectural features in multi-/many-core architectures. In this work, we propose using a supervised learning technique and develop a Support Vector Machine (SVM) based hierarchical classifier to identify the best-performing tile size and loop order for a given loop nest. Our approach results in identifying tile sizes and loop orders whose performance, on average, is within 18% and 9% of the optimal performance for two sets of loop nests on Intel Xeon Cascadelake architecture. Further, our method outperforms state-of-the-art techniques, Pluto and Polly, with a geometric mean speedup of 1.35x to 1.58x.

References

[1]

Felix Agakov, Edwin Bonilla, John Cavazos, Björn Franke, Grigori Fursin, Michael FP O’Boyle, John Thomson, Marc Toussaint, and Christopher KI Williams. 2006. Using machine learning to focus iterative optimization. In International Symposium on Code Generation and Optimization (CGO’06). IEEE, 295–305.

Digital Library

[2]

Jason Ansel, Shoaib Kamil, Kalyan Veeramachaneni, Jonathan Ragan-Kelley, Jeffrey Bosboom, Una-May O’Reilly, and Saman Amarasinghe. 2014. Opentuner: An extensible framework for program autotuning. In Proceedings of the 23rd international conference on Parallel architectures and compilation. 303–316.

Digital Library

[3]

Mohamed Arafa, Bahaa Fahim, Sailesh Kottapalli, Akhilesh Kumar, Lily P Looi, Sreenivas Mandava, Andy Rudoff, Ian M Steiner, Bob Valentine, Geetha Vedaraman, 2019. Cascade lake: Next generation intel xeon scalable processor. IEEE Micro 39, 2 (2019), 29–36.

[4]

Amir H Ashouri, Andrea Bignoli, Gianluca Palermo, Cristina Silvano, Sameer Kulkarni, and John Cavazos. 2017. Micomp: Mitigating the compiler phase-ordering problem using optimization sub-sequences and machine learning. ACM Transactions on Architecture and Code Optimization (TACO) 14, 3 (2017), 1–28.

Digital Library

[5]

Shilpa Babalad, Shirish K Shevade, Matthew Jacob Thazhuthaveetil, and R Govindarajan. 2023. A Machine Learning Approach to Identify the Best-Performing Loop Order. https://github.com/knightlander2023/OptLoopOrder, Technical Report, Department of Computer Science and Automation, Indian Institute of Science, Bengaluru.

[6]

David F Bacon, Susan L Graham, and Oliver J Sharp. 1994. Compiler transformations for high-performance computing. ACM Computing Surveys (CSUR) 26, 4 (1994), 345–420.

Digital Library

[7]

David Bailey, Tim Harris, William Saphir, Rob Van Der Wijngaart, Alex Woo, and Maurice Yarrow. 1995. The NAS parallel benchmarks 2.0. Technical Report. Technical Report NAS-95-020, NASA Ames Research Center.

[8]

David H Bailey, Eric Barszcz, John T Barton, David S Browning, Robert L Carter, Leonardo Dagum, Rod A Fatoohi, Paul O Frederickson, Thomas A Lasinski, Rob S Schreiber, 1991. The NAS parallel benchmarks summary and preliminary results. In Supercomputing’91: Proceedings of the 1991 ACM/IEEE conference on Supercomputing. IEEE, 158–165.

Digital Library

[9]

Christian Bienia, Sanjeev Kumar, Jaswinder Pal Singh, and Kai Li. 2008. The PARSEC benchmark suite: Characterization and architectural implications. In Proceedings of the 17th international conference on Parallel architectures and compilation techniques. 72–81.

Digital Library

[10]

Uday Bondhugula, Muthu Baskaran, Sriram Krishnamoorthy, Jagannathan Ramanujam, Atanas Rountev, and Ponnuswamy Sadayappan. 2008. Automatic transformations for communication-minimized parallelization and locality optimization in the polyhedral model. In International Conference on Compiler Construction. Springer, 132–146.

Digital Library

[11]

Uday Bondhugula, Albert Hartono, Jagannathan Ramanujam, and Ponnuswamy Sadayappan. 2008. A practical automatic polyhedral parallelizer and locality optimizer. In Proceedings of the 29th ACM SIGPLAN Conference on Programming Language Design and Implementation. 101–113.

Digital Library

[12]

Marius Cornea. 2015. Intel AVX-512 instructions and their use in the implementation of math functions. Intel Corporation (2015), 1–20.

[13]

Corinna Cortes and Vladimir Vapnik. 1995. Support-vector networks. Machine learning 20, 3 (1995), 273–297.

Digital Library

[14]

Chris Cummins, Pavlos Petoumenos, Zheng Wang, and Hugh Leather. 2017. Synthesizing benchmarks for predictive modeling. In 2017 IEEE/ACM International Symposium on Code Generation and Optimization (CGO). IEEE, 86–99.

[15]

Anderson Faustino da Silva, Bruno Conde Kind, José Wesley de Souza Magalhães, Jerônimo Nunes Rocha, Breno Campos Ferreira Guimaraes, and Fernando Magno Quinão Pereira. 2021. AnghaBench: A suite with one million compilable C benchmarks for code-size reduction. In 2021 IEEE/ACM International Symposium on Code Generation and Optimization (CGO). IEEE, 378–390.

Digital Library

[16]

Pradipta De, Ravi Kothari, and Vijay Mann. 2007. Identifying sources of operating system jitter through fine-grained kernel instrumentation. In 2007 IEEE International Conference on Cluster Computing. IEEE, 331–340.

Digital Library

[17]

Sylvain Girbal, Nicolas Vasilache, Cédric Bastoul, Albert Cohen, David Parello, Marc Sigler, and Olivier Temam. 2006. Semi-automatic composition of loop transformations for deep parallelism and memory hierarchies. International Journal of Parallel Programming 34, 3 (2006), 261–317.

Digital Library

[18]

Tobias Grosser, Hongbin Zheng, Raghesh Aloor, Andreas Simbürger, Armin Größlinger, and Louis-Noël Pouchet. 2011. Polly-Polyhedral optimization in LLVM. In Proceedings of the First International Workshop on Polyhedral Compilation Techniques (IMPACT), Vol. 2011. 1.

[19]

Ameer Haj-Ali, Nesreen K Ahmed, Ted Willke, Yakun Sophia Shao, Krste Asanovic, and Ion Stoica. 2020. Neurovectorizer: End-to-end vectorization with deep reinforcement learning. In Proceedings of the 18th ACM/IEEE International Symposium on Code Generation and Optimization. 242–255.

Digital Library

[20]

Martin Kong, Richard Veras, Kevin Stock, Franz Franchetti, Louis-Noël Pouchet, and Ponnuswamy Sadayappan. 2013. When polyhedral transformations meet SIMD code generation. In Proceedings of the 34th ACM SIGPLAN conference on Programming language design and implementation. 127–138.

Digital Library

[21]

David Meyer, Evgenia Dimitriadou, Kurt Hornik, Andreas Weingessel, Friedrich Leisch, Chih-Chung Chang, Chih-Chen Lin, and Maintainer David Meyer. 2019. Package ‘e1071’. The R Journal (2019).

[22]

Kumudha Narasimhan, Aravind Acharya, Abhinav Baid, and Uday Bondhugula. 2021. A practical tile size selection model for affine loop nests. In Proceedings of the ACM International Conference on Supercomputing. 27–39.

Digital Library

[23]

LN Pouchet. 2012. Polybench: The polyhedral benchmark suite. http://www. cs. ucla. edu/pouchet/software/polybench.

[24]

LN Pouchet and Scott Grauer-Gray. 2011. PolyBench: The Polyhedral Benchmark suite (2011), Version 3.2. http://www-roc. inria. fr/ pouchet/software/polybench.

[25]

Louis-Noël Pouchet, C. Bastoul, and U. Bondhugula. 2019. PoCC: the polyhedral compiler collection. http://web.cs.ucla.edu/ pouchet/software/pocc/.

[26]

Louis-Noël Pouchet, Uday Bondhugula, Cédric Bastoul, Albert Cohen, Jagannathan Ramanujam, Ponnuswamy Sadayappan, and Nicolas Vasilache. 2011. Loop transformations: convexity, pruning and optimization. ACM SIGPLAN Notices 46, 1 (2011), 549–562.

Digital Library

[27]

Kishore Kumar Pusukuri, Rajiv Gupta, and Laxmi N Bhuyan. 2012. Thread tranquilizer: Dynamically reducing performance variation. ACM Transactions on Architecture and Code Optimization (TACO) 8, 4 (2012), 1–21.

Digital Library

[28]

Peter J Rousseeuw and Mia Hubert. 2011. Robust statistics for outlier detection. Wiley interdisciplinary reviews: Data mining and knowledge discovery 1, 1 (2011), 73–79.

[29]

Savvas Sioutas, Sander Stuijk, Henk Corporaal, Twan Basten, and Lou Somers. 2018. Loop transformations leveraging hardware prefetching. In Proceedings of the 2018 International Symposium on Code Generation and Optimization. 254–264.

Digital Library

[30]

Avinash Sodani. 2015. Knights Landing (KNL): 2nd generation Intel® Xeon Phi processor. In 2015 IEEE Hot Chips 27 Symposium (HCS). IEEE, 1–24.

[31]

Avinash Sodani, Roger Gramunt, Jesus Corbal, Ho-Seop Kim, Krishna Vinod, Sundaram Chinthamani, Steven Hutsell, Rajat Agarwal, and Yen-Chen Liu. 2016. Knights Landing: Second-generation Intel Xeon Phi product. IEEE Micro 36, 2 (2016), 34–46.

Digital Library

[32]

Mark Stephenson and Saman Amarasinghe. 2005. Predicting unroll factors using supervised classification. In International symposium on code generation and optimization. IEEE, 123–134.

Digital Library

[33]

Kevin Stock, Louis-Noël Pouchet, and P Sadayappan. 2012. Using machine learning to improve automatic vectorization. ACM Transactions on Architecture and Code Optimization (TACO) 8, 4 (2012), 1–23.

Digital Library

[34]

Konrad Trifunovic, Dorit Nuzman, Albert Cohen, Ayal Zaks, and Ira Rosen. 2009. Polyhedral-model guided loop-nest auto-vectorization. In 2009 18th International Conference on Parallel Architectures and Compilation Techniques. IEEE, 327–337.

Digital Library

[35]

Sven Verdoolaege. 2010. isl: An integer set library for the polyhedral model. In International Congress on Mathematical Software. Springer, 299–302.

Digital Library

[36]

Sven Verdoolaege, Juan Carlos Juega, Albert Cohen, Jose Ignacio Gomez, Christian Tenllado, and Francky Catthoor. 2013. Polyhedral parallel code generation for CUDA. ACM Transactions on Architecture and Code Optimization (TACO) 9, 4 (2013), 1–23.

Digital Library

[37]

Rui Xu, Edwin Hsing-Mean Sha, Qingfeng Zhuge, Yuhong Song, and Han Wang. 2023. Loop interchange and tiling for multi-dimensional loops to minimize write operations on NVMs. Journal of Systems Architecture 135 (2023), 102799.

Digital Library

Index Terms

Tile Size and Loop Order Selection using Machine Learning for Multi-/Many-Core Architectures
1. Software and its engineering
  1. Software notations and tools
    1. Compilers

Recommendations

Loop Transformation Using Nonunimodular Matrices

Linear transformations are widely used to vectorize and parallelize loops. A subset of these transformations are unimodular transformations. When a unimodular transformation is used, the exact bounds of the transformed loop nest are easily computed and ...
Improving Memory Hierarchy Performance through Combined Loop Interchange and Multi-Level Fusion

Because of the increasing gap between the speeds of processors and main memories, compilers must enhance the locality of applications to achieve high performance. Loop fusion enhances locality by fusing loops that access similar sets of data. Typically, ...
Loop Coalescing and Scheduling for Barrier MIMD Architectures

Barrier MIMD's are asynchronous multiple instruction stream, multiple data stream architectures capable of parallel execution of variable execution time instructions and arbitrary control flow (e.g., while loops and calls); however, they differ from ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

ICS '24: Proceedings of the 38th ACM International Conference on Supercomputing

May 2024

582 pages

ISBN:9798400706103

DOI:10.1145/3650200

Copyright © 2024 Owner/Author.

This work is licensed under a Creative Commons Attribution International 4.0 License.

Sponsors

SIGARCH: ACM Special Interest Group on Computer Architecture

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 03 June 2024

Check for updates

Author Tags

Qualifiers

Research-article
Research
Refereed limited

Conference

ICS '24

Sponsor:

SIGARCH

ICS '24: 2024 International Conference on Supercomputing

June 4 - 7, 2024

Kyoto, Japan

Acceptance Rates

ICS '24 Paper Acceptance Rate 45 of 125 submissions, 36%;

Overall Acceptance Rate 629 of 2,180 submissions, 29%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

0
Total Citations
178
Total Downloads

Downloads (Last 12 months)178
Downloads (Last 6 weeks)29

Reflects downloads up to 21 Oct 2024

Other Metrics

View Author Metrics

Citations

View Options

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

HTML Format

View this article in HTML Format.

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Media

Figures

Other

Tables

View Table of Contents