11institutetext: VU Amsterdam, Netherlands 22institutetext: Netherlands eScience Center, Amsterdam, Netherlands 33institutetext: Leiden Institute of Advanced Computer Science, Leiden, Netherlands

Bringing Auto-tuning to HIP: Analysis of Tuning Impact and Difficulty on AMD and Nvidia GPUs

Milo Lurati 1122    Stijn Heldens 22 0000-0001-8792-6305    Alessio Sclocco 22 0000-0003-3278-0518    Ben van Werkhoven 3322 0000-0002-7508-3272
Abstract

Many studies have focused on developing and improving auto-tuning algorithms for Nvidia Graphics Processing Units (GPUs), but the effectiveness and efficiency of these approaches on AMD devices have hardly been studied. This paper aims to address this gap by introducing an auto-tuner for AMD’s HIP. We do so by extending Kernel Tuner, an open-source Python library for auto-tuning GPU programs. We analyze the performance impact and tuning difficulty for four highly-tunable benchmark kernels on four different GPUs: two from Nvidia and two from AMD. Our results demonstrate that auto-tuning has a significantly higher impact on performance on AMD compared to Nvidia (10x vs 2x). Additionally, we show that applications tuned for Nvidia do not perform optimally on AMD, underscoring the importance of auto-tuning specifically for AMD to achieve high performance on these GPUs.

Keywords:
Auto-tuning GPU Programming HIP CUDA.

1 Introduction

Graphics Processing Units (GPUs) are widely used in High-Performance Computing (HPC) and artificial intelligence because of their high parallel processing power and ability to accelerate complex workloads [10, 14]. Eight out of nine supercomputers funded by EuroHPC JU use GPUs as the main source of compute power111https://eurohpc-ju.europa.eu/supercomputers/our-supercomputers_en (Accessed March 2024). GPUs excel in terms of compute performance and energy efficiency for tasks that involve large data sets and dense computation, making them increasingly vital in various scientific domains [31].

GPU programming models – such as HIP, CUDA, and OpenCL – allow developers to create highly parallel functions, called kernels. However, GPU programmers are confronted with a myriad of implementation choices and optimization techniques related to thread organization, memory usage, and computation strategies to achieve optimal compute performance [11]. The optimal kernel configuration depends on the specific GPU architecture and the task at hand, and finding this configuration is a process known as performance tuning; automating this process is called auto-tuning [4].

While auto-tuning techniques have been extensively studied for Nvidia GPUs [15, 3, 18, 17, 23, 21, 9, 30], their effectiveness on AMD GPUs has received considerably less attention. The studies that do consider AMD GPUs predominantly use OpenCL [25, 26, 29]. In 2016, AMD introduced HIP: an open-source GPU programming model that enables applications to run on both AMD and Nvidia GPUs through a single source code. HIP creates new opportunities for auto-tuning. For example, OpenCL on AMD was restricted to at most 256 threads per block [13, 6, 33, 24], whereas HIP increases this limit to 1024.

After a long period of market dominance by Nvidia, the HPC landscape is rapidly diversifying with the first generation of exascale supercomputers featuring for example Intel [2] and AMD GPUs [1]. Europe’s #1 supercomputer LUMI, which uses AMD’s MI250X GPUs, is part of this trend. It is urgent that we understand how the lessons learned from optimizing and tuning applications predominantly on Nvidia GPUs for over a decade, can be migrated to GPUs from different vendors.

To this end, this paper introduces the first auto-tuning tool for HIP kernels and studies the performance impact of tuning HIP kernels on AMD GPUs. Since HIP applications can run on both AMD and Nvidia GPUs, we subsequently compare the impact, tuning difficulty, and performance portability of tuned HIP applications on both AMD and Nvidia GPUs.

The contributions of this work are as follows:

  • We extend Kernel Tuner [29], an open-source Python tool for auto-tuning GPU applications, with support for HIP by integrating PyHIP, an open-source Python library to access the HIP runtime library and compiler [32].

  • We compare performance and portability of four highly-optimized auto-tuned HIP kernels on two AMD and two Nvidia GPUs.

  • We show that GPUs by Nvidia are generally easier to tune than those from AMD, both manually and using optimization algorithms, while the performance impact of tuning the same code on AMD GPUs is much larger compared to Nvidia (10x vs 2x).

  • We show that kernels tuned for AMD generally perform well on Nvidia GPUs, but not the other way around.

These findings demonstrate that it is even more important to use auto-tuning for HIP applications on AMD GPUs, compared to Nvidia, and thus emphasize the need for new tools that enable auto-tuning HIP code for AMD GPUs.

2 Related Work

Auto-tuning is widely used in various contexts such as optimizing numerical libraries, compilers, and application performance [4]. Examples of applications using auto-tuning include FFTW [8] for optimizing Fast Fourier Transforms on CPUs [28] and MAGMA for linear algebra [3]. In this paper, we focus on software-level auto-tuning, and in particular on the automatic tuning of code that targets GPUs.

There are several generic auto-tuners targeting GPU code. CLTune [20] is an auto-tuner for OpenCL. KTT [7] tunes parameters in OpenCL, CUDA, and GLSL applications focusing on pipelines of multiple kernels. ATF [23] focuses on OpenCL and CUDA kernels that have interdependent parameters.

HIP was released by AMD in March 2016 and is increasingly being adopted as a programming model for HPC applications, such as AMBER222https://ambermd.org/GPUSupport.php, NAMD333http://www.ks.uiuc.edu/Research/namd/alpha/2.15_amdhip/, PeleC444https://amrex-combustion.github.io/PeleC/, and AMReX555https://amrex-codes.github.io/amrex/docs_html/GPU.html. However, HIP is, to the best of our knowledge, not supported by any current auto-tuning framework.

In general, most auto-tuning studies have focused primarily on auto-tuning applications for Nvidia GPUs [15, 18, 17, 23, 21, 9, 30, 7, 27]. Many auto-tuning studies have included one or more AMD GPUs using OpenCL [13, 20, 12, 33, 29, 24]. To the best of our knowledge, this paper is the first study to investigate and compare the impact, tuning difficulty, and performance portability on both AMD and Nvidia GPUs for auto-tuned HIP applications.

3 Design and Implementation

Refer to caption
Figure 1: Kernel Tuner software architecture.
Refer to caption
Figure 2: Fitness Flow Graph of 2D Convolution search space for A4000.

The layered software architecture of Kernel Tuner, extended to accommodate our contributions, is shown in Figure 2. This revised architecture incorporates the HIP functions interface built on top of PyHIP666https://github.com/jatinx/PyHIP.

Users of Kernel Tuner create a small Python script that describes how the GPU code can be tuned. The strategies layer implements a great variety of optimization algorithms, which in turn rely on a runner. The runners interact with the diverse set of supported compilers and hardware through a unified device function interface, which abstracts the device-specific functionalities offered by various backends such as PyCUDA, CuPy, and PyHIP. This allows the higher-level layers (e.g. runners, optimization strategies) to operate independently of the underlying hardware and runtime.

The HIP backend in Kernel Tuner builds on PyHIP, a Python wrapper for HIP. We have made various contributions to PyHIP to increase its coverage of the HIP Runtime API and simplified the installation procedure. To integrate the new HIP backend with the rest of Kernel Tuner several changes were made. Due to the very high similarity between CUDA and HIP kernels, Kernel Tuner is not able to automatically detect the kernel language. To solve this problem, we require the user to manually specify when HIP is used.

Kernel Tuner performs empirical measurements of the execution time of each kernel configuration it compiles and benchmarks. As with CUDA, the execution time of HIP kernels is measured by recording events before and after the kernel and calling hipEventElapsedTime to retrieve the execution time.

Finally, to support loop-unrolling, a code optimization that aims to improve program performance by reducing loop overhead, while increasing instruction-level parallelism [11], we have extended support in Kernel Tuner to auto-tuning partial loop unrolling factors in CUDA kernels to also support HIP kernels.

4 Evaluation metrics

We compare auto-tuning GPU codes for either vendor along three main axes: performance impact of auto-tuning, the tuning difficulty, and the performance portability of tuned kernels.

Tuning impact. To quantify the performance impact of auto-tuning we analyze the statistical properties of the performance distribution of the full tuning search space of a kernel. More specifically, we define tuning impact as the factor between the performance of the global optimum and the median performance of configurations in the space. The rationale is that without auto-tuning one can expect to achieve performance that is the most common among configurations, and with auto-tuning the application can achieve optimal performance. In addition, violin plots are used to visualize the performance distributions relative to the optimum across devices, allowing for direct comparison and pattern identification.

Tuning difficulty. For some tuning spaces, the global optimum may be a statistical outlier in terms of performance, but that does not necessarily mean that the global optimum is also difficult to find for an optimization algorithm. To assess the respective tuning difficulty on GPUs from the different vendors, we quantify how difficult it is for an optimization algorithm to arrive at a configuration of acceptable performance. For this, we use the proportion of centrality [24].

The proportion of centrality is computed on a fitness flow graph (FFG), which has directed edges between neighbouring points with better fitness values, as shown in Figure 2. The idea is that a random walk on the FFG simulates the path taken by a local search algorithm. We use PageRank [5] centrality to quantify the likelihood of arriving at a local minimum. Given a proportion p𝑝pitalic_p, consider foptsubscript𝑓𝑜𝑝𝑡f_{opt}italic_f start_POSTSUBSCRIPT italic_o italic_p italic_t end_POSTSUBSCRIPT as the optimal fitness, L(X)𝐿𝑋L(X)italic_L ( italic_X ) as the set of local minima of X𝑋Xitalic_X, and Lp(X)subscript𝐿𝑝𝑋L_{p}(X)italic_L start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT ( italic_X ) as the collection of local minima with fitness values less than (1+p)fopt1𝑝subscript𝑓𝑜𝑝𝑡(1+p)f_{opt}( 1 + italic_p ) italic_f start_POSTSUBSCRIPT italic_o italic_p italic_t end_POSTSUBSCRIPT. P-proportion of centrality is defined, with cGsubscript𝑐𝐺c_{G}italic_c start_POSTSUBSCRIPT italic_G end_POSTSUBSCRIPT as the centrality function, as:

Cp(G,X)=xLp(X)cG(x)xL(X)cG(x)subscript𝐶𝑝𝐺𝑋subscript𝑥subscript𝐿𝑝𝑋subscript𝑐𝐺𝑥subscript𝑥𝐿𝑋subscript𝑐𝐺𝑥C_{p}(G,X)=\frac{\sum_{x\in L_{p}(X)}c_{G}(x)}{\sum_{x\in L(X)}c_{G}(x)}italic_C start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT ( italic_G , italic_X ) = divide start_ARG ∑ start_POSTSUBSCRIPT italic_x ∈ italic_L start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT ( italic_X ) end_POSTSUBSCRIPT italic_c start_POSTSUBSCRIPT italic_G end_POSTSUBSCRIPT ( italic_x ) end_ARG start_ARG ∑ start_POSTSUBSCRIPT italic_x ∈ italic_L ( italic_X ) end_POSTSUBSCRIPT italic_c start_POSTSUBSCRIPT italic_G end_POSTSUBSCRIPT ( italic_x ) end_ARG (1)

Performance portability. Performance portability examines how well a configuration that gives optimal performance on one device or set of devices, performs when moving to another device. We use the metric defined by Pennycook et al. [22], denoted as P P, which measures the performance portability across a set of devices H𝐻Hitalic_H for configuration x𝑥xitalic_x of kernel p𝑝pitalic_p as:

P

P
(x,p,H)
=|H|iH1ei(x,p)

P

P
𝑥𝑝𝐻
𝐻subscript𝑖𝐻1subscript𝑒𝑖𝑥𝑝
\reflectbox{P}\text{P}(x,p,H)=\frac{|H|}{\sum_{i\in H}\frac{1}{e_{i}(x,p)}}roman_P ( italic_x , italic_p , italic_H ) = divide start_ARG | italic_H | end_ARG start_ARG ∑ start_POSTSUBSCRIPT italic_i ∈ italic_H end_POSTSUBSCRIPT divide start_ARG 1 end_ARG start_ARG italic_e start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_x , italic_p ) end_ARG end_ARG
(2)
ei(x,p)=Pi(x,p)maxxXPi(x,p)subscript𝑒𝑖𝑥𝑝subscript𝑃𝑖𝑥𝑝𝑥𝑋subscript𝑃𝑖𝑥𝑝e_{i}(x,p)=\frac{P_{i}(x,p)}{\underset{x\in X}{\max}~{}P_{i}(x,p)}italic_e start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_x , italic_p ) = divide start_ARG italic_P start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_x , italic_p ) end_ARG start_ARG start_UNDERACCENT italic_x ∈ italic_X end_UNDERACCENT start_ARG roman_max end_ARG italic_P start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_x , italic_p ) end_ARG (3)

Here, ei(x,p)subscript𝑒𝑖𝑥𝑝e_{i}(x,p)italic_e start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_x , italic_p ) represents the performance efficiency of configuration x𝑥xitalic_x for kernel p𝑝pitalic_p on device i𝑖iitalic_i as the ratio of the achieved performance Pi(x,p)subscript𝑃𝑖𝑥𝑝P_{i}(x,p)italic_P start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_x , italic_p ) to the highest observed performance across all configurations called X𝑋Xitalic_X.

5 Experimental setup

In this section, we introduce the benchmark applications and the hardware and software used to compare auto-tuning HIP code on AMD and Nvidia GPUs.

Benchmark kernels. For the evaluation, we use four benchmark kernels taken from the CLBlast library [19] (namely GEMM) and the BAT benchmark suite [27] (namely Convolution, Hotspot, and Dedispersion). The problems implemented by these kernels and an explanation of their tunable parameters can be found in [19] and [27]. The tunable parameter values are listed in Table 1 and 4. For GEMM, the input matrices are 4096x4096. The full source code of the kernels, input problem dimensions, and analysis tools are provided in the accompanying GitHub repository777https://github.com/MiloLurati/AutoTuning_AMD_vs_Nvidia_GPUs.

Hardware and software description. For the evaluation, we focus on four different GPU models available in the DAS-6 cluster and the LUMI supercomputer. The GPU specifications are listed in Table 2. On DAS-6 we use Rocky-8 Linux 4.18.0, ROCM 6.0.2 with AMD clang 17.0.0, and CUDA 12.2 with GCC 9.4.0. For the MI250X, LUMI is running SUSE Linux 5.14.21, ROCM 5.2.3 with AMD clang 14.0.0. Note that the MI250X is a multi-chip module with two individually operating GPU dies and we use only a single die. All measurements have been performed with Kernel Tuner 1.0.0b6, into which our modifications have been merged. For proportion of centrality calculation and visualization, we adapted the code from Schoonhoven et al. [24].

Table 1: Tunable parameters for Convolution, Hotspot, and Dedispersion kernels.
Parameter Convolution Hotspot Dedispersion
block_size_x 16k16𝑘16k16 italic_k for k𝑘kitalic_k in 1,2,,1612161,2,\ldots,161 , 2 , … , 16 1,2,4,8,16,32k12481632𝑘1,2,4,8,16,\newline 32k1 , 2 , 4 , 8 , 16 , 32 italic_k for k𝑘kitalic_k in 1,2,,3212321,2,\ldots,321 , 2 , … , 32 1, 2, 4, 8, 16, 32
block_size_y 1, 2, 4, 8, 16 1, 2, 4, 8, 16, 32 8k8𝑘8k8 italic_k for k𝑘kitalic_k in 4,5,,3245324,5,\ldots,324 , 5 , … , 32
tile_size_x 1, 2, 3, 4 1, 2, 3, 4, 5, 6, 7, 8, 9, 10 1, 2, 3, 4
tile_size_y 1, 2, 3, 4 1, 2, 3, 4, 5, 6, 7, 8, 9, 10 1, 2, 3, 4, 5, 6, 7, 8
read_only 0, 1
use_padding 0, 1
use_shmem 0, 1
temporal_tiling_factor 1, 2, 3, 4, 5, 6, 7, 8, 9, 10
loop_unroll_factor_t 1, 2, 3, 4, 5, 6, 7, 8, 9, 10
sh_power 0, 1
tile_stride_x 0, 1
tile_stride_y 0, 1
Table 2: GPUs used in our experiments. *Only one out of two dies of MI250X is used.
GPU Year Architecture Cores Memory Cache Bandwidth (GB/s) Peak SP (GFLOPS/s)
AMD W6600 2021 RDNA 2 1792 16 GB GDDR6 32 MB L3 224 10404
AMD MI250X* 2021 CDNA 2 7040 64 GB HMB2e 8 MB L2 1638 28160
Nvidia A4000 2021 Ampere 6144 8 GB GDDR6 4 MB L2 448 17800
Nvidia A100 2020 Ampere 6912 40 GB HMB2 40 MB L2 1555 19500

6 Evaluation

In this section, we first present the results on the four benchmark applications by analyzing the tuning impact, tuning difficulty, and performance portability. We also present the top five best performing configurations in each auto-tuning search space to discuss how the results obtained by the tuner can be explained by properties of the hardware.

6.1 Convolution

Table 3: GEMM tunable parameters, as explained in [19].
Parameter Values
MWG 16, 32, 64, 128
NWG 16, 32, 64, 128
KWG 16, 32
MDIMC 8, 16, 32
NDIMC 8, 16, 32
MDIMA 8, 16, 32
NDIMB 8, 16, 32
VWM 1, 2, 4, 8
VWN 1, 2, 4, 8
STRM 0, 1
STRN 0, 1
SA 0, 1
SB 0, 1
Table 4: Statistical properties of the benchmarks. Tuning impact is the maximum over the median.
W6600 MI250X A4000 A100
Convolution (GFLOP/s) median 137 380 2284 4117
4,362 configurations maximum 4370 11460 7393 13637
impact 31.9x 30.1x 3.2x 3.3x
Hotspot (GFLOP/s) median 94 334 92 632
10,5412 configurations maximum 229 1781 177 1776
impact 2.5x 5.3x 1.9x 2.8x
Dedispersion (GB/s) median 427 667 470 1085
11,130 configurations maximum 582 1586 532 1154
impact 1.4x 2.4x 1.1x 1.1x
GEMM (GFLOPS/s) median 1154 7799 4802 10748
116,928 configurations maximum 6010 19807 10502 17145
impact 5.2x 2.5x 2.2x 1.6x

Figure 4 presents the performance distributions of the convolution kernel tuning space on all four GPUs showing rather bottom-heavy distributions, meaning that the optimal configurations are extreme outliers in terms of performance. This is, however, even more pronounced for the two AMD devices. It is quite clear from these results that manual performance optimization of the convolution kernel is, if not impossible, at least very unlikely to result in optimal performance.

The median and maximum of each kernel on each device are shown in Table 4, showing how important tuning is for this kernel, in absolute performance: tuning provides a similar-to{\sim}30x performance improvement for the AMD GPUs, and a similar-to{\sim}3x improvement for the Nvidia ones. A whole order of magnitude difference between the two vendors, meaning the impact of auto-tuning is high for our AMD devices.

Figure 4 shows the proportion of centrality of the convolution, for all platforms, at different levels of acceptable optima p𝑝pitalic_p, ranging from 0% (the global optimum) to 15%. Here we see that, while manual tuning was more difficult for the AMD GPUs, the results for this experiment are different. Instead of a vendor split, we see that finding the global optimum of the A100 is more difficult than finding the optimum of the other devices, and that by relaxing the constraints on the optimum the A4000 becomes easier to tune than the rest.

Refer to caption
Figure 3: 2D Convolution tuning search space.
Refer to caption
Figure 4: 2D Convolution proportion of centrality.
Table 5: Top configurations for convolution. Parameters match Table 1. Performance in TFLOP/s.

W6600
Parameters Perf. 128 1 1 4 1 0 0 4.37 32 1 1 4 1 0 0 4.35 64 1 1 4 1 0 0 4.33 256 1 1 4 1 0 0 4.32 16 16 4 2 1 1 1 3.65

MI250X
Parameters Perf. 64 1 2 4 1 0 0 11.46 128 1 2 4 1 0 0 11.46 256 1 2 4 1 0 0 11.29 128 1 1 4 1 0 0 11.28 64 1 1 4 1 0 0 11.23

A4000
Parameters Perf. 256 1 2 4 0 0 0 7.39 32 1 2 4 0 0 0 7.36 128 1 2 4 0 0 0 7.31 256 1 1 4 0 0 0 7.30 32 1 4 4 0 0 0 7.30

A100
Parameters Perf. 32 4 1 3 1 0 1 13.64 128 2 1 3 1 0 1 12.69 128 1 1 3 1 0 1 12.27 48 2 1 4 1 0 1 12.09 48 2 1 3 1 0 1 12.08

Table 5 shows the top 5 configurations for each device. A first observation is that these configurations are different for each device. However, we can observe certain patterns. All GPUs prefer small thread blocks, with at most 256 threads, but while the two AMD devices, and the A4000, prefer one-dimensional block configurations, the A100 prefers two-dimensional ones. So, even if the total number of threads is similar, the distribution of threads in the two-dimensional block is not. Another similarity between the GPUs is that all configurations use some form of tiling in the y𝑦yitalic_y dimension, to compensate for the lack of thread-level parallelism within thread blocks. In contrast, tiling in the x𝑥xitalic_x dimension is mainly used by the MI250X and the A4000, and not by the other two devices. Two more facts to highlight are that the A100 is the only GPU to consistently prefer using shared memory, but without padding to avoid bank conflicts, which is only used by one configuration in the top 5 on the W6600 with a 16x16 thread block size.

6.2 Hotspot

Next, we study the Hotspot kernel. Figure 6 shows a clear separation between consumer and server grade GPUs, with the consumer GPUs having more configurations that lead to reasonably good performance, and the server grade GPUs showing that only a few configurations achieve high performance. As shown in Table 4, the impact of auto-tuning the hotspot kernel varies from 1.9x on the A4000 to 5.3x on the MI250X.

The consumer grade GPUs are also easier to tune for optimization algorithms, as shown in Figure 6, although in this case the tuning difficulty of the two AMD devices is not that different from each other once we relax the amount of acceptable configurations.

Refer to caption
Figure 5: Hotspot tuning search space.
Refer to caption
Figure 6: Hotspot proportion of centrality.
Table 6: Top configurations for Hotspot. Parameters match Table 1. Performance in GFLOP/s.

W6600
Parameters Perf. 8 32 4 2 3 3 1 229.31 16 16 4 2 3 3 1 227.13 8 32 8 1 3 3 1 226.72 8 32 7 1 4 4 1 226.70 16 32 4 1 3 3 1 226.49

MI250X
Parameters Perf. 16 32 2 1 5 5 1 1781.16 16 32 2 1 4 4 1 1738.68 32 32 4 1 4 4 1 1723.78 32 16 2 1 4 4 1 1690.77 16 32 6 1 8 8 1 1685.52

A4000
Parameters Perf. 64 1 8 2 1 1 0 177.93 64 2 8 2 1 1 0 177.59 32 2 8 3 1 1 0 177.40 64 2 8 3 1 1 0 177.31 32 2 4 8 1 1 0 177.29

A100
Parameters Perf. 8 32 4 1 4 4 1 1776.14 4 32 4 1 9 1 1 1770.40 4 32 5 2 7 1 1 1763.47 8 32 4 1 4 2 1 1747.24 8 32 6 1 4 4 1 1741.92

Table 6 shows the top 5 configurations on all four devices. Again, we see that no configuration appears twice, underlining the need to tune for each device individually. The A4000 stands out, it is has the worst performance of all four GPUs and is the only GPU that does not store the power input data in shared memory. Also, the A4000 does not use temporal tiling, and instead uses relatively small block sizes combined with spatial tiling. All to reduce register usage and improve thread-level parallelism at the cost of data reuse in shared memory.

The other GPUs all use some degree of temporal tiling, which computes multiple calls of the kernel in a single kernel call, trading increased SM-level resource usage and even redundant work for reduced DRAM traffic. The AMD GPUs prefer to fully unroll the temporal tiling loop, where this preference is less pronounced on the A100. The MI250X uses large thread blocks, up to 1024 threads, much larger than the A100, showing that while the distributions, and even performance, of the two devices are similar, the optimal configurations are not.

6.3 Dedispersion

Now we shall look at the Dedispersion kernel. In Figure 8, we can see a clear distinction in the distribution of the MI250X compared with the other GPUs, where the optimum is clearly an outlier in terms of performance. In particular, looking at the median values, the A100 and A4000 achieve respectively the 94% and 88% of the optimum, making these devices not difficult to tune manually.

In terms of absolute performance, shown in Table 4, we can see that the MI250X achieves the highest overall performance, and over 96% of its peak bandwidth, and while it is more difficult to tune than the others, the impact is also higher. The proportion of centrality (Figure 8) shows that the MI250X remains difficult even if we include more configurations in the acceptable range. In contrast, the A100 achieves only 74% of its peak, but the majority of configurations come close to the optimal performance on A100.

Refer to caption
Figure 7: Dedispersion tuning search space.
Refer to caption
Figure 8: Dedispersion proportion of centrality.
Table 7: Top configurations for Dedispersion. Parameters match Table 1. Performance in GB/s.

W6600

Parameters Perf.
32 32 1 1 0 0 582.19
2 96 1 1 0 0 575.18
16 64 1 1 0 0 573.26
2 128 1 8 0 1 568.87
4 112 1 1 0 0 568.40

MI250X

Parameters Perf.
8 32 1 1 0 0 1586.43
8 64 1 1 0 0 1584.01
4 64 1 1 0 0 1579.71
16 32 1 1 0 0 1576.24
4 32 1 1 0 0 1576.03

A4000

Parameters Perf.
8 96 1 6 0 1 532.46
8 96 1 4 0 1 532.32
16 48 1 5 0 1 532.25
8 64 1 5 0 1 532.00
8 64 1 7 0 1 531.99

A100

Parameters Perf.
4 64 1 3 0 1 1154.54
8 96 1 7 0 1 1153.83
8 96 3 7 1 1 1153.06
16 48 1 7 0 1 1151.78
4 64 1 4 0 1 1151.38

Table 7 shows the top configurations on each device for the Dedispersion kernel. One thing that stands out is that all GPUs have a strong preference for large thread blocks, something that we could not have found using OpenCL instead of HIP for the AMD GPUs. More importantly, all GPUs prefer to do more work in the y-dimension, either per block or per thread, which is the one dimension where data reuse can be exploited. In particular, the W6600 benefits from its large L3 cache (32MB), achieving up to 582 GB/s, which is more than double of its theoretical peak memory bandwidth.

6.4 GEMM (General Matrix Multiplication)

Finally, we study the GEMM kernel. In Figure 10 we notice that the shape of the violin plots for the W6600 and the MI250X are quite similar, although the median performance of the W6600 is barely 20% of the optimum. The outlier for GEMM is the A100, for which the distribution is more top heavy with half of the configurations within 60% of the optimum. However, Table 4 shows that the speedup over the median is still 1.6x even for the A100. The GEMM kernel on the A100 achieves 88% of the theoretical peak performance of the GPU.

Looking at the proportion of centrality in Figure 10 we see that, while the optimal configurations are outliers on all GPUs, including more configurations in the acceptable range makes tuning easier for all devices. The Nvidia GPUs do become easier to tune, compared to AMD, even after a modest increase of the optimality criterion.

Table 8 shows again that no single configuration appears in the top 5 for more than one GPU. At the same time, there is a lot of similarity between the top configurations on all four GPUs. For example, all GPUs prefer to store both matrix A and B in shared memory and use a 16 as the loop blocking value for the K loop (KWG, 3rd column in Table 8). The thread block dimensions (MDIMC & NDIMC, 4th and 5th columns) shows that AMD GPUs overall prefer larger thread blocks than the A4000 and the A100. The two server grade GPUs strongly prefer to assign 8 by 8 blocks of work to each thread (MWGMDIMCMWGMDIMC\frac{\text{MWG}}{\text{MDIMC}}divide start_ARG MWG end_ARG start_ARG MDIMC end_ARG in x and NWGNDIMCNWGNDIMC\frac{\text{NWG}}{\text{NDIMC}}divide start_ARG NWG end_ARG start_ARG NDIMC end_ARG in y dimension), while the top configurations for the A4000 uses 16 in either x or y, and the W6600 uses 4 in the x or y dimension. We see here the effects of the small cache size of A4000, that prefers to rely on data reuse in registers, compared with the large L3 cache of the W6600, that instead prefers relying more heavily on the hardware managed cache.

Refer to caption
Figure 9: GEMM tuning search space.
Refer to caption
Figure 10: GEMM proportion of centrality.
Table 8: Top configurations for GEMM. Parameters match Table 4. Performance in GFLOPS/s.

W6600

Parameters Perf.
128 128 32 32 16 32 16 1 2 1 1 1 1 6010.47
128 128 32 32 16 16 16 2 2 1 1 1 1 6010.17
128 128 32 32 16 32 32 2 2 1 1 1 1 5992.99
128 128 32 32 16 16 32 2 2 1 1 1 1 5985.31
128 128 32 16 32 32 16 2 2 1 1 1 1 5982.45

MI250X

Parameters Perf.
128 128 16 16 16 32 32 4 2 1 1 1 1 19806.86
128 128 16 16 16 32 32 4 4 1 1 1 1 19718.46
128 128 16 16 16 32 32 4 4 1 0 1 1 19686.09
128 128 16 16 16 16 16 2 2 1 1 1 1 19651.96
128 128 16 16 16 16 16 4 2 1 1 1 1 19569.83

A4000

Parameters Perf.
128 128 16 16 8 8 8 4 4 1 1 1 1 10502.17
128 128 16 8 16 16 16 4 4 1 1 1 1 10489.60
128 128 16 16 8 16 16 4 4 1 1 1 1 10479.14
128 128 16 16 8 16 8 4 4 1 1 1 1 10469.28
128 128 16 8 16 16 8 4 4 1 1 1 1 10418.63

A100

Parameters Perf.
128 64 16 16 8 32 8 2 4 1 1 1 1 17145.04
64 128 16 8 16 16 16 4 4 1 1 1 1 17138.06
128 64 16 16 8 8 8 4 4 1 1 1 1 17135.74
128 64 16 16 8 16 8 4 4 1 1 1 1 17123.11
64 128 16 8 16 16 8 4 4 1 1 1 1 17116.28

6.5 Performance Portability

Next, we consider the performance portability of our benchmarks. Given that the performance portability score P P is computed over a specific set of devices H𝐻Hitalic_H, we can consider different aspects of performance portability by using different subsets of devices for H𝐻Hitalic_H. For instance, by identifying the configuration with the optimal P P score for H={W6600,MI250X}𝐻W6600MI250XH=\{\text{W6600},\text{MI250X}\}italic_H = { W6600 , MI250X } we can determine the most portable configuration across the two AMD devices. In this work, we consider the following seven options for H𝐻Hitalic_H:

  • Each of the four GPUs individually.

  • The two AMD devices together: W6600 and MI250X.

  • The two Nvidia devices together: A4000 and A100.

  • All four devices together.

For each combination of subset H𝐻Hitalic_H and kernel, we calculated the performance portability P P across all configurations and selected the one with the highest score. Figure 11 shows the results for each of the three kernels. From these results, we can make the following observations.

For the dedispersion and GEMM kernels, we observe that a highly portable configuration exists that achieves an application efficiency of at least 85% across all devices (bottom row). However, for the convolution and the hotspot kernel, we do not find a configuration that qualifies as performance-portable, as each configuration results in a performance loss of at least 15% on one or more devices.

Another observation is that, in general, configurations performing well on Nvidia tend to not translate to good performance on AMD. This is especially evident when looking at GEMM and convolution, where configurations exists that obtain more than 80% of the performance on both Nvidia devices (sixth row), but achieve abysmal performance of less than 10% on AMD. Similar patterns can be observed for the other two kernels, albeit with less pronounced differences. Figure 13 shows the average results, revealing that the configuration most portable across Nvidia gives 93% of the performance on Nvidia and only 41% on AMD. These findings underscore the necessity of re-tuning applications previously optimized for Nvidia GPUs when porting to AMD.

However, the converse is not true, and configurations that perform well on AMD typically also perform well on Nvidia. For example, for GEMM, the configuration that exhibits the highest portability across AMD (fifth row) also delivers 97% of the performance on the A4000 and 96% on the A100. On average, when considering the most portable configurations for AMD across the four kernels, we find AMD gives 97% of the optimal performance and Nvidia achieves 81%.

Another observation is that the convolution kernel presents an especially difficult target to tune for, since configurations that perform well on each GPU individually (top four rows), perform poorly on the other devices. Especially the optimal configurations for the A100, delivers poor performance on AMD.

Refer to caption
Figure 11: Performance portability results. Each row considers a different subset H𝐻Hitalic_H and shows the results for the configuration x𝑥xitalic_x with a maximum

P

P
(x,p,H)

P

P
𝑥𝑝𝐻
\reflectbox{P}\text{P}{}(x,p,H)roman_P ( italic_x , italic_p , italic_H )
score as defined by Eq. 2. Values shown are the application efficiencies ei(x,p)subscript𝑒𝑖𝑥𝑝e_{i}(x,p)italic_e start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_x , italic_p ) of x𝑥xitalic_x as defined in Eq. 3 for the different devices.

7 Discussion

In this section, we look at the results of all experiments presented in Section 6 and provide some highlights on tuning impact, difficulty, and performance portability for all applications and GPUs.

We defined the tuning impact as the performance improvement of the optimum over the median of the tuning space. There are clear differences between the impact on performance of auto-tuning on AMD and Nvidia GPUs: the average performance improvement, over all applications, for AMD is 10 times, while for Nvidia it is only 2x. Our results show that auto-tuning is crucial to achieving high performance for all applications and GPUs in our experiments, but the performance impact is much larger for AMD GPUs than for Nvidia GPUs.

Auto-tuning is not only more important in terms of achieved performance on AMD compared to Nvidia, it is also more difficult. We observe that for all applications the optimum is more of an outlier for AMD than it is for Nvidia. This does not mean that tuning these applications on the A4000 or A100 is particularly easy, but rather that tuning for the W6600 or the MI250X is, on average, more difficult.

In Figure 13, we see the averaged proportion of centrality for all the applications, showing that while the global optimum is difficult to find for both vendors, if we relax the constraint on optimality the Nvidia GPUs become easier to tune than the AMD GPUs. We can conclude that, for our benchmarks, tuning HIP kernels is overall more difficult for AMD than for Nvidia.

By using the performance portability metric, we assessed how well a kernel tuned for one specific GPU performs on the other devices. A final observation from Figure 13 is that configurations that perform well on the A4000 often fall short on AMD devices. On average, the configuration that achieves optimal performance on the A4000, only attains an average performance of 22%similar-toabsentpercent22{\sim}22\%∼ 22 % on the MI250X and 36%similar-toabsentpercent36{\sim}36\%∼ 36 % on the W6600.

Refer to caption
Figure 12: Average proportion of centrality.
Refer to caption
Figure 13: Data from Fig. 11 averaged over kernels.

8 Conclusions

In this paper, we compared the auto-tuning effectiveness between AMD and Nvidia GPUs. We integrated support for HIP into Kernel Tuner, now available in the production-ready version 1.0 of the tool, enabling us to auto-tune GPU kernels on both AMD and Nvidia devices. We have compared the impact, tuning difficulty, and performance portability on AMD and Nvidia using four different kernels: 2D convolution, hotspot, dedispersion, and GEMM.

For all four kernels, we see larger differences between the global optimum and the average performance within the search spaces on AMD, compared to Nvidia. This shows that auto-tuning is crucial for achieving high performance on AMD, while manual or no optimization may still yield relatively good performance on Nvidia hardware. Overall, the impact on performance of tuning the same HIP code on AMD GPUs is much larger (10x vs 2x) compared to Nvidia GPUs.

Our evaluation also shows that it is easier for an optimization algorithm to find near-optimal implementations on Nvidia, compared to AMD. Generally, AMD-tuned kernels perform well on Nvidia, but the reverse is not consistently true. Thus, while HIP enables code portability, it does not guarantee performance portability. Given that many current GPU applications are written in CUDA and optimized for Nvidia, re-tuning is crucial when migrating to HIP for AMD execution. Fortunately, the extensions to Kernel Tuner presented in this paper make it possible to tune GPU kernels using HIP on AMD.

This study opens up several avenues for future research. Future work could include a broader array of computational kernels and a broader range of devices from both vendors to fully assess the generalizability of the findings. Also, the disparity in performance portability between Nvidia and AMD GPUs when using HIP suggests a need for deeper investigation into the underlying reasons for these differences. This could involve analyzing the architectural differences between the GPUs of both vendors and how they interact with the HIP programming language. Finally, our extensions to Kernel Tuner bring us one step closer to investigating the effectiveness of auto-tuning for optimizing the energy efficiency of applications on AMD GPUs.

Acknowledgment and Artifact Availability

The CORTEX project has received funding from the Dutch Research Council (NWO) in the framework of the NWA-ORC Call (file number NWA.1160.18.316). Funded by the European Union. The ESiWACE3 project has received funding from the European High Performance Computing Joint Undertaking (JU) under grant agreement No 101093054. The code is available in the repository [16].

References

  • [1] Frontier: OLCF’s Exascale Future (2018), https://www.olcf.ornl.gov/2018/02/13/frontier-olcfs-exascale-future/
  • [2] U.S. Department of Energy and Intel to deliver first exascale supercomputer, Argonne National Laboratory (2019), https://www.anl.gov/article/us-department-of-energy-and-intel-to-deliver-first-exascale-supercomputer
  • [3] Agullo, E., Demmel, J., et al.: Numerical linear algebra on emerging architectures: The PLASMA and MAGMA projects. In: Journal of Physics: Conference Series. IOP Publishing (2009)
  • [4] Balaprakash, P., Dongarra, J., et al.: Autotuning in high-performance computing applications. Proc. IEEE (2018)
  • [5] Brin, S., Page, L.: The anatomy of a large-scale hypertextual web search engine. Computer networks and ISDN systems (1998)
  • [6] Dolbeau, R., Bodin, F., et al.: One opencl to rule them all? In: 6th International Workshop on Multi-/Many-core Computing Systems (MuCoCoS). IEEE (2013)
  • [7] Filipovič, J., Petrovič, F., et al.: Autotuning of OpenCL kernels with global optimizations. In: autotuning and adaptivity approaches for energy efficient HPC systems (2017)
  • [8] Frigo, M., Johnson, S.G.: FFTW: An adaptive software architecture for the FFT. In: International Conference on Acoustics, Speech and Signal Processing (1998)
  • [9] Grauer-Gray, S., Xu, L., et al.: Auto-tuning a high-level language targeted to GPU codes. In: Innovative Parallel Computing. IEEE (2012)
  • [10] Heldens, S., Hijma, P., et al.: The landscape of exascale research: A data-driven literature analysis. ACM Comput. Surv. (2020)
  • [11] Hijma, P., Heldens, S., et al.: Optimization techniques for GPU programming. ACM Comput. Surv. (2023)
  • [12] Hou, K., Feng, W., et al.: Auto-tuning strategies for parallelizing sparse matrix-vector (SPMV) multiplication on multi-and many-core processors. In: International Parallel and Distributed Processing Symposium Workshops. IEEE (2017)
  • [13] Komatsu, K., Sato, K., et al.: Evaluating performance and portability of opencl programs. In: 5th international workshop on automatic performance tuning (2010)
  • [14] LeCun, Y., Bengio, Y., et al.: Deep learning. Nature (2015)
  • [15] Li, Y., Dongarra, J., et al.: A note on auto-tuning GEMM for GPUs. In: Computational Science–ICCS. Springer (2009)
  • [16] Lurati, M., Heldens, S., Sclocco, A., van Werkhoven, B.: Artifact of the paper: Bringing auto-tuning to HIP: Analysis of tuning impact and difficulty on AMD and Nvidia GPUs (Jun 2024). https://doi.org/10.5281/zenodo.11617999
  • [17] Magni, A., Grewe, D., et al.: Input-aware auto-tuning for directive-based GPU programming. In: Proceedings of the 6th Workshop on General Purpose Processor Using Graphics Processing Units (2013)
  • [18] Nath, R., Tomov, S., et al.: An improved magma GEMM for fermi graphics processing units. Int J High Perform Comput Appl (2010)
  • [19] Nugteren, C.: CLBlast: A tuned opencl blas library. In: International Workshop on OpenCL (2018)
  • [20] Nugteren, C., Codreanu, V.: CLTune: A generic auto-tuner for OpenCL kernels. In: 9th International Symposium on Embedded Multicore/Many-core Systems-on-Chip (2015)
  • [21] Nukada, A., Matsuoka, S.: Auto-tuning 3-D FFT library for CUDA GPUs. In: Conference on High Performance Computing Networking, Storage and Analysis (2009)
  • [22] Pennycook, S.J., Sewall, J.D., et al.: A metric for performance portability (2016)
  • [23] Rasch, A., Schulze, R., et al.: Efficient auto-tuning of parallel programs with interdependent tuning parameters via auto-tuning framework (ATF). ACM Trans. Archit. Code. Optim. (TACO) (2021)
  • [24] Schoonhoven, R., van Werkhoven, B., et al.: Benchmarking optimization algorithms for auto-tuning GPU kernels. IEEE Trans. Evol. Comput. (2022)
  • [25] Sclocco, A., Bal, H.E., et al.: Auto-tuning dedispersion for many-core accelerators. In: IEEE 28th International Parallel and Distributed Processing Symposium (2014)
  • [26] Sclocco, A., Heldens, S., et al.: AMBER: a real-time pipeline for the detection of single pulse astronomical transients. SoftwareX (2020)
  • [27] Tørring, J.O., van Werkhoven, B., et al.: Towards a benchmarking suite for kernel tuners. In: International Parallel and Distributed Processing Symposium Workshops (IPDPSW). IEEE (2023)
  • [28] Vuduc, R., Demmel, J.W.: Code generators for automatic tuning of numerical kernels: Experiences with FFTW position paper. In: Semantics, Applications, and Implementation of Program Generation. Springer (2000)
  • [29] van Werkhoven, B.: Kernel tuner: A search-optimizing GPU code auto-tuner. Future Gener. Comput. Syst. (2019)
  • [30] van Werkhoven, B., Maassen, J., et al.: Optimizing convolution operations on GPUs using adaptive tiling. Future Gener. Comput. Syst. (2014)
  • [31] van Werkhoven, B., Palenstijn, W.J., Sclocco, A.: Lessons learned in a decade of research software engineering GPU applications. In: ICCS (2020)
  • [32] Xavier, J.: Python interface to HIP and hiprtc library (2022)
  • [33] Yu, C.L., Tsao, S.L.: Efficient and portable workgroup size tuning. Trans. Parallel Distrib. Syst. (2019)