LUMI training - Omniperf by Example - part 1 - Oslo, Norway - June 2024

Environment for LUMI

module load CrayEnv
module load buildtools/23.09

module load PrgEnv-cray/8.4.0
module load cce/16.0.1
module load craype-accel-amd-gfx90a
module load craype-x86-trento

module use /pfs/lustrep2/projappl/project_462000125/samantao-public/mymodules

module load rocm/5.4.3 omniperf/2.0.1-rocm-5.4.x

You can setup the following environment variables for the project you want to use:

export SALLOC_ACCOUNT=project_<your porject ID>
export SBATCH_ACCOUNT=project_<your porject ID>

The allocation can be setup with:

salloc -N 1 --gpus=8 -p standard-g --exclusive -t 20:00 --reservation <reservation name>

Omniperf Advanced Exercises

These exercises are meant to provide extra insight on the tunning of kernels. The exercises files are included in:

git clone
cd HPCTrainingExamples/OmniperfExamples

Just navigate the diferent folders, each corresponding to a section of these exercises.

Exercise 1: Launch Parameter Tuning

Simple kernel implementing a version of yAx, to demonstrate effects of Launch Parameters on kernel
execution time.

Background: Acronyms and terms used in this exercise

Initial Roofline Analysis:

The roofline model is a way to gauge kernel performance in terms of maximum achievable bandwidth and floating-point operations.
It can be used to determine how efficiently a kernel makes use of the available hardware. It is a key tool in initially determining
which kernels are performing well, and which kernels should be able to perform better. Below are roofline plots for the yAx kernel in problem.cpp:

Roofline Type Roofline Legend Roofline Plot

These plots were generated by running:

srun omniperf profile -n problem_roof_only --roof-only --kernel-names -- ./problem.exe

The plots will appear as PDF files in the ./workloads/problem_roof_only/MI200 directory, if generated on MI200 hardware.

We see that the kernel's performance is not near the achievable bandwidth possible on the hardware, which makes it a good candidate to consider

Exercise instructions:

From the roofline we were able to see that there is room for improvement in this kernel. One of the
first things to check is whether or not we have reasonable launch parameters for this kernel.

To get started, build and run the problem code:

srun ./problem.exe

(simulated output)

yAx time: 771.765559 milliseconds

The runtime of the problem should be very slow, due to sub-optimal launch parameters.
Let's confirm this hypothesis by looking at the omniperf profile. Start by running:

srun omniperf profile -n problem --no-roof -- ./problem.exe

This command requires omniperf to run your code a few times to collect all the necessary hardware counters.

After the profiling data is collected, we can view the profile by using this command:

omniperf analyze -p workloads/problem/MI200 --dispatch 1 --block 7.1.0 7.1.1 7.1.2

This allows us to view nicely formatted profiling data directly in the command line.
The command given here has a few arguments that are noteworthy:

The output of the omniperf analyze command should look something like this:


0. Top Stat
|    | KernelName                               |   Count |      Sum(ns) |     Mean(ns) |   Median(ns) |    Pct |
|  0 | yax(double*, double*, double*, int, int, |    1.00 | 755935180.00 | 755935180.00 | 755935180.00 | 100.00 |
|    |  double*)                                |         |              |              |              |        |

7. Wavefront
7.1 Wavefront Launch Stats
| Index   | Metric           |    Avg |    Min |    Max | Unit       |
| 7.1.0   | Grid Size        | 256.00 | 256.00 | 256.00 | Work items |
| 7.1.1   | Workgroup Size   |  64.00 |  64.00 |  64.00 | Work items |
| 7.1.2   | Total Wavefronts |   4.00 |   4.00 |   4.00 | Wavefronts |

Looking through this data we see:

We can definitely get better performance by adjusting the launch parameters of our kernel.
Either try out some new values for the launch bounds, or run the provided solution to see its performance:

cd solution
srun ./solution.exe

(simulated output)

yAx time: 70 ms

We get much better performance with the new launch parameters. Note that in general it can be difficult to find the
most optimal launch parameters for a given kernel due to the many factors that impact performance, so determining
launch parameters experimentally is usually necessary.

We should also confirm that our updated launch parameters are reported by omniperf, we need to run:

srun omniperf profile -n solution --no-roof -- ./solution.exe

This command is the same as before, except the workload name has changed to solution.
Once the profile command has completed, run:

omniperf analyze -p workloads/solution/mi200 --dispatch 1 --block 7.1.0 7.1.1 7.1.2

Again, this command largely uses the same arguments as before, except for the workload name.
The output should look something like this:


0. Top Stat
|    | KernelName                               |   Count |     Sum(ns) |    Mean(ns) |   Median(ns) |    Pct |
|  0 | yax(double*, double*, double*, int, int, |    1.00 | 70115217.00 | 70115217.00 |  70115217.00 | 100.00 |
|    |  double*)                                |         |             |             |              |        |

7. Wavefront
7.1 Wavefront Launch Stats
| Index   | Metric           |       Avg |       Min |       Max | Unit       |
| 7.1.0   | Grid Size        | 131072.00 | 131072.00 | 131072.00 | Work items |
| 7.1.1   | Workgroup Size   |     64.00 |     64.00 |     64.00 | Work items |
| 7.1.2   | Total Wavefronts |   2048.00 |   2048.00 |   2048.00 | Wavefronts |

Looking through this data we see:

Omniperf Command Line Comparison Feature

On releases newer than Omniperf 1.0.10, the comparison feature of omniperf can be used to quickly compare two profiles.
To use this feature, use the command:

omniperf analyze -p workloads/problem/MI200 -p solution/workloads/solution/MI200 --dispatch 1 --block 7.1.0 7.1.1 7.1.2

This feature sets the first -p argument as the baseline, and the second as the comparison workload.
In this case, problem is set as the baseline and is compared to solution.
The output should look like:


0. Top Stat
|    | KernelName                               |   Count | Count      |      Sum(ns) | Sum(ns)              |     Mean(ns) | Mean(ns)             |   Median(ns) | Median(ns)           |    Pct | Pct          |
|  0 | yax(double*, double*, double*, int, int, |    1.00 | 1.0 (0.0%) | 754934306.50 | 69702016.5 (-90.77%) | 754934306.50 | 69702016.5 (-90.77%) | 754934306.50 | 69702016.5 (-90.77%) | 100.00 | 100.0 (0.0%) |
|    |  double*)                                |         |            |              |                      |              |                      |              |                      |        |              |

7. Wavefront
7.1 Wavefront Launch Stats
| Index   | Metric           |    Avg | Avg                 |    Min | Min                 |    Max | Max                 | Unit       |
| 7.1.0   | Grid Size        | 256.00 | 131072.0 (51100.0%) | 256.00 | 131072.0 (51100.0%) | 256.00 | 131072.0 (51100.0%) | Work items |
| 7.1.1   | Workgroup Size   |  64.00 | 64.0 (0.0%)         |  64.00 | 64.0 (0.0%)         |  64.00 | 64.0 (0.0%)         | Work items |
| 7.1.2   | Total Wavefronts |   4.00 | 2048.0 (51100.0%)   |   4.00 | 2048.0 (51100.0%)   |   4.00 | 2048.0 (51100.0%)   | Wavefronts |

Note that the comparison workload shows the percentage difference from the baseline.
This feature can be used to quickly compare filtered stats to make sure code changes fix known issues.

More Kernel Filtering

For this exercise, it is appropriate to filter the omniperf analyze command with the --dispatch 1 argument.
This --dispatch 1 argument filters the data shown to only include the kernel invocation with dispatch ID 1, or the second kernel run during profiling.

However, there is another way to filter kernels that may be more applicable in real use-cases.
Typically real codes launch many kernels, and only a few of them take most of the overall kernel runtime.
To see a ranking of the top kernels that take up most of the kernel runtime in your code, you can run:

omniperf analyze -p workloads/problem/MI200 --list-stats

This command will output something like:


Detected Kernels
|    | KernelName                                        |
|  0 | yax(double*, double*, double*, int, int, double*) |

Using Omniperf versions greater than 1.0.10, --list-stats will list all kernels launched by your code, in order of runtime (largest runtime first).
The number displayed beside the kernel in the output can be used to filter omniperf analyze commands.
Note that this will display aggregated stats for kernels of the same name, meaning that the invocations could differ in terms of launch parameters, and vary widely in terms of work completed.
This filtering is accomplished with the -k argument:

omniperf analyze -p workloads/problem/MI200 -k 0 --block 7.1.0 7.1.1 7.1.2

Which should show something like:


0. Top Stat
|    | KernelName                               |   Count |       Sum(ns) |     Mean(ns) |   Median(ns) |    Pct | S   |
|  0 | yax(double*, double*, double*, int, int, |    2.00 | 1508098228.00 | 754049114.00 | 754049114.00 | 100.00 | *   |
|    |  double*)                                |         |               |              |              |        |     |

7. Wavefront
7.1 Wavefront Launch Stats
| Index   | Metric           |    Avg |    Min |    Max | Unit       |
| 7.1.0   | Grid Size        | 256.00 | 256.00 | 256.00 | Work items |
| 7.1.1   | Workgroup Size   |  64.00 |  64.00 |  64.00 | Work items |
| 7.1.2   | Total Wavefronts |   4.00 |   4.00 |   4.00 | Wavefronts |

Note that the 'count' field in Top Stat is 2 here, where filtering by dispatch ID displays a count of 1, indicating that filtering with -k returns aggregated stats for two kernel invocations in this case.
Also note that the "Top Stats" table will still show all the top kernels but the rightmost column titled "S" will have an asterisk beside the kernel for which data is being displayed.

Solution Roofline

We've demonstrated better performance than problem.cpp in solution.cpp, but could we potentially do better?
To answer that we again turn to the roofline model:

Roofline Type Roofline Legend Roofline Plot

These plots were generated with:

srun omniperf profile -n solution_roof_only --roof-only --kernel-names -- ./solution.exe

The plots will appear as PDF files in the ./workloads/solution_roof_only/MI200 directory, if generated on MI200 hardware.

We see that the solution is solidly in the bandwidth-bound regime, but even still there seems to be room for improvement. Further performance improvements will be a topic for later exercises.

Roofline Comparison

Roofline Type Problem Roofline Solution Roofline

We see that the solution has drastically increased performance over the problem code, as shown by the solution points moving up closer to the line plotted by the bandwidth limit.

Note: on statically generated roofline images, it is possible for the L1, L2, or HBM points to overlap and hide one another.

Summary and Take-aways

Launch parameters should be the first check in optimizing performance, due to the fact that they are
usually easy to change, but can have a large performance impact if they aren't tuned to your workload.
It is difficult to predict the optimal launch parameters for any given kernel, so some experimentation
may be required to achieve the best performance.

Exercise 2: LDS Occupancy Limiter

Simple kernel implementing a version of yAx, to demonstrate the downside of allocating a large
amount of LDS, and the benefit of using a smaller amount of LDS due to occupancy limits.

Background: Acronyms and terms used in this exercise

Initial Roofline Analysis

In this exercise we're using a problem code that is slightly different than where we left off in Exercise 1.
Regardless, to get started we need to get a roofline by running:

srun omniperf profile -n problem_roof_only --roof-only --kernel-names -- ./problem.exe

The plots will appear as PDF files in the ./workloads/problem_roof_only/MI200 directory, if generated on MI200 hardware.

For convenience, the resulting plots on a representative system are below:

Roofline Type Roofline Legend Roofline Plot

We see that there looks to be room for improvement here. We'll use omniperf to see what the current limiters are.

Exercise Instructions:

First, we should get an idea of the code's runtime:

srun ./problem.exe

(simulated output)

yAx time: 140 ms

This problem.cpp uses LDS allocations to move the x vector closer to the compute resources, a common optimization.
However, we see that it ends up slower than the previous solution that didn't use LDS at all.
In kernels that request a lot of LDS, it is common to see that the LDS usage limits the occupancy of the kernel.
That is, more wavefronts cannot be resident on the device, because all of them need more LDS than is available.
We need to confirm this hypothesis, let's start by running:

srun omniperf profile -n problem --no-roof -- ./problem.exe

The usage of omniperf profile arguments can be found here, or by running omniperf profile --help.

This omniperf profile command will take a minute or two to run, as omniperf must run your code a few times to collect all the hardware counters.

Note: For large scientific codes, it can be useful to profile a small representative workload if possible, as profiling a full run may take prohibitively long.

Once the profiling run completes, let's take a look at the occupancy stats related to LDS allocations:

omniperf analyze -p workloads/problem/MI200 --dispatch 1 --block 2.1.15 6.2.7

The metrics we're looking at are:

The SPI section (6.2) generally shows what resources limit occupancy, while Wavefront occupancy (2.1.15) shows how severely occupancy is limited in general.
The SPI 'insufficient' fields are typically either zero or very large numbers (on the order of 1 million), with large numbers indicating some resource preventing wavefronts from scheduling.
If more than one field is nonzero, the relative magnitude of the nonzero fields correspond to how severely the resources are limiting occupancy, but if only one field is nonzero it is difficult to say how severely that field is limiting occupancy.

The output of the omniperf analyze command should look similar to this:


0. Top Stat
|    | KernelName                               |   Count |      Sum(ns) |     Mean(ns) |   Median(ns) |    Pct |
|  0 | yax(double*, double*, double*, int, int, |    1.00 | 175427205.00 | 175427205.00 | 175427205.00 | 100.00 |
|    |  double*)                                |         |              |              |              |        |

2. System Speed-of-Light
2.1 Speed-of-Light
| Index   | Metric         |   Value | Unit       |    Peak |   PoP |
| 2.1.15  | Wave Occupancy |  102.70 | Wavefronts | 3328.00 |  3.09 |

6. Workgroup Manager (SPI)
6.2 Workgroup Manager - Resource Allocation
| Metric_ID   | Metric              |   Avg |   Min |   Max | Unit   |
| 6.2.7       | Insufficient CU LDS | 69.53 | 69.53 | 69.53 | Pct    |

Looking through this data we see:

There are two solution directories, which correspond to two ways that this occupancy limit can be addressed.
First, we have solution-no-lds, which completely removes the LDS usage. Let's build and run this solution:

cd solution-no-lds
srun ./solution.exe

(simulated output)

yAx time: 70 ms

We see that the runtime is much better for this solution than the problem, let's see if removing LDS did indeed increase occupancy:

srun omniperf profile -n solution --no-roof -- ./solution.exe

(output omitted)

Once the profile command completes, run:

omniperf analyze -p workloads/solution/MI200 --dispatch 1 --block 2.1.15 6.2.7

The output should look something like:


0. Top Stat
|    | KernelName                               |   Count |     Sum(ns) |    Mean(ns) |   Median(ns) |    Pct |
|  0 | yax(double*, double*, double*, int, int, |    1.00 | 70512671.00 | 70512671.00 |  70512671.00 | 100.00 |
|    |  double*)                                |         |             |             |              |        |

2. System Speed-of-Light
2.1 Speed-of-Light
| Index   | Metric         |   Value | Unit       |    Peak |   PoP |
| 2.1.15  | Wave Occupancy |  445.33 | Wavefronts | 3328.00 | 13.38 |

6. Shader Processor Input (SPI)
6.2 SPI Resource Allocation
| Index   | Metric              |   Avg |   Min |   Max | Pct.   |
| 6.2.7   | Insufficient CU LDS |  0.00 |  0.00 |  0.00 | Cu     |

Looking through this data we see:

Can we get some runtime advantage from using smaller LDS allocations?

This is the solution implemented in the solution directory:

cd ../solution
srun ./solution.exe

(simulated output)

yAx time: 50 ms

This solution, rather than removing the LDS allocation, simply reduces the amount of LDS requested to address the occupancy limit.
This gives us the benefit of having some data pulled closer than it was in solution-no-lds which is validated through the speedup we see.
But is this solution still occupancy limited by LDS?

srun omniperf profile -n solution --no-roof -- ./solution.exe

(output omitted)

Once the profile command completes, run:

omniperf analyze -p workloads/solution/MI200 --dispatch 1 --block 2.1.15 6.2.7

The output should look something like:


0. Top Stat
|    | KernelName                               |   Count |     Sum(ns) |    Mean(ns) |   Median(ns) |    Pct |
|  0 | yax(double*, double*, double*, int, int, |    1.00 | 50366185.00 | 50366185.00 |  50366185.00 | 100.00 |
|    |  double*)                                |         |             |             |              |        |

2. System Speed-of-Light
2.1 Speed-of-Light
| Index   | Metric         |   Value | Unit       |    Peak |   PoP |
| 2.1.15  | Wave Occupancy |  487.32 | Wavefronts | 3328.00 | 14.64 |

6. Shader Processor Input (SPI)
6.2 SPI Resource Allocation
| Index   | Metric              |   Avg |   Min |   Max | Unit   |
| 6.2.7   | Insufficient CU LDS |  0.00 |  0.00 |  0.00 | Pct.    |

Looking at this data we see:

Pulling some data from global device memory to LDS can be an effective optimization strategy, if occupancy limits are carefully avoided.

Solution Roofline

Let's take a look at the roofline for solution, which can be generated with:

srun omniperf profile -n solution_roof_only --roof-only -- ./solution.exe

The plots will appear as PDF files in the ./workloads/problem_roof_only/MI200 directory, if generated on MI200 hardware.

The plots are shown here:

Roofline Type Roofline Legend Roofline Plot

We see that there is still room to move the solution roofline up towards the bandwidth limit.

Roofline Comparison

Roofline Type Problem Roofline Solution Roofline

Again, we see that the solution's optimizations have resulted in the kernel moving up in the roofline, meaning the solution executes more efficiently than the problem.

Summary and Take-aways

Using LDS can be very helpful in reducing global memory reads where you have repeated use of the same data.
However, large LDS allocations can also negatively impact performance by limiting the amount of
wavefronts that can be resident in the device at any given time. Be wary of LDS usage, and check
the SPI stats to ensure your LDS usage is not negatively impacting occupancy.

Exercise 3: Register Occupancy Limiter

More complex yAx implementation to demonstrate a register limited kernel using an innocuous looking
function call.

Background: Acronyms and terms used in this exercise

Initial Roofline Analysis

This kernel is slightly different from the one we used in previous exercises. Let's see how well it measures up in the roofline:

Roofline Type Roofline Legend Roofline Plot

You can generate these plots by running:

srun omniperf profile -n problem_roof_only --roof-only --kernel-names -- ./problem.exe

The plots will appear as PDF files in the ./workloads/problem_roof_only/MI200 directory, if generated on MI200 hardware.

We see that the kernel is still a considerable amount below the maximum achievable bandwidth, so there should still be room for improvement.

Exercise Instructions:

Let's get an idea of the runtime of this code:

srun ./problem.exe

(simulated output)

yAx time 79 ms

We see that this kernel seems to be on par with some of our other exercises, but let's see what omniperf shows us:

srun omniperf profile -n problem --no-roof -- ./problem.exe

(lots of output from this command)

omniperf analyze -p workloads/problem/MI200 --dispatch 1 --block 2.1.15 6.2.5 7.1.5 7.1.6 7.1.7

0. Top Stat
|    | KernelName                               |   Count |     Sum(ns) |    Mean(ns) |   Median(ns) |    Pct |
|  0 | yax(double*, double*, double*, int, int, |    1.00 | 76983902.00 | 76983902.00 |  76983902.00 | 100.00 |
|    |  double*)                                |         |             |             |              |        |

2. System Speed-of-Light
2.1 Speed-of-Light
| Index   | Metric         |   Value | Unit       |    Peak |   PoP |
| 2.1.15  | Wave Occupancy |  438.00 | Wavefronts | 3328.00 | 13.16 |

6. Workgroup Manager (SPI)
6.2 Workgroup Manager - Resource Allocation
| Metric_ID   | Metric                  |   Avg |   Min |   Max | Unit   |
| 6.2.5       | Insufficient SIMD VGPRs |  0.04 |  0.04 |  0.04 | Pct    |

7. Wavefront
7.1 Wavefront Launch Stats
| Index   | Metric   |    Avg |    Min |    Max | Unit      |
| 7.1.5   | VGPRs    |  92.00 |  92.00 |  92.00 | Registers |
| 7.1.6   | AGPRs    | 132.00 | 132.00 | 132.00 | Registers |
| 7.1.7   | SGPRs    |  64.00 |  64.00 |  64.00 | Registers |

Looking at this data, we see:

This is due to a call to assert that checks if our result is zeroed out on device.
We need to check this hypothesis, let's look at the solution code:

cd solution
srun ./solution.exe

(simulated output)

yAx time: 69 ms

Our runtime gets better from removing the assert, but we should also check that omniperf reports that our limiters are gone:

srun omniperf profile -n solution --no-roof -- ./solution.exe

(omitted output)

omniperf analyze -p workloads/solution/MI200 --dispatch 1 --block 2.1.15 6.2.5 7.1.5 7.1.6 7.1.7

The output of this command should look something like:


0. Top Stat
|    | KernelName                               |   Count |     Sum(ns) |    Mean(ns) |   Median(ns) |    Pct |
|  0 | yax(double*, double*, double*, int, int, |    1.00 | 69815871.00 | 69815871.00 |  69815871.00 | 100.00 |
|    |  double*)                                |         |             |             |              |        |

2. System Speed-of-Light
2.1 Speed-of-Light
| Index   | Metric         |   Value | Unit       |    Peak |   PoP |
| 2.1.15  | Wave Occupancy |  444.10 | Wavefronts | 3328.00 | 13.34 |

6. Shader Processor Input (SPI)
6.2 SPI Resource Allocation
| Index   | Metric                  |   Avg |   Min |   Max | Unit   |
| 6.2.5   | Insufficient SIMD VGPRs |  0.00 |  0.00 |  0.00 | Pct   |

7. Wavefront
7.1 Wavefront Launch Stats
| Index   | Metric   |   Avg |   Min |   Max | Unit      |
| 7.1.5   | VGPRs    | 32.00 | 32.00 | 32.00 | Registers |
| 7.1.6   | AGPRs    |  0.00 |  0.00 |  0.00 | Registers |
| 7.1.7   | SGPRs    | 112.00 | 112.00 | 112.00 | Registers |

Looking at this data, we see:

More generally, you can use this command to look at all SPI "insufficient resource" stats in the same screen, to determine if any resource is currently limiting occupancy:

omniperf analyze -p workloads/problem/MI200 --dispatch 1 --block 6.2

Which will show output similar to this (note, fields 6.2.4 to 6.2.8 show resources which currently limit occupancy):


0. Top Stat
|    | KernelName                               |   Count |     Sum(ns) |    Mean(ns) |   Median(ns) |    Pct |
|  0 | yax(double*, double*, double*, int, int, |    1.00 | 76983902.00 | 76983902.00 |  76983902.00 | 100.00 |
|    |  double*)                                |         |             |             |              |        |

6. Workgroup Manager (SPI)
6.2 Workgroup Manager - Resource Allocation
_ Metric_ID   _ Metric                                 _   Avg _   Min _   Max _ Unit   _
_ 6.2.0       _ Not-scheduled Rate (Workgroup Manager) _  0.02 _  0.02 _  0.02 _ Pct    _
_ 6.2.1       _ Not-scheduled Rate (Scheduler-Pipe)    _  0.03 _  0.03 _  0.03 _ Pct    _
_ 6.2.2       _ Scheduler-Pipe Stall Rate              _  0.01 _  0.01 _  0.01 _ Pct    _
_ 6.2.3       _ Scratch Stall Rate                     _  0.00 _  0.00 _  0.00 _ Pct    _
_ 6.2.4       _ Insufficient SIMD Waveslots            _  0.00 _  0.00 _  0.00 _ Pct    _
_ 6.2.5       _ Insufficient SIMD VGPRs                _  0.04 _  0.04 _  0.04 _ Pct    _
_ 6.2.6       _ Insufficient SIMD SGPRs                _  0.00 _  0.00 _  0.00 _ Pct    _
_ 6.2.7       _ Insufficient CU LDS                    _  0.00 _  0.00 _  0.00 _ Pct    _
_ 6.2.8       _ Insufficient CU Barriers               _  0.00 _  0.00 _  0.00 _ Pct    _
_ 6.2.9       _ Reached CU Workgroup Limit             _  0.00 _  0.00 _  0.00 _ Pct    _
_ 6.2.10      _ Reached CU Wavefront Limit             _  0.00 _  0.00 _  0.00 _ Pct    _

Solution Roofline

Let's see how the solution stacks up in the roofline:

Roofline Type Roofline Legend Roofline Plot

You can generate these plots with:

srun omniperf profile -n problem_roof_only --roof-only --kernel-names -- ./problem.exe

The plots will appear as PDF files in the ./workloads/problem_roof_only/MI200 directory, if generated on MI200 hardware.

We see there is still room for improvement in the solution, as this kernel is not getting the maximum achievable bandwidth.

Roofline Comparison

Roofline Type Problem Roofline Solution Roofline

The most notable change between these rooflines is that the L1/L2 arithmetic intensity spread is more pronounced in the problem, which shows that the call to assert was causing more data to be moved to the L1, while not adding floating-point operations.

Note: Arithmetic Intensity is computed as (total floating point operations)/(total data movement)

Summary and Take-aways

Function calls inside kernels can have surprisingly adverse performance side-effects. Calling assert, printf and even excessive use of math functions (e.g. pow, sin, cos) can limit performance in difficult-to-predict ways. If you see unexpected resource usage, try eliminating these sorts of function calls.

Exercise 4: Strided Data Access Patterns (and how to find them)

This exercise uses a simple implementation of a yAx kernel to show how difficult strided data access patterns can be to spot in code,
and demonstrates how to use omniperf to begin to diagnose them.

Background: Acronyms and terms used in this exercise

Background: What is a "Strided Data Access Pattern"?

Strided data patterns happen when each thread in a wavefront has to access data locations which have a lot of space between them. For instance, in the algorithm we've been using, each thread works on a row, and those rows are contiguous in device memory. This scenario is depicted below:
Here the memory addresses accessed by threads at each step of the computation have a lot of space between them, which is suboptimal for memory systems, especially on GPUs. To fix this, we have to re-structure the matrix A so that the columns of the matrix are contiguous, which will result in the rows striding, as seen below:
This new data layout has each block of threads accessing a contiguous chunk of device memory, and will use the memory system of the device much more efficiently. Importantly, the only thing that changed is the physical layout of the memory, so the result of this computation will be the same as the result of the previous data layout.

Initial Roofline Analysis

To start, we want to check the roofline of problem.exe, to make sure we are able to improve it.
These plots can be generated with:

srun omniperf profile -n problem_roof_only --roof-only --kernel-names -- ./problem.exe

The plots will appear as PDF files in the ./workloads/problem_roof_only/MI200 directory, if generated on MI200 hardware.

They are also provided below for easy reference:

Roofline Type Roofline Legend Roofline Plot

We have plenty of space to improve this kernel, the next step is profiling.

Exercise Instructions:

To start, let's build and run the problem executable:

srun ./problem.exe

(simulated output)

yAx time: 70 ms

From our other experiments, this time seems reasonable. Let's look closer at the memory system usage with omniperf:

srun omniperf profile -n problem --no-roof -- ./problem.exe

(omitted output)

omniperf analyze -p workloads/problem/MI200 --dispatch 1 --block 16.1 17.1

Previous examples have used specific fields inside metrics, but we can also request a group of metrics with just two numbers (i.e. 16.1 vs. 16.1.1)

These requested metrics are:

The speed-of-light stats are a more broad overview of how the memory systems are used throughout execution of your kernel.
As such, they're great statistics for seeing if the memory system is generally being used efficiently or not.
Output from the analyze command should look like this:


0. Top Stat
|    | KernelName                               |   Count |     Sum(ns) |    Mean(ns) |   Median(ns) |    Pct |
|  0 | yax(double*, double*, double*, int, int, |    1.00 | 69768072.00 | 69768072.00 |  69768072.00 | 100.00 |
|    |  double*)                                |         |             |             |              |        |

16. Vector L1 Data Cache
16.1 Speed-of-Light
_ Metric_ID   _ Metric      _   Avg _ Unit        _
_ 16.1.0      _ Hit rate    _  0.00 _ Pct of peak _
_ 16.1.1      _ Bandwidth   _  8.26 _ Pct of peak _
_ 16.1.2      _ Utilization _ 83.57 _ Pct of peak _
_ 16.1.3      _ Coalescing  _ 25.00 _ Pct of peak _

17. L2 Cache
17.1 Speed-of-Light
_ Metric_ID   _ Metric                        _    Avg _ Unit   _
_ 17.1.0      _ Utilization                   _  97.67 _ Pct    _
_ 17.1.1      _ Bandwidth                     _  28.41 _ Pct    _
_ 17.1.2      _ Hit Rate                      _  93.45 _ Pct    _
_ 17.1.3      _ L2-Fabric Read BW             _ 126.27 _ Gb/s   _
_ 17.1.4      _ L2-Fabric Write and Atomic BW _   0.00 _ Gb/s   _

Looking at this data, we see:

Since our implementation of yAx simply uses 1 for all values in y, A, and x, we do not have to change how we populate our data.
Since A is implemented as a flat array, we don't need to change our allocation either.

In real-world use-cases, these considerations add non-trivial development overhead, so data access patterns may be non-trivial to change.

To observe the performance effects of a different data access pattern, we simply need to change our indexing scheme.
Let's see how this performs by running solution:

cd solution
srun ./solution.exe

(simulated output)

yAx time: 13 ms

We see the runtime here is significantly better than our previous kernel, but we need to check how the caches behave now:

srun omniperf profile -n solution --no-roof -- ./solution.exe

(output omitted)

omniperf analyze -p workloads/solution/MI200 --dispatch 1 --block 16.1 17.1

The output from this analyze command should look like:


0. Top Stat
|    | KernelName                               |   Count |     Sum(ns) |    Mean(ns) |   Median(ns) |    Pct |
|  0 | yax(double*, double*, double*, int, int, |    1.00 | 12464570.00 | 12464570.00 |  12464570.00 | 100.00 |
|    |  double*)                                |         |             |             |              |        |

16. Vector L1 Data Cache
16.1 Speed-of-Light
_ Metric_ID   _ Metric      _   Avg _ Unit        _
_ 16.1.0      _ Hit rate    _ 49.98 _ Pct of peak _
_ 16.1.1      _ Bandwidth   _ 10.88 _ Pct of peak _
_ 16.1.2      _ Utilization _ 98.15 _ Pct of peak _
_ 16.1.3      _ Coalescing  _ 25.00 _ Pct of peak _

17. L2 Cache
17.1 Speed-of-Light
_ Metric_ID   _ Metric                        _    Avg _ Unit   _
_ 17.1.0      _ Utilization                   _  98.60 _ Pct    _
_ 17.1.1      _ Bandwidth                     _   9.40 _ Pct    _
_ 17.1.2      _ Hit Rate                      _   0.52 _ Pct    _
_ 17.1.3      _ L2-Fabric Read BW             _ 650.84 _ Gb/s   _
_ 17.1.4      _ L2-Fabric Write and Atomic BW _   0.00 _ Gb/s   _

Looking at this data, we see:

Solution Roofline Analysis

We should check where our new kernel stands on the roofline.
These plots can be generated with:

srun omniperf profile -n solution_roof_only --roof-only --kernel-names -- ./solution.exe

The plots will appear as PDF files in the ./workloads/problem_roof_only/MI200 directory, if generated on MI200 hardware.

They are also provided below for easy reference:

Roofline Type Roofline Legend Roofline Plot

We appear to be very close to being bound by the HBM bandwidth from the fp32 roofline.
To get more performance we need to look closer at our algorithm.

Roofline Comparison

Roofline Type Problem Roofline Solution Roofline

We see that the HBM roofline point moves up, while the L1/L2 points move up and to the right from problem to solution. This means that our arithmetic intensity is increasing for the caches, so we are moving less data through the caches to do the same computation.

Summary and Take-aways

This exercise illustrates the at times insidious nature of strided data access patterns.
They can be difficult to spot in code, but profiling more readily shows when adversarial
access patterns occur, by showing poor cache hit rates, low cache bandwidth, and potentially low utilization.
Data access patterns can be non-trivial to change, so these sorts of optimizations can involve significant development and validation overhead.

