LUMI training - Omniperf by Example - part 1 - Oslo, Norway - June 2024

Environment for LUMI

module load CrayEnv
module load buildtools/23.09

module load PrgEnv-cray/8.4.0
module load cce/16.0.1
module load craype-accel-amd-gfx90a
module load craype-x86-trento

module use /pfs/lustrep2/projappl/project_462000125/samantao-public/mymodules

module load rocm/5.4.3 omniperf/2.0.1-rocm-5.4.x

You can setup the following environment variables for the project you want to use:

export SALLOC_ACCOUNT=project_<your porject ID>
export SBATCH_ACCOUNT=project_<your porject ID>

The allocation can be setup with:

salloc -N 1 --gpus=8 -p standard-g --exclusive -t 20:00 --reservation <reservation name>

Omniperf Advanced Exercises

These exercises are meant to provide extra insight on the tunning of kernels. The exercises files are included in:

git clone https://github.com/amd/HPCTrainingExamples.git
cd HPCTrainingExamples/OmniperfExamples

Just navigate the diferent folders, each corresponding to a section of these exercises.

Exercise 1: Launch Parameter Tuning

Simple kernel implementing a version of yAx, to demonstrate effects of Launch Parameters on kernel
execution time.

Background: Acronyms and terms used in this exercise

yAx: a vector-matrix-vector product, y*A*x, where y and x are vectors, and A is a matrix
FP(32/16): 32- or 16-bit Floating Point numeric types
FLOPs: Floating Point Operations Per second
HBM: High Bandwidth Memory is globally accessible from the GPU, and is a level of memory above the L2 cache

Initial Roofline Analysis:

The roofline model is a way to gauge kernel performance in terms of maximum achievable bandwidth and floating-point operations.
It can be used to determine how efficiently a kernel makes use of the available hardware. It is a key tool in initially determining
which kernels are performing well, and which kernels should be able to perform better. Below are roofline plots for the yAx kernel in problem.cpp:

Roofline Type	Roofline Legend	Roofline Plot
FP32
FP16/INT8

These plots were generated by running:

srun omniperf profile -n problem_roof_only --roof-only --kernel-names -- ./problem.exe

The plots will appear as PDF files in the ./workloads/problem_roof_only/MI200 directory, if generated on MI200 hardware.

We see that the kernel's performance is not near the achievable bandwidth possible on the hardware, which makes it a good candidate to consider
optimizing.

Exercise instructions:

From the roofline we were able to see that there is room for improvement in this kernel. One of the
first things to check is whether or not we have reasonable launch parameters for this kernel.

To get started, build and run the problem code:

make
srun ./problem.exe

(simulated output)

yAx time: 771.765559 milliseconds

The runtime of the problem should be very slow, due to sub-optimal launch parameters.
Let's confirm this hypothesis by looking at the omniperf profile. Start by running:

srun omniperf profile -n problem --no-roof -- ./problem.exe

This command requires omniperf to run your code a few times to collect all the necessary hardware counters.

-n problem names the workload, meaning that the profile will appear in the ./workloads/problem/MI200/ directory, if you are profiling on an MI200 device.
--no-roof turns off the roofline, which will save some profiling time by avoiding the collection of achievable bandwidths and FLOPs on the device.
Everything after the -- is the command that will be profiled.

After the profiling data is collected, we can view the profile by using this command:

omniperf analyze -p workloads/problem/MI200 --dispatch 1 --block 7.1.0 7.1.1 7.1.2

This allows us to view nicely formatted profiling data directly in the command line.
The command given here has a few arguments that are noteworthy:

-p workloads/problem/MI200 must point to the output directory of your profile run. For the above omniperf profile command, this will be workloads/problem/MI200.
--dispatch 1 filters kernel statistics by dispatch ID. In this case kernel 0 was a "warm-up" kernel, and kernel 1 is what the code reports timings for.
--block displays only the requested block of metrics, in this case we want metrics specific to Launch Parameters:
- 7.1.0 is the Grid Size
- 7.1.1 is the Workgroup Size
- 7.1.2 is the Total Wavefronts Launched

The output of the omniperf analyze command should look something like this:

--------
Analyze
--------


--------------------------------------------------------------------------------
0. Top Stat
----------------------------------------------------------------------------------------------------------------
|    | KernelName                               |   Count |      Sum(ns) |     Mean(ns) |   Median(ns) |    Pct |
----------------------------------------------------------------------------------------------------------------
|  0 | yax(double*, double*, double*, int, int, |    1.00 | 755935180.00 | 755935180.00 | 755935180.00 | 100.00 |
|    |  double*)                                |         |              |              |              |        |
----------------------------------------------------------------------------------------------------------------


--------------------------------------------------------------------------------
7. Wavefront
7.1 Wavefront Launch Stats
---------------------------------------------------------------------
| Index   | Metric           |    Avg |    Min |    Max | Unit       |
---------------------------------------------------------------------
| 7.1.0   | Grid Size        | 256.00 | 256.00 | 256.00 | Work items |
---------------------------------------------------------------------
| 7.1.1   | Workgroup Size   |  64.00 |  64.00 |  64.00 | Work items |
---------------------------------------------------------------------
| 7.1.2   | Total Wavefronts |   4.00 |   4.00 |   4.00 | Wavefronts |
---------------------------------------------------------------------

Looking through this data we see:

Workgroup Size (7.1.1) is 64 threads, which corresponds with the size of a wavefront.
Total Wavefronts (7.1.2) shows that we are launching only 4 Wavefronts.

We can definitely get better performance by adjusting the launch parameters of our kernel.
Either try out some new values for the launch bounds, or run the provided solution to see its performance:

cd solution
make
srun ./solution.exe

(simulated output)

yAx time: 70 ms

We get much better performance with the new launch parameters. Note that in general it can be difficult to find the
most optimal launch parameters for a given kernel due to the many factors that impact performance, so determining
launch parameters experimentally is usually necessary.

We should also confirm that our updated launch parameters are reported by omniperf, we need to run:

srun omniperf profile -n solution --no-roof -- ./solution.exe

This command is the same as before, except the workload name has changed to solution.
Once the profile command has completed, run:

omniperf analyze -p workloads/solution/mi200 --dispatch 1 --block 7.1.0 7.1.1 7.1.2

Again, this command largely uses the same arguments as before, except for the workload name.
The output should look something like this:

--------
Analyze
--------


--------------------------------------------------------------------------------
0. Top Stat
--------------------------------------------------------------------------------------------------------------
|    | KernelName                               |   Count |     Sum(ns) |    Mean(ns) |   Median(ns) |    Pct |
--------------------------------------------------------------------------------------------------------------
|  0 | yax(double*, double*, double*, int, int, |    1.00 | 70115217.00 | 70115217.00 |  70115217.00 | 100.00 |
|    |  double*)                                |         |             |             |              |        |
--------------------------------------------------------------------------------------------------------------


--------------------------------------------------------------------------------
7. Wavefront
7.1 Wavefront Launch Stats
------------------------------------------------------------------------------
| Index   | Metric           |       Avg |       Min |       Max | Unit       |
------------------------------------------------------------------------------
| 7.1.0   | Grid Size        | 131072.00 | 131072.00 | 131072.00 | Work items |
------------------------------------------------------------------------------
| 7.1.1   | Workgroup Size   |     64.00 |     64.00 |     64.00 | Work items |
------------------------------------------------------------------------------
| 7.1.2   | Total Wavefronts |   2048.00 |   2048.00 |   2048.00 | Wavefronts |
------------------------------------------------------------------------------

Looking through this data we see:

Workgroup Size (7.1.1) corresponds to the first argument of the block launch parameter
Total Wavefronts (7.1.2) corresponds to the first index of the grid launch parameter
Grid size (7.1.0) is Workgroup Size (7.1.1) times Total Wavefronts (7.1.2)

Omniperf Command Line Comparison Feature

On releases newer than Omniperf 1.0.10, the comparison feature of omniperf can be used to quickly compare two profiles.
To use this feature, use the command:

omniperf analyze -p workloads/problem/MI200 -p solution/workloads/solution/MI200 --dispatch 1 --block 7.1.0 7.1.1 7.1.2

This feature sets the first -p argument as the baseline, and the second as the comparison workload.
In this case, problem is set as the baseline and is compared to solution.
The output should look like:

--------
Analyze
--------


--------------------------------------------------------------------------------
0. Top Stat
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
|    | KernelName                               |   Count | Count      |      Sum(ns) | Sum(ns)              |     Mean(ns) | Mean(ns)             |   Median(ns) | Median(ns)           |    Pct | Pct          |
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
|  0 | yax(double*, double*, double*, int, int, |    1.00 | 1.0 (0.0%) | 754934306.50 | 69702016.5 (-90.77%) | 754934306.50 | 69702016.5 (-90.77%) | 754934306.50 | 69702016.5 (-90.77%) | 100.00 | 100.0 (0.0%) |
|    |  double*)                                |         |            |              |                      |              |                      |              |                      |        |              |
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------


--------------------------------------------------------------------------------
7. Wavefront
7.1 Wavefront Launch Stats
---------------------------------------------------------------------------------------------------------------------------------------
| Index   | Metric           |    Avg | Avg                 |    Min | Min                 |    Max | Max                 | Unit       |
---------------------------------------------------------------------------------------------------------------------------------------
| 7.1.0   | Grid Size        | 256.00 | 131072.0 (51100.0%) | 256.00 | 131072.0 (51100.0%) | 256.00 | 131072.0 (51100.0%) | Work items |
---------------------------------------------------------------------------------------------------------------------------------------
| 7.1.1   | Workgroup Size   |  64.00 | 64.0 (0.0%)         |  64.00 | 64.0 (0.0%)         |  64.00 | 64.0 (0.0%)         | Work items |
---------------------------------------------------------------------------------------------------------------------------------------
| 7.1.2   | Total Wavefronts |   4.00 | 2048.0 (51100.0%)   |   4.00 | 2048.0 (51100.0%)   |   4.00 | 2048.0 (51100.0%)   | Wavefronts |
---------------------------------------------------------------------------------------------------------------------------------------

Note that the comparison workload shows the percentage difference from the baseline.
This feature can be used to quickly compare filtered stats to make sure code changes fix known issues.

More Kernel Filtering

For this exercise, it is appropriate to filter the omniperf analyze command with the --dispatch 1 argument.
This --dispatch 1 argument filters the data shown to only include the kernel invocation with dispatch ID 1, or the second kernel run during profiling.

However, there is another way to filter kernels that may be more applicable in real use-cases.
Typically real codes launch many kernels, and only a few of them take most of the overall kernel runtime.
To see a ranking of the top kernels that take up most of the kernel runtime in your code, you can run:

omniperf analyze -p workloads/problem/MI200 --list-stats

This command will output something like:

--------
Analyze
--------


--------------------------------------------------------------------------------
Detected Kernels
----------------------------------------------------------
|    | KernelName                                        |
----------------------------------------------------------
|  0 | yax(double*, double*, double*, int, int, double*) |
----------------------------------------------------------

Using Omniperf versions greater than 1.0.10, --list-stats will list all kernels launched by your code, in order of runtime (largest runtime first).
The number displayed beside the kernel in the output can be used to filter omniperf analyze commands.
Note that this will display aggregated stats for kernels of the same name, meaning that the invocations could differ in terms of launch parameters, and vary widely in terms of work completed.
This filtering is accomplished with the -k argument:

omniperf analyze -p workloads/problem/MI200 -k 0 --block 7.1.0 7.1.1 7.1.2

Which should show something like:

--------
Analyze
--------


--------------------------------------------------------------------------------
0. Top Stat
-----------------------------------------------------------------------------------------------------------------------
|    | KernelName                               |   Count |       Sum(ns) |     Mean(ns) |   Median(ns) |    Pct | S   |
-----------------------------------------------------------------------------------------------------------------------
|  0 | yax(double*, double*, double*, int, int, |    2.00 | 1508098228.00 | 754049114.00 | 754049114.00 | 100.00 | *   |
|    |  double*)                                |         |               |              |              |        |     |
-----------------------------------------------------------------------------------------------------------------------


--------------------------------------------------------------------------------
7. Wavefront
7.1 Wavefront Launch Stats
----------------------------------------------------------------------
| Index   | Metric           |    Avg |    Min |    Max | Unit       |
----------------------------------------------------------------------
| 7.1.0   | Grid Size        | 256.00 | 256.00 | 256.00 | Work items |
----------------------------------------------------------------------
| 7.1.1   | Workgroup Size   |  64.00 |  64.00 |  64.00 | Work items |
----------------------------------------------------------------------
| 7.1.2   | Total Wavefronts |   4.00 |   4.00 |   4.00 | Wavefronts |
----------------------------------------------------------------------

Note that the 'count' field in Top Stat is 2 here, where filtering by dispatch ID displays a count of 1, indicating that filtering with -k returns aggregated stats for two kernel invocations in this case.
Also note that the "Top Stats" table will still show all the top kernels but the rightmost column titled "S" will have an asterisk beside the kernel for which data is being displayed.

Solution Roofline

We've demonstrated better performance than problem.cpp in solution.cpp, but could we potentially do better?
To answer that we again turn to the roofline model:

Roofline Type	Roofline Legend	Roofline Plot
FP32
FP16/INT8

These plots were generated with:

srun omniperf profile -n solution_roof_only --roof-only --kernel-names -- ./solution.exe

The plots will appear as PDF files in the ./workloads/solution_roof_only/MI200 directory, if generated on MI200 hardware.

We see that the solution is solidly in the bandwidth-bound regime, but even still there seems to be room for improvement. Further performance improvements will be a topic for later exercises.

Roofline Comparison

Roofline Type	Problem Roofline	Solution Roofline
FP32
FP16/INT8

We see that the solution has drastically increased performance over the problem code, as shown by the solution points moving up closer to the line plotted by the bandwidth limit.

Note: on statically generated roofline images, it is possible for the L1, L2, or HBM points to overlap and hide one another.

Summary and Take-aways

Launch parameters should be the first check in optimizing performance, due to the fact that they are
usually easy to change, but can have a large performance impact if they aren't tuned to your workload.
It is difficult to predict the optimal launch parameters for any given kernel, so some experimentation
may be required to achieve the best performance.

Exercise 2: LDS Occupancy Limiter

Simple kernel implementing a version of yAx, to demonstrate the downside of allocating a large
amount of LDS, and the benefit of using a smaller amount of LDS due to occupancy limits.

Background: Acronyms and terms used in this exercise

Wavefront: A collection of threads, usually 64.
Workgroup: A collection of Wavefronts (at least 1), which can be scheduled on a Compute Unit (CU)
LDS: Local Data Store is Shared Memory that is accessible to the entire workgroup on a Compute Unit (CU)
CU: The Compute Unit is responsible for executing the User's kernels
SPI: Shader Processor Input, also referred to as the Workgroup Manager, is responsible for scheduling workgroups on Compute Units
Occupancy: A measure of how many wavefronts are executing on the GPU on average through the duration of the kernel
PoP: Percent of Peak refers to the ratio of an achieved value and a theoretical or actual maximum. In terms of occupancy, it is how many wavefronts on average were on the device divided by how many can fit on the device.
yAx: a vector-matrix-vector product, y*A*x, where y and x are vectors, and A is a matrix
FP(32/16): 32- or 16-bit Floating Point numeric types
FLOPs: Floating Point Operations Per second
HBM: High Bandwidth Memory is globally accessible from the GPU, and is a level of memory above the L2 cache

Initial Roofline Analysis

In this exercise we're using a problem code that is slightly different than where we left off in Exercise 1.
Regardless, to get started we need to get a roofline by running:

srun omniperf profile -n problem_roof_only --roof-only --kernel-names -- ./problem.exe

The plots will appear as PDF files in the ./workloads/problem_roof_only/MI200 directory, if generated on MI200 hardware.

For convenience, the resulting plots on a representative system are below:

Roofline Type	Roofline Legend	Roofline Plot
FP32
FP16/INT8

We see that there looks to be room for improvement here. We'll use omniperf to see what the current limiters are.

Exercise Instructions:

First, we should get an idea of the code's runtime:

make
srun ./problem.exe

(simulated output)

yAx time: 140 ms

This problem.cpp uses LDS allocations to move the x vector closer to the compute resources, a common optimization.
However, we see that it ends up slower than the previous solution that didn't use LDS at all.
In kernels that request a lot of LDS, it is common to see that the LDS usage limits the occupancy of the kernel.
That is, more wavefronts cannot be resident on the device, because all of them need more LDS than is available.
We need to confirm this hypothesis, let's start by running:

srun omniperf profile -n problem --no-roof -- ./problem.exe

The usage of omniperf profile arguments can be found here, or by running omniperf profile --help.

This omniperf profile command will take a minute or two to run, as omniperf must run your code a few times to collect all the hardware counters.

Note: For large scientific codes, it can be useful to profile a small representative workload if possible, as profiling a full run may take prohibitively long.

Once the profiling run completes, let's take a look at the occupancy stats related to LDS allocations:

omniperf analyze -p workloads/problem/MI200 --dispatch 1 --block 2.1.15 6.2.7

The metrics we're looking at are:

2.1.15 Wavefront occupancy – a measure of how many wavefronts, on average, are active on the device
6.2.7 SPI: Insufficient CU LDS – indicates whether wavefronts are not able to be scheduled due to insufficient LDS

The SPI section (6.2) generally shows what resources limit occupancy, while Wavefront occupancy (2.1.15) shows how severely occupancy is limited in general.
The SPI 'insufficient' fields are typically either zero or very large numbers (on the order of 1 million), with large numbers indicating some resource preventing wavefronts from scheduling.
If more than one field is nonzero, the relative magnitude of the nonzero fields correspond to how severely the resources are limiting occupancy, but if only one field is nonzero it is difficult to say how severely that field is limiting occupancy.

The output of the omniperf analyze command should look similar to this:

--------
Analyze
--------


--------------------------------------------------------------------------------
0. Top Stat
-----------------------------------------------------------------------------------------------------------------
|    | KernelName                               |   Count |      Sum(ns) |     Mean(ns) |   Median(ns) |    Pct |
-----------------------------------------------------------------------------------------------------------------
|  0 | yax(double*, double*, double*, int, int, |    1.00 | 175427205.00 | 175427205.00 | 175427205.00 | 100.00 |
|    |  double*)                                |         |              |              |              |        |
-----------------------------------------------------------------------------------------------------------------


--------------------------------------------------------------------------------
2. System Speed-of-Light
2.1 Speed-of-Light
---------------------------------------------------------------------
| Index   | Metric         |   Value | Unit       |    Peak |   PoP |
---------------------------------------------------------------------
| 2.1.15  | Wave Occupancy |  102.70 | Wavefronts | 3328.00 |  3.09 |
---------------------------------------------------------------------


--------------------------------------------------------------------------------
6. Workgroup Manager (SPI)
6.2 Workgroup Manager - Resource Allocation
______________________________________________________________________
| Metric_ID   | Metric              |   Avg |   Min |   Max | Unit   |
______________________________________________________________________
| 6.2.7       | Insufficient CU LDS | 69.53 | 69.53 | 69.53 | Pct    |
______________________________________________________________________

Looking through this data we see:

Wavefront occupancy (2.1.15) is around 3%, which is very low
Insufficient CU LDS (6.2.7) contains a very large number, which indicates our occupancy is currently limited by LDS allocations.

There are two solution directories, which correspond to two ways that this occupancy limit can be addressed.
First, we have solution-no-lds, which completely removes the LDS usage. Let's build and run this solution:

cd solution-no-lds
make
srun ./solution.exe

(simulated output)

yAx time: 70 ms

We see that the runtime is much better for this solution than the problem, let's see if removing LDS did indeed increase occupancy:

srun omniperf profile -n solution --no-roof -- ./solution.exe

(output omitted)

Once the profile command completes, run:

omniperf analyze -p workloads/solution/MI200 --dispatch 1 --block 2.1.15 6.2.7

The output should look something like:

--------
Analyze
--------


--------------------------------------------------------------------------------
0. Top Stat
--------------------------------------------------------------------------------------------------------------
|    | KernelName                               |   Count |     Sum(ns) |    Mean(ns) |   Median(ns) |    Pct |
--------------------------------------------------------------------------------------------------------------
|  0 | yax(double*, double*, double*, int, int, |    1.00 | 70512671.00 | 70512671.00 |  70512671.00 | 100.00 |
|    |  double*)                                |         |             |             |              |        |
--------------------------------------------------------------------------------------------------------------


--------------------------------------------------------------------------------
2. System Speed-of-Light
2.1 Speed-of-Light
---------------------------------------------------------------------
| Index   | Metric         |   Value | Unit       |    Peak |   PoP |
---------------------------------------------------------------------
| 2.1.15  | Wave Occupancy |  445.33 | Wavefronts | 3328.00 | 13.38 |
---------------------------------------------------------------------


--------------------------------------------------------------------------------
6. Shader Processor Input (SPI)
6.2 SPI Resource Allocation
------------------------------------------------------------------
| Index   | Metric              |   Avg |   Min |   Max | Pct.   |
------------------------------------------------------------------
| 6.2.7   | Insufficient CU LDS |  0.00 |  0.00 |  0.00 | Cu     |
------------------------------------------------------------------

Looking through this data we see:

Wave occupancy (2.1.15) is about 10% higher than in problem.cpp
Insufficient CU LDS (6.2.7) is now zero, indicating solution-no-lds is not occupancy limited by LDS allocations.

Can we get some runtime advantage from using smaller LDS allocations?

This is the solution implemented in the solution directory:

cd ../solution
make
srun ./solution.exe

(simulated output)

yAx time: 50 ms

This solution, rather than removing the LDS allocation, simply reduces the amount of LDS requested to address the occupancy limit.
This gives us the benefit of having some data pulled closer than it was in solution-no-lds which is validated through the speedup we see.
But is this solution still occupancy limited by LDS?

srun omniperf profile -n solution --no-roof -- ./solution.exe

(output omitted)

Once the profile command completes, run:

omniperf analyze -p workloads/solution/MI200 --dispatch 1 --block 2.1.15 6.2.7

The output should look something like:

--------
Analyze
--------


--------------------------------------------------------------------------------
0. Top Stat
--------------------------------------------------------------------------------------------------------------
|    | KernelName                               |   Count |     Sum(ns) |    Mean(ns) |   Median(ns) |    Pct |
--------------------------------------------------------------------------------------------------------------
|  0 | yax(double*, double*, double*, int, int, |    1.00 | 50366185.00 | 50366185.00 |  50366185.00 | 100.00 |
|    |  double*)                                |         |             |             |              |        |
--------------------------------------------------------------------------------------------------------------


--------------------------------------------------------------------------------
2. System Speed-of-Light
2.1 Speed-of-Light
---------------------------------------------------------------------
| Index   | Metric         |   Value | Unit       |    Peak |   PoP |
---------------------------------------------------------------------
| 2.1.15  | Wave Occupancy |  487.32 | Wavefronts | 3328.00 | 14.64 |
---------------------------------------------------------------------


--------------------------------------------------------------------------------
6. Shader Processor Input (SPI)
6.2 SPI Resource Allocation
------------------------------------------------------------------
| Index   | Metric              |   Avg |   Min |   Max | Unit   |
------------------------------------------------------------------
| 6.2.7   | Insufficient CU LDS |  0.00 |  0.00 |  0.00 | Pct.    |
------------------------------------------------------------------

Looking at this data we see:

Wave Occupancy (2.1.15) is even higher than before
Insufficient CU LDS (6.2.7) shows we are not occupancy limited by LDS allocations.

Pulling some data from global device memory to LDS can be an effective optimization strategy, if occupancy limits are carefully avoided.

Solution Roofline

Let's take a look at the roofline for solution, which can be generated with:

srun omniperf profile -n solution_roof_only --roof-only -- ./solution.exe

The plots will appear as PDF files in the ./workloads/problem_roof_only/MI200 directory, if generated on MI200 hardware.

The plots are shown here:

Roofline Type	Roofline Legend	Roofline Plot
FP32
FP16/INT8

We see that there is still room to move the solution roofline up towards the bandwidth limit.

Roofline Comparison

Roofline Type	Problem Roofline	Solution Roofline
FP32
FP16/INT8

Again, we see that the solution's optimizations have resulted in the kernel moving up in the roofline, meaning the solution executes more efficiently than the problem.

Summary and Take-aways

Using LDS can be very helpful in reducing global memory reads where you have repeated use of the same data.
However, large LDS allocations can also negatively impact performance by limiting the amount of
wavefronts that can be resident in the device at any given time. Be wary of LDS usage, and check
the SPI stats to ensure your LDS usage is not negatively impacting occupancy.

Exercise 3: Register Occupancy Limiter

More complex yAx implementation to demonstrate a register limited kernel using an innocuous looking
function call.

Background: Acronyms and terms used in this exercise

VGPR: Vector General Purpose Register, holds distinct values for each thread in a wavefront
SGPR: Scalar General Purpose Register, holds a single value for all threads in a workgroup
AGPR: Accumulation vector General Purpose Register, used for Matrix Fused Multiply-Add (MFMA) instructions, or low-cost register spills
Wavefront: A collection of threads, usually 64.
Workgroup: A collection of Wavefronts (at least 1), which can be scheduled on a Compute Unit (CU)
LDS: Local Data Store is Shared Memory that is accessible to the entire workgroup on a Compute Unit (CU)
CU: The Compute Unit is responsible for executing the User's kernels
SPI: Shader Processor Input, also referred to as the Workgroup Manager, is responsible for scheduling workgroups on Compute Units
Occupancy: A measure of how many wavefronts are executing on the GPU on average through the duration of the kernel
PoP: Percent of Peak refers to the ratio of an achieved value and a theoretical or actual maximum. In terms of occupancy, it is how many wavefronts on average were on the device divided by how many can fit on the device.
yAx: a vector-matrix-vector product, y*A*x, where y and x are vectors, and A is a matrix
FP(32/16): 32- or 16-bit Floating Point numeric types
FLOPs: Floating Point Operations Per second
HBM: High Bandwidth Memory is globally accessible from the GPU, and is a level of memory above the L2 cache

Initial Roofline Analysis

This kernel is slightly different from the one we used in previous exercises. Let's see how well it measures up in the roofline:

Roofline Type	Roofline Legend	Roofline Plot
FP32
FP16/INT8

You can generate these plots by running:

srun omniperf profile -n problem_roof_only --roof-only --kernel-names -- ./problem.exe

The plots will appear as PDF files in the ./workloads/problem_roof_only/MI200 directory, if generated on MI200 hardware.

We see that the kernel is still a considerable amount below the maximum achievable bandwidth, so there should still be room for improvement.

Exercise Instructions:

Let's get an idea of the runtime of this code:

make
srun ./problem.exe

(simulated output)

yAx time 79 ms

We see that this kernel seems to be on par with some of our other exercises, but let's see what omniperf shows us:

srun omniperf profile -n problem --no-roof -- ./problem.exe

(lots of output from this command)

omniperf analyze -p workloads/problem/MI200 --dispatch 1 --block 2.1.15 6.2.5 7.1.5 7.1.6 7.1.7

2.1.15 Shows Wavefront occupancy
6.2.5 Shows Insufficient SIMD VGPRs – indicating if this kernel is occupancy limited by VGPR usage
7.1.5-7 Shows the register usage: VGPRs, SGPRs, and AGPRs

--------
Analyze
--------


--------------------------------------------------------------------------------
0. Top Stat
--------------------------------------------------------------------------------------------------------------
|    | KernelName                               |   Count |     Sum(ns) |    Mean(ns) |   Median(ns) |    Pct |
--------------------------------------------------------------------------------------------------------------
|  0 | yax(double*, double*, double*, int, int, |    1.00 | 76983902.00 | 76983902.00 |  76983902.00 | 100.00 |
|    |  double*)                                |         |             |             |              |        |
--------------------------------------------------------------------------------------------------------------


--------------------------------------------------------------------------------
2. System Speed-of-Light
2.1 Speed-of-Light
--------------------------------------------------------------------
| Index   | Metric         |   Value | Unit       |    Peak |   PoP |
--------------------------------------------------------------------
| 2.1.15  | Wave Occupancy |  438.00 | Wavefronts | 3328.00 | 13.16 |
--------------------------------------------------------------------


--------------------------------------------------------------------------------
6. Workgroup Manager (SPI)
6.2 Workgroup Manager - Resource Allocation
__________________________________________________________________________
| Metric_ID   | Metric                  |   Avg |   Min |   Max | Unit   |
__________________________________________________________________________
| 6.2.5       | Insufficient SIMD VGPRs |  0.04 |  0.04 |  0.04 | Pct    |
__________________________________________________________________________

--------------------------------------------------------------------------------
7. Wavefront
7.1 Wavefront Launch Stats
------------------------------------------------------------
| Index   | Metric   |    Avg |    Min |    Max | Unit      |
------------------------------------------------------------
| 7.1.5   | VGPRs    |  92.00 |  92.00 |  92.00 | Registers |
------------------------------------------------------------
| 7.1.6   | AGPRs    | 132.00 | 132.00 | 132.00 | Registers |
------------------------------------------------------------
| 7.1.7   | SGPRs    |  64.00 |  64.00 |  64.00 | Registers |
------------------------------------------------------------

Looking at this data, we see:

Insufficient SIMD VGPRs (6.2.5) shows a percentage of scheduling attempts that stalled due to lack of VGPRs, which indicates our kernel is occupancy limited by VGPR register usage.
VGPRs (7.1.5) shows we are using a lot of VGPRs and AGPRs (7.1.6) shows we are using a lot of AGPRs, which can indicate low-cost register spills in the absence of MFMA instructions.

This is due to a call to assert that checks if our result is zeroed out on device.
We need to check this hypothesis, let's look at the solution code:

cd solution
make
srun ./solution.exe

(simulated output)

yAx time: 69 ms

Our runtime gets better from removing the assert, but we should also check that omniperf reports that our limiters are gone:

srun omniperf profile -n solution --no-roof -- ./solution.exe

(omitted output)

omniperf analyze -p workloads/solution/MI200 --dispatch 1 --block 2.1.15 6.2.5 7.1.5 7.1.6 7.1.7

The output of this command should look something like:

--------
Analyze
--------


--------------------------------------------------------------------------------
0. Top Stat
--------------------------------------------------------------------------------------------------------------
|    | KernelName                               |   Count |     Sum(ns) |    Mean(ns) |   Median(ns) |    Pct |
--------------------------------------------------------------------------------------------------------------
|  0 | yax(double*, double*, double*, int, int, |    1.00 | 69815871.00 | 69815871.00 |  69815871.00 | 100.00 |
|    |  double*)                                |         |             |             |              |        |
--------------------------------------------------------------------------------------------------------------


--------------------------------------------------------------------------------
2. System Speed-of-Light
2.1 Speed-of-Light
--------------------------------------------------------------------
| Index   | Metric         |   Value | Unit       |    Peak |   PoP |
--------------------------------------------------------------------
| 2.1.15  | Wave Occupancy |  444.10 | Wavefronts | 3328.00 | 13.34 |
--------------------------------------------------------------------


--------------------------------------------------------------------------------
6. Shader Processor Input (SPI)
6.2 SPI Resource Allocation
--------------------------------------------------------------------
| Index   | Metric                  |   Avg |   Min |   Max | Unit   |
--------------------------------------------------------------------
| 6.2.5   | Insufficient SIMD VGPRs |  0.00 |  0.00 |  0.00 | Pct   |
--------------------------------------------------------------------


--------------------------------------------------------------------------------
7. Wavefront
7.1 Wavefront Launch Stats
----------------------------------------------------------
| Index   | Metric   |   Avg |   Min |   Max | Unit      |
----------------------------------------------------------
| 7.1.5   | VGPRs    | 32.00 | 32.00 | 32.00 | Registers |
----------------------------------------------------------
| 7.1.6   | AGPRs    |  0.00 |  0.00 |  0.00 | Registers |
----------------------------------------------------------
| 7.1.7   | SGPRs    | 112.00 | 112.00 | 112.00 | Registers |
----------------------------------------------------------

Looking at this data, we see:

Insufficient SIMD VGPRs (6.2.5) shows we are no longer occupancy limited by VGPR usage.
VGPRs (7.1.5) and AGPRs (7.1.6) show considerably fewer vector registers, and no AGPRs being used.
SGPRs (7.1.7) shows a 2x increase over the previous implementation, which shows our memory access is more uniform here.
Wave Occupancy (2.1.15) shows our occupancy increased only slightly from the previous implementation, despite the large decrease in 6.2.5.

More generally, you can use this command to look at all SPI "insufficient resource" stats in the same screen, to determine if any resource is currently limiting occupancy:

omniperf analyze -p workloads/problem/MI200 --dispatch 1 --block 6.2

Which will show output similar to this (note, fields 6.2.4 to 6.2.8 show resources which currently limit occupancy):

--------
Analyze
--------


--------------------------------------------------------------------------------
0. Top Stat
---------------------------------------------------------------------------------------------------------------
|    | KernelName                               |   Count |     Sum(ns) |    Mean(ns) |   Median(ns) |    Pct |
---------------------------------------------------------------------------------------------------------------
|  0 | yax(double*, double*, double*, int, int, |    1.00 | 76983902.00 | 76983902.00 |  76983902.00 | 100.00 |
|    |  double*)                                |         |             |             |              |        |
---------------------------------------------------------------------------------------------------------------


--------------------------------------------------------------------------------
6. Workgroup Manager (SPI)
6.2 Workgroup Manager - Resource Allocation
_________________________________________________________________________________________
_ Metric_ID   _ Metric                                 _   Avg _   Min _   Max _ Unit   _
_________________________________________________________________________________________
_ 6.2.0       _ Not-scheduled Rate (Workgroup Manager) _  0.02 _  0.02 _  0.02 _ Pct    _
_________________________________________________________________________________________
_ 6.2.1       _ Not-scheduled Rate (Scheduler-Pipe)    _  0.03 _  0.03 _  0.03 _ Pct    _
_________________________________________________________________________________________
_ 6.2.2       _ Scheduler-Pipe Stall Rate              _  0.01 _  0.01 _  0.01 _ Pct    _
_________________________________________________________________________________________
_ 6.2.3       _ Scratch Stall Rate                     _  0.00 _  0.00 _  0.00 _ Pct    _
_________________________________________________________________________________________
_ 6.2.4       _ Insufficient SIMD Waveslots            _  0.00 _  0.00 _  0.00 _ Pct    _
_________________________________________________________________________________________
_ 6.2.5       _ Insufficient SIMD VGPRs                _  0.04 _  0.04 _  0.04 _ Pct    _
_________________________________________________________________________________________
_ 6.2.6       _ Insufficient SIMD SGPRs                _  0.00 _  0.00 _  0.00 _ Pct    _
_________________________________________________________________________________________
_ 6.2.7       _ Insufficient CU LDS                    _  0.00 _  0.00 _  0.00 _ Pct    _
_________________________________________________________________________________________
_ 6.2.8       _ Insufficient CU Barriers               _  0.00 _  0.00 _  0.00 _ Pct    _
_________________________________________________________________________________________
_ 6.2.9       _ Reached CU Workgroup Limit             _  0.00 _  0.00 _  0.00 _ Pct    _
_________________________________________________________________________________________
_ 6.2.10      _ Reached CU Wavefront Limit             _  0.00 _  0.00 _  0.00 _ Pct    _
_________________________________________________________________________________________

Solution Roofline

Let's see how the solution stacks up in the roofline:

Roofline Type	Roofline Legend	Roofline Plot
FP32
FP16/INT8

You can generate these plots with:

srun omniperf profile -n problem_roof_only --roof-only --kernel-names -- ./problem.exe

The plots will appear as PDF files in the ./workloads/problem_roof_only/MI200 directory, if generated on MI200 hardware.

We see there is still room for improvement in the solution, as this kernel is not getting the maximum achievable bandwidth.

Roofline Comparison

Roofline Type	Problem Roofline	Solution Roofline
FP32
FP16/INT8

The most notable change between these rooflines is that the L1/L2 arithmetic intensity spread is more pronounced in the problem, which shows that the call to assert was causing more data to be moved to the L1, while not adding floating-point operations.

Note: Arithmetic Intensity is computed as (total floating point operations)/(total data movement)

Summary and Take-aways

Function calls inside kernels can have surprisingly adverse performance side-effects. Calling assert, printf and even excessive use of math functions (e.g. pow, sin, cos) can limit performance in difficult-to-predict ways. If you see unexpected resource usage, try eliminating these sorts of function calls.

Exercise 4: Strided Data Access Patterns (and how to find them)

This exercise uses a simple implementation of a yAx kernel to show how difficult strided data access patterns can be to spot in code,
and demonstrates how to use omniperf to begin to diagnose them.

Background: Acronyms and terms used in this exercise

L1: Level 1 Cache, the first level cache local to the Compute Unit (CU). If requested data is not found in the L1, the request goes to the L2
L2: Level 2 Cache, the second level cache, which is shared by all Compute Units (CUs) on a GPU. If requested data is not found in the L2, the request goes to HBM
HBM: High Bandwidth Memory is globally accessible from the GPU, and is a level of memory above the L2 cache
CU: The Compute Unit is responsible for executing the User's kernels
yAx: a vector-matrix-vector product, y*A*x, where y and x are vectors, and A is a matrix
FP(32/16): 32- or 16-bit Floating Point numeric types

Background: What is a "Strided Data Access Pattern"?

Strided data patterns happen when each thread in a wavefront has to access data locations which have a lot of space between them. For instance, in the algorithm we've been using, each thread works on a row, and those rows are contiguous in device memory. This scenario is depicted below:

Image Not Showing Possible Reasons

The image file may be corrupted
The server hosting the image is unavailable
The image path is incorrect
The image format is not supported

Learn More →

Here the memory addresses accessed by threads at each step of the computation have a lot of space between them, which is suboptimal for memory systems, especially on GPUs. To fix this, we have to re-structure the matrix A so that the columns of the matrix are contiguous, which will result in the rows striding, as seen below:

Image Not Showing Possible Reasons

The image file may be corrupted
The server hosting the image is unavailable
The image path is incorrect
The image format is not supported

Learn More →

This new data layout has each block of threads accessing a contiguous chunk of device memory, and will use the memory system of the device much more efficiently. Importantly, the only thing that changed is the physical layout of the memory, so the result of this computation will be the same as the result of the previous data layout.

Initial Roofline Analysis

To start, we want to check the roofline of problem.exe, to make sure we are able to improve it.
These plots can be generated with:

srun omniperf profile -n problem_roof_only --roof-only --kernel-names -- ./problem.exe

The plots will appear as PDF files in the ./workloads/problem_roof_only/MI200 directory, if generated on MI200 hardware.

They are also provided below for easy reference:

Roofline Type	Roofline Legend	Roofline Plot
FP32
FP16/INT8

We have plenty of space to improve this kernel, the next step is profiling.

Exercise Instructions:

To start, let's build and run the problem executable:

make
srun ./problem.exe

(simulated output)

yAx time: 70 ms

From our other experiments, this time seems reasonable. Let's look closer at the memory system usage with omniperf:

srun omniperf profile -n problem --no-roof -- ./problem.exe

(omitted output)

omniperf analyze -p workloads/problem/MI200 --dispatch 1 --block 16.1 17.1

Previous examples have used specific fields inside metrics, but we can also request a group of metrics with just two numbers (i.e. 16.1 vs. 16.1.1)

These requested metrics are:

16.1 L1 memory speed-of-light stats
17.1 L2 memory speed-of-light stats

The speed-of-light stats are a more broad overview of how the memory systems are used throughout execution of your kernel.
As such, they're great statistics for seeing if the memory system is generally being used efficiently or not.
Output from the analyze command should look like this:

--------
Analyze
--------


--------------------------------------------------------------------------------
0. Top Stat
--------------------------------------------------------------------------------------------------------------
|    | KernelName                               |   Count |     Sum(ns) |    Mean(ns) |   Median(ns) |    Pct |
--------------------------------------------------------------------------------------------------------------
|  0 | yax(double*, double*, double*, int, int, |    1.00 | 69768072.00 | 69768072.00 |  69768072.00 | 100.00 |
|    |  double*)                                |         |             |             |              |        |
--------------------------------------------------------------------------------------------------------------


--------------------------------------------------------------------------------
16. Vector L1 Data Cache
16.1 Speed-of-Light
___________________________________________________
_ Metric_ID   _ Metric      _   Avg _ Unit        _
___________________________________________________
_ 16.1.0      _ Hit rate    _  0.00 _ Pct of peak _
___________________________________________________
_ 16.1.1      _ Bandwidth   _  8.26 _ Pct of peak _
___________________________________________________
_ 16.1.2      _ Utilization _ 83.57 _ Pct of peak _
___________________________________________________
_ 16.1.3      _ Coalescing  _ 25.00 _ Pct of peak _
___________________________________________________


--------------------------------------------------------------------------------
17. L2 Cache
17.1 Speed-of-Light
_________________________________________________________________
_ Metric_ID   _ Metric                        _    Avg _ Unit   _
_________________________________________________________________
_ 17.1.0      _ Utilization                   _  97.67 _ Pct    _
_________________________________________________________________
_ 17.1.1      _ Bandwidth                     _  28.41 _ Pct    _
_________________________________________________________________
_ 17.1.2      _ Hit Rate                      _  93.45 _ Pct    _
_________________________________________________________________
_ 17.1.3      _ L2-Fabric Read BW             _ 126.27 _ Gb/s   _
_________________________________________________________________
_ 17.1.4      _ L2-Fabric Write and Atomic BW _   0.00 _ Gb/s   _
_________________________________________________________________

Looking at this data, we see:

L1 Cache Hit (16.1.0) is 0%, so the kernel's memory requests are never found in the L1.
L2 Cache Hit (17.1.2) is 93.45%, so most requests are found in the L2, with about 7% needing to go out to HBM.
We are never finding data in the L1 and generating a lot of requests to the L2, so restructuring our data accesses should provide better performance

Since our implementation of yAx simply uses 1 for all values in y, A, and x, we do not have to change how we populate our data.
Since A is implemented as a flat array, we don't need to change our allocation either.

In real-world use-cases, these considerations add non-trivial development overhead, so data access patterns may be non-trivial to change.

To observe the performance effects of a different data access pattern, we simply need to change our indexing scheme.
Let's see how this performs by running solution:

cd solution
make
srun ./solution.exe

(simulated output)

yAx time: 13 ms

We see the runtime here is significantly better than our previous kernel, but we need to check how the caches behave now:

srun omniperf profile -n solution --no-roof -- ./solution.exe

(output omitted)

omniperf analyze -p workloads/solution/MI200 --dispatch 1 --block 16.1 17.1

The output from this analyze command should look like:

--------
Analyze
--------


--------------------------------------------------------------------------------
0. Top Stat
--------------------------------------------------------------------------------------------------------------
|    | KernelName                               |   Count |     Sum(ns) |    Mean(ns) |   Median(ns) |    Pct |
--------------------------------------------------------------------------------------------------------------
|  0 | yax(double*, double*, double*, int, int, |    1.00 | 12464570.00 | 12464570.00 |  12464570.00 | 100.00 |
|    |  double*)                                |         |             |             |              |        |
--------------------------------------------------------------------------------------------------------------


--------------------------------------------------------------------------------
16. Vector L1 Data Cache
16.1 Speed-of-Light
___________________________________________________
_ Metric_ID   _ Metric      _   Avg _ Unit        _
___________________________________________________
_ 16.1.0      _ Hit rate    _ 49.98 _ Pct of peak _
___________________________________________________
_ 16.1.1      _ Bandwidth   _ 10.88 _ Pct of peak _
___________________________________________________
_ 16.1.2      _ Utilization _ 98.15 _ Pct of peak _
___________________________________________________
_ 16.1.3      _ Coalescing  _ 25.00 _ Pct of peak _
___________________________________________________


--------------------------------------------------------------------------------
17. L2 Cache
17.1 Speed-of-Light
_________________________________________________________________
_ Metric_ID   _ Metric                        _    Avg _ Unit   _
_________________________________________________________________
_ 17.1.0      _ Utilization                   _  98.60 _ Pct    _
_________________________________________________________________
_ 17.1.1      _ Bandwidth                     _   9.40 _ Pct    _
_________________________________________________________________
_ 17.1.2      _ Hit Rate                      _   0.52 _ Pct    _
_________________________________________________________________
_ 17.1.3      _ L2-Fabric Read BW             _ 650.84 _ Gb/s   _
_________________________________________________________________
_ 17.1.4      _ L2-Fabric Write and Atomic BW _   0.00 _ Gb/s   _
_________________________________________________________________

Looking at this data, we see:

L1 Cache Hit (16.1.0) is around 50%, so half the requests to the L1 need to go to the L2.
L2 Cache Hit (17.1.2) is 0.52%, so almost all the requests to the L2 have to go out to HBM.
L2-EA Rd BW (17.1.3) has increased significantly, due to the increase in L2 cache misses requiring HBM reads.

Solution Roofline Analysis

We should check where our new kernel stands on the roofline.
These plots can be generated with:

srun omniperf profile -n solution_roof_only --roof-only --kernel-names -- ./solution.exe

The plots will appear as PDF files in the ./workloads/problem_roof_only/MI200 directory, if generated on MI200 hardware.

They are also provided below for easy reference:

Roofline Type	Roofline Legend	Roofline Plot
FP32
FP16/INT8

We appear to be very close to being bound by the HBM bandwidth from the fp32 roofline.
To get more performance we need to look closer at our algorithm.

Roofline Comparison

Roofline Type	Problem Roofline	Solution Roofline
FP32
FP16/INT8

We see that the HBM roofline point moves up, while the L1/L2 points move up and to the right from problem to solution. This means that our arithmetic intensity is increasing for the caches, so we are moving less data through the caches to do the same computation.

Summary and Take-aways

This exercise illustrates the at times insidious nature of strided data access patterns.
They can be difficult to spot in code, but profiling more readily shows when adversarial
access patterns occur, by showing poor cache hit rates, low cache bandwidth, and potentially low utilization.
Data access patterns can be non-trivial to change, so these sorts of optimizations can involve significant development and validation overhead.

LUMI training - Omniperf by Example - part 1 - Oslo, Norway - June 2024

Environment for LUMI

Omniperf Advanced Exercises

Exercise 1: Launch Parameter Tuning

Background: Acronyms and terms used in this exercise

Initial Roofline Analysis:

Exercise instructions:

Omniperf Command Line Comparison Feature

More Kernel Filtering

Solution Roofline

Roofline Comparison

Summary and Take-aways

Exercise 2: LDS Occupancy Limiter

Background: Acronyms and terms used in this exercise

Initial Roofline Analysis

Exercise Instructions:

Solution Roofline

Roofline Comparison

Summary and Take-aways

Exercise 3: Register Occupancy Limiter

Background: Acronyms and terms used in this exercise

Initial Roofline Analysis

Exercise Instructions:

Solution Roofline

Roofline Comparison

Summary and Take-aways

Exercise 4: Strided Data Access Patterns (and how to find them)

Background: Acronyms and terms used in this exercise

Background: What is a "Strided Data Access Pattern"?

Initial Roofline Analysis

Exercise Instructions:

Solution Roofline Analysis

Roofline Comparison

Summary and Take-aways

Read more

LUMI training advanced omniperf examples - part 2 - Oslo, Norway - June 2024

LUMI training basic examples- Oslo, Norway - June 2024

LUMI training advanced omnitrace examples- Oslo, Norway - June 2024

Omniperf Advanced - part 2 - Performance Analysis Tools for AMD GPUs, CRAY User Group Tutorial 2024