module load CrayEnv
module load buildtools/23.09
module load PrgEnv-cray/8.4.0
module load cce/16.0.1
module load craype-accel-amd-gfx90a
module load craype-x86-trento
module use /pfs/lustrep2/projappl/project_462000125/samantao-public/mymodules
module load rocm/5.4.3 omniperf/2.0.1-rocm-5.4.x
You can setup the following environment variables for the project you want to use:
export SALLOC_ACCOUNT=project_<your porject ID>
export SBATCH_ACCOUNT=project_<your porject ID>
The allocation can be setup with:
salloc -N 1 --gpus=8 -p standard-g --exclusive -t 20:00 --reservation <reservation name>
These exercises are meant to provide extra insight on the tunning of kernels. The exercises files are included in:
git clone https://github.com/amd/HPCTrainingExamples.git
cd HPCTrainingExamples/OmniperfExamples
Just navigate the diferent folders, each corresponding to a section of these exercises.
Simple kernel implementing a version of yAx, to demonstrate effects of Launch Parameters on kernel
execution time. Background: Acronyms and terms used in this exercise
The roofline model is a way to gauge kernel performance in terms of maximum achievable bandwidth and floating-point operations.
It can be used to determine how efficiently a kernel makes use of the available hardware. It is a key tool in initially determining
which kernels are performing well, and which kernels should be able to perform better. Below are roofline plots for the yAx kernel in problem.cpp:
Roofline Type | Roofline Legend | Roofline Plot |
---|---|---|
FP32 | ||
FP16/INT8 |
These plots were generated by running:
srun omniperf profile -n problem_roof_only --roof-only --kernel-names -- ./problem.exe
The plots will appear as PDF files in the ./workloads/problem_roof_only/MI200
directory, if generated on MI200 hardware.
We see that the kernel's performance is not near the achievable bandwidth possible on the hardware, which makes it a good candidate to consider
optimizing.
From the roofline we were able to see that there is room for improvement in this kernel. One of the
first things to check is whether or not we have reasonable launch parameters for this kernel.
To get started, build and run the problem code:
make
srun ./problem.exe
(simulated output)
yAx time: 771.765559 milliseconds
The runtime of the problem should be very slow, due to sub-optimal launch parameters.
Let's confirm this hypothesis by looking at the omniperf profile. Start by running:
srun omniperf profile -n problem --no-roof -- ./problem.exe
This command requires omniperf to run your code a few times to collect all the necessary hardware counters.
-n problem
names the workload, meaning that the profile will appear in the ./workloads/problem/MI200/
directory, if you are profiling on an MI200 device.--no-roof
turns off the roofline, which will save some profiling time by avoiding the collection of achievable bandwidths and FLOPs on the device.--
is the command that will be profiled.After the profiling data is collected, we can view the profile by using this command:
omniperf analyze -p workloads/problem/MI200 --dispatch 1 --block 7.1.0 7.1.1 7.1.2
This allows us to view nicely formatted profiling data directly in the command line.
The command given here has a few arguments that are noteworthy:
-p workloads/problem/MI200
must point to the output directory of your profile run. For the above omniperf profile
command, this will be workloads/problem/MI200
.--dispatch 1
filters kernel statistics by dispatch ID. In this case kernel 0 was a "warm-up" kernel, and kernel 1 is what the code reports timings for.--block
displays only the requested block of metrics, in this case we want metrics specific to Launch Parameters:
7.1.0
is the Grid Size7.1.1
is the Workgroup Size7.1.2
is the Total Wavefronts LaunchedThe output of the omniperf analyze
command should look something like this:
--------
Analyze
--------
--------------------------------------------------------------------------------
0. Top Stat
----------------------------------------------------------------------------------------------------------------
| | KernelName | Count | Sum(ns) | Mean(ns) | Median(ns) | Pct |
----------------------------------------------------------------------------------------------------------------
| 0 | yax(double*, double*, double*, int, int, | 1.00 | 755935180.00 | 755935180.00 | 755935180.00 | 100.00 |
| | double*) | | | | | |
----------------------------------------------------------------------------------------------------------------
--------------------------------------------------------------------------------
7. Wavefront
7.1 Wavefront Launch Stats
---------------------------------------------------------------------
| Index | Metric | Avg | Min | Max | Unit |
---------------------------------------------------------------------
| 7.1.0 | Grid Size | 256.00 | 256.00 | 256.00 | Work items |
---------------------------------------------------------------------
| 7.1.1 | Workgroup Size | 64.00 | 64.00 | 64.00 | Work items |
---------------------------------------------------------------------
| 7.1.2 | Total Wavefronts | 4.00 | 4.00 | 4.00 | Wavefronts |
---------------------------------------------------------------------
Looking through this data we see:
7.1.1
) is 64 threads, which corresponds with the size of a wavefront.7.1.2
) shows that we are launching only 4 Wavefronts.We can definitely get better performance by adjusting the launch parameters of our kernel.
Either try out some new values for the launch bounds, or run the provided solution to see its performance:
cd solution
make
srun ./solution.exe
(simulated output)
yAx time: 70 ms
We get much better performance with the new launch parameters. Note that in general it can be difficult to find the
most optimal launch parameters for a given kernel due to the many factors that impact performance, so determining
launch parameters experimentally is usually necessary.
We should also confirm that our updated launch parameters are reported by omniperf, we need to run:
srun omniperf profile -n solution --no-roof -- ./solution.exe
This command is the same as before, except the workload name has changed to solution
.
Once the profile
command has completed, run:
omniperf analyze -p workloads/solution/mi200 --dispatch 1 --block 7.1.0 7.1.1 7.1.2
Again, this command largely uses the same arguments as before, except for the workload name.
The output should look something like this:
--------
Analyze
--------
--------------------------------------------------------------------------------
0. Top Stat
--------------------------------------------------------------------------------------------------------------
| | KernelName | Count | Sum(ns) | Mean(ns) | Median(ns) | Pct |
--------------------------------------------------------------------------------------------------------------
| 0 | yax(double*, double*, double*, int, int, | 1.00 | 70115217.00 | 70115217.00 | 70115217.00 | 100.00 |
| | double*) | | | | | |
--------------------------------------------------------------------------------------------------------------
--------------------------------------------------------------------------------
7. Wavefront
7.1 Wavefront Launch Stats
------------------------------------------------------------------------------
| Index | Metric | Avg | Min | Max | Unit |
------------------------------------------------------------------------------
| 7.1.0 | Grid Size | 131072.00 | 131072.00 | 131072.00 | Work items |
------------------------------------------------------------------------------
| 7.1.1 | Workgroup Size | 64.00 | 64.00 | 64.00 | Work items |
------------------------------------------------------------------------------
| 7.1.2 | Total Wavefronts | 2048.00 | 2048.00 | 2048.00 | Wavefronts |
------------------------------------------------------------------------------
Looking through this data we see:
7.1.1
) corresponds to the first argument of the block launch parameter7.1.2
) corresponds to the first index of the grid launch parameter7.1.0
) is Workgroup Size (7.1.1
) times Total Wavefronts (7.1.2
)On releases newer than Omniperf 1.0.10, the comparison feature of omniperf can be used to quickly compare two profiles.
To use this feature, use the command:
omniperf analyze -p workloads/problem/MI200 -p solution/workloads/solution/MI200 --dispatch 1 --block 7.1.0 7.1.1 7.1.2
This feature sets the first -p
argument as the baseline, and the second as the comparison workload.
In this case, problem is set as the baseline and is compared to solution.
The output should look like:
--------
Analyze
--------
--------------------------------------------------------------------------------
0. Top Stat
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
| | KernelName | Count | Count | Sum(ns) | Sum(ns) | Mean(ns) | Mean(ns) | Median(ns) | Median(ns) | Pct | Pct |
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
| 0 | yax(double*, double*, double*, int, int, | 1.00 | 1.0 (0.0%) | 754934306.50 | 69702016.5 (-90.77%) | 754934306.50 | 69702016.5 (-90.77%) | 754934306.50 | 69702016.5 (-90.77%) | 100.00 | 100.0 (0.0%) |
| | double*) | | | | | | | | | | |
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
--------------------------------------------------------------------------------
7. Wavefront
7.1 Wavefront Launch Stats
---------------------------------------------------------------------------------------------------------------------------------------
| Index | Metric | Avg | Avg | Min | Min | Max | Max | Unit |
---------------------------------------------------------------------------------------------------------------------------------------
| 7.1.0 | Grid Size | 256.00 | 131072.0 (51100.0%) | 256.00 | 131072.0 (51100.0%) | 256.00 | 131072.0 (51100.0%) | Work items |
---------------------------------------------------------------------------------------------------------------------------------------
| 7.1.1 | Workgroup Size | 64.00 | 64.0 (0.0%) | 64.00 | 64.0 (0.0%) | 64.00 | 64.0 (0.0%) | Work items |
---------------------------------------------------------------------------------------------------------------------------------------
| 7.1.2 | Total Wavefronts | 4.00 | 2048.0 (51100.0%) | 4.00 | 2048.0 (51100.0%) | 4.00 | 2048.0 (51100.0%) | Wavefronts |
---------------------------------------------------------------------------------------------------------------------------------------
Note that the comparison workload shows the percentage difference from the baseline.
This feature can be used to quickly compare filtered stats to make sure code changes fix known issues.
For this exercise, it is appropriate to filter the omniperf analyze
command with the --dispatch 1
argument.
This --dispatch 1
argument filters the data shown to only include the kernel invocation with dispatch ID 1, or the second kernel run during profiling.
However, there is another way to filter kernels that may be more applicable in real use-cases.
Typically real codes launch many kernels, and only a few of them take most of the overall kernel runtime.
To see a ranking of the top kernels that take up most of the kernel runtime in your code, you can run:
omniperf analyze -p workloads/problem/MI200 --list-stats
This command will output something like:
--------
Analyze
--------
--------------------------------------------------------------------------------
Detected Kernels
----------------------------------------------------------
| | KernelName |
----------------------------------------------------------
| 0 | yax(double*, double*, double*, int, int, double*) |
----------------------------------------------------------
Using Omniperf versions greater than 1.0.10
, --list-stats
will list all kernels launched by your code, in order of runtime (largest runtime first).
The number displayed beside the kernel in the output can be used to filter omniperf analyze
commands.
Note that this will display aggregated stats for kernels of the same name, meaning that the invocations could differ in terms of launch parameters, and vary widely in terms of work completed.
This filtering is accomplished with the -k
argument:
omniperf analyze -p workloads/problem/MI200 -k 0 --block 7.1.0 7.1.1 7.1.2
Which should show something like:
--------
Analyze
--------
--------------------------------------------------------------------------------
0. Top Stat
-----------------------------------------------------------------------------------------------------------------------
| | KernelName | Count | Sum(ns) | Mean(ns) | Median(ns) | Pct | S |
-----------------------------------------------------------------------------------------------------------------------
| 0 | yax(double*, double*, double*, int, int, | 2.00 | 1508098228.00 | 754049114.00 | 754049114.00 | 100.00 | * |
| | double*) | | | | | | |
-----------------------------------------------------------------------------------------------------------------------
--------------------------------------------------------------------------------
7. Wavefront
7.1 Wavefront Launch Stats
----------------------------------------------------------------------
| Index | Metric | Avg | Min | Max | Unit |
----------------------------------------------------------------------
| 7.1.0 | Grid Size | 256.00 | 256.00 | 256.00 | Work items |
----------------------------------------------------------------------
| 7.1.1 | Workgroup Size | 64.00 | 64.00 | 64.00 | Work items |
----------------------------------------------------------------------
| 7.1.2 | Total Wavefronts | 4.00 | 4.00 | 4.00 | Wavefronts |
----------------------------------------------------------------------
Note that the 'count' field in Top Stat is 2 here, where filtering by dispatch ID displays a count of 1, indicating that filtering with -k
returns aggregated stats for two kernel invocations in this case.
Also note that the "Top Stats" table will still show all the top kernels but the rightmost column titled "S" will have an asterisk beside the kernel for which data is being displayed.
We've demonstrated better performance than problem.cpp in solution.cpp, but could we potentially do better?
To answer that we again turn to the roofline model:
Roofline Type | Roofline Legend | Roofline Plot |
---|---|---|
FP32 | ||
FP16/INT8 |
These plots were generated with:
srun omniperf profile -n solution_roof_only --roof-only --kernel-names -- ./solution.exe
The plots will appear as PDF files in the ./workloads/solution_roof_only/MI200
directory, if generated on MI200 hardware.
We see that the solution is solidly in the bandwidth-bound regime, but even still there seems to be room for improvement. Further performance improvements will be a topic for later exercises.
Roofline Type | Problem Roofline | Solution Roofline |
---|---|---|
FP32 | ||
FP16/INT8 |
We see that the solution has drastically increased performance over the problem code, as shown by the solution points moving up closer to the line plotted by the bandwidth limit.
Note: on statically generated roofline images, it is possible for the L1, L2, or HBM points to overlap and hide one another.
Launch parameters should be the first check in optimizing performance, due to the fact that they are
usually easy to change, but can have a large performance impact if they aren't tuned to your workload.
It is difficult to predict the optimal launch parameters for any given kernel, so some experimentation
may be required to achieve the best performance.
Simple kernel implementing a version of yAx, to demonstrate the downside of allocating a large
amount of LDS, and the benefit of using a smaller amount of LDS due to occupancy limits. Background: Acronyms and terms used in this exercise
In this exercise we're using a problem code that is slightly different than where we left off in Exercise 1.
Regardless, to get started we need to get a roofline by running:
srun omniperf profile -n problem_roof_only --roof-only --kernel-names -- ./problem.exe
The plots will appear as PDF files in the ./workloads/problem_roof_only/MI200
directory, if generated on MI200 hardware.
For convenience, the resulting plots on a representative system are below:
Roofline Type | Roofline Legend | Roofline Plot |
---|---|---|
FP32 | ||
FP16/INT8 |
We see that there looks to be room for improvement here. We'll use omniperf to see what the current limiters are.
First, we should get an idea of the code's runtime:
make
srun ./problem.exe
(simulated output)
yAx time: 140 ms
This problem.cpp uses LDS allocations to move the x vector closer to the compute resources, a common optimization.
However, we see that it ends up slower than the previous solution that didn't use LDS at all.
In kernels that request a lot of LDS, it is common to see that the LDS usage limits the occupancy of the kernel.
That is, more wavefronts cannot be resident on the device, because all of them need more LDS than is available.
We need to confirm this hypothesis, let's start by running:
srun omniperf profile -n problem --no-roof -- ./problem.exe
The usage of omniperf profile
arguments can be found here, or by running omniperf profile --help
.
This omniperf profile
command will take a minute or two to run, as omniperf must run your code a few times to collect all the hardware counters.
Note: For large scientific codes, it can be useful to profile a small representative workload if possible, as profiling a full run may take prohibitively long.
Once the profiling run completes, let's take a look at the occupancy stats related to LDS allocations:
omniperf analyze -p workloads/problem/MI200 --dispatch 1 --block 2.1.15 6.2.7
The metrics we're looking at are:
2.1.15
Wavefront occupancy – a measure of how many wavefronts, on average, are active on the device6.2.7
SPI: Insufficient CU LDS – indicates whether wavefronts are not able to be scheduled due to insufficient LDSThe SPI section (6.2
) generally shows what resources limit occupancy, while Wavefront occupancy (2.1.15
) shows how severely occupancy is limited in general.
The SPI 'insufficient' fields are typically either zero or very large numbers (on the order of 1 million), with large numbers indicating some resource preventing wavefronts from scheduling.
If more than one field is nonzero, the relative magnitude of the nonzero fields correspond to how severely the resources are limiting occupancy, but if only one field is nonzero it is difficult to say how severely that field is limiting occupancy.
The output of the omniperf analyze
command should look similar to this:
--------
Analyze
--------
--------------------------------------------------------------------------------
0. Top Stat
-----------------------------------------------------------------------------------------------------------------
| | KernelName | Count | Sum(ns) | Mean(ns) | Median(ns) | Pct |
-----------------------------------------------------------------------------------------------------------------
| 0 | yax(double*, double*, double*, int, int, | 1.00 | 175427205.00 | 175427205.00 | 175427205.00 | 100.00 |
| | double*) | | | | | |
-----------------------------------------------------------------------------------------------------------------
--------------------------------------------------------------------------------
2. System Speed-of-Light
2.1 Speed-of-Light
---------------------------------------------------------------------
| Index | Metric | Value | Unit | Peak | PoP |
---------------------------------------------------------------------
| 2.1.15 | Wave Occupancy | 102.70 | Wavefronts | 3328.00 | 3.09 |
---------------------------------------------------------------------
--------------------------------------------------------------------------------
6. Workgroup Manager (SPI)
6.2 Workgroup Manager - Resource Allocation
______________________________________________________________________
| Metric_ID | Metric | Avg | Min | Max | Unit |
______________________________________________________________________
| 6.2.7 | Insufficient CU LDS | 69.53 | 69.53 | 69.53 | Pct |
______________________________________________________________________
Looking through this data we see:
2.1.15
) is around 3%, which is very low6.2.7
) contains a very large number, which indicates our occupancy is currently limited by LDS allocations.There are two solution directories, which correspond to two ways that this occupancy limit can be addressed.
First, we have solution-no-lds
, which completely removes the LDS usage. Let's build and run this solution:
cd solution-no-lds
make
srun ./solution.exe
(simulated output)
yAx time: 70 ms
We see that the runtime is much better for this solution than the problem, let's see if removing LDS did indeed increase occupancy:
srun omniperf profile -n solution --no-roof -- ./solution.exe
(output omitted)
Once the profile command completes, run:
omniperf analyze -p workloads/solution/MI200 --dispatch 1 --block 2.1.15 6.2.7
The output should look something like:
--------
Analyze
--------
--------------------------------------------------------------------------------
0. Top Stat
--------------------------------------------------------------------------------------------------------------
| | KernelName | Count | Sum(ns) | Mean(ns) | Median(ns) | Pct |
--------------------------------------------------------------------------------------------------------------
| 0 | yax(double*, double*, double*, int, int, | 1.00 | 70512671.00 | 70512671.00 | 70512671.00 | 100.00 |
| | double*) | | | | | |
--------------------------------------------------------------------------------------------------------------
--------------------------------------------------------------------------------
2. System Speed-of-Light
2.1 Speed-of-Light
---------------------------------------------------------------------
| Index | Metric | Value | Unit | Peak | PoP |
---------------------------------------------------------------------
| 2.1.15 | Wave Occupancy | 445.33 | Wavefronts | 3328.00 | 13.38 |
---------------------------------------------------------------------
--------------------------------------------------------------------------------
6. Shader Processor Input (SPI)
6.2 SPI Resource Allocation
------------------------------------------------------------------
| Index | Metric | Avg | Min | Max | Pct. |
------------------------------------------------------------------
| 6.2.7 | Insufficient CU LDS | 0.00 | 0.00 | 0.00 | Cu |
------------------------------------------------------------------
Looking through this data we see:
2.1.15
) is about 10% higher than in problem.cpp6.2.7
) is now zero, indicating solution-no-lds is not occupancy limited by LDS allocations.Can we get some runtime advantage from using smaller LDS allocations?
This is the solution implemented in the solution
directory:
cd ../solution
make
srun ./solution.exe
(simulated output)
yAx time: 50 ms
This solution, rather than removing the LDS allocation, simply reduces the amount of LDS requested to address the occupancy limit.
This gives us the benefit of having some data pulled closer than it was in solution-no-lds
which is validated through the speedup we see.
But is this solution still occupancy limited by LDS?
srun omniperf profile -n solution --no-roof -- ./solution.exe
(output omitted)
Once the profile command completes, run:
omniperf analyze -p workloads/solution/MI200 --dispatch 1 --block 2.1.15 6.2.7
The output should look something like:
--------
Analyze
--------
--------------------------------------------------------------------------------
0. Top Stat
--------------------------------------------------------------------------------------------------------------
| | KernelName | Count | Sum(ns) | Mean(ns) | Median(ns) | Pct |
--------------------------------------------------------------------------------------------------------------
| 0 | yax(double*, double*, double*, int, int, | 1.00 | 50366185.00 | 50366185.00 | 50366185.00 | 100.00 |
| | double*) | | | | | |
--------------------------------------------------------------------------------------------------------------
--------------------------------------------------------------------------------
2. System Speed-of-Light
2.1 Speed-of-Light
---------------------------------------------------------------------
| Index | Metric | Value | Unit | Peak | PoP |
---------------------------------------------------------------------
| 2.1.15 | Wave Occupancy | 487.32 | Wavefronts | 3328.00 | 14.64 |
---------------------------------------------------------------------
--------------------------------------------------------------------------------
6. Shader Processor Input (SPI)
6.2 SPI Resource Allocation
------------------------------------------------------------------
| Index | Metric | Avg | Min | Max | Unit |
------------------------------------------------------------------
| 6.2.7 | Insufficient CU LDS | 0.00 | 0.00 | 0.00 | Pct. |
------------------------------------------------------------------
Looking at this data we see:
2.1.15
) is even higher than before6.2.7
) shows we are not occupancy limited by LDS allocations.Pulling some data from global device memory to LDS can be an effective optimization strategy, if occupancy limits are carefully avoided.
Let's take a look at the roofline for solution
, which can be generated with:
srun omniperf profile -n solution_roof_only --roof-only -- ./solution.exe
The plots will appear as PDF files in the ./workloads/problem_roof_only/MI200
directory, if generated on MI200 hardware.
The plots are shown here:
Roofline Type | Roofline Legend | Roofline Plot |
---|---|---|
FP32 | ||
FP16/INT8 |
We see that there is still room to move the solution roofline up towards the bandwidth limit.
Roofline Type | Problem Roofline | Solution Roofline |
---|---|---|
FP32 | ||
FP16/INT8 |
Again, we see that the solution's optimizations have resulted in the kernel moving up in the roofline, meaning the solution executes more efficiently than the problem.
Using LDS can be very helpful in reducing global memory reads where you have repeated use of the same data.
However, large LDS allocations can also negatively impact performance by limiting the amount of
wavefronts that can be resident in the device at any given time. Be wary of LDS usage, and check
the SPI stats to ensure your LDS usage is not negatively impacting occupancy.
More complex yAx implementation to demonstrate a register limited kernel using an innocuous looking
function call. Background: Acronyms and terms used in this exercise
This kernel is slightly different from the one we used in previous exercises. Let's see how well it measures up in the roofline:
Roofline Type | Roofline Legend | Roofline Plot |
---|---|---|
FP32 | ||
FP16/INT8 |
You can generate these plots by running:
srun omniperf profile -n problem_roof_only --roof-only --kernel-names -- ./problem.exe
The plots will appear as PDF files in the ./workloads/problem_roof_only/MI200
directory, if generated on MI200 hardware.
We see that the kernel is still a considerable amount below the maximum achievable bandwidth, so there should still be room for improvement.
Let's get an idea of the runtime of this code:
make
srun ./problem.exe
(simulated output)
yAx time 79 ms
We see that this kernel seems to be on par with some of our other exercises, but let's see what omniperf shows us:
srun omniperf profile -n problem --no-roof -- ./problem.exe
(lots of output from this command)
omniperf analyze -p workloads/problem/MI200 --dispatch 1 --block 2.1.15 6.2.5 7.1.5 7.1.6 7.1.7
2.1.15
Shows Wavefront occupancy6.2.5
Shows Insufficient SIMD VGPRs – indicating if this kernel is occupancy limited by VGPR usage7.1.5-7
Shows the register usage: VGPRs, SGPRs, and AGPRs--------
Analyze
--------
--------------------------------------------------------------------------------
0. Top Stat
--------------------------------------------------------------------------------------------------------------
| | KernelName | Count | Sum(ns) | Mean(ns) | Median(ns) | Pct |
--------------------------------------------------------------------------------------------------------------
| 0 | yax(double*, double*, double*, int, int, | 1.00 | 76983902.00 | 76983902.00 | 76983902.00 | 100.00 |
| | double*) | | | | | |
--------------------------------------------------------------------------------------------------------------
--------------------------------------------------------------------------------
2. System Speed-of-Light
2.1 Speed-of-Light
--------------------------------------------------------------------
| Index | Metric | Value | Unit | Peak | PoP |
--------------------------------------------------------------------
| 2.1.15 | Wave Occupancy | 438.00 | Wavefronts | 3328.00 | 13.16 |
--------------------------------------------------------------------
--------------------------------------------------------------------------------
6. Workgroup Manager (SPI)
6.2 Workgroup Manager - Resource Allocation
__________________________________________________________________________
| Metric_ID | Metric | Avg | Min | Max | Unit |
__________________________________________________________________________
| 6.2.5 | Insufficient SIMD VGPRs | 0.04 | 0.04 | 0.04 | Pct |
__________________________________________________________________________
--------------------------------------------------------------------------------
7. Wavefront
7.1 Wavefront Launch Stats
------------------------------------------------------------
| Index | Metric | Avg | Min | Max | Unit |
------------------------------------------------------------
| 7.1.5 | VGPRs | 92.00 | 92.00 | 92.00 | Registers |
------------------------------------------------------------
| 7.1.6 | AGPRs | 132.00 | 132.00 | 132.00 | Registers |
------------------------------------------------------------
| 7.1.7 | SGPRs | 64.00 | 64.00 | 64.00 | Registers |
------------------------------------------------------------
Looking at this data, we see:
6.2.5
) shows a percentage of scheduling attempts that stalled due to lack of VGPRs, which indicates our kernel is occupancy limited by VGPR register usage.7.1.5
) shows we are using a lot of VGPRs and AGPRs (7.1.6
) shows we are using a lot of AGPRs, which can indicate low-cost register spills in the absence of MFMA instructions.This is due to a call to assert
that checks if our result is zeroed out on device.
We need to check this hypothesis, let's look at the solution code:
cd solution
make
srun ./solution.exe
(simulated output)
yAx time: 69 ms
Our runtime gets better from removing the assert
, but we should also check that omniperf reports that our limiters are gone:
srun omniperf profile -n solution --no-roof -- ./solution.exe
(omitted output)
omniperf analyze -p workloads/solution/MI200 --dispatch 1 --block 2.1.15 6.2.5 7.1.5 7.1.6 7.1.7
The output of this command should look something like:
--------
Analyze
--------
--------------------------------------------------------------------------------
0. Top Stat
--------------------------------------------------------------------------------------------------------------
| | KernelName | Count | Sum(ns) | Mean(ns) | Median(ns) | Pct |
--------------------------------------------------------------------------------------------------------------
| 0 | yax(double*, double*, double*, int, int, | 1.00 | 69815871.00 | 69815871.00 | 69815871.00 | 100.00 |
| | double*) | | | | | |
--------------------------------------------------------------------------------------------------------------
--------------------------------------------------------------------------------
2. System Speed-of-Light
2.1 Speed-of-Light
--------------------------------------------------------------------
| Index | Metric | Value | Unit | Peak | PoP |
--------------------------------------------------------------------
| 2.1.15 | Wave Occupancy | 444.10 | Wavefronts | 3328.00 | 13.34 |
--------------------------------------------------------------------
--------------------------------------------------------------------------------
6. Shader Processor Input (SPI)
6.2 SPI Resource Allocation
--------------------------------------------------------------------
| Index | Metric | Avg | Min | Max | Unit |
--------------------------------------------------------------------
| 6.2.5 | Insufficient SIMD VGPRs | 0.00 | 0.00 | 0.00 | Pct |
--------------------------------------------------------------------
--------------------------------------------------------------------------------
7. Wavefront
7.1 Wavefront Launch Stats
----------------------------------------------------------
| Index | Metric | Avg | Min | Max | Unit |
----------------------------------------------------------
| 7.1.5 | VGPRs | 32.00 | 32.00 | 32.00 | Registers |
----------------------------------------------------------
| 7.1.6 | AGPRs | 0.00 | 0.00 | 0.00 | Registers |
----------------------------------------------------------
| 7.1.7 | SGPRs | 112.00 | 112.00 | 112.00 | Registers |
----------------------------------------------------------
Looking at this data, we see:
6.2.5
) shows we are no longer occupancy limited by VGPR usage.7.1.5
) and AGPRs (7.1.6
) show considerably fewer vector registers, and no AGPRs being used.7.1.7
) shows a 2x increase over the previous implementation, which shows our memory access is more uniform here.2.1.15
) shows our occupancy increased only slightly from the previous implementation, despite the large decrease in 6.2.5
.More generally, you can use this command to look at all SPI "insufficient resource" stats in the same screen, to determine if any resource is currently limiting occupancy:
omniperf analyze -p workloads/problem/MI200 --dispatch 1 --block 6.2
Which will show output similar to this (note, fields 6.2.4
to 6.2.8
show resources which currently limit occupancy):
--------
Analyze
--------
--------------------------------------------------------------------------------
0. Top Stat
---------------------------------------------------------------------------------------------------------------
| | KernelName | Count | Sum(ns) | Mean(ns) | Median(ns) | Pct |
---------------------------------------------------------------------------------------------------------------
| 0 | yax(double*, double*, double*, int, int, | 1.00 | 76983902.00 | 76983902.00 | 76983902.00 | 100.00 |
| | double*) | | | | | |
---------------------------------------------------------------------------------------------------------------
--------------------------------------------------------------------------------
6. Workgroup Manager (SPI)
6.2 Workgroup Manager - Resource Allocation
_________________________________________________________________________________________
_ Metric_ID _ Metric _ Avg _ Min _ Max _ Unit _
_________________________________________________________________________________________
_ 6.2.0 _ Not-scheduled Rate (Workgroup Manager) _ 0.02 _ 0.02 _ 0.02 _ Pct _
_________________________________________________________________________________________
_ 6.2.1 _ Not-scheduled Rate (Scheduler-Pipe) _ 0.03 _ 0.03 _ 0.03 _ Pct _
_________________________________________________________________________________________
_ 6.2.2 _ Scheduler-Pipe Stall Rate _ 0.01 _ 0.01 _ 0.01 _ Pct _
_________________________________________________________________________________________
_ 6.2.3 _ Scratch Stall Rate _ 0.00 _ 0.00 _ 0.00 _ Pct _
_________________________________________________________________________________________
_ 6.2.4 _ Insufficient SIMD Waveslots _ 0.00 _ 0.00 _ 0.00 _ Pct _
_________________________________________________________________________________________
_ 6.2.5 _ Insufficient SIMD VGPRs _ 0.04 _ 0.04 _ 0.04 _ Pct _
_________________________________________________________________________________________
_ 6.2.6 _ Insufficient SIMD SGPRs _ 0.00 _ 0.00 _ 0.00 _ Pct _
_________________________________________________________________________________________
_ 6.2.7 _ Insufficient CU LDS _ 0.00 _ 0.00 _ 0.00 _ Pct _
_________________________________________________________________________________________
_ 6.2.8 _ Insufficient CU Barriers _ 0.00 _ 0.00 _ 0.00 _ Pct _
_________________________________________________________________________________________
_ 6.2.9 _ Reached CU Workgroup Limit _ 0.00 _ 0.00 _ 0.00 _ Pct _
_________________________________________________________________________________________
_ 6.2.10 _ Reached CU Wavefront Limit _ 0.00 _ 0.00 _ 0.00 _ Pct _
_________________________________________________________________________________________
Let's see how the solution stacks up in the roofline:
Roofline Type | Roofline Legend | Roofline Plot |
---|---|---|
FP32 | ||
FP16/INT8 |
You can generate these plots with:
srun omniperf profile -n problem_roof_only --roof-only --kernel-names -- ./problem.exe
The plots will appear as PDF files in the ./workloads/problem_roof_only/MI200
directory, if generated on MI200 hardware.
We see there is still room for improvement in the solution, as this kernel is not getting the maximum achievable bandwidth.
Roofline Type | Problem Roofline | Solution Roofline |
---|---|---|
FP32 | ||
FP16/INT8 |
The most notable change between these rooflines is that the L1/L2 arithmetic intensity spread is more pronounced in the problem, which shows that the call to assert
was causing more data to be moved to the L1, while not adding floating-point operations.
Note: Arithmetic Intensity is computed as (total floating point operations)/(total data movement)
Function calls inside kernels can have surprisingly adverse performance side-effects. Calling assert, printf and even excessive use of math functions (e.g. pow, sin, cos) can limit performance in difficult-to-predict ways. If you see unexpected resource usage, try eliminating these sorts of function calls.
This exercise uses a simple implementation of a yAx kernel to show how difficult strided data access patterns can be to spot in code,
and demonstrates how to use omniperf to begin to diagnose them. Background: Acronyms and terms used in this exercise
Strided data patterns happen when each thread in a wavefront has to access data locations which have a lot of space between them.
For instance, in the algorithm we've been using, each thread works on a row, and those rows are contiguous in device memory.
This scenario is depicted below:
Background: What is a "Strided Data Access Pattern"?
Learn More →
Learn More →
To start, we want to check the roofline of problem.exe
, to make sure we are able to improve it.
These plots can be generated with:
srun omniperf profile -n problem_roof_only --roof-only --kernel-names -- ./problem.exe
The plots will appear as PDF files in the ./workloads/problem_roof_only/MI200
directory, if generated on MI200 hardware.
They are also provided below for easy reference:
Roofline Type | Roofline Legend | Roofline Plot |
---|---|---|
FP32 | ||
FP16/INT8 |
We have plenty of space to improve this kernel, the next step is profiling.
To start, let's build and run the problem executable:
make
srun ./problem.exe
(simulated output)
yAx time: 70 ms
From our other experiments, this time seems reasonable. Let's look closer at the memory system usage with omniperf:
srun omniperf profile -n problem --no-roof -- ./problem.exe
(omitted output)
omniperf analyze -p workloads/problem/MI200 --dispatch 1 --block 16.1 17.1
Previous examples have used specific fields inside metrics, but we can also request a group of metrics with just two numbers (i.e. 16.1 vs. 16.1.1)
These requested metrics are:
16.1
L1 memory speed-of-light stats17.1
L2 memory speed-of-light statsThe speed-of-light stats are a more broad overview of how the memory systems are used throughout execution of your kernel.
As such, they're great statistics for seeing if the memory system is generally being used efficiently or not.
Output from the analyze
command should look like this:
--------
Analyze
--------
--------------------------------------------------------------------------------
0. Top Stat
--------------------------------------------------------------------------------------------------------------
| | KernelName | Count | Sum(ns) | Mean(ns) | Median(ns) | Pct |
--------------------------------------------------------------------------------------------------------------
| 0 | yax(double*, double*, double*, int, int, | 1.00 | 69768072.00 | 69768072.00 | 69768072.00 | 100.00 |
| | double*) | | | | | |
--------------------------------------------------------------------------------------------------------------
--------------------------------------------------------------------------------
16. Vector L1 Data Cache
16.1 Speed-of-Light
___________________________________________________
_ Metric_ID _ Metric _ Avg _ Unit _
___________________________________________________
_ 16.1.0 _ Hit rate _ 0.00 _ Pct of peak _
___________________________________________________
_ 16.1.1 _ Bandwidth _ 8.26 _ Pct of peak _
___________________________________________________
_ 16.1.2 _ Utilization _ 83.57 _ Pct of peak _
___________________________________________________
_ 16.1.3 _ Coalescing _ 25.00 _ Pct of peak _
___________________________________________________
--------------------------------------------------------------------------------
17. L2 Cache
17.1 Speed-of-Light
_________________________________________________________________
_ Metric_ID _ Metric _ Avg _ Unit _
_________________________________________________________________
_ 17.1.0 _ Utilization _ 97.67 _ Pct _
_________________________________________________________________
_ 17.1.1 _ Bandwidth _ 28.41 _ Pct _
_________________________________________________________________
_ 17.1.2 _ Hit Rate _ 93.45 _ Pct _
_________________________________________________________________
_ 17.1.3 _ L2-Fabric Read BW _ 126.27 _ Gb/s _
_________________________________________________________________
_ 17.1.4 _ L2-Fabric Write and Atomic BW _ 0.00 _ Gb/s _
_________________________________________________________________
Looking at this data, we see:
16.1.0
) is 0%, so the kernel's memory requests are never found in the L1.17.1.2
) is 93.45%, so most requests are found in the L2, with about 7% needing to go out to HBM.Since our implementation of yAx simply uses 1 for all values in y, A, and x, we do not have to change how we populate our data.
Since A is implemented as a flat array, we don't need to change our allocation either.
In real-world use-cases, these considerations add non-trivial development overhead, so data access patterns may be non-trivial to change.
To observe the performance effects of a different data access pattern, we simply need to change our indexing scheme.
Let's see how this performs by running solution
:
cd solution
make
srun ./solution.exe
(simulated output)
yAx time: 13 ms
We see the runtime here is significantly better than our previous kernel, but we need to check how the caches behave now:
srun omniperf profile -n solution --no-roof -- ./solution.exe
(output omitted)
omniperf analyze -p workloads/solution/MI200 --dispatch 1 --block 16.1 17.1
The output from this analyze command should look like:
--------
Analyze
--------
--------------------------------------------------------------------------------
0. Top Stat
--------------------------------------------------------------------------------------------------------------
| | KernelName | Count | Sum(ns) | Mean(ns) | Median(ns) | Pct |
--------------------------------------------------------------------------------------------------------------
| 0 | yax(double*, double*, double*, int, int, | 1.00 | 12464570.00 | 12464570.00 | 12464570.00 | 100.00 |
| | double*) | | | | | |
--------------------------------------------------------------------------------------------------------------
--------------------------------------------------------------------------------
16. Vector L1 Data Cache
16.1 Speed-of-Light
___________________________________________________
_ Metric_ID _ Metric _ Avg _ Unit _
___________________________________________________
_ 16.1.0 _ Hit rate _ 49.98 _ Pct of peak _
___________________________________________________
_ 16.1.1 _ Bandwidth _ 10.88 _ Pct of peak _
___________________________________________________
_ 16.1.2 _ Utilization _ 98.15 _ Pct of peak _
___________________________________________________
_ 16.1.3 _ Coalescing _ 25.00 _ Pct of peak _
___________________________________________________
--------------------------------------------------------------------------------
17. L2 Cache
17.1 Speed-of-Light
_________________________________________________________________
_ Metric_ID _ Metric _ Avg _ Unit _
_________________________________________________________________
_ 17.1.0 _ Utilization _ 98.60 _ Pct _
_________________________________________________________________
_ 17.1.1 _ Bandwidth _ 9.40 _ Pct _
_________________________________________________________________
_ 17.1.2 _ Hit Rate _ 0.52 _ Pct _
_________________________________________________________________
_ 17.1.3 _ L2-Fabric Read BW _ 650.84 _ Gb/s _
_________________________________________________________________
_ 17.1.4 _ L2-Fabric Write and Atomic BW _ 0.00 _ Gb/s _
_________________________________________________________________
Looking at this data, we see:
16.1.0
) is around 50%, so half the requests to the L1 need to go to the L2.17.1.2
) is 0.52%, so almost all the requests to the L2 have to go out to HBM.17.1.3
) has increased significantly, due to the increase in L2 cache misses requiring HBM reads.We should check where our new kernel stands on the roofline.
These plots can be generated with:
srun omniperf profile -n solution_roof_only --roof-only --kernel-names -- ./solution.exe
The plots will appear as PDF files in the ./workloads/problem_roof_only/MI200
directory, if generated on MI200 hardware.
They are also provided below for easy reference:
Roofline Type | Roofline Legend | Roofline Plot |
---|---|---|
FP32 | ||
FP16/INT8 |
We appear to be very close to being bound by the HBM bandwidth from the fp32 roofline.
To get more performance we need to look closer at our algorithm.
Roofline Type | Problem Roofline | Solution Roofline |
---|---|---|
FP32 | ||
FP16/INT8 |
We see that the HBM roofline point moves up, while the L1/L2 points move up and to the right from problem to solution. This means that our arithmetic intensity is increasing for the caches, so we are moving less data through the caches to do the same computation.
This exercise illustrates the at times insidious nature of strided data access patterns.
They can be difficult to spot in code, but profiling more readily shows when adversarial
access patterns occur, by showing poor cache hit rates, low cache bandwidth, and potentially low utilization.
Data access patterns can be non-trivial to change, so these sorts of optimizations can involve significant development and validation overhead.