module load CrayEnv
module load buildtools/23.09
module load PrgEnv-cray/8.4.0
module load cce/16.0.1
module load craype-accel-amd-gfx90a
module load craype-x86-trento
module use /pfs/lustrep2/projappl/project_462000125/samantao-public/mymodules
module load rocm/5.4.3 omnitrace/1.11.2-rocm-5.4.x omniperf/2.0.1-rocm-5.4.x
You can setup the following environment variables for the project you want to use:
export SALLOC_ACCOUNT=project_<your porject ID>
export SBATCH_ACCOUNT=project_<your porject ID>
Setup allocation
salloc -N 1 --gpus=8 -p standard-g --exclusive -t 20:00 --reservation <reservation name>
Download examples repo and navigate to the HIPIFY
exercises
git clone https://github.com/amd/HPCTrainingExamples.git
cd HPCTrainingExamples/HIPIFY/mini-nbody/hip/
Compile and run one case. We are on the front-end node, so we have two ways to compile for the
GPU that we want to run on.
hipcc -I../ -DSHMOO --offload-arch=gfx90a nbody-orig.hip -o nbody-orig
export ROCM_GPU=gfx90a
.srun hipcc -I../ -DSHMOO nbody-orig.cpp -o nbody-orig
Now Run rocprof
on nbody-orig to obtain hotspots list
srun rocprof --stats nbody-orig 65536
Check Results
cat results.csv
Check the statistics result file, one line per kernel, sorted in descending order of durations
cat results.stats.csv
Using --basenames on
will show only kernel names without their parameters.
srun rocprof --stats --basenames on nbody-orig 65536
Check the statistics result file, one line per kernel, sorted in descending order of durations
cat results.stats.csv
Trace HIP calls with --hip-trace
srun rocprof --stats --hip-trace nbody-orig 65536
Check the new file results.hip_stats.csv
cat results.hip_stats.csv
Profile also the HSA API with the --hsa-trace
srun rocprof --stats --hip-trace --hsa-trace nbody-orig 65536
Check the new file results.hsa_stats.csv
cat results.hsa_stats.csv
On your laptop, download results.json
scp -i <HOME_DIRECTORY>/.ssh/<public ssh key file> <username>@lumi.csc.fi:<path_to_file>/results.json results.json
You could open a browser and go to https://ui.perfetto.dev/ to load the latest version of the tool but we recomment using an older version that is known to work well with traces generated by rocprof. For that, make sure you start a session to connect to LUMI as:
ssh -i <HOME_DIRECTORY>/.ssh/<public ssh key file> <username>@lumi.csc.fi -L10000:uan02:10000
and then connect to http://localhost:10000.
Alternatively you can run perfetto on your laptop if you have a Docker installed with:
docker run -it --rm -p 10000:10000 --name myperfetto sfantao/perfetto4rocm
Click on Open trace file
in the top left corner.
Navigate to the results.json
you just downloaded.
Use the keystrokes W,A,S,D to zoom in and move right and left in the GUI
Navigation
w/s Zoom in/out
a/d Pan left/right
Your trace should look like:
Read about hardware counters available for the GPU on this system (look for gfx90a section)
less $ROCM_PATH/lib/rocprofiler/gfx_metrics.xml
Create a rocprof_counters.txt
file with the counters you would like to collect
vi rocprof_counters.txt
Content for rocprof_counters.txt
:
pmc : Wavefronts VALUInsts
pmc : SALUInsts SFetchInsts GDSInsts
pmc : MemUnitBusy ALUStalledByLDS
Execute with the counters we just added:
srun rocprof --timestamp on -i rocprof_counters.txt nbody-orig 65536
You'll notice that rocprof
runs 3 passes, one for each set of counters we have in that file.
Contents of rocprof_counters.csv
cat rocprof_counters.csv
Omnitrace is known to work better with ROCm versions more recent than 5.2.3. So we use a ROCm 5.4.3 installation for this. For certain featurs we may need to rollback to a version closer to the LUMI driver. Hopefully, that won't be much of an issue from September 2024 on due to the schedukled upgrade of LUMI.
srun -n 1 --gpus 1 omnitrace-avail --categories omnitrace
srun -n 1 --gpus 1 omnitrace-avail --categories omnitrace --brief --description
srun -n 1 omnitrace-avail -G ~/.omnitrace.cfg --all
export OMNITRACE_CONFIG_FILE=~/.omnitrace.cfg
This path is the default anyway, so you actullay only need this variable if you prefer omnitrace configuration file to live elsewhere.
Compile and execute saxpy
cd HPCTrainingExamples/HIP/saxpy
hipcc --offload-arch=gfx90a -O3 -o saxpy saxpy.hip
time srun -n 1 ./saxpy
Check the duration
Compile and execute Jacobi
cd HPCTrainingExamples/HIP/jacobi
Now build the code
nice make -f Makefile.cray -j
time srun -n 1 --gpus 1 Jacobi_hip -g 1 1
Check the duration
Execute dynamic instrumentation: time srun -n 1 --gpus 1 omnitrace-instrument -- ./saxpy
and check the duration
About Jacobi example, as the dynamic instrumentation would take long time, check what the binary calls and gets instrumented: nm --demangle Jacobi_hip | egrep -i ' (t|u) '
Available functions to instrument: srun -n 1 --gpus 1 omnitrace-instrument -v 1 --simulate --print-available functions -- ./Jacobi_hip -g 1 1
Binary rewriting available functions: srun -n 1 --gpus 1 omnitrace-instrument -v -1 --print-available functions -o jacobi.inst -- ./Jacobi_hip
Binary rewriting: srun -n 1 --gpus 1 omnitrace-instrument -o jacobi.inst -I Jacobi_t::Run -- ./Jacobi_hip
Confirm we instrumented our user function Jacobi_t::Run
:
cat omnitrace-jacobi.inst-output/TIMESTAMP/instrumentation/instrumented.txt
Let's enable collection of numeric profilling data. Edit ~/.omnitrace.cfg
to include:
OMNITRACE_PROFILE = true
Executing the new instrumented binary: time srun -n 1 --gpus 1 omnitrace-run -- ./jacobi.inst -g 1 1
and check the duration
See the list of the instrumented GPU calls: cat omnitrace-jacobi.inst-output/TIMESTAMP/roctracer.txt
omnitrace-jacobi.inst-output/TIMESTAMP/perfetto-trace-0.proto
to your laptop, open the web page https://ui.perfetto.dev/ click to open the trace and select the fileq
srun -n 1 --gpus 1 omnitrace-avail --all
OMNITRACE_ROCM_EVENTS = GPUBusy,Wavefronts
srun -n 1 --gpus 1 omnitrace-run -- ./jacobi.inst -g 1 1
and copy the perfetto file and visualizeActivate in your configuration file OMNITRACE_USE_SAMPLING = true
and OMNITRACE_SAMPLING_FREQ = 100
, execute and visualize
This will sample the callstack which you can see in the bottom of your profile.
omnitrace-binary-output/timestamp/wall_clock.txt
(replace binary and timestamp with your information)OMNITRACE_PROFILE = true
and OMNITRACE_FLAT_PROFILE = true
, execute the code and open again the file omnitrace-jacobi.inst-output/TIMESTAMP/wall_clock.txt
Reserve a GPU, compile the exercise and execute Omniperf, observe how many times the code is executed
Let's build a double-precision general matrix multiply example - DGEMM.
cd HPCTrainingExamples/HIP/dgemm/
mkdir build
cd build
cmake ..
nice make -j
cd bin
srun -n 1 omniperf profile -n dgemm -- ./dgemm -m 8192 -n 8192 -k 8192 -i 1 -r 10 -d 0 -o dgemm.csv
Run srun -n 1 --gpus 1 omniperf profile -h
to see all the options
Now is created a workload in the directory workloads with the name dgemm (the argument of the -n). So, we can analyze it
srun -n 1 --gpus 1 omniperf analyze -p workloads/dgemm/MI200/ &> dgemm_analyze.txt
srun -n 1 omniperf profile -n dgemm --roof-only -- ./dgemm -m 8192 -n 8192 -k 8192 -i 1 -r 10 -d 0 -o dgemm.csv
There is no need for srun to analyze but we want to avoid everybody to use the login node. Explore the file dgemm_analyze.txt
srun -n 1 --gpus 1 omniperf analyze -p workloads/dgemm/mi200/ -b 7.1.2
But you need to know the code of the IP Block
omniperf analyze -p workloads/dgemm/mi200/ --gui